I read some articles and posts regarding Multithreading in OpenCV:
But I couldn't get consistent answers which method of multithreading is the right way to go.
Regarding TBB, an answer from 2012 with 5 upvotes:
With WITH_TBB=ON OpenCV tries to use several threads for some functions. The problem is that just a handsome of function are threaded with TBB at the moment (may be a dozen). So, it is hard to see any speedup. OpenCV philosophy here is that application should be multi-threaded, not OpenCV functions.[...]
Regarding multithreading on application level, an comment from an moderator on answers.opencv.org:
please avoid using your own multithreading with opencv. a lot of functions are explicitly not thread-safe. rather rebuild the opencv libs with TBB or openmp support.
But another answer with 3 upvotes is stating:
The library itself is thread safe in that you can have multiple calls into the library at the same time, however the data is not always thread safe.
So I thought it was at least okay to use (multi)threading on application level. But I encountered strange performance problems when running my program for longer time periods.
After investigating these performance problems I created this minimal, complete, and verifiable example code:
#include "opencv2\opencv.hpp"
#include <vector>
#include <chrono>
#include <thread>
using namespace cv;
using namespace std;
using namespace std::chrono;
void blurSlowdown(void*) {
Mat m1(360, 640, CV_8UC3);
Mat m2(360, 640, CV_8UC3);
medianBlur(m1, m2, 3);
}
int main()
{
for (;;) {
high_resolution_clock::time_point start = high_resolution_clock::now();
for (int k = 0; k < 100; k++) {
thread t(blurSlowdown, nullptr);
t.join(); //INTENTIONALLY PUT HERE READ PROBLEM DESCRIPTION
}
high_resolution_clock::time_point end = high_resolution_clock::now();
cout << duration_cast<microseconds>(end - start).count() << endl;
}
}
If the program is running for an extended period of time the time spans printed by
cout << duration_cast<microseconds>(end - start).count() << endl;
are getting larger and larger.
After running the program for around 10 minutes the printed timespans have doubled, which is not explainable with normal fluctuations.
The behavior of the program I would expect is that the time spans are staying pretty much constant, even tho they might be longer than calling the function directly.
When calling the function directly:
[...]
for (int k = 0; k < 100; k++) {
blurSlowdown(nullptr);
}
[...]
The printed time spans are staying constant.
When not calling the cv function:
void blurSlowdown(void*) {
Mat m1(360, 640, CV_8UC3);
Mat m2(360, 640, CV_8UC3);
//medianBlur(m1, m2, 3);
}
The printed time spans are staying constant too. So there must be something wrong when using threading in combination with OpenCV functions.
blurSlowdown()
function.medianBlur()
function since it happens on other with other functions like erode()
or blur()
too.First of all, thank you for the clarity of the question.
Q: Is it okay to use (multi)threading on application level with OpenCV?
A: Yes it is totally ok to use multithreading on application level with OpenCV unless and until you are using functions which can take advantage of multithreading such as blurring, colour space changing, here you can split the image into multiple parts and apply global functions throughout the divided part and then recombine it to give the final output.
In some functions such as Hough, pca_analysis which cannot give correct results when they are applied to divided image sections and then recombined, applying multithreading on application level to such functions may not give correct results and thus should not be done.
As πάντα ῥεῖ mentioned, your implementation of multithreading will not give you an advantage because you are joining the thread in the for loop itself. I would suggest you use promise and future objects(If you want an example of how to, let me know down in the comments, I will share the snippet.
Below answer took a lot of research, thanks for asking the question, it really helps me add info to my multithreading knowledge :)
Q: If yes, why are the time spans printed by my program above GROWING over time?
A: After a lot of research I found out that creating and destroying threads takes a lot of CPU as well as memory resources. When we initialize a thread(in your code by this line: thread t(blurSlowdown, nullptr);
) an identifier is written to the memory location to which this variable points and this identifier enables us to refer to the thread. Now in your program you are creating and destroying thread at a very high rate, now this is what happens, there is a thread pool allocated to a program through which our program can run and destroy threads, I will keep it short and let's look at the explanation below:
BUT
When you again create a thread after no time the first thread is destroyed, the identifier of this new thread points to a new location(location other than the previous thread) in the thread pool.
After repeatedly creating and destroying a thread, the thread pool is exhausted and so CPU is forced to slow down our program cycles a bit so that the thread pool is again freed for making space for a new thread.
Intel TBB and OpenMP are very good at thread pool management so this problem may not occur while using them.
Q: Is TBB in 2019 now widely supported?
A: Yes, you can take advantages of TBB in your OpenCV program while also turning on TBB support on building OpenCV.
Here is a program for TBB implementation in medianBlur:
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include <iostream>
#include <chrono>
using namespace cv;
using namespace std;
using namespace std::chrono;
class Parallel_process : public cv::ParallelLoopBody
{
private:
cv::Mat img;
cv::Mat& retVal;
int size;
int diff;
public:
Parallel_process(cv::Mat inputImgage, cv::Mat& outImage,
int sizeVal, int diffVal)
: img(inputImgage), retVal(outImage),
size(sizeVal), diff(diffVal)
{
}
virtual void operator()(const cv::Range& range) const
{
for(int i = range.start; i < range.end; i++)
{
/* divide image in 'diff' number
of parts and process simultaneously */
cv::Mat in(img, cv::Rect(0, (img.rows/diff)*i,
img.cols, img.rows/diff));
cv::Mat out(retVal, cv::Rect(0, (retVal.rows/diff)*i,
retVal.cols, retVal.rows/diff));
cv::medianBlur(in, out, size);
}
}
};
int main()
{
VideoCapture cap(0);
cv::Mat img, out;
while(1)
{
cap.read(img);
out = cv::Mat::zeros(img.size(), CV_8UC3);
// create 8 threads and use TBB
auto start1 = high_resolution_clock::now();
cv::parallel_for_(cv::Range(0, 8), Parallel_process(img, out, 9, 8));
//cv::medianBlur(img, out, 9); //Uncomment to compare time w/o TBB
auto stop1 = high_resolution_clock::now();
auto duration1 = duration_cast<microseconds>(stop1 - start1);
auto time_taken1 = duration1.count()/1000;
cout << "TBB Time: " << time_taken1 << "ms" << endl;
cv::imshow("image", img);
cv::imshow("blur", out);
cv::waitKey(1);
}
return 0;
}
On my machine, TBB implementation takes around 10ms and w/o TBB it takes around 40ms.
Q: If yes, what offers better performance, multithreading on the application level(if allowed) or TBB / OpenMP?
A: I would suggest using TBB/OpenMP over POSIX multithreading(pthread/thread) because TBB offers you better control over thread + better structure for writing parallel code and internally it manages pthreads. In case if you use pthreads you will have to take care of sync and safety etc in your code. But using these framework abstracts the need for handling thread which may get very complex.
Edit: I checked the comments regarding the incompatibility of image dimensions with the number of thread in which you want to divide the processing. So here is a potential workaround(haven't tested but should work), scale the image resolution to the compatible dimensions like:
If your image res is 485 x 647, scale it to 488 x 648 then pass it to Parallel_process
then scale back the output to the original size of 458 x 647.
For comparison of TBB and OpenMP check this answer