How to properly Multithread in OpenCV in 2019?

Crigges picture Crigges · Jan 31, 2019 · Viewed 10.7k times · Source

Background:

I read some articles and posts regarding Multithreading in OpenCV:

  • On the one hand you can build OpenCV with TBB or OpenMP support which parallelize OpenCV's functions internally.
  • On the other hand you can create multiple threads yourself and call the functions parallel to realize multithreading on application level.

But I couldn't get consistent answers which method of multithreading is the right way to go.

Regarding TBB, an answer from 2012 with 5 upvotes:

With WITH_TBB=ON OpenCV tries to use several threads for some functions. The problem is that just a handsome of function are threaded with TBB at the moment (may be a dozen). So, it is hard to see any speedup. OpenCV philosophy here is that application should be multi-threaded, not OpenCV functions.[...]

Regarding multithreading on application level, an comment from an moderator on answers.opencv.org:

please avoid using your own multithreading with opencv. a lot of functions are explicitly not thread-safe. rather rebuild the opencv libs with TBB or openmp support.

But another answer with 3 upvotes is stating:

The library itself is thread safe in that you can have multiple calls into the library at the same time, however the data is not always thread safe.

Problem Description:

So I thought it was at least okay to use (multi)threading on application level. But I encountered strange performance problems when running my program for longer time periods.

After investigating these performance problems I created this minimal, complete, and verifiable example code:

#include "opencv2\opencv.hpp"
#include <vector>
#include <chrono>
#include <thread>

using namespace cv;
using namespace std;
using namespace std::chrono;

void blurSlowdown(void*) {
    Mat m1(360, 640, CV_8UC3);
    Mat m2(360, 640, CV_8UC3);
    medianBlur(m1, m2, 3);
}

int main()
{
    for (;;) {
        high_resolution_clock::time_point start = high_resolution_clock::now();

        for (int k = 0; k < 100; k++) {
            thread t(blurSlowdown, nullptr);
            t.join(); //INTENTIONALLY PUT HERE READ PROBLEM DESCRIPTION
        }

        high_resolution_clock::time_point end = high_resolution_clock::now();
        cout << duration_cast<microseconds>(end - start).count() << endl;
    }
}

Actual Behavior:

If the program is running for an extended period of time the time spans printed by

cout << duration_cast<microseconds>(end - start).count() << endl;

are getting larger and larger.

After running the program for around 10 minutes the printed timespans have doubled, which is not explainable with normal fluctuations.

Expected Behavior:

The behavior of the program I would expect is that the time spans are staying pretty much constant, even tho they might be longer than calling the function directly.

Notes:

When calling the function directly:

[...]
for (int k = 0; k < 100; k++) {
    blurSlowdown(nullptr);
}
[...]

The printed time spans are staying constant.

When not calling the cv function:

void blurSlowdown(void*) {
    Mat m1(360, 640, CV_8UC3);
    Mat m2(360, 640, CV_8UC3);
    //medianBlur(m1, m2, 3);
}

The printed time spans are staying constant too. So there must be something wrong when using threading in combination with OpenCV functions.

  • I know that the code above does NOT achieve actual multithreading there will only be one thread active at the same time that is calling the blurSlowdown() function.
  • I know that creating threads and and cleaning them up afterwards is not coming free and will be slower than calling the function directly.
  • It is NOT about that the code is slow in general. The problem is that the printed time spans are getting longer and longer over time.
  • The problem is not related to the medianBlur() function since it happens on other with other functions like erode() or blur() too.
  • The problem was reproduced under Mac under clang++ see comment by @Mark Setchell
  • The problem is amplified when using the debug library instead of the release

My testing environment:

  • Windows 10 64bit
  • MSVC compiler
  • Official OpenCV 3.4.2 binaries

My Questions:

  • Is it okay to use (multi)threading on application level with OpenCV?
  • If yes, why are the time spans printed by my program above GROWING over time?
  • If no, why is OpenCV then considered thread safe and please explain how to interpret the statement from Kirill Kornyakov instead
  • Is TBB / OpenMP in 2019 now widely supported?
  • If yes, what offers better performance, multithreading on application level(if allowed) or TBB / OpenMP?

Answer

FutureJJ picture FutureJJ · Mar 20, 2019

First of all, thank you for the clarity of the question.

Q: Is it okay to use (multi)threading on application level with OpenCV?

A: Yes it is totally ok to use multithreading on application level with OpenCV unless and until you are using functions which can take advantage of multithreading such as blurring, colour space changing, here you can split the image into multiple parts and apply global functions throughout the divided part and then recombine it to give the final output.

In some functions such as Hough, pca_analysis which cannot give correct results when they are applied to divided image sections and then recombined, applying multithreading on application level to such functions may not give correct results and thus should not be done.

As πάντα ῥεῖ mentioned, your implementation of multithreading will not give you an advantage because you are joining the thread in the for loop itself. I would suggest you use promise and future objects(If you want an example of how to, let me know down in the comments, I will share the snippet.

Below answer took a lot of research, thanks for asking the question, it really helps me add info to my multithreading knowledge :)

Q: If yes, why are the time spans printed by my program above GROWING over time?

A: After a lot of research I found out that creating and destroying threads takes a lot of CPU as well as memory resources. When we initialize a thread(in your code by this line: thread t(blurSlowdown, nullptr); ) an identifier is written to the memory location to which this variable points and this identifier enables us to refer to the thread. Now in your program you are creating and destroying thread at a very high rate, now this is what happens, there is a thread pool allocated to a program through which our program can run and destroy threads, I will keep it short and let's look at the explanation below:

  1. When you create a thread, this creates an identifier which points this thread.
  2. When you destroy the thread, this memory is freed

BUT

  1. When you again create a thread after no time the first thread is destroyed, the identifier of this new thread points to a new location(location other than the previous thread) in the thread pool.

  2. After repeatedly creating and destroying a thread, the thread pool is exhausted and so CPU is forced to slow down our program cycles a bit so that the thread pool is again freed for making space for a new thread.

Intel TBB and OpenMP are very good at thread pool management so this problem may not occur while using them.

Q: Is TBB in 2019 now widely supported?

A: Yes, you can take advantages of TBB in your OpenCV program while also turning on TBB support on building OpenCV.

Here is a program for TBB implementation in medianBlur:

#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
#include <iostream>
#include <chrono>

using namespace cv;
using namespace std;
using namespace std::chrono;

class Parallel_process : public cv::ParallelLoopBody
{

private:
    cv::Mat img;
    cv::Mat& retVal;
    int size;
    int diff;

public:
    Parallel_process(cv::Mat inputImgage, cv::Mat& outImage,
                     int sizeVal, int diffVal)
        : img(inputImgage), retVal(outImage),
          size(sizeVal), diff(diffVal)
    {
    }

    virtual void operator()(const cv::Range& range) const
    {
        for(int i = range.start; i < range.end; i++)
        {
            /* divide image in 'diff' number
               of parts and process simultaneously */

            cv::Mat in(img, cv::Rect(0, (img.rows/diff)*i,
                                     img.cols, img.rows/diff));
            cv::Mat out(retVal, cv::Rect(0, (retVal.rows/diff)*i,
                                         retVal.cols, retVal.rows/diff));

            cv::medianBlur(in, out, size);
        }
    }
};

int main()
{
    VideoCapture cap(0);

    cv::Mat img, out;

    while(1)
    {
        cap.read(img);
        out = cv::Mat::zeros(img.size(), CV_8UC3);

        // create 8 threads and use TBB
        auto start1 = high_resolution_clock::now();
        cv::parallel_for_(cv::Range(0, 8), Parallel_process(img, out, 9, 8));
        //cv::medianBlur(img, out, 9); //Uncomment to compare time w/o TBB
        auto stop1 = high_resolution_clock::now();
        auto duration1 = duration_cast<microseconds>(stop1 - start1);

        auto time_taken1 = duration1.count()/1000;
        cout << "TBB Time: " <<  time_taken1 << "ms" << endl;

        cv::imshow("image", img);
        cv::imshow("blur", out);
        cv::waitKey(1);
    }

    return 0;
}

On my machine, TBB implementation takes around 10ms and w/o TBB it takes around 40ms.

Q: If yes, what offers better performance, multithreading on the application level(if allowed) or TBB / OpenMP?

A: I would suggest using TBB/OpenMP over POSIX multithreading(pthread/thread) because TBB offers you better control over thread + better structure for writing parallel code and internally it manages pthreads. In case if you use pthreads you will have to take care of sync and safety etc in your code. But using these framework abstracts the need for handling thread which may get very complex.

Edit: I checked the comments regarding the incompatibility of image dimensions with the number of thread in which you want to divide the processing. So here is a potential workaround(haven't tested but should work), scale the image resolution to the compatible dimensions like:

If your image res is 485 x 647, scale it to 488 x 648 then pass it to Parallel_process then scale back the output to the original size of 458 x 647.

For comparison of TBB and OpenMP check this answer