I'm running a camera acquisition program that performs processing on acquired images, and I'm using simple OpenMP directives for this processing. So basically I wait for an image from the camera, and then process it.
When migrating to VC2010, I see very strange performance hog : under VC2010 my app is taking nearly 100% CPU while it is taking only 10% under VC2008.
If I benchmark only the processing code I get no difference between VC2010 and VC2008, the difference occurs when using the acquisition functions.
I have reduced the code needed to reproduce the problem to a simple loop that does the following:
for (int i=0; i<1000; ++i)
{
GetImage(buffer);//wait for image
Copy2Array(buffer, my_array);
long long sum = 0;//do some simple OpenMP parallel loop
#pragma omp parallel for reduction(+:sum)
for (int j=0; j<size; ++j)
sum += my_array[j];
}
This loop eats 5% of CPU with 2008, and 70% with 2010.
I've done some profiling, that shows that in 2010 most of the time is spent in OpenMP's vcomp100.dll!_vcomp::PartialBarrierN::Block
I have also done some concurrency profiling:
In 2008, processing work is distributed over 3 worker threads, that are very lightly active as processing time is much inferior as image waiting time
The same threads appear in 2010, but they are all 100% occupied by the PartialBarrierN::Block
function. As I have four cores, they are eating 75% of the work, which is roughly what I see in the CPU occupation.
So it looks like there is a conflict between OpenMP and the Matrox acquisition library (proprietary). But is it a bug of VS2010 or Matrox? Is there anything I can do? Using VC++2010 is mandatory for me, so I cannot just stick with 2008.
Big thanks
Using new concurrency framework, as suggested by DeadMG, leads to 40% CPU. Profiling it shows that time is spent in processing, so it doesn't show the bug I'm seeing with OpenMP, but performance in my case is way poorer than OpenMP.
I have installed an evaluation version of latest Intel C++. It shows exactly the same performance problems!!
I cross-posted to MSDN forum
Tested on Windows 7 64 bits and XP 32 bits, with the exact same results (on the same machinje)
In 2010 OpenMP, each worker thread does a spin-wait of about 200 ms after task completion. In my case of a I/O wait and repetitive OpenMP task it is massively loading the CPU.
The solution is to change this behaviour; Intel C++ has an extension routine for this, kmp_set_blocktime()
. However Visual 2010 doesn't have such possibility.
In this Autodesk note they talks about the problem for Intel C++. This compiler first introduced the behavior, but allows to change it (see above). Visual 2010 switched to it, but... without the workaround like Intel.
So to sum it up, switching to Intel C++ and using kmp_set_blocktime(0)
solved it.
Thanks to John Lilley from DataLever Corporation on the other MSDN thread
Issue has been submitted to MS Connect, and received the "won't fix" feedback.