I know many examples when GPU is much faster than CPU. But exists algorithms (problems) which are very hard to parallelise. Could you give me some examples or tests when CPU can overcome GPU ?
Edit:
Thanks for suggestions! We can make a comparison between the most popular and the newest cpu's and gpu's, for example Core i5 2500k vs GeForce GTX 560 Ti.
I wonder how to compare SIMD model between them. For example: Cuda calls a SIMD model more precisely SIMT. But SIMT should be compared to the multhitreading on CPU's which is distributing threads (tasks) between MIMD cores (Core i5 2500k give as 4 MIMD cores). On the other hand each of these MIMD cores can implement SIMD model, but this is something else than SIMT and I don't know how to compare them. Finally a fermi architecture with concurrent kernel execution might be consider as MIMD cores with SIMT.
Based on my experience, I will summarize the key differences in terms of performance between parallel programs in CPUs and GPUs. Trust me, a comparison can be changed from generation to generation. So I will just point out what is good and is bad for CPUs and GPUs. Of course, if you make a program at the extreme, i.e., having only bad or good sides, it will run definitely faster on one platform. But a mixture of those requires very complicated reasoning.
Host program level
One key difference is memory transfer cost. GPU devices requires some memory transfers. This cost is non-trivial in some cases, for example when you have to frequently transfer some big arrays. In my experience, this cost can be minimized but pushing most of host code to device code. The only cases you can do so are when you have to interact with the host operating system in program, such as outputting to monitor.
Device program level
Now we come to see a complex picture that hasn't been fully revealed yet. What I mean is there are many mysterious scenes in GPUs that haven't been disclosed. But still, we have a lot of distinguish CPU and GPU (kernel code) in terms of performance.
There are few factors that I noticed those dramatically contribute to the difference.
GPUs, which consist of many execution units, are designed to handle massively parallel programs. If you have little of work, say a few sequential tasks, and put these tasks on a GPU, only a few of those many execution units are busy, thus will be slower than CPU. Because CPUs are, in other hand, better to handle short and sequential tasks. The reason is simple, CPUs are much more complicated and able to exploit instruction level parallelism, whereas GPUs exploit thread level parallelism. Well, I heard NVIDIA GF104 can do Superscalar, but I had no chance to experience with it though.
It is worth noting that, in GPUs, workload are divided into small blocks (or workgroups in OpenCL), and blocks are arranged in chunks, each of which is executed in one Streaming processor (I am using terminologies from NVIDIA). But in CPUs, those blocks are executed sequentially - I can't think of anything else than a single loop.
Thus, for programs that have small number of blocks, it will be likely to run faster on CPUs.
Branches are bad things to GPUs, always. Please bear in mind that GPUs prefer equal things. Equal blocks, equal threads within a blocks, and equal threads within a warp. But what matters the most?
***Branch divergences.***
Cuda/OpenCL programmers hate branch divergences. Since all the threads somehow are divided into sets of 32 threads, called a warp, and all threads within a warp execute in lockstep, a branch divergence will cause some threads in the warp to be serialized. Thus, the execution time of the warp will be accordingly multiplied.
Unlike GPUs, each cores in CPUs can follow their own path. Furthermore, branches can be efficiently executed because CPUs have branch prediction.
Thus, programs that have more warp divergences are likely to run faster on CPUs.
This REALLY is complicated enough so let's make it brief.
Remember that global memory accesses have very high latency (400-800 cycles). So in old generations of GPUs, whether memory accesses are coalesced was a critical matter. Now your GTX560 (Fermi) has more 2 level of caches. So global memory access cost can be reduced in many cases. However, caches in CPUs and GPUs are different, so their effects are also different.
What I can say is that it really really depends on your memory access pattern, your kernel code pattern (how memory accesses are interleaved with computation, the types of operations, etc., ) to tell if one runs faster on GPUs or CPUs.
But somehow you can expect a huge number of cache misses (in GPUs) has a very bad effect on GPUs (how bad? - it depends on your code).
Additionally, shared memory is an important feature of GPUs. Accessing to shared memory is as fast as accessing to GPU L1 cache. So kernels that make use of shared memory will have pretty much benefit.
Some other factors I haven't really mentioned but those can have big impact on the performance in many cases such as bank conflicts, size of memory transaction, GPU occupancy...