https://web.archive.org/web/20170227190422/http://hilbert-space.de/?p=22
On this site which is quite dated it shows that hand written asm would give a much greater improvement then the intrinsics. I am wondering if this is the current truth even now in 2012.
So has the compilation optimization improved for intrinsics using gnu cross compiler?
My experience is that the intrinsics haven't really been worth the trouble. It's too easy for the compiler to inject extra register unload/load steps between your intrinsics. The effort to get it to stop doing that is more complicated than just writing the stuff in raw NEON. I've seen this kind of stuff in pretty recent compilers (including clang 3.1).
At this level, I find you really need to control exactly what's happening. You can have all kinds of stalls if you do things in just barely the wrong order. Doing it in intrinsics feels like surgery with welder's gloves on. If the code is so performance critical that I need intrinsics at all, then intrinsics aren't good enough. Maybe others have difference experiences here.