I'm tried to improve performance of copy operation via SSE and AVX:
#include <immintrin.h>
const int sz = 1024;
float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
float a=0;
std::generate(mas, mas+sz, [&](){return ++a;});
const int nn = 1000;//Number of iteration in tester loops
std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3;
//std::copy testing
start1 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
std::copy(mas, mas+sz, tar);
end1 = std::chrono::system_clock::now();
float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=4, _tar+=4)
{
__m128 buffer = _mm_load_ps(_mas);
_mm_store_ps(_tar, buffer);
}
}
end2 = std::chrono::system_clock::now();
float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
//AVX-copy testing
start3 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
auto _mas = mas;
auto _tar = tar;
for(; _mas!=mas+sz; _mas+=8, _tar+=8)
{
__m256 buffer = _mm256_load_ps(_mas);
_mm256_store_ps(_tar, buffer);
}
}
end3 = std::chrono::system_clock::now();
float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();
std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;
_mm_free(mas);
_mm_free(tar);
It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:
nn=10: SSE-gain=3, AVX-gain=6;
nn=100: SSE-gain=0.75, AVX-gain=1.5;
nn=1000: SSE-gain=0.55, AVX-gain=1.1;
Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?
The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:
for blah blah:
sleep(500ms)
std::copy
sse
axv
output:
SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy
So in this case, AVX is a bunch faster than std::copy
. What happens when I change to test case to..
for blah blah:
sleep(500ms)
sse
axv
std::copy
Notice that absolutely nothing changed, except the order of the tests.
SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy
Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.
This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.