I am currently trying to vectorize a program and i have observed an odd behaviour
Seems that a for loop is vectorized when using
#pragma simd
(262): (col. 3) remark: SIMD LOOP WAS VECTORIZED.
but it doesn't when i use
#pragma vector always
#pragma ivdep
(262): (col. 3) remark: loop was not vectorized: existence of vector dependence.
I always thought that both sentences do the same vectorization
#pragma simd
is an explicit vectorization tool given to the developer to enforce vectorization as mentioned at https://software.intel.com/en-us/node/514582 while #pragma vector
is a tool which is used to indicate the compiler that loop should be vectorized based on its argument(s). Here the argument is always
, which means "neglect the cost/efficiency heuristics of the compiler and go ahead with vectorization". More information on #pragma vector
is available at https://software.intel.com/en-us/node/514586. That doesn't mean #pragma simd
produces wrong results it succeeds in vectorizing a loop where #pragma vector always
failed to vectorize. When #pragma simd
is used with right set of clauses, it can vectorize and still produce a correct result.
Below is a small code snippet which demonstrates that:
void foo(float *a, float *b, float *c, int N) { #pragma vector always #pragma ivdep //#pragma simd vectorlength(2) for(int i = 2; i < N; i++) a[i] = a[i-2] + b[i] + c[i]; return; }
Compiling this code using ICC will produce the following vectorization report:
$ icc -c -vec-report2 test11.cc
test11.cc(5): (col. 1) remark: loop was not vectorized: existence of vector dependence
By default ICC targets SSE2 which uses 128 bits XMM registers. 4 floats can be accommodated in one XMM register but when you try to accommodate vector of 4 floats, there is a vector dependence. So what #pragma vector always emits is right. But instead of 4, if we consider just 2 floats, we can vectorize this loop without corrupting the results. The vectorization report for the same is shown below:
void foo(float *a, float *b, float *c, int N){
//#pragma vector always
//#pragma ivdep
#pragma simd vectorlength(2)
for(int i = 2; i < N; i++)
a[i] = a[i-2] + b[i] + c[i];
return;
}
$ icc -c -vec-report2 test11.cc
test11.cc(5): (col. 1) remark: SIMD LOOP WAS VECTORIZED
But #pragma vector
doesn't have a clause which can explicitly specify the vector length to consider while vectoring the loop. This is where #pragma simd
can really come in handy.
When used with right clauses which best explains the computation in vector fashion, the compiler will generate the requested vector which will not generate wrong results. The Intel(R) Cilk(TM) Plus White Paper published at https://software.intel.com/sites/default/files/article/402486/intel-cilk-plus-white-paper.pdf has a section for "Usage of $pragma simd vectorlength clause" and "Usage of $pragma simd reduction and private clause" which explains how to pragma simd clause with right clauses. The clauses help the developer express to the compiler what he wants to achieve and the compiler generates the vector code accordingly. Is it highly recommended to use #pragma simd with relevant clauses wherever needed to best express the loop logic to the compiler.
Also traditionally inner loops are targeted for vectorization but pragma simd can be used for vectorizing outer loops too. More information on this available at https://software.intel.com/en-us/articles/outer-loop-vectorization.