Java can recognize SIMD advantages of CPU; or there is just optimization effect of loop unrolling

huseyin tugrul buyukisik picture huseyin tugrul buyukisik · Jul 4, 2013 · Viewed 7.2k times · Source

This part of code is from dotproduct method of a vector class of mine. The method does inner product computing for a target array of vectors(1000 vectors).

When vector length is an odd number(262145), compute time is 4.37 seconds. When vector length(N) is 262144(multiple of 8), compute time is 1.93 seconds.

     time1=System.nanotime();
     int count=0;
     for(int j=0;j<1000;i++)
     {

             b=vektors[i]; // selects next vector(b) to multiply as inner product.
                           // each vector has an array of float elements.

             if(((N/2)*2)!=N)
             {
                 for(int i=0;i<N;i++)
                 {
                     t1+=elements[i]*b.elements[i];
                 }
             }
             else if(((N/8)*8)==N)
             {
                 float []vek=new float[8];
                 for(int i=0;i<(N/8);i++)
                 {
                     vek[0]=elements[i]*b.elements[i];
                     vek[1]=elements[i+1]*b.elements[i+1];
                     vek[2]=elements[i+2]*b.elements[i+2];
                     vek[3]=elements[i+3]*b.elements[i+3];
                     vek[4]=elements[i+4]*b.elements[i+4];
                     vek[5]=elements[i+5]*b.elements[i+5];
                     vek[6]=elements[i+6]*b.elements[i+6];
                     vek[7]=elements[i+7]*b.elements[i+7];


                     t1+=vek[0]+vek[1]+vek[2]+vek[3]+vek[4]+vek[5]+vek[6]+vek[7];
                     //t1 is total sum of all dot products.
                 }
             }
     }
     time2=System.nanotime();
     time3=(time2-time1)/1000000000.0; //seconds

Question: Could the reduction of time from 4.37s to 1.93s (2x as fast) be JIT's wise decision of using SIMD instructions or just my loop-unrolling's positive effect?

If JIT cannot do SIMD optimizaton automatically, then in this example there is also no unrolling optimization done automatically by JIT, is this true?.

For 1M iterations(vectors) and for vector size of 64, speedup multiplier goes to 3.5X(cache advantage?).

Thanks.

Answer

tmyklebu picture tmyklebu · Jul 4, 2013

Your code has a bunch of problems. Are you sure you're measuring what you think you're measuring?

Your first loop does this, indented more conventionally:

 for(int j=0;j<1000;i++) {
     b=vektors[i]; // selects next vector(b) to multiply as inner product.
                   // each vector has an array of float elements.
 }

Your rolled loop involves a really long chain of dependent loads and stores. Your unrolled loop involves 8 separate chains of dependent loads and stores. The JVM can't turn one into the other if you're using floating-point arithmetic because they're fundamentally different computations. Breaking dependent load-store chains can lead to major speedups on modern processors.

Your rolled loop iterates over the whole vector. Your unrolled loop only iterates over the first (roughly) eighth. Thus, the unrolled loop again computes something fundamentally different.

I haven't seen a JVM generate vectorised code for something like your second loop, but I'm maybe a few years out of date on what JVMs do. Try using -XX:+PrintAssembly when you run your code and inspect the code opto generates.