Neon Optimization using intrinsics

itisravi picture itisravi · Apr 19, 2011 · Viewed 8.7k times · Source

Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function.

Without NEON :

    void  double_elements(unsigned int *ptr, unsigned int size)
 {
        unsigned int loop;
        for( loop= 0; loop<size; loop++)
                ptr[loop]<<=1;
        return;
 }

With NEON:

 void  double_elements(unsigned int *ptr, unsigned int size)
{    
        unsigned int i;
        uint32x4_t Q0,vector128Output;
        for( i=0;i<(SIZE/4);i++)
        {
                Q0=vld1q_u32(ptr);               
                Q0=vaddq_u32(Q0,Q0);
                vst1q_u32(ptr,Q0);
                ptr+=4;

        }
        return;
}

Wondering if the load/store operations between the array and vector is consuming more time which offsets the benefit of the parallel addition.

UPDATE:More Info in response to Igor's reply.
1.The code is posted here:
plain.c
plain.s
neon.c
neon.s
From the section(label) L7 in both the assembly listings,I see that the neon version has more number of assembly instructions.(hence more time taken?)
2.I compiled using -mfpu=neon on arm-gcc, no other flags or optimizations.For the plain version, no compiler flags at all.
3.That was a typo, SIZE was meant to be size;both are same.
4,5.Tried on an array of 4000 elements. I timed using gettimeofday() before and after the function call.NEON=230us,ordinary=155us.
6.Yes I printed the elements in each case.
7.Did this, no improvement whatsoever.

Answer

Guy Sirton picture Guy Sirton · Jun 13, 2011

Something like this might run a bit faster.

void  double_elements(unsigned int *ptr, unsigned int size)
{    
    unsigned int i;
    uint32x4_t Q0,Q1,Q2,Q3;

    for( i=0;i<(SIZE/16);i++)
    {
            Q0=vld1q_u32(ptr);               
            Q1=vld1q_u32(ptr+4);               
            Q0=vaddq_u32(Q0,Q0);
            Q2=vld1q_u32(ptr+8);               
            Q1=vaddq_u32(Q1,Q1);
            Q3=vld1q_u32(ptr+12);               
            Q2=vaddq_u32(Q2,Q2);
            vst1q_u32(ptr,Q0);
            Q3=vaddq_u32(Q3,Q3);
            vst1q_u32(ptr+4,Q1);
            vst1q_u32(ptr+8,Q2);
            vst1q_u32(ptr+12,Q3);
            ptr+=16;

    }
    return;
}

There are a few problems with the original code (some of those the optimizer may fix but other it may not, you need to verify in the generated code):

  • The result of the add is only available in the N3 stage of the NEON pipeline so the following store will stall.
  • Assuming the compiler is not unrolling the loop there may be some overhead associated with the loop/branch.
  • It doesn't take advantage of the ability to dual issue load/store with another NEON instruction.
  • If the source data isn't in cache then the loads would stall. You can preload the data to speed this up with the __builtin_prefetch intrinsic.
  • Also as others have pointed out the operation is fairly trivial, you'll see more gains for more complex operations.

If you were to write this with inline assembly you could also:

  • Use the aligned load/stores (which I don't think the intrinsics can generate) and ensure your pointer is always 128 bit aligned, e.g. vld1.32 {q0}, [r1 :128]
  • You could also use the postincrement version (which I'm also not sure intrinsics will generate), e.g. vld1.32 {q0}, [r1 :128]!

95us for 4000 elements sounds pretty slow, on a 1GHz processor that's ~95 cycles per 128bit chunk. You should be able to do better assuming you're working from the cache. This figure is about what you'd expect if you're bound by the speed of the external memory.