Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function.
Without NEON :
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int loop;
for( loop= 0; loop<size; loop++)
ptr[loop]<<=1;
return;
}
With NEON:
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,vector128Output;
for( i=0;i<(SIZE/4);i++)
{
Q0=vld1q_u32(ptr);
Q0=vaddq_u32(Q0,Q0);
vst1q_u32(ptr,Q0);
ptr+=4;
}
return;
}
Wondering if the load/store operations between the array and vector is consuming more time which offsets the benefit of the parallel addition.
UPDATE:More Info in response to Igor's reply.
1.The code is posted here:
plain.c
plain.s
neon.c
neon.s
From the section(label) L7 in both the assembly listings,I see that the neon version has more number of assembly instructions.(hence more time taken?)
2.I compiled using -mfpu=neon on arm-gcc, no other flags or optimizations.For the plain version, no compiler flags at all.
3.That was a typo, SIZE was meant to be size;both are same.
4,5.Tried on an array of 4000 elements. I timed using gettimeofday() before and after the function call.NEON=230us,ordinary=155us.
6.Yes I printed the elements in each case.
7.Did this, no improvement whatsoever.
Something like this might run a bit faster.
void double_elements(unsigned int *ptr, unsigned int size)
{
unsigned int i;
uint32x4_t Q0,Q1,Q2,Q3;
for( i=0;i<(SIZE/16);i++)
{
Q0=vld1q_u32(ptr);
Q1=vld1q_u32(ptr+4);
Q0=vaddq_u32(Q0,Q0);
Q2=vld1q_u32(ptr+8);
Q1=vaddq_u32(Q1,Q1);
Q3=vld1q_u32(ptr+12);
Q2=vaddq_u32(Q2,Q2);
vst1q_u32(ptr,Q0);
Q3=vaddq_u32(Q3,Q3);
vst1q_u32(ptr+4,Q1);
vst1q_u32(ptr+8,Q2);
vst1q_u32(ptr+12,Q3);
ptr+=16;
}
return;
}
There are a few problems with the original code (some of those the optimizer may fix but other it may not, you need to verify in the generated code):
If you were to write this with inline assembly you could also:
95us for 4000 elements sounds pretty slow, on a 1GHz processor that's ~95 cycles per 128bit chunk. You should be able to do better assuming you're working from the cache. This figure is about what you'd expect if you're bound by the speed of the external memory.