Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.
You might try this (it's not in ASM, but you should be able to convert it easily):
float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);
In ASM it would be probably only VADD and VPADD.
I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...
PS. I'm new to NEON too