I have two arrays and I want to copy one array into the other with some stride. For example, I have
A A A A A A A A ...
B B B B B B B B ...
and I want to copy every three elements of B
to A
to obtain
B A A B A A B A ...
From the post "Is there a standard, strided version of memcpy?", it seems that there is no such a possibility in C.
However, I have experienced that, in some cases, memcpy
is faster than a for
loop based copy.
My question is; Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for
loop?
Thank you very much.
EDIT - CLARIFICATION OF THE PROBLEM
To make the problem clearer, let us denote the two arrays at hand by a
and b
. I have a function that performs the unique following for
loop
for (int i=0; i<NumElements, i++)
a_[i] = b_[i];
where both the []
's are overloaded operators (I'm using an expression templates technique) so that they can be actually mean, for example
a[3*i]=b[i];
Might be a too specific answer, but on an ARM platform that supports NEON, NEON vectorization can be used to make strided copy even faster. This could be life-saving in an environment where resources are relatively more limited, which is probably why ARM is used in that setting in the first place. A prominent example is Android where most devices still use the ARM v7a architecture which supports NEON.
The following examples demonstrate this, it is a loop to copy the semi-planar UV plane of a YUV420sp image into the planar UV plane of a YUV420p image. The sizes of the source and destination buffers are both 640*480/2
bytes. All of the examples are compiled with the g++ 4.8 inside Android NDK r9d. They are executed on a Samsung Exynos Octa 5420 processor:
Level 1: Regular
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
for(int i=0;i<stride;i++){
dstptr[i] = srcptr[i*2];
dstptr[i + stride] = srcptr[i*2 + 1];
}
}
Compiled with -O3
only, takes about 1.5 ms on average.
Level 2: Unrolled and squeezed a bit more with moving pointers
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
unsigned char* endptr = dstptr + stride;
while(dstptr<endptr){
*(dstptr + 0) = *(srcptr + 0);
*(dstptr + stride + 0) = *(srcptr + 1);
*(dstptr + 1) = *(srcptr + 2);
*(dstptr + stride + 1) = *(srcptr + 3);
*(dstptr + 2) = *(srcptr + 4);
*(dstptr + stride + 2) = *(srcptr + 5);
*(dstptr + 3) = *(srcptr + 6);
*(dstptr + stride + 3) = *(srcptr + 7);
*(dstptr + 4) = *(srcptr + 8);
*(dstptr + stride + 4) = *(srcptr + 9);
*(dstptr + 5) = *(srcptr + 10);
*(dstptr + stride + 5) = *(srcptr + 11);
*(dstptr + 6) = *(srcptr + 12);
*(dstptr + stride + 6) = *(srcptr + 13);
*(dstptr + 7) = *(srcptr + 14);
*(dstptr + stride + 7) = *(srcptr + 15);
srcptr+=16;
dstptr+=8;
}
}
Compiled with -O3
only, takes about 1.15 ms on average. This is probably as fast as it gets on a regular architecture, as per the other answer.
Level 3: Regular + GCC automatic NEON vectorization
void convertUVsp2UVp(
unsigned char* __restrict srcptr,
unsigned char* __restrict dstptr,
int stride)
{
for(int i=0;i<stride;i++){
dstptr[i] = srcptr[i*2];
dstptr[i + stride] = srcptr[i*2 + 1];
}
}
Compiled with -O3 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=1 -mfloat-abi=softfp
, takes about 0.6 ms on average. For reference, a memcpy
of 640*480
bytes, or double the amount of what's tested here, takes about 0.6 ms on average.
As a side note, the second code (unrolled and pointered) compiled with the NEON parameters above takes about the same amount of time, 0.6 ms.