Copying strided data in C++

Vitality picture Vitality · Jun 13, 2013 · Viewed 10.7k times · Source

I have two arrays and I want to copy one array into the other with some stride. For example, I have

A A A A A A A A ...

B B B B B B B B ...

and I want to copy every three elements of B to A to obtain

B A A B A A B A ...

From the post "Is there a standard, strided version of memcpy?", it seems that there is no such a possibility in C.

However, I have experienced that, in some cases, memcpy is faster than a for loop based copy.

My question is; Is there any way to efficiently perform strided memory copy in C++ performing at least as a standard for loop?

Thank you very much.

EDIT - CLARIFICATION OF THE PROBLEM

To make the problem clearer, let us denote the two arrays at hand by a and b. I have a function that performs the unique following for loop

for (int i=0; i<NumElements, i++)
    a_[i] = b_[i];

where both the []'s are overloaded operators (I'm using an expression templates technique) so that they can be actually mean, for example

 a[3*i]=b[i];

Answer

Ayberk &#214;zg&#252;r picture Ayberk Özgür · Oct 7, 2014

Might be a too specific answer, but on an ARM platform that supports NEON, NEON vectorization can be used to make strided copy even faster. This could be life-saving in an environment where resources are relatively more limited, which is probably why ARM is used in that setting in the first place. A prominent example is Android where most devices still use the ARM v7a architecture which supports NEON.

The following examples demonstrate this, it is a loop to copy the semi-planar UV plane of a YUV420sp image into the planar UV plane of a YUV420p image. The sizes of the source and destination buffers are both 640*480/2 bytes. All of the examples are compiled with the g++ 4.8 inside Android NDK r9d. They are executed on a Samsung Exynos Octa 5420 processor:

Level 1: Regular

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    for(int i=0;i<stride;i++){
        dstptr[i]           = srcptr[i*2];
        dstptr[i + stride]  = srcptr[i*2 + 1];
    }
}

Compiled with -O3 only, takes about 1.5 ms on average.

Level 2: Unrolled and squeezed a bit more with moving pointers

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    unsigned char* endptr = dstptr + stride;
    while(dstptr<endptr){
        *(dstptr + 0)             = *(srcptr + 0);
        *(dstptr + stride + 0)    = *(srcptr + 1);
        *(dstptr + 1)             = *(srcptr + 2);
        *(dstptr + stride + 1)    = *(srcptr + 3);
        *(dstptr + 2)             = *(srcptr + 4);
        *(dstptr + stride + 2)    = *(srcptr + 5);
        *(dstptr + 3)             = *(srcptr + 6);
        *(dstptr + stride + 3)    = *(srcptr + 7);
        *(dstptr + 4)             = *(srcptr + 8);
        *(dstptr + stride + 4)    = *(srcptr + 9);
        *(dstptr + 5)             = *(srcptr + 10);
        *(dstptr + stride + 5)    = *(srcptr + 11);
        *(dstptr + 6)             = *(srcptr + 12);
        *(dstptr + stride + 6)    = *(srcptr + 13);
        *(dstptr + 7)             = *(srcptr + 14);
        *(dstptr + stride + 7)    = *(srcptr + 15);
        srcptr+=16;
        dstptr+=8;
    } 
}

Compiled with -O3 only, takes about 1.15 ms on average. This is probably as fast as it gets on a regular architecture, as per the other answer.

Level 3: Regular + GCC automatic NEON vectorization

void convertUVsp2UVp(
    unsigned char* __restrict srcptr, 
    unsigned char* __restrict dstptr, 
    int stride)
{
    for(int i=0;i<stride;i++){
        dstptr[i]           = srcptr[i*2];
        dstptr[i + stride]  = srcptr[i*2 + 1];
    }
}

Compiled with -O3 -mfpu=neon -ftree-vectorize -ftree-vectorizer-verbose=1 -mfloat-abi=softfp, takes about 0.6 ms on average. For reference, a memcpy of 640*480 bytes, or double the amount of what's tested here, takes about 0.6 ms on average.

As a side note, the second code (unrolled and pointered) compiled with the NEON parameters above takes about the same amount of time, 0.6 ms.