I am copying N bytes from pSrc
to pDest
. This can be done in a single loop:
for (int i = 0; i < N; i++)
*pDest++ = *pSrc++
Why is this slower than memcpy
or memmove
? What tricks do they use to speed it up?
Because memcpy uses word pointers instead of byte pointers, also the memcpy implementations are often written with SIMD instructions which makes it possible to shuffle 128 bits at a time.
SIMD instructions are assembly instructions that can perform the same operation on each element in a vector up to 16 bytes long. That includes load and store instructions.