Faster way to zero memory than with memset?

maep picture maep · Sep 7, 2010 · Viewed 119k times · Source

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?

I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?

Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.

edit2: I revisited this issue and did a little testing. Here is what I tested:

    #include <stdio.h>
    #include <malloc.h>
    #include <string.h>
    #include <sys/time.h>

    #define TIME(body) do {                                                     \
        struct timeval t1, t2; double elapsed;                                  \
        gettimeofday(&t1, NULL);                                                \
        body                                                                    \
        gettimeofday(&t2, NULL);                                                \
        elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
        printf("%s\n --- %f ---\n", #body, elapsed); } while(0)                 \


    #define SIZE 0x1000000

    void zero_1(void* buff, size_t size)
    {
        size_t i;
        char* foo = buff;
        for (i = 0; i < size; i++)
            foo[i] = 0;

    }

    /* I foolishly assume size_t has register width */
    void zero_sizet(void* buff, size_t size)
    {
        size_t i;
        char* bar;
        size_t* foo = buff;
        for (i = 0; i < size / sizeof(size_t); i++)
            foo[i] = 0;

        // fixes bug pointed out by tristopia
        bar = (char*)buff + size - size % sizeof(size_t);
        for (i = 0; i < size % sizeof(size_t); i++)
            bar[i] = 0;
    }

    int main()
    {
        char* buffer = malloc(SIZE);
        TIME(
            memset(buffer, 0, SIZE);
        );
        TIME(
            zero_1(buffer, SIZE);
        );
        TIME(
            zero_sizet(buffer, SIZE);
        );
        return 0;
    }

results:

zero_1 is the slowest, except for -O3. zero_sizet is the fastest with roughly equal performance across -O1, -O2 and -O3. memset was always slower than zero_sizet. (twice as slow for -O3). one thing of interest is that at -O3 zero_1 was equally fast as zero_sizet. however the disassembled function had roughly four times as many instructions (I think caused by loop unrolling). Also, I tried optimizing zero_sizet further, but the compiler always outdid me, but no surprise here.

For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)

edit3: fixed bug in test code, test results are not affected

edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.

Answer

Tim picture Tim · Sep 7, 2010

x86 is rather broad range of devices.

For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.

For chips with mmx, an assembly loop with movq could hit 64bits at a time.

You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.

for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps

Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.