What's the most efficient way to compare two blocks of memory in the D language?

dsimcha picture dsimcha · Nov 6, 2009 · Viewed 10.2k times · Source

I need a comparison function for blocks of memory for doing binary searches on arrays of bytes in the D programming language. It does not need to have any useful semantics. It only needs to be fast and be a valid comparison function (one that produces a total ordering). The blocks of memory to be compared are already known to be the same length.

C's memcmp is actually pretty slow because it tries to preserve useful string comparison semantics, which I don't need. Below is the best I've come up with so far. Does anyone know of anything better, preferably w/o dipping into non-portable CPU-specific instructions?

// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics.  It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
    for(; n >= uint.sizeof; n -= uint.sizeof) {
        if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
            return -1;
        } else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
            return 1;
        }
        lhs += uint.sizeof;
        rhs += uint.sizeof;
    }

    for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
        if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
            return -1;
        } else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
            return 1;
        }
        lhs += ubyte.sizeof;
        rhs += ubyte.sizeof;
    }

    return 0;
}

Edit: I've read up on SSE and I don't want to use it for 3 reasons:

  1. It's not portable.
  2. It requires programming in ASM.
  3. The comparison instructions assume your data is floating points, which might be problematic if some data happens to match the pattern for a NaN.

Answer

rsp picture rsp · Nov 6, 2009

You could try:

  • check if uint is the largest type that fits in your target CPU (ulongs might fit a native register better)
  • use 2 pointers of that type
  • use 2 local variables using *p++ (dont dereference the pointers 2 times for 1 value)
  • devide the counter of the 1st loop up front (use while (counter--))
  • unroll the 2nd loop by replacing it with a switch (if sizeof(type that fits in a register) is known and will be always the same.)

Edit: if the first loop is the bottleneck, unrolling might be the answer. Combined with halving the number of conditions in case of equal values, for unrolling 4 times I get something like:

uint* lp = (uint*)lhs;
uint* rp = (uint*)rhs;
uint  l;
uint  r;
int count = (n / uint.sizeof) / 4;

while (count--) {
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
}

Of course that leaves (n / uint.sizeof) % 4 iterations left to do, which you can mix into this loop by interleaving a switch, I left that as exercise for the reader evil grin.