views:

599

answers:

5

I need a comparison function for blocks of memory for doing binary searches on arrays of bytes in the D programming language. It does not need to have any useful semantics. It only needs to be fast and be a valid comparison function (one that produces a total ordering). The blocks of memory to be compared are already known to be the same length.

C's memcmp is actually pretty slow because it tries to preserve useful string comparison semantics, which I don't need. Below is the best I've come up with so far. Does anyone know of anything better, preferably w/o dipping into non-portable CPU-specific instructions?

// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics.  It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
    for(; n >= uint.sizeof; n -= uint.sizeof) {
        if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
            return -1;
        } else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
            return 1;
        }
        lhs += uint.sizeof;
        rhs += uint.sizeof;
    }

    for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
        if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
            return -1;
        } else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
            return 1;
        }
        lhs += ubyte.sizeof;
        rhs += ubyte.sizeof;
    }

    return 0;
}

Edit: I've read up on SSE and I don't want to use it for 3 reasons:

  1. It's not portable.
  2. It requires programming in ASM.
  3. The comparison instructions assume your data is floating points, which might be problematic if some data happens to match the pattern for a NaN.
A: 

You could try:

  • check if uint is the largest type that fits in your target CPU (ulongs might fit a native register better)
  • use 2 pointers of that type
  • use 2 local variables using *p++ (dont dereference the pointers 2 times for 1 value)
  • devide the counter of the 1st loop up front (use while (counter--))
  • unroll the 2nd loop by replacing it with a switch (if sizeof(type that fits in a register) is known and will be always the same.)

Edit: if the first loop is the bottleneck, unrolling might be the answer. Combined with halving the number of conditions in case of equal values, for unrolling 4 times I get something like:

uint* lp = (uint*)lhs;
uint* rp = (uint*)rhs;
uint  l;
uint  r;
int count = (n / uint.sizeof) / 4;

while (count--) {
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
    if( (l = *lp++) != (r = *rp++) {
        return (l < r) ? -1 : 1;
    }
}

Of course that leaves (n / uint.sizeof) % 4 iterations left to do, which you can mix into this loop by interleaving a switch, I left that as exercise for the reader evil grin.

rsp
1. I tried using ulongs, it was slower.2. I tried taking out the second dereference. It wasn't any faster, propbably b/c the compiler optimizes to the same thing anyhow, so I put the 2nd deref back because it makes the code more readable.3. I think most of the bottleneck is the first loop.
dsimcha
Can the person who voted -1 on all answers add a note as to why?
rsp
rsp I assume its just a troll. I dont think any of these answers deserve a downvote.
acidzombie24
A: 

I think memcmp is specified to do a byte-by-byte comparison, regardless of data type. Are you sure your compiler's implementation preserves string semantics? It shouldn't.

Crashworks
A: 

I dont know much about it but there are vector instructions which can apply instructions to many bytes at once. You can use those results to do a blazing fast memcmp. I dont know which instructions you would need but if you look up Larrabee New Instructions or check out this article you may find what you are looking for http://www.ddj.com/architect/216402188

NOTE: This CPU isnt out ATM AFAIK

-Edit- Right now i am positive there is an instruction sets (try looking at SSE or SSE2) which may compare 16 bytes at once if they are aligned.

You can try this pure c++ code.

template<class T>
int memoryCompare(const T* lhs, const T * rhs, size_t n) {
    const T* endLHS = lhs + n/sizeof(T);
    while(lhs<endLHS) {
        int i = *lhs - *rhs;
        if(i != 0) return i > 0 ? 1 : -1;
        lhs++;
        rhs++;
    }
    //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
    return 0;
}

The advantage here is your increasing the pointer so you can dereference it and not apply an index (its ptr_offset[index] vs ptr_offset). The above uses a template so you can use 64 bits on 64 bit machines. and CMPs in assembly are really just subtracts checking the N and Z flags. Instead of comparing N and decreasing N i just compare in my version.

acidzombie24
It might help if you dereference lhs and rhs :-)
rsp
oops, heh XD. Fixed.
acidzombie24
A: 

Well a lot depends upon your system and data. There is only so many assumptions we can make. What processor are you using? Does it have to be straight C code? How wide are the CPU registers? What is the cache structure of the CPU? etc etc

It also depends on how different your data is. If it is very unlikely that the first byte from each buffer is the same then the speed of the function is pretty meaningless as, logically, it won't get to the rest of the function. If it is likely that the first n-1 bytes are usually the sme then it becomes more important.

All in you are unlikely to see much change regardless of how you do the test.

Anyway this is a little implementation of my own, it may, or may not, be faster than your own (or, given i just made it up as i went along, it may, or may not, work ;)):

int memoryCompare(const void* lhs, const void* rhs, size_t n) 
{
    uint_64 diff  = 0

    // Test the first few bytes until we are 32-bit aligned.
    while( (n & 0x3) != 0 && diff != 0 )
    {
     diff = (uint_8*)lhs - (uint_8*)rhs;
     n--;
     ((uint_8*)lhs)++;
     ((uint_8*)rhs)++;              
        }

    // Test the next set of 32-bit integers using comparisons with
    // aligned data.
    while( n > sizeof( uint_32 ) && diff != 0 )
    {
     diff = (uint_32*)lhs - (uint_32*)rhs;
     n  -= sizeof( uint_32 );
     ((uint_32*)lhs)++;
     ((uint_32*)rhs)++;
    }

    // now do final bytes.
    while( n > 0 && diff != 0 ) 
        {
     diff = (uint_8*)lhs - (uint_8*)rhs;
     n--;
     ((uint_8*)lhs)++;
     ((uint_8*)rhs)++;  
        }

    return (int)*diff / abs( diff ));
}
Goz
+1  A: 

Does the answer to this question help you at all?

If the compiler has support for implementing memcmp() as an intrinsic/builtin function, it seems that you would have a hard time besting it.

I have to admit to knowing little to nothing about D, so I have no idea whether the D compiler has support for intrinsics.

sdtom
The OP already determined his alternative is faster then memcmp and asked for further improvements. If his compiler would use an intrinsic this probably would not be the case.
rsp