tags:

views:

25

answers:

1

Hello, I am looking at MurmurHash (sites.google.com/site/murmurhash/) I'm using it in a black box kind of a way and not trying to undersatand the maths at this stage.

However, I did have a little look at the code and got worried about how it seems to work... Here's the code:

uint64_t MurmurHash64A ( const void * key, int len, unsigned int seed )
{
    const uint64_t m = 0xc6a4a7935bd1e995;
    const int r = 47;

    uint64_t h = seed ^ (len * m);

    const uint64_t * data = (const uint64_t *)key;
    const uint64_t * end = data + (len/8);

    while(data != end)
    {
        uint64_t k = *data++;

        k *= m; 
        k ^= k >> r; 
        k *= m; 

        h ^= k;
        h *= m; 
    }

    const unsigned char * data2 = (const unsigned char*)data;

    switch(len & 7)
    {
    case 7: h ^= uint64_t(data2[6]) << 48;
    case 6: h ^= uint64_t(data2[5]) << 40;
    case 5: h ^= uint64_t(data2[4]) << 32;
    case 4: h ^= uint64_t(data2[3]) << 24;
    case 3: h ^= uint64_t(data2[2]) << 16;
    case 2: h ^= uint64_t(data2[1]) << 8;
    case 1: h ^= uint64_t(data2[0]);
            h *= m;
    };

    h ^= h >> r;
    h *= m;
    h ^= h >> r;

    return h;
} 

Note this is the 64bit version for 64bit machines. My problem is that I don't understand how it steps through the key you send it. If I for example send it a pointer to a string of characters "ABC". I can see that I would send it a pointer to the first character "A" and a length of 3.

My limited C++ knowledge tells me that it creates a pointer 'data' that points to the same location as the incoming pointer. But then in it calculates the end of the key by taking 'data' and adding to it the length of the string divided by 8. So if the key were less than 8 the while loop wouldn't get triggered and none of the first bit of maths stuff would get done. Does anyone know why it divides by 8?

Is it because the first maths bit only is only meant to happen to keys 8 characters and over (if so why)?

Thanks in advance. C

+2  A: 

The algorithm processes the data passed 8 bytes at a time (uint64_t is 8 bytes). The first loop will combine all set of 8 bytes to make a single key of 8 bytes. Then the switch will use the remaining bytes (all 3 bytes in your example passing "ABC") and processes it to take them into account in the final result.

Benoit Thiery
Thank you very much.
Columbo