ansaurus

Question

Faster String GetHashCode (e.g. using Multicore or GPU)

Answer 1

A:

You could parallelize this however the problem you will run into is that threads, CUDA, etc have overheads associated with them. Even if you use a thread pool, if your strings are not very large, let's say a typical string is 128-256 characters (probably less than this) you will probably still end up making each call to this function taking longer than it did originally.

Now, if you were dealing with very large strings, then yes it would improve your time. The simple algorithm is "embarrassingly parallel."

BobbyShaftoe 2009-10-30 15:23:35

Answer 2

A:

I think all of your suggested approaches are very inefficient compared to the current implementation.

Using GPU: The string data needs to be transferred to the GPU and the result back, which takes a lot of time. GPU's are very fast, but only when comparing floating point calculations, which aren't used here. All operations are on Integers, for which x86 CPU power is decent.

Using Another CPU Core: This would involve creating a separate thread, locking down memory and synchronizing the thread requesting the Hash Code. The incurred overhead simply outweighs the benefits of parallel processing.

If you would want to calculate Hash values of thousands of strings in one go, things might look a little different, but I can't imagine a scenario where this would justify implementing a faster GetHashCode().

Johannes Rudolph 2009-10-30 15:26:12

If you want to calculate Hash values od thousands of strings, you don't need to re-implement GetHashCode, just call the default GetHashCode on different threads for each string.

Guillaume 2009-10-30 15:37:47

True, but even if you had a thousand threads, you'd still want the basic code for the hashing to run as fast as you could make a sequential version of it run.

Ira Baxter 2009-10-30 20:19:40

Answer 3

A:

Given that strings are immutable, the first thing that I would consider is caching the return result.

HTH, Kent

Kent Boogaart 2009-10-30 15:35:38

It takes memory and it's usefull only if GetHashCode is called multiple times on the same instance.

Guillaume 2009-10-30 15:39:40

The question was about speed, not memory use ;)

Kent Boogaart 2009-10-30 15:43:43

When we talk about primitive types, speed and memory are linked... But you are right, the question was only about speed of a particular algorithm. Anyway, it still helps only when GetHashCode is called multiple times on the same instance.

Guillaume 2009-10-30 15:52:57

To cache the result, you have to look up the result based on the argument. So you'll need to hash the argument to look up the cached answer... oops, that was the problem we were trying to avoid. I don't think you can reasonably do this.

Ira Baxter 2009-10-30 21:11:14

@Ira: er, there is no argument. The cache would be on the instance of the string. There is nothing to check except whether you've already calculated the result. if (_hashCode != 0) return _hashCode;

Kent Boogaart 2009-10-30 22:01:14

who downvoted this; this is the sanest/most practical solution posted

Robert Fraser 2009-11-01 16:27:06

@Kent: so your solution adds a cache data slot to the string data structure? OK, now I understand how you can avoid computing the hash value after it has been computed. The OP wanted a faster hashing function, period.

Ira Baxter 2009-11-01 19:45:38

@Ira: as stated in the question, this was an exercise in curiosity, so I'm not sure why you're being so pedantic. The OP did not want a faster hashing function. He asked for a hash method that returns the same result but faster. My solution will do exactly that for subsequent calls. When is the last time you measured the performance of a method by executing it once?

Kent Boogaart 2009-11-01 20:06:56

@Kent: At the point in time that the hash value is needed, somebody, somewhere had to compute it. Less work is less work, even if you cache the answer. I think we both understand the other viewpoint.

Ira Baxter 2009-11-02 05:59:18

Answer 4

+1 A:

Threads and GPU most certainly will introduce overhead greater than possible performance boost. The approach that could be justified is using SIMD instruction sets, such as SSE. However, it would require testing whether this partcular instruction set is available, which may cost. It will also bring boost on long strings only.

If you want to try it, consider testing Mono support for SIMD before diving into C or assembly. Read here about development possibilities and gotchas.

elder_george 2009-10-30 15:36:12

Answer 5

+2 A:

One way to make a function go faster is to take special cases into account. A function with variable size inputs has special cases based on size.

Going parallel only makes sense when the the cost of going parallel is smaller than the gain, and for this kind of computation it is likely that the string would have to be fairly large to overcome the cost of forking a parallel thread. But implementing that isn't hard; basically you need a test for this.Length exceeding an empirically determined threshold, and then forking multiple threads to compute hashes on substrings, with a final step composing the subhashes into a final hash. Implementation left for the reader.

Modern processors also have SIMD instructions, which can process up to 32 (or 64) bytes in a single instruction. This would allow you to process the string in 32 (16 bit character) chunks in one-two SIMD instructions per chunk; and then fold the 64 byte result into a single hashcode at the end. This is likely to be extremely fast for strings of any reasonable size. The implementation of this from C# is harder, because one doesn't expect a virtual machine to provide provide easy (or portable) access to the SIMD instructions that you need. Implementation also left for the reader. EDIT: Another answer suggests that Mono system does provide SIMD instruction access.

Having said that, the particular implementation exhibited is pretty stupid. The key observation is that the loop checks the limit twice on every iteration. One can solve that problem by checking the end condition cases in advance, and executing a loop that does the correct number of iterations. One can do better than that by using Duffs device to jump into an unrolled loop of N iterations. This gets rid of the loop limit checking overhead for N-1 iterations. That modification would be very easy and surely be worth the effort to implement.

EDIT: You can also combine the SIMD idea and the loop unrolling idea to enable processing many chunks of 8/16 characters in a few SIMD instrucions.

For languages that can't jump into loops, one can do the equivalent of Duff's device by simply peeling off the initial cases. A shot at how to recode the original code using the loop peeling approach is the following:

    public override unsafe int GetHashCode()
    {
        fixed (char* str = ((char*) this))
        {
            const int N=3; // a power of two controlling number of loop iterations
            char* chPtr = str;
            int num = 0x15051505;
            int num2 = num;
            int* numPtr = (int*) chPtr;
            count = this.length;
            unrolled_iterations = count >> (N+1); // could be 0 and that's OK
            for (int i = unrolled_iterations; i > 0; i--)
            {
               // repeat 2**N times
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[8];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[9]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[10];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[11]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[12];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[13]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[14];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[15]; }
               numPtr += 16;
            }
            if (count & ((1<<N)-1))
            {
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7]; }
               numPtr += 8;
            }
            if (count & ((1<<(N-1))-1))
            {
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3]; }
               numPtr += 4;
            }
            if (count & ((1<<(N-2)-1))
            {
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                 num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1]; }
               numPtr += 2;
            }
            // repeat N times and finally:
            if { count & (1) }
            {
               { num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
               // numPtr += 1;
            }

            return (num + (num2 * 0x5d588b65));
        }
    }

I haven't compiled or tested this code, but the idea is right. It depends on the compiler doing reasonable constant folding and address arithmetic.

I tried to code this to preserve the exact hash value of the original, but IMHO that isn't really a requirement. It would be even simpler and a tiny bit faster if it didn't use the num/num2 stunt, but simply updated num for each character.

Corrected version (by Brian) as a static function:

    public static unsafe int GetHashCodeIra(string x)
    {
        fixed (char* str = x.ToCharArray())
        {
            const int N = 2; // a power of two controlling number of loop iterations
            char* chPtr = str;
            int num = 0x15051505;
            int num2 = num;
            int* numPtr = (int*)chPtr;
            int count = (x.Length+1) / 2;
            int unrolled_iterations = count >> (N+1); // could be 0 and that's OK
            for (int i = unrolled_iterations; i > 0; i--)
            {  // repeat 2**N times
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
                }
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3];
                }
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[4];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[5];
                }
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[6];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[7];
                }
                numPtr += 8;
            }
            if (0 != (count & ((1 << N) )))
            {
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
                }
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[2];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[3];
                }
                numPtr += 4;
            }
            if (0 != (count & ((1 << (N - 1) ))))
            {
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                    num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
                }
                numPtr += 2;
            }
            // repeat N times and finally:
            if (1 == (count & 1))
            {
                {
                    num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
                    // numPtr += 1;
                }
            }

            return (num + (num2 * 0x5d588b65));
        }
    }

Ira Baxter 2009-10-30 15:41:19

(after fixing it), I tested this. It made no difference, either in debug or compiled mode. It helped a little on really long strings (more than 20 characters), and hurt a little on really short strings (less than 5 characters). Not much difference, though.

Brian 2009-10-30 18:56:15

Glad you were able to fix it; I didn't think it was wrong :-} One might try reversing the loop peeling steps: do the unit step first, and check so see if the string is short; then the two step,then check for get out, etc. This would make short strings runfaster at virtually no impact for the longer ones.I'd also simplify the code to avoid the num/num2 dichotomy; the computation might run entirely in the registers in that case.

Ira Baxter 2009-10-30 19:59:36

... so what did you actually measure? The time to run a compuation in which hashing was involved, or the time to do the hashing itself?

Ira Baxter 2009-10-30 20:00:52

... and I wouldn't consider 20 character strings as "long"; you'd never be able to parallelize this for 20 characters. 1000 characters is long. How does it do on 1000 character strings?

Ira Baxter 2009-10-30 20:08:18

@Ira: Just the hashing itself.

Brian 2009-10-30 20:27:36

@Ira: Here's a corrected copy. And yes, I know I reduced the unrolling by a factor of 2.

Brian 2009-10-30 20:31:21

@Brian: 1) What's the reason for using char* str = x.ToCharArray()) rather than the original (char*) cast? Are you sure the the ToCharArray() doesn't actually copy the original string (which would kill most of the performance gain)? 2) How much you unroll it depends how how small you want to drive the loop overhead. I did lots of unrolled loop coding like this on smaller machines and found that 16x-32x unrolls provided measurable differences when the operands got long enough.

Ira Baxter 2009-10-30 20:48:56

@Ira: #1 No particular reason. #2: Yeah, I had no particular reason to change that.

Brian 2009-10-30 21:07:43

@Brian: I'm serious, check that ToCharArray() doesn't do a a copy. Strings are immutable. Char arrays are not; how can ToCharArray return a reference to a mutable thing using the same storage as the string?

Ira Baxter 2009-10-30 21:13:06

@Ira: 1: It can return a reference to a mutable thing because it is unsafe. 2: I can't change it to (char*) x because that gives a cast error. However, I used the same code in the "normal" version for testing, so my timing measurements are consistent.

Brian 2009-10-30 21:51:15

@Brian... and how did it do on 1000 character strings, which is the point of your original question?

Ira Baxter 2009-11-01 19:41:39

ansaurus

tags:

views:

answers:

Faster String GetHashCode (e.g. using Multicore or GPU)

related questions