unique integer/long hash key generation over strings for faster compairson

views:

994

answers:

+8 Q:

unique integer/long hash key generation over strings for faster compairson

I'm curious how others have solved this problem, and what problems might lurk behind the naive solution:

I have a system which processes stock market data. There are tens of thousands of symbols, with associated prices/sizes, flowing into the system at the rate of several thousand every millisecond.

One of the basic operations that needs to happen on every tick is string comparison to see if the incoming matches the symbol we are interested in. At such high frequency, optimization of these string comparisons can make a measurable difference in the performance of the whole system.

I am thinking of generating a hash of the symbol string, and storing it with the record. For subsequent comparison, the system should use this hash (being an int or a long, the comparison should be a single operation, rather than iterating through each character of the string until a mismatch is found).

Let's ignore the cost of generating the hash itself (which, in reality, may actually be prohibitive). The only problem I can see is that with a large number of unique symbols, a hash collision (two separate symbols generate the same hash) would be devastating. Is there a hashing algorithm which guarantees that strings which match certain constraints (such as limit on the number of characters) are unique?

EDIT: I'll write this code in Java. Not sure of the (collision) quality of hashCode or the speed with which it is calculated.

+2 A:

If you use String.intern() or your own String pooling you can then use == rather than .equals() - I've done this in similar performance critical code and it has made a big difference. The default String already has a hashCode() which works fairly effectively.

I've just realised it wasn't a Java question, but the same applies. Yes, hashing and then using identity checking can save time. The java hashing algorithm uses:

     s[0] * 31^(n-1) + s[1] * 31^(n-2) + ... + s[n-1]

Nick Fortescue 2009-07-02 16:11:17

Not a Java question but my code will be in Java :) I should have mentioned Java, which does include a hashCode function.

Shahbaz 2009-07-02 17:04:41

+4 A:

Common cryptographic hash functions like SHA-1 outputs 20 bytes (160 bit). How long are your stock symbols? If we're talking about ticker symbols like "WMT" (Walmart), "KO" (Coca-Cola), etc, then they seem to be only a couple of bytes long -- thus it should be faster to compare them directly instead of dealing with a 20 byte hash. You mention hash collisions -- I wouldn't worry about them, especially not when the inputs are much smaller than the hash output.

You might be able to cast the bytes into an int or long depending on the programming language and platform and then do the comparison between these "numbers" in one CPU instruction. (I don't know if modern compilers can compare a bunch of bytes equally fast with a call to memcmp?)

Martin Geisler 2009-07-02 16:12:23

TMN 2009-07-02 18:26:00

+3 A:

You should consider using a Perfect hash function, I think it fits your requirements

Hasturkun 2009-07-02 16:12:23

+1 A:

You could generate the hash by treating the string as a Base-27 number (assuming the symbols contain only letters). This would generate the uniqueness you're looking for. For example:

(no letter) = 0, A = 1, B = 2, ... Z = 26

AA = (1 x 27¹) + (1 x 27⁰) = 28

AAA = (1 x 27²) + (1 x 27¹) + (1 x 27⁰) = 757

BBB = (2 x 27²) + (2 x 27¹) + (2 x 27⁰) = 1514

GOOG = (7 x 27³) + (15 x 27²) + (15 x 27¹) + (7 x 27⁰) = 149128

This will work fine up to 6 characters in a 32-bit int.

John Rasch 2009-07-02 16:22:42

Any decent hash function handles collisions well. Basically, if the hash results in a hit for which multiple answers exist, there's a linked list of potential solutions in that bucket, and of necessity, things slow down in finding the correct answer (if one exists).

But don't write your own hash function, use one that is out there.

Oh, and generating the hash should be done only once, I would think. Because you have a lookup table of things you are tracking, and the hash table should only have to change when you add a new "interesting" thing to scan for.

xcramps 2009-07-02 16:25:47

Edit: Better comments than my own were thrown on (and earlier), making mine redundant at best.

hythlodayr 2009-07-02 16:28:50

+6 A:

Maybe hash functions aren't the best approach here. If you're receiving a ticker symbol (and not the hash of the ticker symbol) you're going to have to compute the hash for it every single time it comes through. If its a hashing algorithm with no collisions, you'll need to look at every character of the symbol anyway. So you might as well directly compare the characters.

I suggest building a Trie data structure of all the tickers you're interested in. (see http://en.wikipedia.org/wiki/Trie). Traverse the tree for each symbol and if you reach the end of the ticker without finding a match, then its not an interesting ticker.

With hashing, you'll have to do this traversal anyway in the set of all hash values of the interesting tickers.

Jeremy Powell 2009-07-02 16:32:59

Good point about the cost of calculating the hash in the first place. Although I decided to ignore it for this question, it is a real concern...but one which I can answer by running tests. I do expect that I will store every incoming tick into a Map keyed by the symbol (so the most recent data will overwrite the old data). Elsewhere in my program, the Map will be used for looking up as often as new ticks come in. Since each time a bid or an offer comes in, it'll need to be combined with last sale price to create an aggregate tick. That's why precalculating hashes might be worth it.

Shahbaz 2009-07-02 17:11:17

Along the same lines as rethinking the hashcode solution, another way is simply to increment an atomic long every times a new symbol comes in and put it in a map. Obviously check the map before incrementing the counter. Right now I have no idea what the cpu cycle cost of this will be, but at least I can test it. Simpler solution and keeps me from worrying about hashcode collisions. Either way, this optimization will be hidden away from the public API

Shahbaz 2009-07-03 01:43:58

What you want is a fast hash function that has good discrimination power. For each string, compute the associated hash function and store it with the string. Then for a comparison, code: if (Hash(s1)==Hash(s2) && s1==s2) then { ... } The actual string compare won't occur unless the hashes match, which in practice is only when the strings match.

Some folks will tell you to implement a perfect hash. You can only do that when the set of strings you want to hash has bounded size, usually only 10-1000. You can't do that for an arbitrarily big vocabulary of strings. Since you can't do that, you actually have to compare the strings to determine equality.

Cryptographic hashes have great discrimination power but aren't designed to be fast. What is generally very fast and has good discrimination power are CRC functions, and most langauges have easily found libraries that compute these quickly (using a table lookup technique on bytes). We use CRC-32 and it is very effective for this (basically 1 chance in 2^32 that a hash collision will occur, when the strings don't match). You can use CRC-64, but the additional discrimination power it provides won't really add any real functionality.

Ira Baxter 2009-07-02 16:37:14

I second the above suggestion of a Trie structure as the best approach for this case. Computationally equivalent to a perfect hash, but conceptually much prettier. This is presuming your symbols are bounded in length.

Paul Milovanov 2009-07-02 17:27:24

FWIW, on the last high-data-volume project I was on, we found filtering, aggregating and pre-classifying data using some heavily-tuned C code was key. All our feeds went into this pre-processor and it took care of the simple data cleansing before passing the bulk of the data to our Java-based system for processing. Basically the pre-processor did just what you're asking: identifying records of interest, verifying they were complete and removing dups and empties. During peak times the pre-processor could eliminate up to 20% of the 8M or so records we'd get per hour (probably not quite the volume I imagine you get from stock market feeds). Our original Java version was lucky to get half that (but it was "elegant", at least!)

TMN 2009-07-02 18:42:54

+1 A:

If you are receiving 4-letter ticker symbols, then each letter should be representable as a single byte. Pack all 4 together into a 32-bit int, and voila, you have your "hash". You can now compare this against the reference using a single machine instruction.

If you weren't using Java, that is.

I really wouldn't suggest using Java for anything speed-critical, especially not thousands of string comparisons per millisecond.

edit: If you wanted to use 64-bit code, you could pack up to 8 letters per long int and then compare in 1 instruction.

muusbolla 2009-07-02 19:05:35

+1. But I doubt you need 64-bit code for stock ticker symbols -- each letter can be represented in 5 bits, meaning 6 letters sit comfortably in a 32-bit word. Packing like this is fast -- just a subtraction and bit shift per character.

j_random_hacker 2009-07-03 05:27:35

ansaurus

tags:

views:

answers:

unique integer/long hash key generation over strings for faster compairson

related questions