views:

205

answers:

6

Another question on SO brought up the facilities in some languages to hash strings to give them a fast lookup in a table. Two examples of this are dictionary<> in .NET and the {} storage structure in Python. Other languages certainly support such a mechanism. C++ has its map, LISP has an equivalent, as do most other modern languages.

It was contended in the answers to the question that hash algorithms on strings can be conducted in constant timem with one SO member who has 25 years experience in programming claiming that anything can be hashed in constant time. My personal contention is that this is not true, unless your particular application places a boundary on the string length. This means that some constant K would dictate the maximal length of a string.

I am familiar with the Rabin-Karp algorithm which uses a hashing function for its operation, but this algorithm does not dictate a specific hash function to use, and the one the authors suggested is O(m), where m is the length of the hashed string.

I see some other pages such as this one (http://www.cse.yorku.ca/~oz/hash.html) that display some hash algorithms, but it seems that each of them iterates over the entire length of the string to arrive at its value.

From my comparatively limited reading on the subject, it appears that most associative arrays for string types are actually created using a hashing function that operates with a tree of some sort under the hood. This may be an AVL tree or red/black tree that points to the location of the value element in the key/value pair.

Even with this tree structure, if we are to remain on the order of theta(log(n)), with n being the number of elements in the tree, we need to have a constant-time hash algorithm. Otherwise, we have the additive penalty of iterating over the string. Even though theta(m) would be eclipsed by theta(log(n)) for indexes containing many strings, we cannot ignore it if we are in such a domain that the texts we search against will be very large.

I am aware that suffix trees/arrays and Aho-Corasick can bring the search down to theta(m) for a greater expense in memory, but what I am asking specifically if a constant-time hash method exists for strings of arbitrary lengths as was claimed by the other SO member.

Thanks.

+7  A: 

A hash function doesn't have to (and can't) return a unique value for every string.

You could use the first 10 characters to initialize a random number generator and then use that to pull out 100 random characters from the string, and hash that. This would be constant time.

You could also just return the constant value 1. Strictly speaking, this is still a hash function, although not a very useful one.

Mark Byers
Reminds me of http://xkcd.com/221/
Ron Warholic
The problem with that is that very similar strings would have a high probability of having identical hashes. In general, a single bit change should change all the bits in the hash, so that the likelihood of two strings colliding is independent of their similarity.--That said, your idea would work if you didn't need to worry about close strings colliding.
Chad Okere
Admittedly, this works. But for something as general as a language construct that stores key/value pairs, it isn't something I'd use, as it doesn't yield consistent run-time results.
San Jacinto
+1  A: 

You can hope for asymptotically less than linear hashing time if you use ropes instead of strings and have sharing that allows you to skip some computations. But obviously a hash function can not separate inputs that it has not read, so I wouldn't take the "everything can be hashed in constant time" too seriously.

Anything is possible in the compromise between the hash function's quality and the amount of computation it takes, and a hash function over long strings must have collisions anyway.

You have to determine if the strings that are likely to occur in your algorithm will collide too often if the hash function only looks at a prefix.

Pascal Cuoq
+3  A: 

In general, I believe that any complete string hash must use every character of the string and therefore would need to grow as O(n) for n characters. However I think for practical string hashes you can use approximate hashes that can easily be O(1).

Consider a string hash that always uses Min(n, 20) characters to compute a standard hash. Obviously this grows as O(1) with string size. Will it work reliably? It depends on your domain...

Ron Warholic
+1  A: 

Although I cannot imagine a fixed-time hash function for unlimited length strings, there is really no need for it.

The idea behind using a hash function is to generate a distribution of the hash values that makes it unlikely that many strings would collide - for the domain under consideration. This key would allow direct access into a data store. These two combined result in a constant time lookup - on average.

If ever such collision occurs, the lookup algorithm falls back on a more flexible lookup sub-strategy.

xtofl
I agree, but in the case of a language construct like an associative array, wouldn't you want to be as close to guaranteeing uniquness as possible?
San Jacinto
+3  A: 

You cannot easily achieve a general constant time hashing algorithm for strings without risking severe cases of hash collisions.

For it to be constant time, you will not be able to access every character in the string. As a simple example, suppose we take the first 6 characters. Then comes someone and tries to hash an array of URLs. The has function will see "http:/" for every single string.

Similar scenarios may occur for other characters selections schemes. You could pick characters pseudo-randomly based on the value of the previous character, but you still run the risk of failing spectacularly if the strings for some reason have the "wrong" pattern and many end up with the same hash value.

Josef Grahn
A: 

Certainly this is doable, so long as you ensure all your strings are 'interned', before you pass them to something requiring hashing. Interning is the process of inserting the string into a string table, such that all interned strings with the same value are in fact the same object. Then, you can simply hash the (fixed length) pointer to the interned string, instead of hashing the string itself.

Nick Johnson
A good idea, but it's worthwhile to note the process of inserting into a String table would add time proportional to the number of Strings in the table, unless the table was hash-based, in which case the problem is reduced back to the original state.
Peter
Well, using a trie, the time to insert into it is proportional to the longest common prefix, which is another option. :)
Nick Johnson
@Nick Johnson you're misunderstanding me, I think. I'm looking for a constant-time way to uniquely identify strings. This means that if I present you with 2 new strings, you can "hash" them in constant time so that if one string is 500 characters and the next one is 5 characters, they take the same theoretical time to determine uniqueness.
San Jacinto
Yes, and obviously that's not possible if the strings are new each time (barring inanities like hashing only the first n characters), but if you need to do many comparisons on the same set of strings, interning reduces the time for each hash to O(1).
Nick Johnson