views:

136

answers:

8

I came across a situation where i had to count the number of occurences of each word in a string. I decided hashing would be the best way to do it (Find the hash value for each word that is encountered and increment the count at the position indexed by the hash value - assuming i use an array). What hashing algorithm can i use to ensure that the hash value generated for every string is unique?

And this led to a bigger question.. How do the language libraries (Java for example) implement data structures like hashmap that generate unique hash values in case of strings?

I want to know the mathematical construct involved behind the implementation of such an algorithm.

+1  A: 

You can't be 100% sure, a hash by definition can have collisions.

You can see on grepcode how a String is hashed in java. And basically HashMap (and other hash based structures) use the hashCode() method every time.

So if you want to count the number of iterations of a particular word, you should use a Map<String, Integer> (in java) and count from there.

For example :

Map<String, Integer> words = new HashMap<String, Integer>();
String word = "lol";

Integer count = words.get(word);
if(count == null){
    count = 0;
}
words.put(word, count + 1);
Colin Hebert
Wrong. See [Perfect Hashing](http://en.wikipedia.org/wiki/Perfect_Hashing).
SLaks
@SLaks, nice, I didn't knew this article. But as it's said, it's for a set S of values, and it's kind of hard (almost impossible) to apply this to "words".
Colin Hebert
I understand.. Are there any standard algorithms to accomplish this?
Raj Kumar
@user: No, you don't understand.
SLaks
+2  A: 

You can't have a hashing algorithm that guarantees uniqueness; it's the pigeonhole principle. Why not use a binary tree?

Oli Charlesworth
But its not possible to perform insertion and deletion operations on a binary tree in O(1), which is what i'm looking for..
Raj Kumar
@user441575: How many different words do you have? You may find that a binary-search for a small number of words is significantly more efficient than having to compute a hash every single time.
Oli Charlesworth
+1  A: 

Theoretically speaking, you cannot guarantee uniqueness for hashes - unless the length of your hash is always as long or longer as the original strings, which is kind of counterproductive.

For a comprehensive explanation on this, please see "Are Hash Codes Unique?" by Tom Archer.

jsalonen
+6  A: 

What hashing algorithm can i use to ensure that the hash value generated for every string is unique?

There is no such function. The space of strings is infinite, but the target space is finite (say you are using 32-bit integers). You can not injectively map an infinite space to a finite space; there must be collisions.

How do the language libraries (Java for example) implement data structures like hashmap that generate unique hash values in case of strings?

They don't; there is not a unique hashing function for strings per the above.

I came across a situation where i had to count the number of occurences of each word in a string. I decided hashing would be the best way to do it (Find the hash value for each word that is encountered and increment the count at the position indexed by the hash value - assuming i use an array).

You have the right idea. Just use a dictionary mapping strings to int. For example, in C# we would use a Dictionary<string, int>. Something similar to this exists in most modern languages. Let the language/framework handle the issue of collisions and what not for you and just focus on expressing your idea within that language/framework.

Jason
A: 

In Java, hashCode for String is implemented as follows:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)

Source: JavaDoc for java.lang.String

You might want to consider using a similar algorithm to make your hashCode bullet proof (mostly).

Mike
+2  A: 

Hashed cannot be a one-to-one function that provides an unique output for every input simply because, usually, the codomain of the function is smaller than the domain, so what you are asking is not possible.

Of course if the length of the string is limited and the set of all possible strings is lower than a precise bound you can have what's called perfect hash function.

You can just search for a good hashing function that has a low collision probability, just start from here and have fun!

side note: if I'm not wrong Java Hashtable doesn't use open addressing. Whenever a collision is found, the element is placed in the same, already occupied, cell through a list. So it's definitely the opposite of what you think.. implmentations doesn't try to guarantee uniqueness, they instead choose a good collision resolution strategy that minimizes some aspects

Jack
A: 

I think what you are looking for is Substring Index or String search. Am i missing something?

yadab