tags:

views:

1390

answers:

3
  • How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?

  • In string, it is overridden to make sure that there exists only one hash code for a particular string. How to do that?

  • How searching for specific key in a hash table is speeded up using hash code?

  • What are the advantages of using hash code over using an index directly in the collection (like in arrays)?

Can someone help?

+7  A: 

Basically, hash functions use some generic function to digest data and generate a fingerprint (and integer number here) for that data. Unlike an index, this fingerprint depends ONLY on the data, and should be free of any predictable ordering based on the data. Any change to a single bit of the data should also change the fingerprint considerably.

Notice that nowhere does this guarantee that different data won't give the same hash. In fact, quite the opposite: this happens very often, and is called a collision. But, with an integer, the probability is roughly 1 in 4 billion against this (1 in 2^32). If a collision happens, you just compare the actual object you are hashing to see if they match.

This fingerprint can then be used as an index to an array (or arraylist) of stored values. Because the fingerprint is dependent only on the data, you can compute a hash for something and just check the array element for that hash value to see if it has been stored already. Otherwise, you'd have to go through the whole array checking if it matches an item.

You can also VERY quickly do associative arrays by using 2 arrays, one with Key values (indexed by hash), and a second with values mapped to those keys. If you use a hash, you just need to know the key's hash to find the matching value for the key. This is much faster than doing a binary search on a sorted key list, or a scan of the whole array to find matching keys.

There are MANY ways to generate a hash, and all of them have various merits, but few are simple. I suggest consulting the wikipedia page on hash functions for more info.

BobMcGee
The hash is not "more or less random"; it's just less. So less random as to not be random at all. A better word would be "arbitrary." And by saying the hash is "unique to that data," you DO "guarantee that different data won't give the same hash." And since that's obviously false, "unique" is not the right word.
Rob Kennedy
I mean random, as in there's no predictable order to the keys from a hashcode vs. indexes in a List being assigned in order. I'll try to clarify my points by rephrasing that.
BobMcGee
A: 

Answering each one of your questions directly:

How that integer hash is generated by the GetHashCode() function? Is it a random value which is not unique?

An integer hash is generated by whatever method is appropriate for the object. The generation method is not random but must follow consistent rules, ensuring that a hash generated for one particular object will equal the hash generated for an equivalent object. As an example, a hash function for an integer would be to simply return that integer.

In string, it is overridden to make sure that there exists only one hash code for a particular string. How to do that?

There are many ways this can be done. Here's an example I'm thinking of on the spot:

int hash = 0;
for(int i = 0; i < theString.Length; ++i)
{
    hash ^= theString[i];
}

This is a valid hash algorithm, because the same sequence of characters will always produce the same hash number. It's not a good hash algorithm (an extreme understatement), because many strings will produce the same hash. A valid hash algorithm doesn't have to guarantee uniqueness. A good hash algorithm will make a chance of two differing objects producing the same number extremely unlikely.

How searching for specific key in a hash table is speeded up using hash code? What are the advantages of using hash code over using an index directly in the collection (like in arrays)?

A hash code is typically used in hash tables. A hash table is an array, but each entry in the array is a "bucket" of items, not just one item. If you have an object and you want to know which bucket it belongs in, calculate

 hash_value MOD hash_table_size.

Then you simply have to compare the object with every item in the bucket. So a hash table lookup will most likely have a search time of O(1), as opposed to O(log(N)) for a sorted list or O(N) for an unsorted list.

Andrew Shepherd
+1  A: 

A hash code IS an index, and a hash table, at its very lowest level, IS an array. But for a given key value, we determine the index into in a hash table differently, to make for much faster data retrieval.

Example: You have 1,000 words and their definitions. You want to store them so that you can retrieve the definition for a word very, very quickly -- faster than a binary search, which is what you would have to do with an array.

So you create a hash table. You start with an array substantially bigger than 1,000 entries -- say 5,000 (the bigger, the more time-efficient).

The way you'll use your table is, you take the word to look up, and convert it to a number between 0 and 4,999. You choose the algorithm for doing this; that's the hashing algorithm. But you could doubtless write something that would be very fast.

Then you use the converted number as an index into your 5,000-element array, and insert/find your definition at that index. There's no searching at all: you've created the index directly from the search word.

All of the operations I've described are constant time; none of them takes longer when we increase the number of entries. We just need to make sure that there is sufficient space in the hash to minimize the chance of "collisions", that is, the chance that two different words will convert to the same integer index. Because that can happen with any hashing algorithm, we need to add checks to see if there is a collision, and do something special (if "hello" and "world" both hash to 1,234 and "hello" is already in the table, what will we do with "world"? Simplest is to put it in 1,235, and adjust our lookup logic to allow for this possibility.)

Edit: after re-reading your post: a hashing algorithm is most definitely not random, it must be deterministic. The index generated for "hello" in my example must be 1,234 every single time; that's the only way the lookup can work.

John Pirie