tags:

views:

155

answers:

3
  • What are they and how do they work?
  • Where are they used?
  • When should I (not) use them?

I've heard the word over and over again, yet I don't know its exact meaning.

What I heard is that they allow associative arrays by sending the array key through a hash function that converts it into an int and then uses a regular array. Am I right with that?

(Notice: This is not my homework; I go too school but they teach us only the BASICs in informatics)

+6  A: 

Wikipedia seems to have a pretty nice answer to what they are.

You should use them when you want to look up values by some index.

As for when you shouldn't use them... when you don't want to look up values by some index (for example, if all you want to ever do is iterate over them.)

Brabster
How in gods name can there be a hash function that always outputs the right integers like 0 1 2 3 when receiving "abc" "myCatIsFat" or "101010" as input?!
keg
@keg Hash functions usually return a more or less random-looking value, not sequential integers. Why are you asking?
Matti Virkkunen
@key That would be the perfect hash function as described in the wiki article. Read it to get why this function is hard to find and how to make the things at least semi-optimal.
PeterMmm
It's very *very* hard to write a good hash function. It's very easy to write a bad one, and the literature is not great (it's much more extensive on cryptographic hash functions). The input language matters a lot too, and some hash functions are theoretically great but slow in practice.
Donal Fellows
@keg as an example, a hash function that, say, just returned the length of the string representation of the input would return integers for arbitrary input. Not a very good hash function mind you, see @PeterMmm comment
Brabster
I simple way to get numbers from 0 to 9 or something would be to modulus the result with the integer 10. That would limit the results to that range, but of course, raise the chances of collisions by a large margins. I don't think you would ever use a hash table with only 10 spots. If you don't know, modulus basically returns the remainder of a division operation. E.g. 23 / 10 = 2 remainder 3. Therefore, 23 % 10 = 3.
Chris Cooper
@Brabster: So... are hashes mostly for performance gain?
ItzWarty
@ItzWarty (making some guesses about what you mean) well yes. As the wikipedia article says, 'In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure'
Brabster
+2  A: 

You've about got it. They're a very good way of mapping from arbitrary things (keys) to arbitrary things (values). The idea is that you apply a function (a hash function) that translates the key to an index into the array where you store the values; the hash function's speed is typically linear in the size of the key, which is great when key sizes are much smaller than the number of entries (i.e., the typical case).

The tricky bit is that hash functions are usually imperfect. (Perfect hash functions exist, but tend to be very specific to particular applications and particular datasets; they're hardly ever worthwhile.) There are two approaches to dealing with this, and each requires storing the key with the value: one (open addressing) is to use a pre-determined pattern to look onward from the location in the array with the hash for somewhere that is free, the other (chaining) is to store a linked list hanging off each entry in the array (so you do a linear lookup over what is hopefully a short list). The cases of production code where I've read the source code have all used chaining with dynamic rebuilding of the hash table when the load factor is excessive.

Donal Fellows
+1 nice explanation
Brabster
A: 

Good hash functions are one way functions that allow you to create a distributed value from any given input. Therefore, you will get somewhat unique values for each input value. They are also repeatable, such that any input will always generate the same output.

An example of a good hash function is SHA1 or SHA256.

Let's say that you have a database table of users. The columns are id, last_name, first_name, telephone_number, and address.

While any of these columns could have duplicates, let's assume that no rows are exactly the same.

In this case, id is simply a unique primary key of our making (a surrogate key). The id field doesn't actually contain any user data because we couldn't find a natural key that was unique for users, but we use the id field for building foreign key relationships with other tables.

We could look up the user record like this from our database:

SELECT * FROM users
WHERE last_name = 'Adams'
AND first_name = 'Marcus'
AND address = '1234 Main St'
AND telephone_number = '555-1212';

We have to search through 4 different columns, using 4 different indexes, to find my record.

However, you could create a new "hash" column, and store the hash value of all four columns combined.

String myHash = myHashFunction("Marcus" + "Adams" + "1234 Main St" + "555-1212");

You might get a hash value like AE32ABC31234CAD984EA8.

You store this hash value as a column in the database and index on that. You now only have to search one index.

SELECT * FROM users
WHERE hash_value = 'AE32ABC31234CAD984EA8';

Once we have the id for the requested user, we can use that value to look up related data in other tables.

The idea is that the hash function offloads work from the database server.

The caveat is that you will have collisions, which in this case, means more than one user ends up with the same hash, even though their record is unique. Therefore, once you execute the query, if more than one row is in the result set, you iterate through each row until the columns match the values that you're looking for.

Marcus Adams