views:

1727

answers:

12

I'm looking for an explanation of how a hashtable works - in plain English for a simpleton like me! For example I know it takes the key, calculates the hash (how?) and then performs some kind of modulo to work out where it lies in the array that the value is stored, but that's where my knowledge stops.

Could anyone clarify the process.

Edit: I'm not looking specifically about how hashcodes are calculated, but a general overview of how a hashtable works.

+1  A: 

How the hash is computed does usually not depend on the hashtable, but on the items added to it. In frameworks/base class libraries such as .net and Java, each object has a GetHashCode() (or similar) method returning a hash code for this object. The ideal hash code algorithm and the exact implementation depends on the data represented by in the object.

Lucero
+5  A: 

This is how it works in my understanding:

Here's an example: picture the entire table as a series of buckets. Suppose you have an implementation with alpha-numeric hash-codes and have one bucket for each letter of the alphabet. This implementation puts each item whose hash code begins with a particular letter in the corresponding bucket.

Let's say you have 200 objects, but only 15 of them have hash codes that begin with the letter 'B.' The hash table would only need to look up and search through the 15 objects in the 'B' bucket, rather than all 200 objects.

As far as calculating the hash code, there is nothing magical about it. The goal is just to have different objects return different codes and for equal objects to return equal codes. You could write a class that always returns the same integer as a hash-code for all instances, but you would essentially destroy the usefulness of a hash-table, as it would just become one giant bucket.

tehblanx
"series of buckets" -- or tubes?
Juliet
tubes? I'm intrigued! Please explain.
tehblanx
You lack geek culture.
toto
+1  A: 

It's even simpler than that.

A hashtable is nothing more than an array (usually sparse one) of vectors which contain key/value pairs. The maximum size of this array is typically smaller than the number of items in the set of possible values for the type of data being stored in the hashtable.

The hash algorithm is used to generate an index into that array based on the values of the item that will be stored in the array.

This is where storing vectors of key/value pairs in the array come in. Because the set of values that can be indexes in the array is typically smaller than the number of all possible values that the type can have, it is possible that your hash algorithm is going to generate the same value for two separate keys. A good hash algorithm will prevent this as much as possible (which is why it is relegated to the type usually because it has specific information which a general hash algorithm can't possibly know), but it's impossible to prevent.

Because of this, you can have multiple keys that will generate the same hash code. When that happens, the items in the vector are iterated through, and a direct comparison is done between the key in the vector and the key that is being looked up. If it is found, great and the value associated with the key is returned, otherwise, nothing is returned.

casperOne
+6  A: 

This turns out to be a pretty deep area of theory, but the basic outline is simple.

Essentially, a hash function is just a function that takes things from one space (say strings of arbitrary length) and maps them to a space useful for indexing (unsigned integers, say).

If you only have a small space of things to hash, you might get away with just interpreting those things as integers, and you're done (e.g. 4 byte strings)

Usually, though, you've got a much larger space. If the space of things you allow as keys is bigger than the space of things you are using to index (your uint32's or whatever) then you can't possibly have a unique value for each one. When two or more things hash to the same result, you'll have to handle the redundancy in an appropriate way (this is usually referred to as a collision, and how you handle it or don't will depend a bit on what you are using the hash for).

This implies you want it to be unlikely to have the same result, and you probably also would really like the hash function to be fast.

Balancing these two properties (and a few others) has kept many people busy!

In practice you usually should be able to find a function that is known to work well for your application and use that.

Now to make this work as a hashtable: Imagine you didn't care about memory usage. Then you can create an array as long as your indexing set (all uint32's, for example). As you add something to the table, you hash it's key and look at the array at that index. If there is nothing there, you put your value there. If there is already something there, you add this new entry to a list of things at that address, along with enough information (your original key, or something clever) to find which entry actually belongs to which key.

So as you go a long, every entry in your hashtable (the array) is either empty, or contains one entry, or a list of entries. Retrieving is a simple as indexing into the array, and either returning the value, or walking the list of values and returning the right one.

Of course in practice you typically can't do this, it wastes too much memory. So you do everything based on a sparse array (where the only entries are the ones you actually use, everything else is implicitly null).

There are lots of schemes and tricks to make this work better, but that's the basics.

simon
A: 

You take a bunch of things, and an array.

For each thing, you make up an index for it, called a hash. The important thing about the hash is that it 'scatter' a lot; you don't want two similar things to have similar hashes.

You put your things into the array at position indicated by the hash. More than one thing can wind up at a given hash, so you store the things in arrays or something else appropriate, which we generally call a bucket.

When you're looking things up in the hash, you go through the same steps, figuring out the hash value, then seeing what's in the bucket at that location and checking whether it's what you're looking for.

When your hashing is working well and your array is big enough, there will only be a few things at most at any particular index in the array, so you won't have to look at very much.

For bonus points, make it so that when your hash table is accessed, it moves the thing found (if any) to the beginning of the bucket, so next time it's the first thing checked.

chaos
+5  A: 

Here's an explanation in laymans terms.

Let's assume you want to fill up a library of books, and not just stuff them in there, but you want to be able to easily find them again when you need them.

So, you decide that if the person that wants to read a book knows the title of the book, and the exact title to boot, then that's all it should take. With the title, the person, with the aid of the librarian, should be able to go find the book easily and quickly.

So, how can you do that? Well, obviously you can keep some kind of list of where you put each book, but then you have the same problem as searching the library, you need to search the list. Granted, the list would be smallers, and easier to search, but still, you don't want to search sequentially from one end of the library (or list) to the other.

You want something that, with the title of the book, can give you the right spot at once, so all you have to do is just stroll over to the right shelf, and pick up the book.

But how can that be done? Well, with a bit of forethought when you fill up the library, and actually, a lot of work when you fill up the library.

Instead of just starting to fill up the library from one end to the other, you device a clever little method. You take the title of the book, run it through a small computer program, which spits out a shelf number and a slot number on that shelf. This is where you place the book.

The beauty of this program is that later on, when a person comes back in to read the book, you feed the title through the program once more, and get back the same shelf number and slot number that you were originally given, and this is where the book is located.

The program, as others have already mentioned, is called a hash algorithm or hash computation, and usually works by taking the data fed into it (the title of the book in this case) and calculates a number from it.

For simplicity, let's say that it just converts each letter and symbol into a number, and sums them all up. In reality it's a lot more complicated than that, but let's leave it at that for now.

The beauty of such an algorithm is that if you feed the same input into it again and again, it will keep spitting out the same number each time.

Ok, so that's basically how a hash table works.

Technical stuff follows.

First, there's the size of the number. Usually, the output of such a hash algorithm is inside a range of some large number, typically much much larger than the space you have in your table. For instance, let's say that we have room for exactly one million books in the library. The output of the hash calculation could be in the range of 0 to one billion, a lot higher.

So what do we do? We use something called modulus calculation, which basically says that if you counted to the number you wanted (ie. the one billion number), but wanted to say inside a much smaller range, each time you hit the limit of that smaller range, you started back at 0, but you have to keep track of how far in the big sequence you've come.

Say that the output of the hash algorithm is in the range of 0 to 20, and you get the value 17 from a particular title. If the size of the library is only 7 books, you count 0, 1, 2, 3, 4, 5, 6, and when you get to 7, you start back at 0. Since we need to count 17 times, we have 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, and the final number is 3.

Of course modulus calculation isn't done like that, it's done with division, and remainder. The remainder of dividing 17 by 7 is 3 (7 goes 2 times into 17, to 14, and the difference between 17 and 14 is 3).

Thus, you put the book in slot nr. 3.

This leads to the next problem. Collisions. Since the algorithm has no way to space out the books so that they fill the library exactly (or the hash table if you will), it will invariably end up calculating a number that has been used before. In the library sense, when you get to the shelf and the slot number you wish to put a book in, there's already a book there.

Various collision handling methods exists, including running the data into yet another calculation to get another spot in the table, or simply to find a space close to the one you were given (ie. right next to the previous book). This would mean that you have some digging to do when you try to find the book later, but it's still better than simply starting at one end of the library.

Finally, at some point, you might want to put more books into the library than the library allows, in other words, you need to build a bigger library. Since the exact spot in the library was calculated using the exact, and current, size of the library, it goes to follow that if you resize the library, you might end up having to find new spots for all the books, since the calculation done to find their spot has changed.

I hope this explanation was a bit more down to earth than buckets and functions :)

Lasse V. Karlsen
A: 

All of the answers so far are good, and get at different aspects of how a hashtable works. Here is a simple example that might be helpful. Lets say we want to store some items with lower case alphabetic strings as a keys.

As simon explained, the hash function is used to map from a large space to a small space. A simple, naive implementation of a hash function for our example could take the first letter of the string, and map it to an integer, so "alligator" has a hash code of 0, "bee" has a hash code of 1, "zebra" would be 25, etc.

Next we have an array of 26 buckets (could be ArrayLists in Java), and we put the item in the bucket that matches the hash code of our key. If we have more than one item that has a key that begins with the same letter, they will have the same hash code, so would all go in the bucket for that hash code so a linear search would have to be made in the bucket to find a particular item.

In our example, if we just had a few dozen items with keys spanning the alphabet, it would work very well. However, if we had a million items or all the keys all started with 'a' or 'b', then our hash table would not be ideal. To get better performance, we would need a different hash function and/or more buckets.

Greg Graham
+1  A: 

Usage and Lingo:

  1. Hash tables are used to quickly store and retrieve data (or records).
  2. Records are stored in buckets using hash keys
  3. Hash keys are calculated by applying a hashing algorithm to a chosen value contained within the record. This chosen value must be a common value to all the records.
  4. Each bucket can have multiple records which are be organized in a particular order.

Real World Example:

Hash & Co., founded in 1803 and lacking any computer technology had a total of 300 filing cabinets to keep the detailed information (the records) for their approximately 30,000 clients. Each file folder were clearly identified with its unique number from 0 to 299.

The filing clerks of that time had to quickly fetch and store client records for the working staff. The staff had decided that it would be more efficient to use a hashing methodology to store and retrieve their records.

To file a client record, filing clerks would use the unique client number written on the folder. Using this client number, they would modulate it by 300 (the hash key) in order to identify the filing cabinet it is contained in. When they opened the filing cabinet they would discover that it contained many folders ordered by client number. After identifying the correct location, they would simply slip it in.

To retrieve a client record, filing clerks would be given a client number on a slip of paper. Using this unique client number, they would modulate it by 300 (the hash key) in order to determine which filing cabinet had the clients folder. When they opened the filing cabinet they would discover that it contained many folders ordered by client number. Searching through the records they would quickly find the client folder and retrieve it.

In our real-world example, our buckets are filing cabinets and our records are file folders.


An important thing to remember is that computers (and their algorithms) deal with numbers better than with strings. So accessing a large array using an index is significantly much faster than accessing sequentially.

As Simon has mentioned which I believe to be very important is that the hashing part is to transform a large space (of arbitrary length, usually strings, etc) and mapping it to a small space (of known size, usually numbers) for indexing. This if very important to remember!

So in the example above, the 30,000 possible clients or so are mapped to a smaller space.


The main idea in this is to divide your entire data set into segments as to speed up the actual searching which is usually time consuming. In our example above each filing cabinet would (statistically) contain about 300 records. Searching (regardless the order) through 300 records is much faster than having to deal with 30,000.

You may have noticed that some actually already do this. But instead of devising a hashing methodology to generate a hash key, they will in most cases simply use the first letter of the last name. So if you have 26 filing cabinets each containing a letter from A to Z, you in theory have just segmented your data and enhanced the filing and retrieval process.

Hope this helps,

Jeach!

Jeach
You describe a specific type of hash table collision avoidance strategy, called variably “open addressing” or “closed addressing” (yes, sad but true) or “chaining”. There’s another type which doesn’t use list buckets but instead stores the items “inline”.
Konrad Rudolph
+2  A: 

Short and sweet:

A hash table wraps up an array, lets call it internalArray. Items are inserted into the array in this way:

let insert key value =
    internalArray[hash(key) % internalArray.Length] <- (key, value)
    //oversimplified for educational purposes

Sometimes two keys will hash to the same index in the array, and you want to keep both values. I like to store both values in the same index, which is simple to code by making internalArray an array of linked lists:

let insert key value =
    internalArray[hash(key) % internalArray.Length].AddLast(key, value)

So, if I wanted to retrieve an item out of my hash table, I could write:

let get key =
    let linkedList = internalArray[hash(key) % internalArray.Length]
    for (testKey, value) in linkedList
        if (testKey = key) then return value
    return null

Delete operations are just as simple to write. As you can tell, inserts, lookups, and removal from our array of linked lists is nearly O(1).

When our internalArray gets too full, maybe at around 85% capacity, we can resize the internal array and move all of the items from the old array into the new array.

Juliet
A: 

Here's another way to look at it.

I assume you understand the concept of an array A. That's something that supports the operation of indexing, where you can get to the Ith element, A[I], in one step, no matter how large A is.

So, for example, if you want to store information about a group of people who all happen to have different ages, a simple way would be to have an array that is large enough, and use each person's age as an index into the array. Thay way, you could have one-step access to any person's information.

But of course there could be more than one person with the same age, so what you put in the array at each entry is a list of all the people who have that age. So you can get to an individual person's information in one step plus a little bit of search in that list (called a "bucket"). It only slows down if there are so many people that the buckets get big. Then you need a larger array, and some other way to get more identifying information about the person, like the first few letters of their surname, instead of using age.

That's the basic idea. Instead of using age, any function of the person that produces a good spread of values can be used. That's the hash function. Like you could take every third bit of the ASCII representation of the person's name, scrambled in some order. All that matters is that you don't want too many people to hash to the same bucket, because the speed depends on the buckets remaining small.

Mike Dunlavey
+1  A: 

You guys are very close to explaining this fully, but missing a couple things. The hashtable is just an array. The array itself will contain something in each slot. At a minimum you will store the hashvalue or the value itself in this slot. In addition to this you could also store a linked/chained list of values that have collided on this slot, or you could use the open addressing method. You can also store a pointer or pointers to other data you want to retrieve out of this slot.

It's important to note that the hashvalue itself generally does not indicate the slot into which to place the value. For example, a hashvalue might be a negative integer value. Obviously a negative number cannot point to an array location. Additionally, hash values will tend to many times be larger numbers than the slots available. Thus another calculation needs to be performed by the hashtable itself to figure out which slot the value should go into. This is done with a modulus math operation like:

uint slotIndex = hashValue % hashTableSize;

This value is the slot the value will go into. In open addressing, if the slot is already filled with another hashvalue and/or other data, the modulus operation will be run once again to find the next slot:

slotIndex = (remainder + 1) % hashTableSize;

I suppose there may be other more advanced methods for determining slot index, but this is the common one I've seen... would be interested in any others that perform better.

With the modulus method, if you have a talbe of say size 1000, any hashvalue that is between 1 and 1000 will go into the corresponding slot. Any Negative values, and any values greater than 1000 will be potentially colliding slot values. The chances of that happening depend both on your hashing method, as well as how many totol items you add to the hash table. Generally, it's best practice to make the size of the hashtable such that the total number of values added to it is only equal to about 70% of its size. If your hash function does a good job of even distribution, you will generally encounter very few to no bucket/slot collisions and it will perform very quickly for both lookup and write operations. If the total number of values to add is not known in advance, make a good guestimate using whatever means, and then resize your hashtable once the number of elements added to it reaches 70% of capacity.

I hope this has helped.

PS - In C# the GetHashCode() method is pretty slow and results in actual value collisions under a lot of conditions I've tested. For some real fun, build your own hashfunction and try to get it to NEVER collide on the specific data you are hashing, run faster than GetHashCode, and have a fairly even distribution. I've done this using long instead of int size hashcode values and it's worked quite well on up to 32 million entires hashvalues in the hashtable with 0 collisions. Unfortunately I can't share the code as it belongs to my employer... but I can reveal it is possible for certain data domains. When you can achieve this, the hashtable is VERY fast. :)

Chris