tags:

views:

1512

answers:

11

I'm quite confused about the basic concepts of a Hash table. If I were to code a hash how would I even begin? What is the difference between a Hash table and just a normal array?

Basically if someone answered this question I think all my questions would be answered: If I had 100 randomly generated numbers (as keys), how would I implement a hash table and why would that be advantageous over an array?

Psuedo-code or Java would be appreciated as a learning tool...

+10  A: 

Hashtables are associative. This is a huge difference from arrays, which are just linear data structures. With an array, you might do something like this:

int[] arr = ...
for (int i = 0; i < arr.length; i++) {
    System.out.println(arr[i] + 1);
}

Notice how you are getting an element out of the array by specifying an exact memory offset (i). This contrasts with hashtables, which allow you to store key/value pairs, later retrieving the value based on the key:

Hashtable<String, Integer> table = new Hashtable<String, Integer>();
table.put("Daniel", 20);
table.put("Chris", 18);
table.put("Joseph", 16);

With the above table, we can make the following call:

int n = table.get("Chris");

...and be assured that n will be valued at 18.

I think this will probably answer most of your questions. The implementation of a hashtable is a fairly interesting topic, one which Wikipedia addresses passably well.

Daniel Spiewak
Okay, but in actual implementation, doesn't the table.get("Chris") still have to traverse the table to find Chris? How does it know Chris is at a "key" value? When it hashes, what is the actually happening to "Chris"?
kylex
Good question. Will address in a separate answer... If you're impatient, try checking out the Wikipedia article.
Daniel Spiewak
@me.yahoo.com: see my comment below for this (couldn't write here because of size limitation)
askgelal
NO. A hash table does not ever traverse. It computes a hash of "Chris" and that's the physical slot in the hash table that will have "Chris" as the key. The hash is a computation on the byte values (see MD5 algorithm for details.)
S.Lott
+2  A: 

You wouldn't want to use a hash table for 100 randomly generated numbers.

A good way to think about hash tables is to think about value pairs. Let's use students, and say everyone has a student ID number. In your program you store information on students (names, phone numbers, bills, etc). You want to find all of the information about a student using only basic information (name or student ID, for example).

Let's say you have 10,000 students. If you store them all in an array, then you have to loop through the entire array comparing each entry's student ID with the one you are looking for.

If, instead, you "hash" (see below) their student ID number to a position in the array, then you only have to search student's who's numbers have the same hash. Much less work to find what you wanted.

In this example, let's say student IDs are just 6 digit numbers. Our hash function could be use only the bottom 3 digits of the number as the "hash key". Thus 232145 is hashed to array location 145. So then you only need an array of 999 element (each element being a list of students).

That should be a good start for you. You should, of course, read a text book or wikipedia for this kind of info. But I assume you've already done that and are tired of reading.

SoapBox
Why not hash the whole student ID ?
Frederic Morin
Cause then it wouldn't really be a hash, it'd just be the student ID. You could use it as an array index at that point. I think "hash" the ID makes for a better example.
SoapBox
I am a beginner programmer, and I was reading your answer. What occurred to me was: when you are in a room of students, say 100 students, and you want to address one of them, you say his or her name. If you say Chris, not every student will stand up. Kris, Chris, Chris, and Christine (who goes by Chris sometimes) will stand up. This is because the keys |Kris|, |Chris|, and |Christine| all hash to the sound /Chris/! But if instead you said, "kay ar eye ess", only Kris would stand up. What does this all MEAN!? I don't understand what my own analogy implies about hashes and arrays...
Ziggy
A: 

I'll answer that part about the difference between a hash table and an array... but since I've never implemented a hashing algorithm of any import before, I'll leave that to somebody more knowledgeable :)

An array is just an ordered list of objects. The object itself doesn't really matter... what's important is that if you want to list the objects in order of insertion, it is always the same (meaning that the first element always has an index of 0).

As for a hashtable, that's indexed by keys, not order... I think that a basic search on hashing algorithms will give you a lot more insight than I can... Wikipedia has a very decent one... that determines "bucket" that the keys go into for quick retrieval on arbitrary objects used as keys.

As for advantages: If order of insertion is important, an array or some kind of ordered list is necessary. If fast look-up by arbitrary key (keyed by various hash functions) is important, then a hash table makes sense.

Jason Coco
Your answer is good, but has some factual holes. Arrays are actually random-access (you can insert arr[10] before you insert arr[0]). They are ordered in memory (as you said), but the insertion order is irrelevant. (I think you were thinking of a linked list)
Daniel Spiewak
To continue, not all associative tables use hashing. A binary-search tree is a very simple associative structure (key/value lookups) which actually maintains things in a perfectly sorted order.
Daniel Spiewak
It's interesting to note that hashtables are always implemented using arrays under the surface, further underscoring the fact that arrays do not enforce insert/access order.
Daniel Spiewak
@daniel Actually, no I wasn't... just not articulating well :) I wasn't talking about insertion order, just that retrieval order is known while in a hashtable order is not as important as arbitrary key retrieval... Thanks for clarifying for others tho!!
Jason Coco
A: 

[This is the reply to a comment made by me.yahoo.com/a above]

That depends on your hash function. Lets suppose that your hash function hashes a word as per the length of your word, the key for chris will be 5. Similarly, key for yahoo will also be 5. Now, both values (chris and yahoo) will go under 5 (i.e. in a 'bucket' keyed by 5). This way you don't have to make an array equal to the size of your data.

askgelal
+4  A: 

"I'm more interested in the way Hash Tables look up the key and how the key is generated."

  1. Hashing transforms a key object to a number. This is called "hashing" -- it makes a hash out of the object. See Hash Function. Summing the bytes of a string, for example, is a standard hash technique. You compute the sum modulo 232 to keep the hash to a manageable size. Hash always gives the same answer. This is O(1).

  2. The number gives you a "slot" in the HashTable. Given an arbitrary key object, the hash value computes a hash value. The hash value then gives you the slot in table. Usually mod( hash, table size ). This is O(1), also.

That's the general solution. Two numeric calculations and you've gone from arbitrary object as key to arbitrary object as value. Few things can be as fast.

The transformation from object to hash value happens in one of these common ways.

  1. If it's a "primitive" object of 4 bytes, then the object's native value is a number.

  2. The object's address is 4 bytes, then the object's address can be used as a hash value.

  3. A simple hash function (MD5, SHA1, whatever) accumulates the bytes of the object to create a 4-byte number. The advanced hashes aren't simple sums of bytes, a simple sum doesn't reflect all the original input bits fairly enough.

The slot in the hash table is mod( number, size of table ).

If that slot has the desired value, you're done. If that's not the desired value, you need to look somewhere else. There are several popular probing algorithms to look for a free spot in the table. Linear is a simple search for the next free spot. Quadratic is a non-linear hopping around looking for a free slot. A random number generator (with a fixed seed) can be used to generate a series of probes that will spread data evenly but arbitrarily.

The probing algorithms are not O(1). If the table's big enough, the odds of collision are low, and probes don't matter. If the table's too small, then collisions happen and probing happens. At that point, it becomes a matter of "tuning and tweaking" to balance probing and table size to optimize performance. Usually we just make the table bigger.

See Hash Table.

S.Lott
Thanks, all your answers are helping me quite a bit. But every answer leads to more questions. How does probing work? Linear probing seems straight-forward enough. Just go down the table until there's an open slot right? But how about Quadratic probing? How does that work and whyo or is that better?
kylex
Quadratic: "the interval between probes increases proportional to the hash value". Why is it better? Empirical data proves that it works better than linear. There's no more "why" than that.
S.Lott
+15  A: 

First, you have to understand a what a hash function is. A hash function is a function that takes a key (for example, an string of arbritrary length) and returns a number as unique as possible. The same key must always return the same hash. A really simple string hashing function in java might look like

public int stringHash(String s) {
    int h = s.length();
    for(char c : s.toCharArray()) {
        h ^= c;
    }
    return h;
}

You can study a good hash function at http://www.azillionmonkeys.com/qed/hash.html

Now, the hash map uses this hash value to place the value into an array. Simplistic java method:

public void put(String key, Object val) {
    int hash = stringHash(s) % array.length;
    if(array[hash] == null) {
        array[hash] = new LinkedList<Entry<String, Object> >();
    }
    for(Entry e : array[hash]) {
        if(e.key.equals(key)){
            e.value = val;
            return;
        }
    }
    array[hash].add(new Entry<String, Object>(key, val));
}

(This map enforces unique keys. Not all maps do.)

It is possible for two different keys to hash to the same value, or two different hashes to map to the same array index. There exists many techniques for dealing with this. The simplest is to use a linked list (or binary tree) for each array index. If the hash function is good enough, you will never need a linear search.

Now to look up a key:

public Object get(String key) {
    int hash = stringHash(key) % array.length;
    if(array[hash] != null) {
        for(Entry e : array[hash]) {
            if(e.key.equals(key))
                return e.value;
        }
    }

    return null;
}
gnud
Excellent! I wish I could vote you up multiple times. This is what I had been planning to write (in response to questions on my first answer) but didn't get the chance.
Daniel Spiewak
This is totally cool post, and good example for beginners.
mtasic
+2  A: 

Here is, in short, how a hash table works.

Imagine you have a library, full of books. If you were to store the books in an array, you would put each book on a spot on a shelf, and then when someone asked you to find a book, you'd look through all the shelves -- pretty slow. If someone said "book #12345", you could find it pretty easily, though.

Let's say instead you say, if the book title starts with 'A', it goes in row 1. If the second letter is 'B', it goes in row 1, rack 2. If the third letter is 'C', it goes in row 1, rack 2, shelf 3... and so on until you identify the book position. Then, based on the title of the book, you could know exactly where it should be.

Now, there are some problems in the simplistic "hashing" algorithm I described -- some shelves are going to be way overloaded while others stand empty, some books will be assigned to the same slot.. so the real hash functions are carefully constructed to try to avoid such problems.

But this is the basic idea.

SquareCog
+3  A: 

Something I didn't see specifically noted yet:

The point of using a hash table over an array is performance.

Iterating through an array would typically take anywhere from O(1) to O(x) where x is the number of items in the array. However the time to find your item will be extremely variable, expecially if we are talking about hundreds of thousands of items in the array.

A properly weighted hash table typically has an almost constant access time of just over O(1), no matter how many items are in the hash table.

rally25rs
Why is searching the hash faster? Don't we still have to traverse the list if looking for a value? Isn't the O(1) only if we know the key to begin with? In that case, if we know the key in an array, wouldn't the order also be O(1)?
kylex
A hash ALWAYS has a key, you trivially compute the slot and the lookup involves no actual search. The difference is a hash can use ANYTHING as a key. An array can only use an integer.
S.Lott
So basically if I do a search for the color "green" and "green" is a key that has been hashed to a number, say, 14... there is no search needed because green will always hash to 14?
kylex
yes, a string, for example "green" will ALWAYS hash to the same value. Also, Java will cache that value, so it only runs the hash algorithm once to generate it. That hash value is used to get a "bucket" which is basically an array. then it iteratively scans that. Ideally there is 1 item per bucket.
rally25rs
@rally25rs [citation needed] :-) Look at the source for hashCode() in most classes. It is generally computed on the fly and not memoized (to avoid thread issues). The Object#hashCode() implementation is also not cached, but it is a constant due to the way the memory model works.
Daniel Spiewak
rally -- I was with you until you mentioned caching.. how do you think caching works? To cache key-value pair, you... hash the key :-)
SquareCog
Java can cache string values to simplify things a little. It's irrelevant to hashing. It just happens to be true for Java.
S.Lott
I was under the impression that Java would store the hash key inside the class instance. I could be wrong though, just thought I read that somewhere.
rally25rs
This is what I was reffering to. guess its string specific (from Java docs): "Since JDK version 1.3, the class java.lang.String caches its hash code, i.e. it calculates the hash code only once and stores it in an instance variable and returns this value whenever the hashCode method is called.
rally25rs
+23  A: 

The answers so far have helped to define hash tables and explain some theory, but I think an example may help you get a better feeling for them.

What is the difference between a hash table and just a normal array?

A hash table and an array are both structures that allow you to store and retrieve data. Both allow you to specify an index and retrieve a value associated with it. The difference, as Daniel Spiewak noted, is that the indices of an array are sequential, while those of a hash table are based on the value of the data associated with them.

Why would I use a hash table?

A hash table can provide a very efficient way to search for items in large amounts of data, particularly data that is not otherwise easily searchable. ("Large" here means ginormous, in the sense that it would take a long time to perform a sequential search).

If I were to code a hash how would I even begin?

No problem. The simplest way is to invent an arbitrary mathematical operation that you can perform on the data, that returns a number N (usually an integer). Then use that number as the index into an array of "buckets" and store your data in bucket #N. The trick is in selecting an operation that tends to place values in different buckets in a way that makes it easy for your to find them later.

Example: A large mall keeps a database of its patrons' cars and parking locations, to help shoppers remember where they parked. The database stores make, color, license plate, and parking location. On leaving the store a shopper finds his car by entering the its make and color. The database returns a (relatively short) list of license plates and parking spaces. A quick scan locates the shopper's car.

You could implement this with an SQL query:

SELECT license, location FROM cars WHERE make="$(make)" AND color="$(color)"

If the data were stored in an array, which is essentially just a list, you can imagine implementing the query by scanning an array for all matching entries.

On the other hand, imagine a hash rule:

Add the ASCII character codes of all the letters in the make and color, divide by 100, and use the remainder as the hash value.

This rule will convert each item to a number between 0 and 99, essentially sorting the data into 100 buckets. Each time a customer needs to locate a car, you can hash the make and color to find the one bucket out of 100 that contains the information. You've immediately reduced the search by a factor of 100!

Now scale the example to huge amounts of data, say a database with millions of entries that is searched based on tens of criteria. A "good" hash function will distribute the data into buckets in a way that minimizes any additional searching, saving a significant amount of time.

Adam Liss
+1  A: 

Hello,

I'm new in C # and i have this exercise that i can't make Create a table that consists of 10 places. Insert the following table entries A1, I9, R18, S19, E5, O15, and Y25, where the integers are subscripts which represent the positions of these letters in the alphabet into the table by using the hashing function H(Ln) = n%10 And probe function P(Ln) = max (1, n/10).

Can anyone help me please?

Chris
A: 

The question, I believe, is answered quite clearly and in many different ways by now.

I would just like to add another perspective (which may confuse a new reader as well)

At a level of least abstraction, arrays are just contiguous block of memory. Given the starting address (startAddress), size (sizeOfElement) and the index of a single element, the address of element is computed as:

elementAddress = startAddress + sizeOfElement * index

The interesting thing to note here is that arrays can be abstracted/viewed as hash tables with index as key and the above function as a hash function which calculates the location of a value in O(1)

Omer Akhter