views:

130

answers:

3

Hi,

Although I understand very well what HashCode is and what a Hash Table does, I have to admit I don´t know how to use it (Beyond a common dictionary). I wanted to implement my own Hash Table so first I want to know the very basic about Hash:

  • I know I can get the hash code with getHashCode()/hashCode() in Java and Scala. How is this number determined. (Just out of curiosity)
  • If I know the HashCode of an object, how can I access it? That is, how can I call that memory bucket?
  • Can I change/set a variables HashCode?

Now, I have a very big (about 10^9) list of Int. I am going to access some of them (from none to all) and I need to do it in the fastest way possible. Is a hash table THE BEST way to do it?

PS: I don't want to discuss it, I just want to know if the HashTable is know to be the most efficient. If there exist other good methods maybe you can point me to them.

Thanks,

+6  A: 

The hash code is just a number that is guaranteed to be the same for every type of object "the same" as the original object.

This means that returning "0" for every hash code call would be valid, but self-defeating. The point is there can (and in most cases will) be duplicates.

If you know the hash code of an object, you cannot necessarily access it. per my example above, if all objects returned "0", you still couldn't ask which object has hash code 0. However, you could ask for ALL objects with hash code 0 and look through them (this is what a hashtable does, it reduces the amount of iterating by getting just the ones with the same hash code, then looks through those).

If you were to set (Change) a HashCode, it would not be a hash code because the value given for an object with a given "State" cannot change.

As for the "Best Way" to do it, the fewer unique objects that return the same hash code, the better your hash tables will perform. If you have a long list of "int", you can just use that int value as your hash code and you will have that rare perfect hash--where each object maps to exactly one hash code.

Note that hashtable isn't really appropriate for this situation of storing ints. It's better for situations where you are trying to store complex objects that are not so easy to uniquely identify or compare using other mechanisms.

The problem with your "List of Int" is that if you have the number 5 and you want to look it up in your table, you are just going to find a number 5 there.

Now, if you want to see if the number 5 exists in your table or not, that's a different matter.

For a set of numbers with few holes you could make a simple boolean array. If a[5] exists (is true), than a is in the list. If your set of numbers is very sparse (1, 5, 10002930304) then that wouldn't be a very good solution since you'd store "False" in spots 2, 3, 4 and then a whole bunch of them before the last number, but it is a direct lookup, a single step that never takes any longer no matter how many numbers you add--O(1).

You could make this type of storage MUCH denser by making it a binary lookup into a byte array, but unless you're pretty good with bit-manipulation, skip it. It would involve stuff that looks like this:

public boolean doesNumberExist(int number) {
    return bytes[number / 8] & ( 1 << number % 8);
}

and this still runs out of memory if your highest number is really big.

So, for a large sparse list I'd use a sorted integer array instead of a lightly populated boolean array. Once it's sorted as an array you just do a binary search; start in the middle of the sorted array, If the number you want is higher, then divide the top half of the list in the center and check that number, repeat.

The sorted int array takes a few more steps but not too many more and it doesn't waste any memory for non-existent numbers.

Bill K
A: 

A hashing function returns an integer. You use that integer (key) as an index to store your information. In java, you can use java.util.Hashtable. You can always roll your own, it can be as simple as an array that uses the key as the index.

For your program, you really need to figure out how you need to access the elements. A hashtable offers super fast access to a specific item, but doesn't (shouldn't) offer sequential access

If you're using java, check out hashtable and see if the methods are sufficient for your application:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/Hashtable.html

prelic
A: 

The big list of Int works as a look up table that I access through index. Then I guess the index would be the key and the list elements are values. Hope that clarifies it

In that case, a java.util.HashTable is not better than an java.util.ArrayList. A HashTable would consume at least twice the memory, while offering slightly slower access.

Even better than the ArrayList is a plain int[], as no Integer instances need to be created and stored. I estimate this will reduce memory consumption by a factor of 3.

However, keeping 10^9 int in memory remains a daunting proposition, as each int consumes 4 bytes of memory. That's 4 GB. You might wish to keep at least part of the list stored on disk rather than memory, and use, for instance, RandomAccessFile to seek to the index being looked up.

meriton