ansaurus

Question

What kind of overhead is involved in a Hashtable ?

Answer 1

+3 A:

Theoretically, it depends on how many unique numbers there are in the stream of numbers. But any real life scenario I can imagine, an array would be ridiculously slower. The more unique numbers you process, the slower the array solution will become.

A HashTable generally maintains the same access speed no matter how large it becomes. For an 'infinite stream', I can't imagine how a HashTable won't be the better solution. How do you intend to search the array?

Seun Osewa 2010-03-04 18:34:20

+1 and given infinite stream of numbers program will be adding entries to the hastable till the end of the world.

TheMachineCharmer 2010-03-04 18:37:57

You're assuming that the 'infinite' stream has a large set of unique numbers. This is not necessarily true; consider frequencies of letters in Gutenberg texts, or frequencies of readings from a 10-bit A/D converter.

Brooks Moses 2010-03-04 18:39:45

Okay what if there is an equal distribution of repeated and unique values ?? So, you could have something like {1,2,3,10000,5594594,10000,559459410000,5594594} obviously it is impossible to allocate such a large array. So, does it then make sense to just use a hashtable? or some other datastructure would be better here ?

Phoenix 2010-03-04 18:46:25

Phoenix: A hashtable is just fine here. One of the things you can do is to preallocate it if you have an educated guess on the count of unique numbers. Say you know that you'll have *at most* 1000 unique items. By preallocating the hash to this size, you will avoid rehashing if you have less than 1001 unique numbers. On the other hand, if you don't know the upper bound, but you know that you'll have *at least* 1 000 000 unique numbers, why not preallocate to say 2 000 000? This way, you are good to go without rehashing up to at least 2 000 000 unique numbers.

andras 2010-03-04 20:07:22

...and you *most certainly* avoid rehashing for the first 1 000 000 insertions. ;-)(That you know for sure, will happen.)

andras 2010-03-04 20:11:19

Answer 2

+3 A:

As Neil's comment implies, the overhead in hashtable implementations depends heavily on the particular implementation of a hash table. Typically, though, there will be storage overhead from unused hashes, and both storage and time overhead from dealing with hash collisions. There is of course also time overhead from computing the hash values.

In answer to your second question, this depends very much on the details of your stream of numbers, and the other aspects of your program. Some questions to consider:

Is the set of possible numbers large or small? (How big an array would you need to create?)
Out of the range of possible numbers, do you expect most of them to be used, or only a few? If you expect most of the possible numbers in the range to be used, then using a hash table will not save you much space.
Do you know the range of possible numbers before you start? Or is this unknown? Hash tables can deal with unknown ranges much more easily.
How important is saving storage space in this program? Can you easily afford to allocate an array of the necessary size? If you can easily afford to allocate the array, why bother with a hash table?
How important is runtime speed in this program? Arrays will be usually be faster.

Brooks Moses 2010-03-04 18:35:39

I do not know the range of numbers. Imagine that I have to calculate something like I have 10 web pages in my application (but this is for the time being .. I could add more tommorow) AND I have to maintain the number of times the page has been viewed. Would you say hashtable is the best datastructure in this situation ? where I could keep a page counter against the URL of the Page ??

Phoenix 2010-03-04 18:44:06

Yes, that's definitely a situation where a hashtable would be better than an array.

Brooks Moses 2010-03-04 20:28:19

Answer 3

+1 A:

Hashtables are pretty fast. Just as an experiment, I get about 50x slowdown between a raw array and a c++ hash_map (compile with the #if toggled both ways and try it yourself).

#include <ext/hash_map>
using namespace __gnu_cxx;

int main() {
#if 0
  hash_map<int,int> table;
  for (int i = 0; i < 256; i++) table[i] = 0;
#else
  int table[256];
#endif

  for (int i = 0; i < 100000000; i++) {
    table[i&0xff]++;
  }
}

Keith Randall 2010-03-04 18:39:43

This is so micro. Out of curiosity, do you have a comparison on what happens when the table size exceeds your processor cache size?

andras 2010-03-04 18:46:35

Yes, it is super micro. I'm not claiming a definitive result here, just getting a feel for the magnitude. As for table size > cache size, try it yourself!

Keith Randall 2010-03-04 18:54:31

ansaurus

tags:

views:

answers:

What kind of overhead is involved in a Hashtable ?

related questions