tags:

views:

59

answers:

3

I understand that since allocation is at run-time there must be some house-keeping operations involved. But other than that what is the overhead ? Also, would it be wise to create a hashtable Vs array when you need to store the number of times an integer element appears in an infinite stream of numbers ?

+3  A: 

Theoretically, it depends on how many unique numbers there are in the stream of numbers. But any real life scenario I can imagine, an array would be ridiculously slower. The more unique numbers you process, the slower the array solution will become.

A HashTable generally maintains the same access speed no matter how large it becomes. For an 'infinite stream', I can't imagine how a HashTable won't be the better solution. How do you intend to search the array?

Seun Osewa
+1 and given infinite stream of numbers program will be adding entries to the hastable till the end of the world.
TheMachineCharmer
You're assuming that the 'infinite' stream has a large set of unique numbers. This is not necessarily true; consider frequencies of letters in Gutenberg texts, or frequencies of readings from a 10-bit A/D converter.
Brooks Moses
Okay what if there is an equal distribution of repeated and unique values ?? So, you could have something like {1,2,3,10000,5594594,10000,559459410000,5594594} obviously it is impossible to allocate such a large array. So, does it then make sense to just use a hashtable? or some other datastructure would be better here ?
Phoenix
Phoenix: A hashtable is just fine here. One of the things you can do is to preallocate it if you have an educated guess on the count of unique numbers. Say you know that you'll have *at most* 1000 unique items. By preallocating the hash to this size, you will avoid rehashing if you have less than 1001 unique numbers. On the other hand, if you don't know the upper bound, but you know that you'll have *at least* 1 000 000 unique numbers, why not preallocate to say 2 000 000? This way, you are good to go without rehashing up to at least 2 000 000 unique numbers.
andras
...and you *most certainly* avoid rehashing for the first 1 000 000 insertions. ;-)(That you know for sure, will happen.)
andras
+3  A: 

As Neil's comment implies, the overhead in hashtable implementations depends heavily on the particular implementation of a hash table. Typically, though, there will be storage overhead from unused hashes, and both storage and time overhead from dealing with hash collisions. There is of course also time overhead from computing the hash values.

In answer to your second question, this depends very much on the details of your stream of numbers, and the other aspects of your program. Some questions to consider:

  • Is the set of possible numbers large or small? (How big an array would you need to create?)

  • Out of the range of possible numbers, do you expect most of them to be used, or only a few? If you expect most of the possible numbers in the range to be used, then using a hash table will not save you much space.

  • Do you know the range of possible numbers before you start? Or is this unknown? Hash tables can deal with unknown ranges much more easily.

  • How important is saving storage space in this program? Can you easily afford to allocate an array of the necessary size? If you can easily afford to allocate the array, why bother with a hash table?

  • How important is runtime speed in this program? Arrays will be usually be faster.

Brooks Moses
I do not know the range of numbers. Imagine that I have to calculate something like I have 10 web pages in my application (but this is for the time being .. I could add more tommorow) AND I have to maintain the number of times the page has been viewed. Would you say hashtable is the best datastructure in this situation ? where I could keep a page counter against the URL of the Page ??
Phoenix
Yes, that's definitely a situation where a hashtable would be better than an array.
Brooks Moses
+1  A: 

Hashtables are pretty fast. Just as an experiment, I get about 50x slowdown between a raw array and a c++ hash_map (compile with the #if toggled both ways and try it yourself).

#include <ext/hash_map>
using namespace __gnu_cxx;

int main() {
#if 0
  hash_map<int,int> table;
  for (int i = 0; i < 256; i++) table[i] = 0;
#else
  int table[256];
#endif

  for (int i = 0; i < 100000000; i++) {
    table[i&0xff]++;
  }
}
Keith Randall
This is so micro. Out of curiosity, do you have a comparison on what happens when the table size exceeds your processor cache size?
andras
Yes, it is super micro. I'm not claiming a definitive result here, just getting a feel for the magnitude. As for table size > cache size, try it yourself!
Keith Randall