views:

74

answers:

4

Usually (as in C++), the hash function returns any size_t value -- thus many different hash values are possible (2^32).

That is why I always thought that when people talked about them being implemented as tables, that is not really true in practice because the table would be much too big (2^32 entries). Of course my assumption was wrong that the table must be as big as the full range of hash values.

It seems that the actual implementation is more difficult than I thought. What I always had in mind, what would be the naive implementation, is something like:

typedef list< pair<Key,Value> > Bucket;
typedef map< size_t, Bucket > Hashtable;

Now my question: How does this naive implementation differs from actual implementations in the praxis in terms of complexity (runtime and memory)?

+2  A: 

Note that there are other ways to implement hash tables as Matthieu M points out. The remainder of this answer assumes that you want to use hashing with buckets made of some kind of list.

Assuming you are talking about time complexity.

Hash tables are expected to have O(1) best-case access. Your proposal for implementation in the question uses a map<size_t,Bucket> for access to the buckets which would result in O(log n) time complexity. You need something with O(1) access time complexity, such as a vector<Bucket> to comply with the expected time complexity of a hash table.

More details

Hash tables can vary between having excellent and poor time complexity, depending on how sparsely populated they are.

In the best case, each bucket has at most one entry, and access by key is O(1). This is the commonly quoted complexity for hash tables.

In the worst case, each key has the same hash value, and access by key is effectively searching a list which results in O(n) behaviour.

Real-world use is usually somewhere between these extremes, hopefully closer towards O(1).

The accepted answer to your other question has some simplified code you can use to work through these two extremes to satisfy yourself that this is the case.

spong
@spong: Reading back I agree I was a bit terse... typical software engineer issue, I think in 1s and 0s. The OP is asking about bucket implementation, but that's because he doesn't know about anything else. I agree that open-addressing suffer from the same issue (well, in fact even more). I've seen hash-tables with hash-tables as buckets though (the second being open-addressed) and with 2 independent hashing schemes you rarely have collisions.
Matthieu M.
Especially since it seems the latest trend on SO :/ But I don't downvote much, especially when people take the time to write down a real answer and not merely link away.
Matthieu M.
Thank you for the answer. The vector<Bucket> implementation keeps wondering me. So then the access time might be close to O(1) but the insertion time will probably be much higher than my proposal because it must frequently resize that vector, i.e. an insertion takes O(n).
Albert
For this kind of implementation, you can't straightforwardly change the number of buckets (resize the `vector`) while there is data in the hash table. The reason behind this is that you've calculated something like `key.getHash() % num_buckets` for each key and with a given value of `num_buckets`. This gives you the bucket (`vector` index) to put the key/value pair into, and this value will be different if you change `num_buckets`.
spong
A: 

The very issue with your question Albert, is that there is not ONE hash table, there are many of them.

The heart of the problem here is the big-O complexity given for some operations. In average a hash-table should yield O(1) complexity to find an item. A binary tree yields O(log N) in average.

In term of speed, it really depends on the size of N, because those are asymptotic complexities, hence they represent order of magnitude when N is big (think million) and the real speed may be much different for small collections.

So, instead of trying to elaborate more on your question, I think you should get a better grasp of hash tables. A quick overview:

  • Hash tables may or may not be implemented in term of buckets: non-bucket implementation include open-addressing schemes (which are more cache-friendly by the way).
  • Buckets may or may not be implemented in term of a linked-list. Other schemes include using another hash function (each bucket being a hash-table itself) or a binary tree (map), the latter requires some ordering though.
  • Reallocation may be done at once: ie once you exceed the capacity a new (bigger) hash-table is allocated and all the content is copied or use a linear reallocation scheme to smooth the reallocating cost and avoid a big hit from time to time.

Read the article on Wikipedia, it addresses these points and more.

Matthieu M.
A: 

It depends. If the hash function does a good job of implementing an even distribution of hash keys and the table isn't too full then you will get approximately O(1). The hash table will get the correct hit with relatively few collisions.

If the table is extensively chained (i.e. full) then the probing process will spend more time resolving colissions. In the theoretical worst case all values will map to the same hash key and the hashing function will spend all of its time tracing down the chain, which is O(n).

In practice, unless your hash function is really broken you should get O(1) for all practical purposes (note that you can take a modulus of a larger hash value for smaller tables). If you have a hash-table based container that can expand, then it may do an expand operation, which will be substantially more expensive(1).

A tree will be O(log n) rather than O(1), although if the tree is unbalanced then the search can turn into an effectively linear operation as well. Note that this is a problem in some common scenarios, such as when nodes are inserted in key order (imagine a shallow copy operation on a tree based collection). Typically, balanced tree algorithms such as red-black trees are used to keep the tree efficient. Another advantage of trees is that one can traverse the tree in order and produce an ordered list of keys without having to explicitly sort them.

  1. For an example of a hashing operation with relatively cheap expansion costs see Linear Hash Table (wi.ipedia.org).
ConcernedOfTunbridgeWells
A: 

An implemenation can easily reduce an "oversize" haskey. For instance, it could use the following structure:

typedef list< pair<Key,Value> > Bucket;
const int HashSize = 511;
Bucket[HashSize] Hashtable;
inline size_t HashIndex(Key k) { return hash(k) % HashSize; }

In practice, of course, HashSize isn't a constant. That would cause the performance to drop drastically if you insert more than a few thousand elements. Also, it uses quite a bit of memory if there are fewer elements. Hence, implementations do smart things with that internal parameter. As a result, the number of values per bucket is O(1), and finding the right bucket is also O(1). This is how such an implementation can retrieve any value in O(1).

MSalters
So, in practice, how big is HashSize? I thought it is usually somewhat around 2^32. And when you are resizing the HashSize dynamically and you are inserting a lot of elements, will the insertion not perform very badly because of the frequent resizes of the hashtable (which each will take O(n))?
Albert
Typically it's about twice the number of elements in the container, not a constant (averagre number of elements per bucket=0.5). This is not a fixed amount. The resize policy is similar to vector: grow rarely, but a lot. E.g. when the average number of elements per bucket exceeeds 0.7 you pick a new `HashSize` that's about double the previous value, and the new average becomes 0.35 elements/bucket.
MSalters