views:

1298

answers:

4

People say it takes amortized O(1) to put into a hash table. Therefore, putting n elements must be O(n). That's not true for large n, however, since as an answerer said, "All you need to satisfy expected amortized O(1) is to expand the table and rehash everything with a new random hash function any time there is a collision."

So: what is the average running-time of inserting n elements into a hash table? I realize this is probably implementation-dependent, so mention what type of implementation you're talking about.

For example, if there are (log n) equally spaced collisions, and each collision takes O(k) to resolve, where k is the current size of the hashtable, then you'd have this recurrence relation:

T(n) = T(n/2) + n/2 + n/2

(that is, you take the time to insert n/2 elements, then you have a collision, taking n/2 to resolve, then you do the remaining n/2 inserts without a collision). This still ends up being O(n), so yay. But is this reasonable?

+4  A: 

It completely depends on how inefficient your rehashing is. Specifically, if you can properly estimate the expected size of your hashtable the second time, your runtime still approaches O(n). Effectively, you have to specify how inefficient your rehash size calculation is before you can determine the expected order.

McWafflestix
Note that in many implementations, you can specify the expected size of the full hashmap. So if n is known before you start filling the map, the expected runtime is still O(1).
gnud
@gnud, that was my exact point; rehashing is only necessary if you get the original size wrong (or get the subsequent size wrong and need to rehash again, etc.).
McWafflestix
Yes, I know -- you wrote about estimating size the second time. I thought I should mention that it's often possible to specify size the first time =)
gnud
A: 

Why not just run a few tests on your system? Maybe if you'll post the source, we can go back and test them on our systems and we could really shape this into a very useful discussion.

It is just not the implementation, but the environment as well that decides how much time the algorithm actually takes. You can however, look if any benchmarking samples are available or not. The problem with me posting my results will be of no use since people have no idea what else is running on my system, how much RAM is free right now and so on. You can only ever have a broad idea. And that is about as good as what the big-O gives you.

dirkgently
+3  A: 

People say it takes amortized O(1) to put into a hash table.

From a theoretical standpoint, it is expected amortized O(1).

Hash tables are fundamentally a randomized data structure, in the same sense that quicksort is a randomized algorithm. You need to generate your hash functions with some randomness, or else there exist pathological inputs which are not O(1).

You can achieve expected amortized O(1) using dynamic perfect hashing:

The naive idea I originally posted was to rehash with a new random hash function on every collision. (See also perfect hash functions) The problem with this is that this requires O(n^2) space, from birthday paradox.

The solution is to have two hash tables, with the second table for collisions; resolve collisions on that second table by rebuilding it. That table will have O(\sqrt{n}) elements, so would grow to O(n) size.

In practice you often just use a fixed hash function because you can assume (or don't care if) your input is pathological, much like you often quicksort without prerandomizing the input.

Captain Segfault
So here is my question exactly. You say "All you need to satisfy expected amortized O(1) is to expand the table and rehash everything with a new random hash function any time there is a collision." Let's say that you do do this. If you have no collision with n insertions, then you have O(n), definitely. But, what's the expected number of collisions per n elements, and how long does each take to resolve? Then we can get a more accurate number for n insertions into a hash table. Something like O(n + #col * coltime) - maybe O(n + (log n)^2)?
Claudiu
Fixed.I'd forgotten that the trick was to have a second table; simply rehashing on every collision would require O(n^2) space because of the birthday paradox.
Captain Segfault
+1  A: 

All O(1) is saying is that the operation is performed in constant time, and it's not dependent on the number of elements in your data structure.

In simple words, this means that you'll have to pay the same cost no matter how big your data structure is.

In practical terms this means that simple data structures such as trees are generally more effective when you don't have to store a lot of data. In my experience I find trees faster up to ~1k elements (32bit integers), then hash tables take over. But as usual YMMW.

Nova