views:

4141

answers:

13

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.

In which case, the lookup would be O(n), not O(1).

Can someone explain whether they are O(1) and, if so, how they achieve this?

+10  A: 

In Java, HashMap works by using hashCode to locate a bucket. Each bucket is a list of items residing in that bucket. The items are scanned, using equals for comparison. When adding items, the HashMap is resized once a certain load percentage is reached.

So, sometimes it will have to compare against a few items, but generally it's much closer to O(1) than O(n). For practical purposes, that's all you should need to know.

FogleBird
Well, since big-O is supposed to specify the limits, it makes no difference whether it's closer to O(1) or not. Even O(n/10^100) is still O(n). I get your point about efficiency bringing then ratio down but that still puts the algorithm at O(n).
paxdiablo
Hash-maps analysis is usually on the average case, which is O(1) (with collusions) On the worst case, you can have O(n), but that is usually not the case. regarding the difference - O(1) means that you get the same access time regardless of the amount of items on the chart, and that is usually the case (as long as there is a good proportion between the size of the table and 'n' )
Liran Orevi
It's also worth noting, that it is still exactly O(1), even if the scanning of the bucket takes a while because there are some elements already in it. As long as the buckets have a fixed maximum size, this is just a constant factor irrelevant to the O() classification. But of course there can be even more elements with "similar" keys been added, so that these buckets overflow and you can't guarantee a constant anymore.
sth
A: 

Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).

One particular example: I believe that if you don't implement hashCode, it will use the memory address of the object. Since no two objects can be located at the same address (unless they are identical), under this particular circumstance you are guaranteed that there will be no collisions.

Caveat: of course if you are using a 64bit memory space, you might get some collisions given that the hashcode is only 32 bits, but this should be very rare.

Cd-MaN
"it is in most cases". More specifically, the total time will tend towards K times N (where K is constant) as N tends towards infinity.
ChrisW
This is wrong. The index in the hash table is going to be determined via `hashCode % tableSize` which means there can certainly be collisions. You aren't getting full use of the 32-bits. That's kind of the point of hash tables... you reduce a large indexing space to a small one.
FogleBird
"you are guaranteed that there will be no collisions" No you're not because the size of the map is smaller than the size of the hash: for example if the size of the map is two, then a collision is guaranteed (not matter what the hash) if/when I try to insert three elements.
ChrisW
But how do you convert from a key to the memory address in O(1)? I mean like x = array["key"]. The key is not the memory address so it would still have to be an O(n) lookup.
paxdiablo
If your hashtable's key is the object's address, then you don't need a hashtable at all. The point is to have a relatively small table (255 buckets, for example), and to have *uniform distribution*. Then you can predict that each bucket will contain approximately the same number of items.
Groo
+5  A: 

I know this might not be an answer but I remember Wikipedia has a very good article about this.

Don't miss the performance analysis section

victor hugo
Sorry but I can't beat wikipedia
victor hugo
+7  A: 

The only data type that has a MAXIMUM lookup time of O(1) is a fixed-length array of fixed-length structures (allowing for memory address calculation instead of list traversal).

What people may be referring to is an AVERAGE lookup time, but this isn't big O. This would simply depend on the efficiency of the hashing algorithm. The only way to have an O(1) lookup time in a hash table would be to have it of a fixed length and a zero-collision hashing algorithm, which (by definition) can't exist.

Adam Robinson
As long as you're using a fixed-length dataset, you might as well analyze it and produce a perfect hash *for that data*. See http://www.gnu.org/software/gperf/ .
David Winslow
Average case has got nothing to do with big-O versus Theta. Theta simply means having an upper as well as lower bound, and can also apply to worst case.
Konrad Rudolph
@Konrad: Thanks for the clarification. I've edited my answer. I wasn't entirely sure on that, so I guess it's better to be thought a fool and say nothing than to open my mouth and remove all doubt ;)
Adam Robinson
Maybe I'm missing something, but I've seen amortized analysis of hash maps in O(1). http://www.cs.cornell.edu/courses/cs3110/2008fa/lectures/lec22_amort.html Am I missing something?
Liran Orevi
The linked amortized analysis assumes that lookup is constant time = 1, because they keep the tables load factor low. Practically this will usually be the case, but in theory all inserted elements could collide and lookup would still be O(n).
sth
The answer is still wrong. Average/worst/best case running time is independant of O, Theta and Omega. Average-case lookup time is indeed describable with O(something).
harms
+2  A: 

This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.

If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).

Tobias Svensson
That assumes the running time is bounded by the lookup time. In practice you'll find a lot of situations where the hash function provides the boundary (String)
Stephan Eggermont
+4  A: 

Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).

So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.

Daniel James
+17  A: 

You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.

Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.

Konrad Rudolph
I've always thought upper bound was worst case but it appears I was mistaken - you can have the upper bound for average case. So it appears that people claiming O(1) should have made it clear that was for average case. The worst case is a data set where there are many collisions making it O(n). That makes sense now.
paxdiablo
You should probably make it clear that when you use big O notation for the average case you are talking about an upper bound on the the expected runtime function which is a clearly defined mathematical function. Otherwise your answer doesn't make much sense.
ldog
gmatt: I'm not sure that I understand your objection: big-O notation is an upper bound on the function *by definition*. What else could I therefore mean?
Konrad Rudolph
well usually in computer literature you see big O notation representing an upperbound on the runtime or space complexity functions of an algorithm. In this case the upperbound is actually on the expectation which is itself not a function but an operator on functions (Random Variables) and is actually in fact an integral (lebesgue.) The very fact that you can bound such a thing should not be taken for granted and is not trivial.
ldog
+1  A: 

It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).

Nizar Grira
+14  A: 

A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.

pcollision = n / capacity

So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation let's us do something more compelling. Observe that for any arbitrary, fixed constant k.

O(n) = O(k * n)

We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.

pcollision x 2 = (n / capacity)2

This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to

pcollision x k = (n / capacity)k

And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.

We talk about this by saying that the hash-map has O(1) access with high probability

TokenMacGuy
There's a <sub> tag! I never knew...
Wim Coenen
paxdiablo
Even with HTML, i'm still not really happy with the fractions. Clean them up if you can think of a nice way to do it.
TokenMacGuy
+1  A: 

We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.

It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.

If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!

jtb
+1  A: 

If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).

The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n -> inf.

I. J. Kennedy
+1  A: 

It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.

Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.

antti.huima
+1  A: 

Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)

Ryan Emerle
Not in practical applications. As soon as you use a string as a key, you'll notice that not all hash functions are ideal, and some are really slow.
Stephan Eggermont