tags:

views:

382

answers:

9
+12  Q: 

O(1) hash look ups?

I ran across an assertion that HashSet<T>.Contains() is an O(1) operation. This surprised me since every discussion of hashing I've encountered mentions the possibility of collisions, potentially leading to O(n) run time.

Being curious, I looked into documentation for HashSet<T>.Contains and also HashTable.Contains. The documentation for both methods make that same claim.

When I look in reflector, HashSet<T>.Contains() is implemented with a for loop, going through a list of slots containing values which have the same hash.

Now admittedly, those same discussions of hashing have also mentioned that a good hashing algorithm avoids collisions and under those circumstances the lookup will indeed be O(1). But my understanding of Big O notation is that it's the worst-case runtime time, not best.

So is the O(1) claim incorrect? Or am I missing something?

+7  A: 

In general, it's O(1).

SLaks
Even considering the known poor performance of the built-in `GetHashCode`? I wouldn't depend on it being O(1)...
Stephen Cleary
@Stephen: What are you talking about? Also, even if `GetHashCode` takes an hour to return, it's still O(1) – the performance of `GetHashCode` doesn't scale with the size of the set.
SLaks
@SLaks, I'd guess Stephen was referring to the default implementation's poor suitability for hashing. See http://stackoverflow.com/questions/720177/default-implementation-for-object-gethashcode/720196#720196
Ben M
@Slaks: Ben is correct. It's not that `GetHashCode` takes a long time to return, it's that it's a poor hashing algorithm. Which causes collisions. Which pushes the "O(1)" reflex answer towards incorrectness, because it's no longer true on average.
Stephen Cleary
`HashSet` uses the `GetHashCode` method of `IEqualityComparer<T>` which you can specify in constructor and influence the performance (for good or bad).
Jaroslav Jandek
Object.GetHashCode() is excellent.
Hans Passant
+5  A: 

No, Big O does not define "worst case", it defines a limit. Hash-based lookups (with good hashing algorithms that provide efficient value distribution and a low collision rate) progress toward a constant value as the number of items increases (they will never reach or that constant value, but that's the point of it being a limit).

Adam Robinson
+2  A: 

I believe it means O(1) on average.

KennyTM
+8  A: 

But my understanding of Big O notation is that it's the worst-case runtime time, not best.

Unfortunately, there is no "standard" for Big-O when describing algorithms. Often, it's used to describe the general or average case - not the worst case.

From Wikipedia:

...this notation is now frequently also used in the analysis of algorithms to describe an algorithm's usage of computational resources: the worst case or average case...

In this case, it's describing a standard case, given proper hashing. If you have proper hashing in place, the limiting behavior will be constant for size N, hence O(1).

Reed Copsey
Yep. Another prominent example is Quicksort — O(n^2) worst case, but often considered to be O(n log n) as this is the average complexity.
KennyTM
When I learned it, big O is used to denote the limit, without regards for best/worst/average case; however, in times when the best, worst, and average cases have significant disconnect, big O is typically used for average case analysis. Use big theta for worst case.
Razor Storm
That's surprising, I would have expected worst case to be the more typical use, though (particularly for hashing) having the worst case come up frequently would probably be a motivation to look for a better algorithm. I can certainly see where the general/average case would be useful though. In the case of hashing, I would expect O(1) most of the time.
ThatBlairGuy
Hash table lookup is only O(1) *amortized*, see [Staffan's answer](http://stackoverflow.com/questions/3301983/o1-hash-look-ups/3302044#3302044). And even that requires some assumptions.
Gilles
@Gilles: Most hashing structures are not amortized, however. They do not modify the original hash to prevent collisions in the future, which would be required. (There are some that do, but .NET's does not...)
Reed Copsey
@Gilles: It is possible, with a very bad hash, to **guarantee** worst case every time with a .NET hash. If you don't believe me, try using HashSet with a custom class that implements GetHashCode as `return 1;`
Reed Copsey
@Reed: yup, in fact it's possible to guarantee worst case with most hashes, see my comment below Staffan's answer. But if you override the hash calculation with a maximally bad `GetHashCode`, it's your fault if hashing is slow!
Gilles
A: 

My understanding of Big Oh is that the "worst case" generally is in reference to the number of elements involved. So if a function were to perform O(n) with 10 elements, but O(n squared) with 100 or more (not sure such an algorithm actually exists), then the algorithm is considered O(n squared).

Nick
A: 

O(1) does not necessarily mean "worst case". For hashes, one usually says that the "expected" lookup time is O(1), since the probability of hash collisions is small.

Ferdinand Beyer
That's what surprised me -- the phrasing in the various places I found references to the look up didn't say "expected" or "typical." They said "is," which implies always.
ThatBlairGuy
+6  A: 

For a properly implemented hash table, look ups have amortized constant time complexity.

In practice, a single look up can be O(n) in case of collisions, as you say. However, if you perform a large number of look ups the average time complexity per operation is constant.

Quoting wikipedia:

Amortized analysis differs from average-case performance in that probability is not involved; amortized analysis guarantees the time per operation over worst-case performance.

The method requires knowledge of which series of operations are possible. This is most commonly the case with data structures, which have state that persists between operations. The basic idea is that a worst case operation can alter the state in such a way that the worst case cannot occur again for a long time, thus "amortizing" its cost.

Staffan
+1, finally the all-important term "amortized".
Hans Passant
Indeed, amortized complexity must be mentioned in a good description of hash table complexity. But note that amortized O(1) complexity requires an assumption that the keys are sufficiently randomly distributed. If an attacker chooses the keys to add to the hash, he can force a collision every time. This could be avoided by using a cryptographic hash, but these are very expensive, so you'd get constant time with a prohibitively large constant. Another way is to include a random seed in the hash (perl did this at some point).
Gilles
+1  A: 

No, Big-O notation is not necessarily restricted to the worst-case. Typically you will see Big-O published for best-case, average-case, and worst-case. It is just that most people tend to focus on the worst-case. Except in the case of a hash table the worst-case rarely happens so using the average-case tends to be more useful.

Yes, a good hash function reduces the probability of a collision. A bad hash function may cause the clustering effect (where different values hash to the exact same value or close to the same value). It is easy to demonstrate that HashSet can indeed become O(n) by implementing the GetHashCode function in such a manner that it returns the same value all of the time.

In a nutshull, yes HashSet and Dictionary can be described as having O(1) runtime complexity because the emphasis is on the average-case scenario.

By the way, Big-O can be used to analyze amortized complexity as well. Amortized complexity is how a sequence of separate (and sometimes even different) operations behave when grouped together as if they were one big operation. For example, a splay tree is said to have amortized O(log(n)) search, insert, and delete complexity even though the worst-case for each could O(n) and the best-case is O(1).

Brian Gideon
A: 

Hash tables not only have average-case performance O(1), but if the hash function is random, for any given percentage P < 100%, the performance that can be obtained P% of the time from a properly-designed hash tale is O(1). Although the extreme parasitic cases become more and more severe as N increases, that is balanced by the fact that even moderately-parasitic cases become less and less likely.

supercat