views:

3391

answers:

13

When given a static set of objects (static in the sense that once loaded it seldom if ever changes) into which repeated concurrent lookups are needed with optimal performance, which is better, a HashMap or an array with a binary search using some custom comparator?

Is the answer a function of object or struct type? Hash and/or Equal function performance? Hash uniqueness? List size? Hashset size/set size?

The size of the set that I'm looking at can be anywhere from 500k to 10m - incase that information is useful.

While I'm looking for a C# answer, I think the true mathematical answer lies not in the language, so I'm not including that tag. However, if there are C# specific things to be aware of, that information is desired.

+6  A: 

Hash algorithms are O(1) while binary search is O(log n). So as n approaches infinity, hash performance improves relative to binary search. Your mileage will vary depending on n, your hash implementation, and your binary search implementation.

Interesting discussion on O(1). Paraphrased:

O(1) doesn't mean instantaneous. It means that the performance doesn't change as the size of n grows. You can design a hashing algorithm that's so slow no one would ever use it and it would still be O(1). I'm fairly sure .NET/C# doesn't suffer from cost-prohibitive hashing, however ;)

I agree with Maghis. Give both approaches a spin and let us know what you find.

Corbin March
Don't know why this was downvoted - good answer and an interesting point. +1.
xan
-1: Big O notation measures complexity, not speed relative to other algorithms. The claim that hashes are O(1) and therefore faster than O(log n) binary searches is not strictly correct.
Juliet
And not even practically correct. Default string hashes touch the whole string and can be a lot slower than compares.
Stephan Eggermont
@Stephan: Agreed! Good alternatives are string length + hash of first 8 characters or length + hash of first 4 + last 4. Anything but using the whole thing.
Zan Lynx
+20  A: 

A binary search is going to be O(log n), whereas a hash lookup will be O(1), amortized. You would have to have a pretty terrible hash function to get worse performance than a binary search.

EDIT: When I say "terrible hash", I mean something like:

hashCode()
{
    return 0;
}

Yeah, it's blazing fast itself, but causes your hash map to become a linked list.

Bill the Lizard
md5 would be totally inappropriate as a hash to look up values in a hash table. It's a cryptographic hash.
Bill the Lizard
Not 'totally inappropriate', just slow. And even good non-cryptographic hash functions can indeed be slower than binary search for small-ish sizes.
Nick Johnson
Yep, the default string hash is such a terrible hash function. If keys are long, the hash will be much slower than the average compare.
Stephan Eggermont
small correction - O(1) on _average_ for random data and good hash function. Not O(1) amortized.
orip
@orip: No, it's amortized. http://en.wikipedia.org/wiki/Amortized
Bill the Lizard
@Stephan: The default string hash for what language?
Bill the Lizard
C# getHashCode()
Stephan Eggermont
Surely you don't mean that getHashCode() is slower than md5? So, you must mean it produces a lot more collisions than md5. Hmmm... I think I might understand your earlier point now. md5 might not be totally bad for an input set that size.
Bill the Lizard
No, getHashCode is slower than compare. A lot slower for long strings.
Stephan Eggermont
@bill - the worst case sequence of operations for a hash table is when all the keys hash to the same bucket, and operations are O(N). Amortized time refers to the average cost per operation in _worst case_ behavior (your wikipedia link says it too).
orip
@orip: The worst case for a hash table doesn't mean you assume the worst possible hash *function*.
Bill the Lizard
You get expected O(1) amortized if you pick a new (sufficiently random) hash function any time you have a collision and you keep the tables a constant factor larger than the number of entries.
Captain Segfault
+2  A: 

It depends on how you handle duplicates for hash tables (if at all). If you do want to allow hash key duplicates (no hash function is perfect), It remains O(1) for primary key lookup but search behind for the "right" value may be costly. Answer is then, theorically most of the time, hashes are faster. YMMV depending on which data you put there...

Keltia
“no hash function is perfect” – no, that's wrong. There's such a thing as perfect hashing, with a very wide area of application. The simplest case is of course a degenerate hash function h(x) = x. Notice that this *is* a valid hash function and there are quite some cases where this is used.
Konrad Rudolph
+1  A: 

The answers by Bobby, Bill and Corbin are wrong. O(1) is not slower than O(log n) for a fixed/bounded n:

log(n) is constant, so it depends on the constant time.

And for a slow hash function, ever heard of md5?

The default string hashing algorithm probably touches all characters, and can be easily 100 times slower than the average compare for long string keys. Been there, done that.

You might be able to (partially) use a radix. If you can split up in 256 approximately same size blocks, you're looking at 2k to 40k binary search. That is likely to provide much better performance.

[Edit] Too many people voting down what they do not understand.

String compares for binary searching sorted sets have a very interesting property: they get slower the closer they get to the target. First they will break on the first character, in the end only on the last. Assuming a constant time for them is incorrect.

Stephan Eggermont
Valid point too.
Keltia
Corbin, please take a look at what big O notation means
Stephan Eggermont
@Stephan: We all three said O(1) is faster than O(log n). You also need to look at what big O notation means. It compares the relative resource usage of algorithms as the input size is changing. It's meaningless to talk about a fixed n.
Bill the Lizard
The question was "which is better". I assume that means "which takes less time". No need to quibble about whether n is constant.
Mike Dunlavey
Er... @Mike: n being constant matters a lot. O(log n) can be much faster than O(1) if the n is constant and small the constant-time operation in the O(1) takes a long time. But O(log n) is incredibly unlikely to be faster than O(1) if n is not constant.
Claudiu
Thanks, Claudiu. I just thought the question was pretty simple. N is bounded, so the only remaining question is the what are the fixed costs of lookup. Unless the fixed cost of hash is pretty bad, hash will be faster.
Mike Dunlavey
@Mike: You're right, if n is constant then only the fixed costs of hashing vs. lookup matter. However, n being constant is a bit different from n being bounded between 500k and 10M. 20:1 is enough room for log n to make a difference.
Bill the Lizard
I couldn't agree more.
Mike Dunlavey
I have found, in some cases (not necessarily this one) that if the dataset seldom changes, it is possible to generate and compile ad-hoc code to do at least part of the search, and shrink the lookup cost by a good factor.
Mike Dunlavey
BTW I'm upvoting this answer because it's not that bad, really :-) In fact the suggestion of using a radix is quite sensible.
Mike Dunlavey
If I could paraphrase Stephan's point: O(10M) = O(1)
Mike Dunlavey
+1, stephan is right: N <= 10 million means O(N) = O(1).
orip
@orip: That's just wrong. A binary search might be faster when n=500k, but be outperformed by a hash when n=10M. If the input size matters, you can't ignore the term.
Bill the Lizard
@Bill: the question was about a nearly-not changing set. Of course the hash could be faster, but it could also have 20 times more collisions. You have to compare actual implementations.
Stephan Eggermont
@Stephan: You don't use a hash table under the assumption that you have a lot of collisions. Any reasonable hash code implementation will give better performance than a binary tree for 500k - 10M objects.
Bill the Lizard
Let's just say I've seen some unreasonable hash implementations then.
Stephan Eggermont
I can agree with that. I saw one that could be simplified to "return 0;", after many complicated steps of combining and mixing.
Bill the Lizard
@bill - if you want to talk actual timing, time it and see. If you want to talk computational complexity: N <= (any constant) means O(N) = O(1). It's two different topics, really.
orip
I'm detecting violent agreement :-)
Mike Dunlavey
Actually the point about string compare getting slower as one gets closer to the target is not inherent in binary search, because it is possible to keep track of the common prefix as you narrow down the subset. (Not that anybody does.)
Mike Dunlavey
@orip: I already explained why that's wrong. We're comparing two algorithms, not just stating the complexity of one.
Bill the Lizard
+2  A: 

I'd say it depends mainly on the performance of the hash and compare methods. For example, when using string keys that are very long but random, a compare will always yield a very quick result, but a default hash function will process the entire string.

But in most cases the hash map should be faster.

Michael Borgwardt
there's no reason the hash function has to use the whole string.
Javier
Just a very practical one, you don't want all extensions of a string to end up in the same bucket (unless you use it as a kind of radix, and remove the prefix from the bucket elements, converting it into a trie-like structure)
Stephan Eggermont
+3  A: 

Hashes are typically faster, although binary searches have better worst-case characteristics. A hash access is typically a calculation to get a hash value to determine which "bucket" a record will be in, and so the performance will generally depend on how evenly the records are distributed, and the method used to search the bucket. A bad hash function (leaving a few buckets with a whole lot of records) with a linear search through the buckets will result in a slow search. (On the third hand, if you're reading a disk rather than memory, the hash buckets are likely to be contiguous while the binary tree pretty much guarantees non-local access.)

If you want generally fast, use the hash. If you really want guaranteed bounded performance, you might go with the binary tree.

David Thornley
Again - downvoted :S. Good points, +1.
xan
trees also have degenerated cases that effectively turn into a list. most variations have strict invariants to avoid these, of course.
Javier
Misleading answer. The performance problem often breaking hashing in practice is the hash function, not the collisions.
Stephan Eggermont
A: 

Of course, hash is fastest for such a big dataset.

One way to speed it up even more, since the data seldom changes, is to programmatically generate ad-hoc code to do the first layer of search as a giant switch statement (if your compiler can handle it), and then branch off to search the resulting bucket.

Mike Dunlavey
Special casing the first layer is definitely a thing to try.
Stephan Eggermont
I guess I've got a soft spot for code generation, if only because none of the major popular "methodologies" can tell you when it's a win.
Mike Dunlavey
A: 

I strongly suspect that in a problem set of size ~1M, hashing would be faster.

Just for the numbers:

a binary search would require ~ 20 compares (2^20 == 1M)

a hash lookup would require 1 hash calculation on the search key, and possibly a handful of compares afterwards to resolve possible collisions

Edit: the numbers:

    for (int i = 0; i < 1000 * 1000; i++) {
        c.GetHashCode();
    }
    for (int i = 0; i < 1000 * 1000; i++) {
        for (int j = 0; j < 20; j++)
            c.CompareTo(d);
    }

times: c = "abcde", d = "rwerij" hashcode: 0.0012 seconds. Compare: 2.4 seconds.

disclaimer: Actually benchmarking a hash lookup versus a binary lookup might be better than this not-entirely-relevant test. I'm not even sure if GetHashCode gets memoized under-the-hood

Jimmy
With a decent optimizer the results should be 0 for both.
Stephan Eggermont
+6  A: 

Ok, I'll try to be short.

C# short answer:

Test the two different approaches.

.NET gives you the tools to change your approach with a line of code. Otherwise use System.Collections.Generic.Dictionary and be sure to initialize it with a large number as initial capacity or you'll pass the rest of your life inserting items due to the job GC has to do to collect old bucket arrays.

Longer answer:

An hashtable has ALMOST constant lookup times and getting to an item in an hash table in the real world does not just require to compute an hash.

To get to an item, your hashtable will do something like this:

  • Get the hash of the key
  • Get the bucket number for that hash (usually the map function looks like this bucket = hash % bucketsCount)
  • Traverse the items chain (basically it's a list of items that share the same bucket, most hashtables use this method of handling bucket/hash collisions) that starts at that bucket and compare each key with the one of the item you are trying to add/delete/update/check if contained.

Lookup times depend on how "good" (how sparse is the output) and fast is your hash function, the number of buckets you are using and how fast is the keys comparer, it's not always the best solution.

A better and deeper explanation: http://en.wikipedia.org/wiki/Hash_table

Maghis
A: 

Here it's described how hashes are built and because the Universe of keys is reasonably big and hash functions are built to be "very injective" so that collisions rarely happen the access time for a hash table is not O(1) actually ... it's something based on some probabilities. But,it is reasonable to say that the access time of a hash is almost always less than the time O(log_2(n))

xxxxxxx
+3  A: 

If your set of objects is truly static and unchanging, you can use a perfect hash to get O(1) performance guaranteed. I've seen gperf mentioned a few times, though I've never had occasion to use it myself.

Mark Ransom
A: 

I wonder why no one mentioned perfect hashing.

It's only relevant if your dataset is fixed for a long time, but what it does it analyze the data and construct a perfect hash function that ensures no collisions.

Pretty neat, if your data set is constant and the time to calculate the function is small compared to the application run time.

orip
+1  A: 

Surprised nobody mentioned Cuckoo hashing, which provides guaranteed O(1) and, unlike perfect hashing, is capable of using all of the memory it allocates, where as perfect hashing can end up with guaranteed O(1) but wasting the greater portion of its allocation. The caveat? Insertion time can be very slow, especially as the number of elements increases, since all of the optimization is performed during the insertion phase.

I believe some version of this is used in router hardware for ip lookups.

See link text

ApplePieIsGood