views:

4748

answers:

8

It's clear that a search performance of the generic HashSet<T> class is higher than of the generic List<T> class. Just compare the hash-based key with the linear approach in the List<T> class.

However calculating a hash key may itself take some CPU cycles, so for a small amount of items the linear search can be a real alternative to the HashSet<T>.

My question: where is the break-even?

To simplify the scenario (and to be fair) let's assume that the List<T> class uses the element's Equals() method to identify an item.

+1  A: 

It depends. If the exact answer really matters, do some profiling and find out. If you're sure you'll never have more than a certain number of elements in the set, go with a List. If the number is unbounded, use a HashSet.

Adam Rosenfield
+3  A: 

The breakeven will depend on the cost of computing the hash. Hash computations can be trivial, or not... :-) There is always the System.Collections.Specialized.HybridDictionary class to help you not have to worry about the breakeven point.

WaldenL
A: 

Depends on a lot of factors... List implementation, CPU architecture, JVM, loop semantics, complexity of equals method, etc... By the time the list gets big enough to effectively benchmark (1000+ elements), Hash-based binary lookups beat linear searches hands-down, and the difference only scales up from there.

Hope this helps!

Kyle
+1  A: 

Depends on what you're hashing. If your keys are integers you probably don't need very many items before the HashSet is faster. If you're keying it on a string then it will be slower, and depends on the input string.

Surely you could whip up a benchmark pretty easily?

Peter
A: 

The answer, as always, is "It depends". I assume from the tags you're talking about C#.

Your best bet is to determine

1) A Set of data 2) Usage requirements

and write some test cases.

It also depends on how you sort the list (if it's sorted at all), what kind of comparisons need to be made, how long the "Compare" operation takes for the particular object in the list, or even how you intend to use the collection.

Generally, the best one to choose isn't so much based on the size of data you're working with, but rather how you intend to access it. Do you have each piece of data associated with a particular string, or other data? A hash based collection would probably be best. Is the order of the data you're storing important, or are you going to need to access all of the data at the same time? A regular list may be better then.

Additional:

Of course, my above comments assume 'performance' means data access. Something else to consider: what are you looking for when you say "performance"? Is performance individual value look up? Is it management of large (10000, 100000 or more) value sets? Is it the performance of filling the data structure with data? Removing data? Accessing individual bits of data? Replacing values? Iterating over the values? Memory usage? Data copying speed? For example, If you access data by a string value, but your main performance requirement is minimal memory usage, you might have conflicting design issues.

Robert P
A: 

Whether to use a HashSet<> or List<> comes down to how you need to access your collection. If you need to guarantee the order of items, use a List. If you don't, use a HashSet. Let Microsoft worry about the implementation of their hashing algorithms and objects.

A HashSet will access items without having to enumerate the collection (complexity of O(1) or near it), and because a List guarantees order, unlike a HashSet, some items will have to be enumerated (complexity of O(n)).

Chris
+1  A: 

One factor your not taking into account is the robustness of the GetHashcode() function. With a perfect hash function the HashSet will clearly have better searching performance. But as the hash function diminishes so will the HashSet search time.

JaredPar
+1  A: 

You're looking at this wrong. Yes a linear search of a List will beat a HashSet for a small number of items. But the performance difference usually doesn't matter for collections that small. It's generally the large collections you have to worry about, and that's where you think in terms of Big-O. However, if you've measured a real bottleneck on HashSet performance, then you can try to create a hybrid List/HashSet, but you'll do that by conducting lots of empirical performance tests - not asking questions on SO.

Eloff