views:

127

answers:

6

I have a large list of integers that are sent to my webservice. Our business rules state that these values must be unique. What is the most performant way to figure out if there are any duplicates? I dont need to know the values, I only need to know if 2 of the values are equal.

At first I was thinking about using a Generic List of integers and the list.Exists() method, but this is of O(n);

Then I was thinking about using a Dictionary and the ContainsKey method. But, I only need the Keys, I do not need the values. And I think this is a linear search as well.

Is there a better datatype to use to find uniqueness within a list? Or am I stuck with a linear search?

+15  A: 

Use a HashSet<T>:

The HashSet class provides high performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order

HashSet<T> even exposes a constructor that accepts an IEnumerable<T>. By passing your List<T> to the HashSet<T>'s constructor you will end up with a reference to a new HashSet<T> that will contain a distinct sequence of items from your original List<T>.

Andrew Hare
When inputList.Count != hashSet.Count, "Houston, we have duplicates!"
sixlettervariables
Which is still O(n), the best I think he can get.
Marc
@sixlettervariables - Excellent point!
Andrew Hare
@Andrew: he could add the items one by one to a HashSet and return an exception immediately upon hashSet.ContainsKey(item) == true. Would save going all the way through if there was a duplicate.
sixlettervariables
@sixlettervariables - Very true, at that point he would no longer need a `HashSet<T>` as any implementation of `IList<T>` has the `Contains` method.
Andrew Hare
+1  A: 

Sounds like a job for a Hashset...

Dan Diplo
A: 

If you are using framework 3.5 you can use the HashSet collection.

Otherwise the best option is the Dictionary. The value of each item will be wasted, but that will give you the best performance.

If you check for duplicates while you add the items to the HashSet/Dictionary instead of counting them afterwards, you get better performance than O(n) in case there are duplicates, as you don't have to continue looking after finding the first duplicate.

Guffa
A: 

Dictionary would be sufficient from a purely runtime complexity perspective since ContainsKey is a O(1) operation, but as other have point out the Hashset is better since you do not need to store key-value pairs.

Brian Gideon
A: 

If the set of numbers is sparse, then as others suggest use a HashSet.

But if the set of numbers is mostly in sequence with occasional gaps, it would be a lot better if you stored the number set as a sorted array or binary tree of begin,end pairs. Then you could search to find the pair with the largest begin value that was smaller than your search key and compare with that pair's end value to see if it exists in the set.

Zan Lynx
A: 

What about doing:

list.Distinct().Count() != list.Count()

I wonder about the performance of this. I think it would be as good as O(n) but with less code and still easily readable.

Stephen B. Burris Jr.