views:

88

answers:

3

I have a very, very large unsorted string array and i need to check to see if there are duplicates.

What is the most efficient method of checking this?

A: 

Loop through the list, and put each element in a sorted tree. This way, you can detect early whether there is a duplicate.

Sjoerd
That ends up being O(n log n) in the case where there isn't an early "out" though.
Jon Skeet
It is roughly the same as your second solution.
Sjoerd
@Sjoerd: Except that mine doesn't sort - leading to an O(n) solution instead of O(n log n). If you're only interested in equality, there's no advantage in sorting.
Jon Skeet
+5  A: 

The simplest way is probably:

if (strings.Length != strings.Distinct().Count())
{
    // There are duplicates
}

That will be O(n) - but it won't tell you which items were duplicated.

Alternatively:

HashSet<string> values = new HashSet<string>();
foreach (string x in strings)
{
    if (!values.Add(x))
    {
        // x was a duplicate
    }
}

Again, this should be amortized O(n).

Note that you can specify a different IEqualityComparer<string> if you want a case-insensitive comparison, or something like that.

Jon Skeet
Your second method is better for this asker, who wants the 'most efficient' way. `.Distinct().Count()` will examine every element, even when the very first two are duplicates (in which case we're done).
AakashM
@AakashM: Indeed - as well as giving the duplicate elements.
Jon Skeet
A: 

Build a prefix tree (trie). This is O(nm) where n is the number of strings and m is the average string length. This is as good as it can get, as e.g. inserting a string in a hash set takes at least O(m) on average.

Henrik
I suspect this will be significantly less efficient in memory though - and probably with a higher constant factor than the hash implementation. It also involves writing your own trie class, or finding a third-party one... using `HashSet<T>` has the advantage of simplicity, more efficient memory use, and I suspect it's not significantly worse in execution time either. Of course if the OP is absolutely desperate to find the fastest route, he could try both.
Jon Skeet
Probably you're right. I think, for large and very large arrays, HashSet would be OK. But as the OP has "very, very large" arrays ;-), it might be worth to just try both.
Henrik
For "very, very large arrays" it's going to potentially be even worse in terms of memory using a trie. I'd certainly try the simpler approach first and see if it worked well enough...
Jon Skeet
This depends very much on the actual data. If there are many strings with long common prefixes, a trie would be more compact.
Henrik