ansaurus

Question

What is the most efficient way of checking to see if an array of strings has any duplicates in .NET

Answer 1

A:

Loop through the list, and put each element in a sorted tree. This way, you can detect early whether there is a duplicate.

Sjoerd 2010-06-29 07:33:59

That ends up being O(n log n) in the case where there isn't an early "out" though.

Jon Skeet 2010-06-29 07:34:50

It is roughly the same as your second solution.

Sjoerd 2010-06-29 07:37:35

@Sjoerd: Except that mine doesn't sort - leading to an O(n) solution instead of O(n log n). If you're only interested in equality, there's no advantage in sorting.

Jon Skeet 2010-06-29 07:38:50

Answer 2

+5 A:

The simplest way is probably:

if (strings.Length != strings.Distinct().Count())
{
    // There are duplicates
}

That will be O(n) - but it won't tell you which items were duplicated.

Alternatively:

HashSet<string> values = new HashSet<string>();
foreach (string x in strings)
{
    if (!values.Add(x))
    {
        // x was a duplicate
    }
}

Again, this should be amortized O(n).

Note that you can specify a different IEqualityComparer<string> if you want a case-insensitive comparison, or something like that.

Jon Skeet 2010-06-29 07:34:23

Your second method is better for this asker, who wants the 'most efficient' way. `.Distinct().Count()` will examine every element, even when the very first two are duplicates (in which case we're done).

AakashM 2010-06-29 07:53:14

@AakashM: Indeed - as well as giving the duplicate elements.

Jon Skeet 2010-06-29 08:00:40

Answer 3

A:

Build a prefix tree (trie). This is O(nm) where n is the number of strings and m is the average string length. This is as good as it can get, as e.g. inserting a string in a hash set takes at least O(m) on average.

Henrik 2010-06-29 07:52:26

I suspect this will be significantly less efficient in memory though - and probably with a higher constant factor than the hash implementation. It also involves writing your own trie class, or finding a third-party one... using `HashSet<T>` has the advantage of simplicity, more efficient memory use, and I suspect it's not significantly worse in execution time either. Of course if the OP is absolutely desperate to find the fastest route, he could try both.

Jon Skeet 2010-06-29 08:02:37

Probably you're right. I think, for large and very large arrays, HashSet would be OK. But as the OP has "very, very large" arrays ;-), it might be worth to just try both.

Henrik 2010-06-29 08:07:02

For "very, very large arrays" it's going to potentially be even worse in terms of memory using a trie. I'd certainly try the simpler approach first and see if it worked well enough...

Jon Skeet 2010-06-29 08:08:59

This depends very much on the actual data. If there are many strings with long common prefixes, a trie would be more compact.

Henrik 2010-06-29 08:43:07

ansaurus

tags:

views:

answers:

What is the most efficient way of checking to see if an array of strings has any duplicates in .NET

related questions