I have a very, very large unsorted string array and i need to check to see if there are duplicates.
What is the most efficient method of checking this?
I have a very, very large unsorted string array and i need to check to see if there are duplicates.
What is the most efficient method of checking this?
Loop through the list, and put each element in a sorted tree. This way, you can detect early whether there is a duplicate.
The simplest way is probably:
if (strings.Length != strings.Distinct().Count())
{
// There are duplicates
}
That will be O(n) - but it won't tell you which items were duplicated.
Alternatively:
HashSet<string> values = new HashSet<string>();
foreach (string x in strings)
{
if (!values.Add(x))
{
// x was a duplicate
}
}
Again, this should be amortized O(n).
Note that you can specify a different IEqualityComparer<string>
if you want a case-insensitive comparison, or something like that.
Build a prefix tree (trie). This is O(nm) where n is the number of strings and m is the average string length. This is as good as it can get, as e.g. inserting a string in a hash set takes at least O(m) on average.