tags:

views:

766

answers:

6

I'm having issues finding the most efficient way to remove duplicates from a list of strings (List).

My current implementation is a dual foreach loop checking the instance count of each object being only 1, otherwise removing the second.

I know there are MANY other questions out there, but they all the best solutions require above .net 2.0, which is the current build environment I'm working in. (GM and Chrysler are very resistant to changes ... :) )

This limits the possible results by not allowing any LINQ, or HashSets.

The code I'm using is Visual C++, but a C# solution will work just fine as well.

Thanks!

+10  A: 

This probably isn't what you're looking for, but if you have control over this, the most efficient way would be to not add them in the first place...

Do you have control over this? If so, all you'd need to do is a myList.Contains(currentItem) call before you add the item and you're set

John
Hah, I never thought about that, I do have control over the initial list generation!
greggorob64
LOL. that is WIN!
Alan
Be aware this approach doesn't scale very well as the size of the list increases...
Lee
If size is a concern, I'd think you'd be fine doing the same method above, but using a SortedList opposed to a standard List
John
It's O(n^2) since List<T>.Contains is O(n). You need to borrow Jared's dictionary to keep track of the items that you've added, giving O(1) checks and O(n) overall.
stevemegson
Luckily, scale isn't a concern in this particular situation
greggorob64
steve, wouldn't the check be logn and not 1? that would be the same as using a sortedlist over a list
John
Dictionaries are constant time lookup. Log(n) is insertion time.
Alan
Oops; looks like I was a little confused on how hashtables work
John
+6  A: 

You could do the following.

List<string> list = GetTheList();
Dictionary<string,object> map = new Dictionary<string,object>();
int i = 0;
while ( i < list.Count ) {
  string current = list[i];
  if ( map.ContainsKey(current) ) {
    list.RemoveAt(i);
  } else {
    i++;
    map.Add(current,null);
  }
}

This has the overhead of building a Dictionary<TKey,TValue> object which will duplicate the list of unique values in the list. But it's fairly efficient speed wise.

JaredPar
+1 First thing that popped into mind was compare each value to every other while removing duplicates as they're found but the complexity on that is N^2. Jared's solution is much nicer since by using a Dicitonary data structure will make use of hashing and therefore very fast lookups. Complexity = N(log N) ?
Paul Sasik
If speed matters, you'd be better creating a new list of the unique values rather than removing the duplicates from the original list, since RemoveAt is O(n) but Add is O(1) when you know the maximum length in advance.
stevemegson
+1  A: 

I'm no Comp Sci PhD, but I'd imagine using a dictionary, with the items in your list as the keys would be fast.

Since a dictionary doesn't allow duplicate keys, you'd only have unique strings at the end of iteration.

Alan
A: 

An easy expression that works in C# 2.0 to remove any duplicate strings from a list of strings would be:

oStrings.RemoveAll(delegate(string sTemp) { return oStrings.IndexOf(sTemp) != oStrings.LastIndexOf(sTemp); });
Stevo3000
-1: Not a big fan of O(2n^2) algorithms.
Juliet
Unfortunately, in-line delegates are not allowed in c++, but I'll give this a try and see how it works out
greggorob64
@Juliet - There is a difference between not liking something and it being worth a down vote! This answered the question with working code! I haven't seen u offer a more efficient solution using the constraints that were given!
Stevo3000
+1  A: 

Just remember when providing a custom class to override the Equals() method in order for the Contains() to function as required.

Example

List<CustomClass> clz = new List<CustomClass>()

public class CustomClass{

    public bool Equals(Object param){
        //Put equal code here...
    }
}
Koekiebox
+1  A: 

If you're going the route of "just don't add duplicates", then checking "List.Contains" before adding an item works, but its O(n^2) where n is the number strings you want to add. Its no different from your current solution using two nested loops.

You'll have better luck using a hashset to store items you've already added, but since you're using .NET 2.0, a Dictionary can substitute for a hash set:

static List<T> RemoveDuplicates<T>(List<T> input)
{
    List<T> result = new List<T>(input.Count);
    Dictionary<T, object> hashSet = new Dictionary<T, object>();
    foreach (T s in input)
    {
        if (!hashSet.ContainsKey(s))
        {
            result.Add(s);
            hashSet.Add(s, null);
        }
    }
    return result;
}

This runs in O(n) and uses O(2n) space, it will generally work very well for up to 100K items. Actual performance depends on the average length of the strings -- if you really need to maximum performance, you can exploit some more powerful data structures like tries make inserts even faster.

Juliet
HashSet's are .net 3.5+, which is out of the scope of this question.
greggorob64
My codes doesn't use HashSet, it uses a dictionary which substitutes as a HashSet.
Juliet
I should have read your code more thoroughly, I just saw the word HashSet, and skipped over it.
greggorob64