ansaurus

Question

Removing duplicate string from List (.NET 2.0!)

Answer 1

+10 A:

This probably isn't what you're looking for, but if you have control over this, the most efficient way would be to not add them in the first place...

Do you have control over this? If so, all you'd need to do is a myList.Contains(currentItem) call before you add the item and you're set

John 2009-08-26 15:34:17

Hah, I never thought about that, I do have control over the initial list generation!

greggorob64 2009-08-26 15:35:49

LOL. that is WIN!

Alan 2009-08-26 15:36:43

Be aware this approach doesn't scale very well as the size of the list increases...

Lee 2009-08-26 15:41:07

If size is a concern, I'd think you'd be fine doing the same method above, but using a SortedList opposed to a standard List

John 2009-08-26 15:48:08

It's O(n^2) since List<T>.Contains is O(n). You need to borrow Jared's dictionary to keep track of the items that you've added, giving O(1) checks and O(n) overall.

stevemegson 2009-08-26 15:49:52

Luckily, scale isn't a concern in this particular situation

greggorob64 2009-08-26 15:51:02

steve, wouldn't the check be logn and not 1? that would be the same as using a sortedlist over a list

John 2009-08-26 16:05:07

Dictionaries are constant time lookup. Log(n) is insertion time.

Alan 2009-08-26 23:05:10

Oops; looks like I was a little confused on how hashtables work

John 2009-08-27 14:40:39

Answer 2

+6 A:

You could do the following.

List<string> list = GetTheList();
Dictionary<string,object> map = new Dictionary<string,object>();
int i = 0;
while ( i < list.Count ) {
  string current = list[i];
  if ( map.ContainsKey(current) ) {
    list.RemoveAt(i);
  } else {
    i++;
    map.Add(current,null);
  }
}

This has the overhead of building a Dictionary<TKey,TValue> object which will duplicate the list of unique values in the list. But it's fairly efficient speed wise.

JaredPar 2009-08-26 15:34:20

+1 First thing that popped into mind was compare each value to every other while removing duplicates as they're found but the complexity on that is N^2. Jared's solution is much nicer since by using a Dicitonary data structure will make use of hashing and therefore very fast lookups. Complexity = N(log N) ?

Paul Sasik 2009-08-26 15:40:07

If speed matters, you'd be better creating a new list of the unique values rather than removing the duplicates from the original list, since RemoveAt is O(n) but Add is O(1) when you know the maximum length in advance.

stevemegson 2009-08-26 15:41:59

Answer 3

+1 A:

I'm no Comp Sci PhD, but I'd imagine using a dictionary, with the items in your list as the keys would be fast.

Since a dictionary doesn't allow duplicate keys, you'd only have unique strings at the end of iteration.

Alan 2009-08-26 15:36:15

Answer 4

A:

An easy expression that works in C# 2.0 to remove any duplicate strings from a list of strings would be:

oStrings.RemoveAll(delegate(string sTemp) { return oStrings.IndexOf(sTemp) != oStrings.LastIndexOf(sTemp); });

Stevo3000 2009-08-26 15:39:44

-1: Not a big fan of O(2n^2) algorithms.

Juliet 2009-08-26 15:49:16

Unfortunately, in-line delegates are not allowed in c++, but I'll give this a try and see how it works out

greggorob64 2009-08-26 15:49:27

@Juliet - There is a difference between not liking something and it being worth a down vote! This answered the question with working code! I haven't seen u offer a more efficient solution using the constraints that were given!

Stevo3000 2009-08-26 15:57:47

Answer 5

+1 A:

Just remember when providing a custom class to override the Equals() method in order for the Contains() to function as required.

Example

List<CustomClass> clz = new List<CustomClass>()

public class CustomClass{

    public bool Equals(Object param){
        //Put equal code here...
    }
}

Koekiebox 2009-08-26 15:54:21

Answer 6

+1 A:

If you're going the route of "just don't add duplicates", then checking "List.Contains" before adding an item works, but its O(n^2) where n is the number strings you want to add. Its no different from your current solution using two nested loops.

You'll have better luck using a hashset to store items you've already added, but since you're using .NET 2.0, a Dictionary can substitute for a hash set:

static List<T> RemoveDuplicates<T>(List<T> input)
{
    List<T> result = new List<T>(input.Count);
    Dictionary<T, object> hashSet = new Dictionary<T, object>();
    foreach (T s in input)
    {
        if (!hashSet.ContainsKey(s))
        {
            result.Add(s);
            hashSet.Add(s, null);
        }
    }
    return result;
}

This runs in O(n) and uses O(2n) space, it will generally work very well for up to 100K items. Actual performance depends on the average length of the strings -- if you really need to maximum performance, you can exploit some more powerful data structures like tries make inserts even faster.

Juliet 2009-08-26 16:01:58

HashSet's are .net 3.5+, which is out of the scope of this question.

greggorob64 2009-08-26 16:12:13

My codes doesn't use HashSet, it uses a dictionary which substitutes as a HashSet.

Juliet 2009-08-26 16:28:24

I should have read your code more thoroughly, I just saw the word HashSet, and skipped over it.

greggorob64 2009-08-26 17:27:37

ansaurus

tags:

views:

answers:

Removing duplicate string from List (.NET 2.0!)

related questions