views:

1016

answers:

5

I am trying to do what I think is a "de-intersect" (I'm not sure what the proper name is, but that's what Tim Sweeney of EpicGames called it in the old UnrealEd)

// foo and bar have some identical elements (given a case-insensitive match)
List‹string› foo = GetFoo();
List‹string› bar = GetBar();

// remove non matches
foo = foo.Where(x => bar.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();
bar = bar.Where(x => foo.Contains(x, StringComparer.InvariantCultureIgnoreCase)).ToList();

Then later on, I do another thing where I subtract the result from the original, to see which elements I removed. That's super-fast using .Except(), so no troubles there.

There must be a faster way to do this, because this one is pretty bad-performing with ~30,000 elements (of string) in either List. Preferably, a method to do this step and the one later on in one fell swoop would be nice. I tried using .Exists() instead of .Contains(), but it's slightly slower. I feel a bit thick, but I think it should be possible with some combination of .Except() and .Intersect() and/or .Union().

+3  A: 

This operation can be called a symmetric difference.

You need a different data structure, like a hash table. Add the intersection of both sets to it, then difference the intersection from each set.

UPDATE:

I got a bit of time to try this in code. I used HashSet<T> with a set of 50,000 strings, from 2 to 10 characters long with the following results:

Original: 79499 ms

Hashset: 33 ms

BTW, there is a method on HashSet called "SymmetricExceptWith" which I thought would do the work for me, but it actually adds the different elements from both sets to the set the method is called on. Maybe this is what you want, and the code would be more elegant.

Here is the code:

namespace ListSetTiming
{
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

class Program
{
    static void Main(string[] args)
    {
        // foo and bar have some identical elements (given a case-insensitive match)
        List<string> foo = GetFoo();
        List<string> bar = GetBar();

        Stopwatch timer = new Stopwatch();

        timer.Start();
        // remove non matches
        var f = foo.Where(x => !bar.Contains(x)).ToList();
        var b = bar.Where(x => !foo.Contains(x)).ToList();
        timer.Stop();

        Debug.WriteLine(String.Format("Original: {0} ms", timer.ElapsedMilliseconds));

        timer.Reset();

        timer.Start();
        HashSet<String> intersect = new HashSet<String>(foo);
        intersect.IntersectWith(bar);

        HashSet<String> fSet = new HashSet<String>(foo);
        HashSet<String> bSet = new HashSet<String>(bar);

        fSet.ExceptWith(intersect);
        bSet.ExceptWith(intersect);
        timer.Stop();

        HashSet<String> fCheck = new HashSet<String>(f);
        HashSet<String> bCheck = new HashSet<String>(b);

        Debug.WriteLine(String.Format("Hashset: {0} ms", timer.ElapsedMilliseconds));

        Console.WriteLine("Sets equal? {0} {1}", fSet.SetEquals(fCheck), bSet.SetEquals(bCheck)); //bSet.SetEquals(set));
        Console.ReadKey();
    }

    static Random _rnd = new Random();

    private static List<String> GetBar()
    {
        return getRandomStrings();
    }

    private static List<String> GetFoo()
    {
        return getRandomStrings();
    }

    private const int Count = 50000;

    private static List<string> getRandomStrings() 
    {
        List<String> strings = new List<String>(Count);

        Char[] chars = new Char[10];

        for (int i = 0; i < Count; i++)
        {
            Int32 len = _rnd.Next(2, 10);

            for (int j = 0; j < len; j++)
            {
                Char c = (Char)_rnd.Next('a', 'z');
                chars[j] = c;
            }

            strings.Add(new String(chars, 0, len));
        }

        return strings;
    }
}

}

codekaizen
A: 

Contains on a list is an O(N) operation. If you had a different data structure, such as a sorted list or a Dictionary, you would dramatically reduce your time. Accessing a key in a sorted list is usually O(log N) time, and in a hash is usually O(1) time.

Robert P
+1  A: 

If the elements are unique within each list you should consider using an HashSet

The HashSet(T) class provides high performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.

Luca Martinetti
+3  A: 

With intersect it would be done like this:

var matches = ((from f in foo 
                select f)
              .Intersect(
                  from b in bar 
                  select b, StringComparer.InvariantCultureIgnoreCase))
gcores
Wow, brilliant. 145 ms instead of 40 seconds is pretty good when processing two lists with ~28,000 entries each, I'd say.Maybe I'd gain more by using a Dictionary, but I'm perfectly happy with this!
J F
What's wrong with "var matches = foo.Intersect( bar, StringComparer.InvariantCultureIgnoreCase );"? No empty select needed.
Emperor XLII
@Emperor XLII: Nothing, that's a good way of writing it :)
gcores
+1  A: 

With sorted list, you can use binary search.