ansaurus

Question

LINQ for diffing sets

Answer 1

A:

You are out of luck. If, as you stated in the comments, the lists are not sorted you can't compute the difference you seek in a single forward pass. Consider:

{ 1, 2, 3, 4, 5, 6, 7, ...
{ 1, 2, 3, 6, 7, 8, 9, ...

At the point where the first difference in encountered (4 vs. 6) it's impossible for you to determine if you are looking at the removal of 4 & 5 (as would be the case if both lists were monotonically increasing, or the insertion of 6, 7, 8, & 9 as would be the case if the lists continued like so:

{ 1, 2, 3,             4, 5, 6, 7, 8, 9,...
{ 1, 2, 3, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9,...

MarkusQ 2009-03-09 04:25:00

sorry for changing the requirements ...

Sam Saffron 2009-03-09 04:52:07

Answer 2

+3 A:

Perhaps a LINQ solution is not the best option in this case.

This will produce a dictionary with the result that you want.

Dictionary<int, Operation> difference = new Dictionary<int,Operation>();
foreach (int value in original) {
 difference.Add(value, Operation.Removed);
}
foreach (int value in target) {
 if (difference.ContainsKey(value)) {
  difference.Remove(value);
 } else {
  difference.Add(value, Operation.Added);
 }
}

To keep the size of the dictionary down, perhaps it's possible to loop the enumerations in parallell. I'll have a look at that...

Edit:
Here it is:

Dictionary<int, Operation> difference = new Dictionary<int,Operation>();
IEnumerator<int> o = ((IEnumerable<int>)original).GetEnumerator();
IEnumerator<int> t = ((IEnumerable<int>)target).GetEnumerator();
bool oActive=true, tActive=true;
while (oActive || tActive) {
 if (oActive && (oActive = o.MoveNext())) {
  if (difference.ContainsKey(o.Current)) {
   difference.Remove(o.Current);
  } else {
   difference.Add(o.Current, Operation.Removed);
  }
 }
 if (tActive && (tActive = t.MoveNext())) {
  if (difference.ContainsKey(t.Current)) {
   difference.Remove(t.Current);
  } else {
   difference.Add(t.Current, Operation.Added);
  }
 }
}

Edit2:
I did some performance testing. The first version runs 10%-20% faster, both with sorted lists and randomly ordered lists.

I made lists with numbers from 1 to 100000, randomly skipping 10% of the numbers. On my machine the first version of the code matches the lists in about 16 ms.

Guffa 2009-03-09 04:25:44

yerp, its pretty much the solution im going for, but I am going to try and encapsulate it nicely in an extension method

Sam Saffron 2009-03-09 04:27:10

Right... To make it look like it's LINQ... ;)

Guffa 2009-03-09 04:30:20

Answer 3

+1 A:

enum Operation { Added, Removed, }

static void Main(string[] args)
{
    var original = new int[] { 2, 1, 3 };
    var target = new int[] { 1, 3, 4 };

    var result = original.Except(target)
        .Select(i => new { Value = i, Operation = Operation.Removed, })
        .Concat(
            target.Except(original)
            .Select(i => new { Value = i, Operation = Operation.Added, })
            );

    foreach (var item in result)
        Console.WriteLine("{0}, {1}", item.Value, item.Operation);
}

I don't think you can do this with LINQ using only a single pass given the stock LINQ extension methods but but might be able to code a custom extension method that will. Your trade off will likely be the loss of deferred execution. It would be interesting to compare the relative performance of both.

Andrew Robinson 2009-03-09 04:31:48

This is about 30% slower than the dictionary in my benchmark, but it is much more concise

Sam Saffron 2009-03-09 04:47:40

Answer 4

A:

This will achieve the result in a single pass, however I'm not sure of the complexity of the GroupBy operation.

var original= new int[] { 1, 2, 3 };
var target = new int[] { 1, 3, 4 };

var output = original.Select( i => new { I = i, L = "o" } )
    .Concat( target.Select( i => new { I = i, L = "t" } ) )
    .GroupBy( i => i.I ).Where( i => i.Count() == 1 )
.Select( i => new { I = i.Key, S = (i.ElementAt( 0 ).L == "o" ? Operation.Removed : Operation.Added) } );

bstoney 2009-03-09 04:39:09

Seems like the group by stuff does not scale that well. This is about 8x slower than the dictionary solution.

Sam Saffron 2009-03-09 04:51:18

I was worried that might be the case.

bstoney 2009-03-09 04:53:37

How many items are you benchmarking against?

bstoney 2009-03-09 04:54:42

Enumerable.Range(5, 10000); vs Enumerable.Range(0, 10000);

Sam Saffron 2009-03-09 05:04:10

Answer 5

+1 A:

Use hashset and compare the performance with all other solutions.

        HashSet<int> originalSet = new HashSet<int>{1, 2, 3};
        HashSet<int> targetSet = new HashSet<int>{1, 3, 4};

        var removed = from x in originalSet.Except(targetSet)                         
                     select new DiffResult {Number = x, Operation = DiffOperation.Remove };

        var added = from y in targetSet.Except(originalSet)
                    select new DiffResult {Number = y, Operation = DiffOperation.Add};

        var result = removed.Union(added);

If the data container has to be an int array then pass it in the constructor of the hashset i.e.

HashSet<int> originalSet = new HashSet<int>(originalArray);

Hasan Khan 2009-03-09 06:14:25

ansaurus

tags:

views:

answers:

LINQ for diffing sets

related questions