tags:

views:

193

answers:

5

I have the following arrays:

var original= new int[] { 2, 1, 3 };
var target = new int[] { 1, 3, 4 };
enum Operation {Added,Removed}

I would like to execute a LINQ query that would return the following:

{{2,Removed},{4,Added}}

Limitation: I would like LINQ to perform this very efficiently and avoid and O(n^2) style algorithms.

A: 

You are out of luck. If, as you stated in the comments, the lists are not sorted you can't compute the difference you seek in a single forward pass. Consider:

{ 1, 2, 3, 4, 5, 6, 7, ...
{ 1, 2, 3, 6, 7, 8, 9, ...

At the point where the first difference in encountered (4 vs. 6) it's impossible for you to determine if you are looking at the removal of 4 & 5 (as would be the case if both lists were monotonically increasing, or the insertion of 6, 7, 8, & 9 as would be the case if the lists continued like so:

{ 1, 2, 3,             4, 5, 6, 7, 8, 9,...
{ 1, 2, 3, 6, 7, 8, 9, 4, 5, 6, 7, 8, 9,...
MarkusQ
sorry for changing the requirements ...
Sam Saffron
+3  A: 

Perhaps a LINQ solution is not the best option in this case.

This will produce a dictionary with the result that you want.

Dictionary<int, Operation> difference = new Dictionary<int,Operation>();
foreach (int value in original) {
 difference.Add(value, Operation.Removed);
}
foreach (int value in target) {
 if (difference.ContainsKey(value)) {
  difference.Remove(value);
 } else {
  difference.Add(value, Operation.Added);
 }
}

To keep the size of the dictionary down, perhaps it's possible to loop the enumerations in parallell. I'll have a look at that...

Edit:
Here it is:

Dictionary<int, Operation> difference = new Dictionary<int,Operation>();
IEnumerator<int> o = ((IEnumerable<int>)original).GetEnumerator();
IEnumerator<int> t = ((IEnumerable<int>)target).GetEnumerator();
bool oActive=true, tActive=true;
while (oActive || tActive) {
 if (oActive && (oActive = o.MoveNext())) {
  if (difference.ContainsKey(o.Current)) {
   difference.Remove(o.Current);
  } else {
   difference.Add(o.Current, Operation.Removed);
  }
 }
 if (tActive && (tActive = t.MoveNext())) {
  if (difference.ContainsKey(t.Current)) {
   difference.Remove(t.Current);
  } else {
   difference.Add(t.Current, Operation.Added);
  }
 }
}

Edit2:
I did some performance testing. The first version runs 10%-20% faster, both with sorted lists and randomly ordered lists.

I made lists with numbers from 1 to 100000, randomly skipping 10% of the numbers. On my machine the first version of the code matches the lists in about 16 ms.

Guffa
yerp, its pretty much the solution im going for, but I am going to try and encapsulate it nicely in an extension method
Sam Saffron
Right... To make it look like it's LINQ... ;)
Guffa
+1  A: 
enum Operation { Added, Removed, }

static void Main(string[] args)
{
    var original = new int[] { 2, 1, 3 };
    var target = new int[] { 1, 3, 4 };

    var result = original.Except(target)
        .Select(i => new { Value = i, Operation = Operation.Removed, })
        .Concat(
            target.Except(original)
            .Select(i => new { Value = i, Operation = Operation.Added, })
            );

    foreach (var item in result)
        Console.WriteLine("{0}, {1}", item.Value, item.Operation);
}

I don't think you can do this with LINQ using only a single pass given the stock LINQ extension methods but but might be able to code a custom extension method that will. Your trade off will likely be the loss of deferred execution. It would be interesting to compare the relative performance of both.

Andrew Robinson
This is about 30% slower than the dictionary in my benchmark, but it is much more concise
Sam Saffron
A: 

This will achieve the result in a single pass, however I'm not sure of the complexity of the GroupBy operation.

var original= new int[] { 1, 2, 3 };
var target = new int[] { 1, 3, 4 };

var output = original.Select( i => new { I = i, L = "o" } )
    .Concat( target.Select( i => new { I = i, L = "t" } ) )
    .GroupBy( i => i.I ).Where( i => i.Count() == 1 )
.Select( i => new { I = i.Key, S = (i.ElementAt( 0 ).L == "o" ? Operation.Removed : Operation.Added) } );
bstoney
Seems like the group by stuff does not scale that well. This is about 8x slower than the dictionary solution.
Sam Saffron
I was worried that might be the case.
bstoney
How many items are you benchmarking against?
bstoney
Enumerable.Range(5, 10000); vs Enumerable.Range(0, 10000);
Sam Saffron
+1  A: 

Use hashset and compare the performance with all other solutions.

        HashSet<int> originalSet = new HashSet<int>{1, 2, 3};
        HashSet<int> targetSet = new HashSet<int>{1, 3, 4};

        var removed = from x in originalSet.Except(targetSet)                         
                     select new DiffResult {Number = x, Operation = DiffOperation.Remove };

        var added = from y in targetSet.Except(originalSet)
                    select new DiffResult {Number = y, Operation = DiffOperation.Add};

        var result = removed.Union(added);

If the data container has to be an int array then pass it in the constructor of the hashset i.e.

HashSet<int> originalSet = new HashSet<int>(originalArray);
Hasan Khan