tags:

views:

173

answers:

5

Hi I would like to implement an efficient algorithm to handle the following case:

Lets assume we have 2 lists with the following elements:

Source: [a,b,c,d,e] New: [d,e,f,g]

Now I have to update source with the new information. The algorithm should be able to find that 'f' and 'g' are new entries, that 'a', 'b' and 'c' has been removed and that 'd' and 'e' have not being modified.

The operations involved are set-intersect operations between Source and New, and viceversa. I am looking for an efficient algorithm to implement in C# for arbitrary non-sorted enumerations.

Thanks in advance,

A: 

I think what you are looking are set operations i.e. union, etc. Take a look at this article: http://srtsolutions.com/public/item/251070

Chuck Conway
A union here would be [a, b, c, d, e, f, g]. About the only operation on the two that wasn't asked for.
Jon Hanna
That's fine, the Linq link I provided covered: Union, Distinct, Intersect and Except.
Chuck Conway
Yep. Don't think this deserves the downvote.
Jon Hanna
Agree with Jon Hanna, +1 to leave it at 0.
Red Knight
+5  A: 
var added = New.Except(Source);
var removed = Source.Except(New);
var notModified = Source.Intersect(New);

If you want to have an approach where you "show your workings", I'd suggest putting them each into HashSets, as that allows for a fast Contains check, compared with other enumerations.

Edit:

Okay, if we're going for total speed at the cost of efficiency of expression, then with the following assumptions:

  1. We have a reasonably hash-able type of item (if not, but they can be absolutely sorted, then a SortedList might beat a hash-set).
  2. We cannot predict whether Source or New will be larger (in the example, there's a slight advantage of doing this the other way around to how I have this, but I'm assuming that is just by chance in the data and that we have to expect each with equal likelihood.

Then I would suggest:

HashSet<T> removed = Source as HashSet<T> ?? new HashSet<T>(Source);
LinkedList<T> added = new LinkedList<T>();
LinkedList<T> notModified = new LinkedList<T>();
foreach(T item in New)
    if(removed.Remove(item))
        notModified.AddLast(item);
    else
        added.AddLast(item);

In setting up removed I test if it's already a hashset to avoid a wasteful building of a new one (I assume the input is typed as IEnumerable<T>). Of course, this is a destructive action so we may wish to avoid it anyway.

Note also that I modify the hashset while enumerating through it. This is allowed by hashset, but outside of the guarantees given by the enumerators, so is implementation-depended. Still, with the current framework impl. it's more efficient to do so than test and add to a different removed collection.

I went for linked-lists for the two other collections, as they tend to come out well in terms of insertion cost (not just O(1), but a fast O(1) compared to using another set).

Now, if you want to go further still, there're probably micro-optimisations available in the implementation of hash-set if you roll your own.

Jon Hanna
Hi Hanna, the problem with this approach is that the algorithm perform 3 scans over the lists. When both lists are big enough there is no reuse of known information among the passes. Nonetheless you you have +1 because your solution is simplier than my current not efficient enough version.
Red Knight
@Red Knight, I still like my first for most uses but I've added another approach.
Jon Hanna
@Jon Hanna, yeap for most uses the first is the way to go (reason for the +1 :) ). This version summarizes both, most common (not optimized) case and @supercat approach. As the response is an IEnumerable<T> the use of LinkedList<T> is a good decision but it impact tight loops. You can also use List<T> if needed, setting the capability beforehand (we know that the size is bounded on the biggest collection size * 1.5 and we get continuous allocation in memory). I will have to profile the cases to find the tipping point, but this version is good enough to give it a try.
Red Knight
@Red Knight also, even if you know you're going to want a particular enumeration type when the next step uses it, it would probably be fastest to build that directly, even if it was a slow insertion type.
Jon Hanna
+1  A: 

You might use the set operations available in Linq.

string[] list1 = { "a","b","c","d","e"};
string[] list2 = { "d", "e", "f", "g" };

string[] newElements = list2.Except(list1).ToArray();
string[] commonElements = list2.Intersect(list1).ToArray();
string[] removedElements = list1.Except(list2).ToArray(); 

Note: The above code assumes that each of the lists is distinct, i.e. does not contain the same element more than once. For example, for the lists [a, b, c, c] and [a, b, c] the code won't detect the removed element.

0xA3
Nice example. And for those like me who like to spelunk through the framework using Reflector, it's good to know that those methods are pretty efficient as they do use set data structures internally.
Josh Einstein
@Josh: Well, they're pretty efficient because they use set data structures; but they still have to *create* those structures by copying elements from the underlying collection. So clearly this code would be faster if `list1` and `list2` were of type `HashSet<string>` to begin with.
Dan Tao
+3  A: 

I have not tested this for performance, but my gut feeling is that you should first sort the two lists. Then you can step through the lists key each removed, added or unchanged element as you progress.

1- Sort the Old and New list
2- Set up a pointer for each list lets call them p1 and p2
3- Step the pointers using the following algorithm
  a) If Old[p1] = New[p2] the items are unchanged, increment p1 and p2
  b) If Old[p1] < New[p2] then Old[p1] has been removed, increment p1
  c) If Old[p1] > new[p2] then New[p2] is a new element, increment p2
  d) If p1 > Old.ItemCount then break out of loop, rest of New contains new items
  e) If p2 > New.ItemCount then break out of loop, rest of Old items have been removed
  f) If p1 < Old.ItemCount and p2 < Old.ItemCount Goto step **a**

That was just off the top of my head, but the basics should be correct. The key to this is that the lists are sorted of course.

Here is a quick and dirty demo, I included the sort for demonstration purposed, of course in this case the data is already sorted.

static void Main(string[] args)
{
  string[] oldList = { "a", "b", "c", "d", "e" };
  string[] newList = { "d", "e", "f", "g" };      

  Array.Sort(oldList);
  Array.Sort(newList);

  int p1 = 0;
  int p2 = 0;

  while (p1 < oldList.Length && p2 < newList.Length)
  {
    if (string.Compare(oldList[p1], newList[p2]) == 0)
    {
      Console.WriteLine("Unchanged:\t{0}", oldList[p1]);
      p1++;
      p2++;
    }
    else if (string.Compare(oldList[p1], newList[p2]) < 0)
    {
      Console.WriteLine("Removed:\t{0}", oldList[p1]);
      p1++;
    }
    else if (string.Compare(oldList[p1], newList[p2]) > 0)
    {
      Console.WriteLine("Added:\t\t{0}", newList[p2]);
      p2++;
    }        
  }

  while (p1 < oldList.Length)
  {
    Console.WriteLine("Removed:\t{0}", oldList[p1]);
    p1++;
  }

  while (p2 < newList.Length)
  {
    Console.WriteLine("Added :\t\t{0}", newList[p2]);
    p2++;
  }

  Console.ReadKey();
}

The output from the above

Removed:        a
Removed:        b
Removed:        c
Unchanged:      d
Unchanged:      e
Added :         f
Added :         g
Chris Taylor
You've the fastest way to do it with a list or vector type collection (e.g. array or `List<T>`), though hash-based or tree-based sets should be faster. That said, with the small size of data here listing might not compare to badly, but it certainly would with much bigger inputs.
Jon Hanna
@Jon Hanna, sure like I said this was not performance tested, would be interesting. I would not wage a guess, it is the sort that potentially hurts here.
Chris Taylor
@Chris Taylor, as I told Jon the problem with this approach also is that with big enough the time you use to sort -n log(n)- in the best approach. Therefore we are assuming that we have a vector type collection and a collection where and order relationship exists. My current implementation which is similar to @Jon Hanna has a running speed of 3n (order n), what I am trying to find out is if there exists an n or 2n solution to it (that will shave-off quite a big ammount of time).The solution is good nonetheless if you have a sorted collection though +1
Red Knight
+1  A: 

Call the sets X and Y. If set X supports rapid lookups, and you have a convenient means of "tagging" and "untagging" items in it, you could start by tagging all the items in X, and then query X for each item in Y. If an item isn't found, the item is "new" in Y. If the item is found, it's common to both sets and you should untag it in X. Repeat for all items in Y. When you're done, any items in X that are still tagged have been "deleted" from Y.

This approach only requires one of the sets to support convenient queries and tagging. It requires querying one set for all the records in the other, and then grabbing from it all items that haven't generated hits. There is no requirement to sort either set.

supercat
@supercat even without convenient tagging in the collection itself you have to perform 2n queries (far better than the 3n queries in the other approach). Looks promising!!!
Red Knight