views:

273

answers:

5

This is an exercise for the CS guys to shine with the theory.

Imagine you have 2 containers with elements. Folders, URLs, Files, Strings, it really doesn't matter.

What is AN algorithm to calculate the added and the removed?

Notice: If there are many ways to solve this problem, please post one per answer so it can be analysed and voted up.

Edit: All the answers solve the matter with 4 containers. Is it possible to use only the initial 2?

+1  A: 

I have not done this in a while but I believe the algorithm goes like this...

sort left-list and right-list
adds = {}
deletes = {}
get first right-item from right-list
get first left-item from left-list
while (either list has items)
  if left-item < right-item or right-list is empty
    add left-item to deletes
    get new left-item from left-list
  else if left-item > right-item or left-list is empty
    add right-item to adds
    get new right-item from right-list
  else
    get new right-item from right-list
    get new left-item from left-list

In regards to right-list's relation to left-list, deletes contains items removed and adds now contains new items.

Joe Skora
Even if I haven't tested your pseudo code into some proper code it makes sense. But It's a bit clunky, sorry...
Gustavo Carreno
+4  A: 

Assuming you have two lists of unique items, and the ordering doesn't matter, you can think of them both as sets rather than lists

If you think of a venn diagram, with list A as one circle and list B as the other, then the intersection of these two is the constant pool.

Remove all the elements in this intersection from both A and B, and and anything left in A has been deleted, whilst anything left in B has been added.

So, iterate through A looking for each item in B. If you find it, remove it from both A and B

Then A is a list of things that were deleted, and B is a list of things that were added

I think...

[edit] Ok, with the new "only 2 container" restriction, the same still holds:

foreach( A ) { 
  if( eleA NOT IN B ) {
    DELETED
  }
}
foreach( B ) {
  if( eleB NOT IN A ) {
    ADDED
  }
}

Then you aren't constructing a new list, or destroying your old ones...but it will take longer as with the previous example, you could just loop over the shorter list and remove the elements from the longer. Here you need to do both lists

An I'd argue my first solution didn't use 4 containers, it just destroyed two ;-)

tim_yates
I've accepted your answer because the Venn is something that everyone relates well. After that image one can get easily into some code.
Gustavo Carreno
I've added a constraint to the question, had to unaccept it...
Gustavo Carreno
I've added a bit to my answer... hope that's ok?
tim_yates
Indeed it is. I was looking for confirmation of looping on the shortest one like I've seen on another question.
Gustavo Carreno
HA, on your last remark: You are right. I should've mentioned that your solution never did imply 4 containers :)
Gustavo Carreno
A: 

What Joe said. And, if the lists are too large to fit into memory, use an external file sorting utility or a Merge sort.

Mark Ransom
A: 

Missing information: How do you define added/removed? E.g. if the lists (A and B) show the same directory on Server A and Server B, that is in sync. If I now wait for 10 days, generate the lists again and compare them, how can I tell if something has been removed? I cannot. I can only tell there are files on Server A not found on Server B and/or the other way round. Whether that is because a file has been added to Server A (thus the file is not found on B) or a file has been deleted on Server B (thus the file is not found on B anymore) is something I cannot determine by just having a list of file names.

For the solution I suggest, I will just assume that you have one list named OLD and one list named NEW. Everything found on OLD but not on NEW has been removed. Everything found on NEW, but not on OLD has been added (e.g. the content of the same directory on the same server, however lists have been created at different dates).

Further I will assume there are no duplicates. That means every item on either list is unique in the sense of: If I compare this item to any other item on the list (no matter how this compare works), I can always say the item is either smaller or bigger than the one I'm comparing it to, but never equal. E.g. when dealing with strings, I can compare them lexicographically and the same string is never twice in the list.

In that case the simplest (not necessarily best solution, though) is:

  1. Sort the OLD lists. E.g. if the list consists of strings, sort them alphabetically. Sorting is necessary, because it means I can use binary search to quickly find an object in the list, assuming it does exist there (or to quickly determine, it does not exist in the list at all). If the list is unsorted, finding the object has a complexity of O(n) (I need to look at every single item on the list). If the list is sorted, complexity is only O(log n), as after every try to match an item on the list I can always exclude 50% of the items on the list not being a match. Even if the list has 100 items, finding an item (or detecting that the item is not on the list) takes at most 7 tests (or is it 8? Anyway, far less than 100). The NEW list doesn't have to be sorted.

  2. Now we perform list elimination. For every item on the NEW list, try to find this item on the OLD list (using binary search). If the item is found, remove this item from the OLD list and also remove it from the NEW list. This also means the lists get smaller the further the elimination progresses and thus the lookups will become faster and faster. Since removing an item from the a list has no effect on the correct sort order of the lists, there is no need to ever resort the OLD list during the elimination phase.

  3. At the end of elimination, both lists might be empty, in which case they were equal. If they are not empty, all items still on the OLD list are items missing on the NEW list (otherwise we had removed them), hence these are the removed items. All items still on the NEW list are items that were not on the OLD list (again, we had removed them otherwise), hence these are the added items.

Mecki
A: 

Are the objects in the list "unique"? In this case I would first build two maps (hashmaps) and then scan the lists and lookup every object in the maps.

map1
map2
removedElements
addedElements

list1.each |item|
{
    map1.add(item)
}
list2.each |item|
{
    map2.add(item)
}
list1.each |item|
{
    removedElements.add(item) unless map2.contains?(item)
}
list2.each |item|
{
    addedElements.add(item) unless map1.contains?(item)
}

Sorry for the horrible meta-language mixing Ruby and Java :-P

In the end removedElements will contain the elements belonging to list1, but not to list2, and addedElements will contain the elements belonging to list2.

The cost of the whole operation is O(4*N) since the lookup in the map/dictionary may be considered constant. On the other hand linear/binary searching each elements in the lists will make that O(N^2).

EDIT: on a second thought moving the last check into the second loop you may remove one of the loops... but that's ugly... :)

list1.each |item|
{
    map1.add(item)
}
list2.each |item|
{
    map2.add(item)
    addedElements.add(item) unless map1.contains?(item)
}
list1.each |item|
{
    removedElements.add(item) unless map2.contains?(item)
}
Manrico Corazzi