views:

317

answers:

6

I'm pretty sure this must be in some kind of text book (or more likely in all of them) but I seem to be using the wrong keywords to search for it... :(

A common task I'm facing while programming is that I am dealing with lists of objects from different sources which I need to keep in sync somehow. Typically there's some sort of "master list" e.g. returned by some external API and then a list of objects I create myself each of which corresponds to an object in the master list. Sometimes the nature of the external API will not allow me to do a live sync: For instance the external list might not implement notifications about items being added or removed or it might notify me but not give me a reference to the actual item that was added or removed. Furthermore, refreshing the external list might return a completely new set of instances even though they still represent the same information so simply storing references to the external objects might also not always be feasible. Another characteristic of the problem is that both lists cannot be sorted in any meaningful way. You should also assume that initializing new objects in the "slave list" is expensive, i.e. simply clearing and rebuilding it from scratch is not an option.

So how would I typically tackle this? What's the name of the algorithm I should google for?

In the past I have implemented this in various ways (see below for an example) but it always felt like there should be a cleaner and more efficient way.

Here's an example approach:

  1. Iterate over the master list
  2. Look up each item in the "slave list"
  3. Add items that do not yet exist
  4. Somehow keep track of items that already exist in both lists (e.g. by tagging them or keeping yet another list)
  5. When done iterate once more over the slave list
  6. Remove all objects that have not been tagged (see 4.)


Update Thanks for all your responses so far! I will need some time to look at the links.

Maybe one more thing worthy of note: In many of the situations where I needed this the implementation of the "master list" is completely hidden from me. In the most extreme cases the only access I might have to the master list might be a COM-interface that exposes nothing but GetFirst-, GetNext-style methods. I'm mentioning this because of the suggestions to either sort the list or to subclass it both of which is unfortunately not practical in these cases unless I copy the elements into a list of my own and I don't think that would be very efficient.

I also might not have made it clear enough that the elements in the two lists are of different types, i.e. not assignment-compatible: Especially, the elements in the master list might be available as interface references only.

+1  A: 

In the C++ STL the algorithm is called set_union. Also, implementing the algorithm is likely to be a lot simpler if you do the union into a 3rd list.

ceretullis
This will likely just perform the algorithm the OP describes. Also, the OP is looking for a language agnostic solution to this problem, ie: an algorithmic solution.
Ben S
set_union requires sorted input; the question excludes that.
MSalters
@Ben S: the questioner said they were having trouble pulling up algorithms in google... and asked for the name to google for. My answer is a leaping off point for the questioner to narrow their googling.
ceretullis
A: 

Very brute-force and pure technical approach:

Inherit from your List class (sorry don't know what is your language). Override add/remove methods in your child list class. Use your class instead of the base one. Now you can track changes with your own methods and synchronize lists on-line.

Alexey Kalmykov
+2  A: 

The 2 typical solutions are: 1. Copy the master list to the sync list. 2. Do an O(N*N) comparison between all element pairs.

You've excluded the smart options already: shared identity, sorting and change notifications.

Note that it's not relevant whether the lists can be sorted in a meaningful way, or even completely. For instance, when comparing two string lists, it would be ideal to sort alphabetically. But the list comparison would still be more efficient if you'd sort both lists by string length! You'd still have to do a full pairwise comparison of strings of the same length, but that will probably be a much smaller nummber of pairs.

MSalters
+2  A: 

This looks like the set reconciliation problem i.e. the problem of synchronizing unordered data. A question on SO was asked on this: link text.

Most of the references on google are to technical paper abstracts.

Andrew
+1  A: 

Often the best solution to such problems is to not solve them directly.

IF you really can't use a sorted binary searchable container in your part of the code (like a set or even a sorted vector) then...

Are you very memory bound? If not then I'd just create a dictionary (an std::set for example) containing the contents of one of the lists and then just iterate over the second which I want o sync with the first.

This way you're doing n*logn to create the dictionary (or n*X for a hash dictionary depending on which will be more efficient) + m*logn operations to go over the second list and sync it (or just M*Y) - hard to beat if you really have to use lists in the first place - it's also good you do it only once when and if you need it and it's way better then keeping the lists sorted all the time which would be a n^2 task for both of them.

RnR
A: 

I had such problem in one project in the past.

That project had one master data source and several clients that update the data independently and in the end all of them have to have the latest and unified set of data that is the sum of them.

What I did was building something similar to the SVN protocol, in which every time I wanted to update the master database (which was accessible through a web service) I got the revision number. Updated my local data store to that revision and then commited the entities that aren't covered by any revision number to the database.

Every client has the ability to update it's local data store to the latest revision.

Karim