tags:

views:

28

answers:

1

I'm currently writing a python application that will take a directory of text files and parse them into custom python objects based on the the attributes specified in the text file. As part of my application, I compare the current loaded object data set to a previous dataset (same format) and scan it for possible duplicates, conflicts, updates, etc. However since there can be ~10,000+ objects at a time, I'm not really sure how to approach this.

I'm currently storing the previous data set in a DB as it's being used by another web app. As of now, my python application loads the 'proposed' dataset into memory (creating the rule objects), and then I store those objects in a dictionary (problem #1). Then when it comes time to compare, I use a combination of SQL queries and failed inserts to determine new/existing and existing but updated entries (problem #2).

This is hackish and terrible at best. I'm looking for some advice on restructuring the application and handling the object storage/comparisons.

A: 

You can fake what Git does and load the entire set as basically a single file and parse from there. The biggest issue is that dictionaries are not ordered so your comparisons will not always be 1:1. A list of tuples will give you 1:1 comparisons. If a lot has changed this will be difficult.

Here is a basic flow for how you can do this.

  • Start with both tuple lists at index 0.
  • Compare a hash of each tuple hashlib.sha1(str(tuple1)) == hashlib.sha1(str(tuple2))
  • If they are equal, record the matching indexes and add 1 to each index and compare again
  • If the are unequal, search each side for a match and record the matching indexes
  • If there are no matches, you can assume there is an insert/update/delete happening and come back to it later

You can map your matching items as reference points to do further investigation into the ones that did not match. This technique can be applied at each level you drill down. You will end up with a map of what is different down to the individual values.

The nice thing is each of the slices that you create can be compared in parallel since they will not correspond to each other... unless you are moving things from one file to another.

Then again, it may be easier to use a diff library to compare the two data sets. Might as well not reinvent the wheel; even if it might be a really shiny wheel.

Check out http://docs.python.org/library/difflib.html

Scott