I'm currently writing a python application that will take a directory of text files and parse them into custom python objects based on the the attributes specified in the text file. As part of my application, I compare the current loaded object data set to a previous dataset (same format) and scan it for possible duplicates, conflicts, updates, etc. However since there can be ~10,000+ objects at a time, I'm not really sure how to approach this.
I'm currently storing the previous data set in a DB as it's being used by another web app. As of now, my python application loads the 'proposed' dataset into memory (creating the rule objects), and then I store those objects in a dictionary (problem #1). Then when it comes time to compare, I use a combination of SQL queries and failed inserts to determine new/existing and existing but updated entries (problem #2).
This is hackish and terrible at best. I'm looking for some advice on restructuring the application and handling the object storage/comparisons.