Hey everyone, Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!