views:

121

answers:

4

Hi folks,

I'm using MongoDB an nosql database. Basically as a result of a query I have a list of dicts which themselves contains lists of dictionaries... which I need to work with.

Unfortunately dealing with all this data within Python can be brought to a crawl when the data is too much.


I have never had to deal with this problem, and it would be great if someone with experience could give a few suggestions. =)

+1  A: 

Are you loading all the data into memory at once? If so you could be causing the OS to swap memory to disk, which can bring any system to a crawl. Dictionaries are hashtables so even an empty dict will use up a lot of memory, and from what you say you are creating a lot of them at once. I don't know the MongoDB API, but I presume there is a way of iterating through the results one at a time instead of reading in the entire set of result at once - try using that. Or rewrite your query to return a subset of the data.

If disk swapping is not the problem then profile the code to see what the bottleneck is, or put some sample code in your question. Without more specific information it is hard to give a more specific answer.

Dave Kirby
+3  A: 

Do you really want all of that data back in your Python program? If so fetch it back a little at a time, but if all you want to do is summarise the data then use mapreduce in MongoDB to distribute the processing and just return the summarised data.

After all, the point about using a NoSQL database that cleanly shards all the data across multiple machines is precisely to avoid having to pull it all back onto a single machine for processing.

Duncan
@Duncan: I agree with your point completely. I am a bit inexperienced with the MongoDB queries I can make and am a bit limited because of that. For now I'm handling these dicts within Python as it is pretty straight forward to do so. Thanks for your reply.
RadiantHex
+1  A: 

If CPU is your bottleneck (and your problem can be parallelized), you can also consider using Python's multiprocessing module, Disco project or Parallel Python to utilize multiple cores and/or multiple machines.

Ztyx
A: 

sqlite is your friend.