views:

75

answers:

3

I'm creating and processing a very large data set, with about 34 million data points, and I'm currently storing them in python dictionaries in memory (about 22,500 dictionaries, with 15 dictionaries in each of 1588 class instances). While I'm able to manage this all in memory, I'm using up all of my RAM and most of my swap.

I need to be able to first generate all of this data, and then do analysis on select portions of it at a time. Would it be beneficial from an efficiency standpoint to write some of this data to file, or store it in a database? Or am I better off just taking the hit to efficiency that comes with using my swap space. If I should be writing to file/a database, are there any python tools that you would recommend to do so?

+1  A: 

Because you will be looking at "select portions", your application will be able to make better use of core than Virtual Memory will. VM is convenient, but - by definition - kinda stupid about locality of reference.

Use a database.

I'd probably start with module sqlite3 on the basis of simplicity, unless or until I find that it is a bottlenck.

msw
Correct me if I'm wrong, but sqlite uses a temporary flat file as a database. Out of curiosity, would I be able to access the data I wrote to it after my program has terminated?
Wilduck
Yes, but the data format is not human readable -- you'd need to use SQLite (or a program interfacing to SQLite) to read the data.
Craig Trader
@Wilduck - I'm going to correct you by saying: SQLite uses a single file as a database. It's not a temporary file unless you've specified a temporary database. (http://www.sqlite.org/pragma.html)
gomad
Also, you don't necessarily need to constrain yourself to a single file with SQLite if you don't need relations that cross files. It often offers better performance to use multiple sqlite database files to store unrelated data and open more than one at a time than to put all of your application data in one database file.
Nick Bastin
+1  A: 

Get a relational database, fast! Or a whole lot more RAM.

If you're using Python, then start with Python Database Programming. SQLite would be a choice, but I'd suggest MySQL based upon the amount of data you're dealing with. If you want an object-oriented approach to storing your data, you might want to look at SQLAlchemy, but you'll probably get more efficiency if you end up mapping each of your object classes to a table yourself and just coping with rows and columns.

Craig Trader
A relational database should not be your first choice if you're merely trying to offload data from memory to disk. If the data's true layout is optimal in memory, it probably doesn't lend itself to being relational, meaning a relational database is giving you a bunch of features you'll never use, at a considerable performance cost.
Nick Bastin
@Nick, It's pretty obvious, from the way the OP described his data that it would benefit from a table layout, and a relational database in particular, since he wanted to deal with subsets of his data, instead of the whole database at one shot.
Craig Trader
@W. Craig: Dealing with subsets is not enough to require a relational database - all key-value stores excel at returning subsets, very quickly. You only want to bump up to a relational database if your "subsets" are generated on a lot of different keys (columns). If you're always generating subsets on the same index (say, date or id), then a relational database provides no benefit.
Nick Bastin
@Nick, if your only goal is to persist data from memory to disk and back again, at top velocity, then you are correct. Most applications have more than a single concern when persisting data, and no single data persistence method will meet all needs for all users. Without more detail about the OP's specific needs, neither you nor I can make a perfect recommendation. The best we can do is present alternatives, and hope that the OP learns enough from all the answers to make the right choice for his/her needs.
Craig Trader
+1  A: 

If you have this data in Python data structures already, assuming you're not doing a lot of in-memory indexing (more than the obvious dictionary keys index), you really don't want to use a relational database - you'll pay a considerable performance penalty for no particular benefit.

You just need to get your already key-value-pair data out of memory, not change its' format. You should look into key-value stores like BDB, Voldemort, MongoDB, or Scalaris (just to name a few - some more involved and functional than others, but all should easily handle your dataset), or for a dataset that you think might grow even larger or more complex you can look into systems like Cassandra, Riak, or CouchDB (among others). ALL of these systems will offer you vastly superior performance to a relational database and more directly map to an in-memory data model.

All that being said, of course, if your dataset really could be more performant by leveraging the benefits of a relational database (complex relationships, multiple views, etc.), then go for it, but you shouldn't use a relational database if all you're trying to do is get your data structures out of memory.

(It's also possible that just marshaling/pickling your data in segments and managing it yourself would offer better performance than a relational database, assuming your access pattern made paging in/out a relatively infrequent event. It's a long shot, but if you're just holding old data around and no one really looks at it, you might as well just throw that to disk yourself.)

Nick Bastin
Not being very experienced in database design/management, I'm not sure what would qualify as complex relationships. Noting that I my dictionaries are attributes of many instances of a class, and I'd be doing analysis at a level higher, comparing different sets of these class instances, would this be considered complex relationships?I'm hesitant to use a non-relational database, but it's an ignorant hesitation.
Wilduck
@Wilduck: Ignorance is an important factor in any decision making process.. :-) Seriously, though - it is. If you're around a bunch of people who use SQL and can help you, and you don't have any better ideas, then you should probably use an SQL database, even if it isn't optimal for the dataset, because it's optimal for your entire situation. That being said, if you haven't put the same object in multiple dictionaries (so that the keys are different), then your data likely isn't relational.
Nick Bastin