views:

132

answers:

4

I'm writing a data processing library in Python that reads data from a variety of sources into memory, manipulates it, then exports it into a variety of different formats. I was loading this data into memory, but some of the datasets I'm processing can be particularly large (over 4 Gig).

I need an open source library for a backing store that can deal elegantly with large datasets. It needs the ability to alter the data structure dynamically (add, rename, and remove columns), and should support reasonably fast iteration. Ideally, it should be able to handle arbitrary-sized strings and integers (just as python does) but I can build that into the library, if needed. And it needs to be able to handle missing values.

Does anyone have any suggestions?

A: 

Pytables might be the answer for you, although I suspect it is mostly used for numerical data, it might fit your bill too (according to what I see on their homepage).

Olivier
It might, but I couldn't see that there was any way to alter a table after you defined it. Their cheat sheet for SQL users has an example renaming a column, but not adding a new column to an existing table.
Chris B.
+3  A: 

A document-oriented database should cope fine with that kind of workload as long as you do not have complex joins.

Common representatives would be CouchDB or MongoDB.

They are both well suited for MapReduce like algorithms (this includes iterating over all datasets). If you want to merge rows with new data, you will want to have the 'table' sorted or have fast access to single elements: Both boils down to having an index.

Document-oriented DBs support multiple 'tables' by having documents with different schemas. They can query documents with a specific schema without a problem.

I do not think you will find a lightweighted solution to handle multiple 4 GB datasets with the requirements you listed. Especially dynamic datastructures are difficult to implement fast.

ebo
That would probably work, but it seems a little like a mismatch. We're dealing with rows and columns of data--a single database table for each dataset. I suppose that would work with a single document for each row, but it still seems a little odd.
Chris B.
Also, I don't need fast query abilities for large datasets, I need to iterate over every row (possibly) multiple times. That's not what CouchDB or MongoDB is designed for, it seems like.
Chris B.
+1  A: 

Give Metakit a try. It allows flexibility in schemas and has Python bindings. Though it doesn't get much press, it's been around awhile.

Vinay Sajip
The documentation's a little light. In my brief check of the website, I didn't see anything about how well it dealt with huge datasets.
Chris B.
+1  A: 

Another idea might to be to use Hadoop for your backend. It has similarities with CouchDB which someone mentioned before, but focuses more on effective processing of big datasets with MapReduce algorithms.

In comparison to CouchDB, Hadoop isn't real suited for real-time applications or as a database behind a website, because it has a high latency of accessing a single entry, but it real shines when iterating over all elements and computing even Peta-Bytes of data.

So maybe you should give Hadoop a try. Of course, it might take some time to get used to those MapReduce algorithms, but they are really a great way for describing such problems. And you don't have to deal with the storage of the interim results on your own. And a nice side-effect is, that your algorithm will still work when your data set becomes bigger, but then you might have to add another server. :-)

There is also quite a lot of books and documentation about Hadoop and MapReduce available, and here is a nice tutorial which might help you to get started with Hadoop and Python.

tux21b