views:

779

answers:

9

I'm starting on a new scientific project which has a lot of data (millions of entries) I'd like to store in an easily and quickly accessible format. I've come across a number of different potential options, but I'm not sure how to pick amongst them. My data can probably just be stored as a dictionary, or potentially a dictionary of dictionaries. Some potential considerations:

  • Speed. I can't load all the data off disk every time I start a new script, and I'd like as quick access to random entries as possible.
  • Ease-of-use. This is python. The storage should feel like python.
  • Stability/maturity. I'd like something that's currently supported, although something that works well but is still in development would be fine.
  • Ease of installation. My sysadmin should be able to get this running on our cluster.

I don't really care that much about the size of the storage, but it could be a consideration if an option is really terrible on this front. Also, if it matters, I'll most likely be creating the database once, and thereafter only reading from it.

Some potential options that I've started looking at (see this post):

Any suggestions on which of these might be better for my purposes? Any better ideas? Some of these have a back-end; any suggestions on which file-system back-end would be best?

+7  A: 

A RDBMS.

Nothing is more realiable than using tables on a well known RDBMS. Postgresql comes to mind.

That automatically gives you some choices for the future like clustering. Also you automatically have a lot of tools to administer your database, and you can use it from other software written in virtually any language.

It is really fast.

In the "feel like python" point, I might add that you can use an ORM. A strong name is sqlalchemy. Maybe with the elixir "extension".

Using sqlalchemy you can leave your user/sysadmin choose which database backend he wants to use. Maybe they already have MySql installed - no problem.

RDBMSs are still the best choice for data storage.

nosklo
Okay, which interface should I use if I go this direction?
Noah
Interface, as in GUI? Or as in API - Application Programmers Interface?
nosklo
A good API for databases is sqlalchemy - I've updated the answer to say some words about it and added a link to the website. Let me know if you need more info.
nosklo
Cool, I'll look into that. Elixir looks interesting.
Noah
also, using a RDBMS gives you the power of doing stuff like JOINS and aggregate functions.
nosklo
Hrm, elixir doesn't appear to work with the latest version of sqlalchemy. If this kind of stuff doesn't get fixed quickly, I'll probably just want to use straight-up sqlalchemy.
Noah
@Noah: are you sure you need latest sqlalchemy features? That said, I always use straight-up sqlalchemy on my projects, since I feel that the declarative option of sqlalchemy is good enough.
nosklo
I take it back -- there is a deprecation warning that the developers seem to be fully aware of and will address in the next version of elixir.
Noah
+2  A: 

Sqlite -- it comes with python, fast, widely availible and easy to maintain

Artyom
+1  A: 

If you only need simple (dict like) access mechanisms and need efficiency for processing a lot of data, then HDF5 might be a good option. If you are going to be using numpy then it is really worth considering.

Ants Aasma
+1  A: 

Go with a RDBMS is reliable scalable and fast.

If you need a more scalabre solution and don't need the features of RDBMS, you can go with a key-value store like couchdb that has a good python api.

Ferran
+10  A: 

Might want to give mongodb a shot - the PyMongo library works with dictionaries and supports most Python types. Easy to install, very performant + scalable. MongoDB (and PyMongo) is also used in production at some big names.

mdirolf
+2  A: 

I'm working on such a project and I'm using SQLite.

SQLite stores everything in one file and is part of Python's standard library. Hence, installation and configuration is virtually for free (ease of installation).

You can easily manage the database file with small Python scripts or via various tools. There is also a Firefox plugin (ease of installation / ease-of-use).

I find it very convenient to use SQL to filter/sort/manipulate/... the data. Although, I'm not an SQL expert. (ease-of-use)

I'm not sure if SQLite is the fastes DB system for this work and it lacks some features you might need e.g. stored procedures.

Anyway, SQLite works for me.

wierob
+1  A: 

The NEMO collaboration (building a cosmic neutrino detector underwater) had much of the same problems, and they used mysql and postgresql without major problems.

Coffee on Mars
+4  A: 

if you really just need dictionary-like storage, some of the new key/value or column stores like Cassandra or MongoDB might provide a lot more speed than you'd get with a relational database. Of course if you decide to go with RDBMS, SQLAlchemy is the way to go (disclaimer: I am its creator), but your desired featurelist seems to lean in the direction of "I just want a dictionary that feels like Python" - if you aren't interested in relational queries or strong ACIDity, those facets of RDBMS will probably feel cumbersome.

zzzeek
Thanks for the reply -- I think I'll use this as a learning experience to see if an RDBMS really is the way to go for future projects.
Noah
A: 

It really depends on what you're trying to do. An RDBMS is designed for relational data, so if your data is relational, then use one of the various SQL options. But it sounds like your data is more oriented towards a key-value store with very fast random GET operations. If that's the case, compare the benchmarks of the various key-stores, focusing on the GET speed. The ideal key-value store will keep or cache requests in memory, and be able to handle many GET requests concurrently. You may actually want to create your own benchmark suite so you can effectively compare random concurrent GET operations.

Why do you need a cluster? Is the size of each value very large? If not, you shouldn't need a cluster to handle storage of a million entries. But if you're storing large blobs of data, that matters, and you may need something easily supports read slaves and/or transparent partitioning. Some of the key-value stores are document oriented and/or optimized for storing larger values. Redis is technically more storage efficient for larger values due to the indexing overhead required for fast GETs, but that doesn't necessarily mean it's slower. In fact, the extra indexing makes lookups faster.

You're the only one that can truly answer this question, and I strongly recommend putting together a custom benchmark suite to test available options with actual usage scenarios. The data you get from that will give you more insight than anything else.

Jacob