views:

948

answers:

4

I am interested in monitoring some objects. I expect to get about 10000 data points every 15 minutes. (Maybe not at first, but this is the 'general ballpark'). I would also like to be able to get daily, weekly, monthly and yearly statistics. It is not critical to keep the data in the highest resolution (15 minutes) for more than two months.

I am considering various ways to store this data, and have been looking at a classic relational database, or at a schemaless database (such as SimpleDB).

My question is, what is the best way to go along doing this? I would very much prefer an open-source (and free) solution to a proprietary costly one.

Small note: I am writing this application in Python.

+2  A: 

PyTables is designed for dealing with very large data sets, and it built on top of the HDF5 library which is designed for this purpose. For example, PyTables has automatic compression and supports Numpy.

tom10
This seems very interesting, I'll check it out.
lorg
+4  A: 

RRDTool by Tobi Oetiker, definitely! It's open-source, it's been designed for exactly such use cases.

EDIT:

To provide a few highlights: RRDTool stores time-series data in a round-robin data base. It keeps raw data for a given period of time, then condenses it in a configurable way so you have fine-grained data say for a month, averaged data over a week for the last 6 months, and averaged data over a month for the last 2 years. As a side effect you data base remains the same size all of the time (so no sweating you disk may run full). This was the storage side. On the retrieval side RRDTool offers data queries that are immediately turned into graphs (e.g. png) that you can readily include in documents and web pages. It's a rock solid, proven solution that is a much generalized form over its predecessor, MRTG (some might have heard of this). And once you got into it, you will find yourself re-using it over and over again.

For a quick overview and who uses RRDTool, see also here. If you want to see which kinds of graphics you can produce, make sure you have a look at the gallery.

ThomasH
I was aware of RRDTool, it's good to have another "vote" to it. I will look into it more deeply. As an aside, do you know if you can interface with it in Python?
lorg
@lorg I haven't tried it myself, but the docs explicitly list Python bindings (http://oss.oetiker.ch/rrdtool/prog/rrdpython.en.html)
ThomasH
it has Python bindings. but last time I looked (long ago), they didn't work great. I end up just wrapping the CLI with subprocess calls like this class does: http://code.google.com/p/perfmetrics/source/browse/trunk/lib/rrd.py
Corey Goldberg
@Corey Right, that's how I've used RRDtool, and it's quite natural to do so.
ThomasH
+1  A: 

plain text files? It's not clear what your 10k data points per 15 minutes translates to in terms of bytes, but in any way text files are easier to store/archive/transfer/manipulate and you can inspect the directly, just by looking at. fairly easy to work with Python, too.

SilentGhost
+1  A: 

This is pretty standard data-warehousing stuff.

Lots of "facts", organized by a number of dimensions, one of which is time. Lots of aggregation.

In many cases, simple flat files that you process with simple aggregation algorithms based on defaultdict will work wonders -- fast and simple.

Look at http://stackoverflow.com/questions/665614/efficiently-storing-7-300-000-000-rows/665641#665641

http://stackoverflow.com/questions/629445/database-choice-for-large-data-volume

S.Lott