views:

286

answers:

7

I want to build something to store and serve up time series data, which is coming in from a variety of sources at different time intervals. this includes both raw data and computed data. for example, let's say I want to log an every-30-seconds temperature reading, and a temperature forecast I'm calculating separately every 5 minutes.

I need to be able to query the data quickly, and I've found a relational database doesn't work well at all once it gets too big. so I was thinking about creating some sort of in-memory thing, but I'm sure it will crash at some point, so I'll need to persist the data to disk. so I was wondering, why not just make the whole thing disk-based, with some sort of caching for commonly requested data?

but I'm a bit clueless on how to go about this. I'm imagining data sources pushing update datasets to the server periodically, using some sort of string key/symbol to identify what the data is. the server gets the data, and then what? write it to some sort of binary file? could I write to one file per symbol? (assume over 100k symbols)

I think what I want is similar to google's BigTable, but on a much smaller scale. basically, a distributed hash table, mapping a string key to a time series of associated data, with very fast retrieval and the ability to retrieve a range query by time. and extra points for multidimensional data.

oh, and this would (ideally) by a c#/windows project - it's doesn't need to be that high performance.

A: 

If you use a database and take out indexing and the relational part you pretty much get what you described. However, I'm not sure how useful it would be. Could you give us a better idea why a database did not work for you? What did you try that didn't work?

pbz
I was using sql server 2005, and it was slow. I had data in the form of (timestamp, key1, key2, key3, data1, data2, data3) where the key was hierarchical. so I would say "give me data1 where key1=x, key2=y, and key3=z, in timestamp range [a,b].
I had indexes out to key3, but this made the index disk space very large, and made either inserts or deletes slow (don't remember). I guess I could revisit sql server, but it seemed sort of "wrong" to use a database as a data server, instead of just a data store...
How did you access the database each time? Creating new connections to the database can be slow, that is why most systems use a connection pool, or you can keep the connection alive.
CookieOfFortune
I would recommend you try to make a small test program that you could share with us and see if we can help make it faster.
pbz
You cannot take out the indexing if you want the queries to be fast. You can of course throw away the relational aspect, since the data is not really relational.
grieve
@grieveTrue, not if you want to efficiently get the data out, but it sure helps with the insert part :-)
pbz
@pbz: If he has no real time requirement he could: 1) turn off indexing, 2) insert data, 3) turn on indexing, 4) query. But that doesn't sound like what he wants. :)
grieve
+1  A: 

I have to tell you that no "file system" approach (that I'm aware of) is going to be any faster than a relational database. And it will probably be a lot worse.

The problem with relational databases is not that they are inherently slow but that placing data into them can be done very easily without regard to how the data are saved. A good index, even for millions of records, should produce sub-second results. It's more a matter of design than a problem of access. If you design it well, access will come.

edit: Also, if by "relational database" you mean Microsoft Access, then you're right; it is slow with lots of records. I would not go that route. Look into MySql if money is an issue or Oracle/Sql Server if money isn't.

Michael Todd
Relational database do sit on top of a file system. Relational database can work for small SCADA systems, but they do not scale well at all. They simply use too much diskspace, and regardless of how good the index is tend to fall over far too soon.
grieve
@grieve Sorry, didn't mean to imply that relational databases were somehow divested of a storage medium. I meant that the mechanisms built into relational databases are better (to my knowledge) than trying to rig-up a similar system via a file-system by oneself. Wasn't aware of the scaling issues.
Michael Todd
@grieve What else would you recommend besides relational databases if scaling is an issue?
Michael Todd
@grieve Ah. Read your answer. Thanks.
Michael Todd
@Michael Todd: No worries. It is actually a very difficult and interesting problem. Probably too much so to be easily answered here, rigging-up our system took three years. Of course we had a lot of constrainsts that bobsmith may not.
grieve
A: 

I'm not sure why you're down on a database for this. I've done real-time statistics on tables with 10s of millions of rows. Further, you could batch up the readings periodically to turn hundreds of thousands of rows into hundreds of rows of compiled data--depending on your needs obviously.

As for in-memory persistence and key-value pair access, you may want to look at memcachedb. It's based on memcached and offers excellent performance.

Also, after thinking about it more, you could easily run the thing as a hashtable in memory and then periodically serialize it out to the file system for persistence.

bbrown
What would you hash against?
grieve
Oh, yeah. That hash space would be huge.
Michael Todd
The questioner stated that it would be key-value pairs. Given that (and his lack of specifics), I just suggested something that might fit. It may or may not be a good idea once details emerge.
bbrown
If you read a bit more closely you can see that he is asking for key-data stream pairs. So the key gets him to the data stream, and then he has to index across the data stream. So it is really going to take more than one data structure to solve.
grieve
A: 

I would agree with others that a database would be your best bet.

If you really are generating such a huge amount of data that it would cause a performance problem, you might want to create two tables - one as a "real time" source and another as an "archive".

Your system would insert new data into the real time table, and a batch job would periodically move data from there to the archive table. When performance is a concern, you would query only the smaller real time table. If you actually needed to query across all the data, you would query a view that UNIONs the real time and archive tables.

Eric Petroelje
+1  A: 

Sounds like a SCADA (System Control And Data Acquisition) type application, making use of the Data Acquisition part of the system. Have you looked at off-the-shelf solutions. Wonderware/IndustrialSQL or some competitor product?

Having said that my present employer (The MetService, New Zealand) logs readings each 30 seconds, 1 minute or 1 hourt from Automatic Weather Stations (temp, rainfall, wind, etc) and forecasts to an Oracle DB. Minimal indexing; indexes slows down 3 out of 4 DML actions and speeds up Selects Of course you need the 3 actions to be fast, in particular the Insert. Fast IO system. Very fast IO for Redo logs. We are moving to partitioned tables so that the deletes are faster and generate less redo (drop the tablespace including contents rather than issue a delete) Serious though given to light, fast transactions for Inserts. Serious though given to performance of machines performing Inserts and network beteen them and the DB.

Karl
+1  A: 

Sadly I am forbidden by NDA agreements to tell you how to do this. I worked on the team that created a non-relational database that does exactly what you are trying to do. It is called Citadel. I can however point you to the link for what is publicly available, and it should give you some ideas of how it works.

http://zone.ni.com/devzone/cda/tut/p/id/6579

You could just buy the product, but it is rather expensive.

Also as Karl points out this is generally Used in SCADA products like Wonderware, Lookout, and LabVIEW DSC.

A search for SCADA data storage turns up some interesting reading as well.


As an aside a relational databases can solve this problem if the amount of data is small. What tends to happen over time is that the data grows without bounds, and the relational database gets filled beyond its capacity. A good SCADA data storage system can easily handle 50000 points being polled at once a second. Though at some point even they start to get too large to handle easily.

grieve
+1  A: 

"RRDTool is the OpenSource industry standard, high performance data logging and graphing system for time series data."

It's in two parts, one which logs, stores and retrieves time series data, and a second part for graphing. There are many examples of it's use.

Even if you don't use it, it's design is definitely relevant.

Conor OG