I want to build something to store and serve up time series data, which is coming in from a variety of sources at different time intervals. this includes both raw data and computed data. for example, let's say I want to log an every-30-seconds temperature reading, and a temperature forecast I'm calculating separately every 5 minutes.
I need to be able to query the data quickly, and I've found a relational database doesn't work well at all once it gets too big. so I was thinking about creating some sort of in-memory thing, but I'm sure it will crash at some point, so I'll need to persist the data to disk. so I was wondering, why not just make the whole thing disk-based, with some sort of caching for commonly requested data?
but I'm a bit clueless on how to go about this. I'm imagining data sources pushing update datasets to the server periodically, using some sort of string key/symbol to identify what the data is. the server gets the data, and then what? write it to some sort of binary file? could I write to one file per symbol? (assume over 100k symbols)
I think what I want is similar to google's BigTable, but on a much smaller scale. basically, a distributed hash table, mapping a string key to a time series of associated data, with very fast retrieval and the ability to retrieve a range query by time. and extra points for multidimensional data.
oh, and this would (ideally) by a c#/windows project - it's doesn't need to be that high performance.