views:

104

answers:

4

Hello,

I'm designing an application that receives information from roughly 100k sensors that measure time-series data. Each sensor measures a single integer data point once every 15 minutes, saves a log of these values, and sends that log to my application once every 4 hours. My application should maintain about 5 years of historical data. The packet I receive once every 4 hours is of the following structure:

  • Data and time of the sequence start
  • Number of samples to arrive (assume this is fixed for the sake of simplicity, although in practice there may be partials)
  • The sequence of samples, each of exactly 4 bytes

My application's main usage scenario is showing graphs of composite signals at certain dates. When I say "composite" signals I mean that for example I need to show the result of adding Sensor A's signal to Sensor B's signal and subtracting Sensor C's signal.

My dilemma is how to store this time-series data in my database. I see two options, assuming I use a relational database:

  1. Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.
  2. Store every 4-hour signal as a separate row with its starting time. In this case, whenever a signal arrives, I just add it as a BLOB to the database.

There are obvious pros and cons for each of the options, including storage size, performance, and complexity of the code "above" the database.

I wondered if there are best practices for such cases.

Many thanks.

+2  A: 

Storing each sample in it's own row sounds simple and logical to me. Don't be too hasty to optimize unless there is actually a good reason for it. Maybe you should do some tests with dummy data to see if any optimization is really necessary.

Tom Dalling
+1  A: 

I think storing the data in the form that makes it easiest to carry out your main goal is likely the least painful overall. In this case, it's likely the more efficient as well.

Since your main goal appears to be to display the information in interesting and flexible ways I'd go with separate rows for each data point. I presume most of the effort required to write this program well is likely on the display side, you should minimize the complexity on that side as much as possible.

acrosman
+1  A: 

Storing data in BLOBs is good if the content isn't relevent and you would never want to run queries against it. In this case, your data will be the contents of the database, and therefore, very relevent.

I think you should:

1.Store every sample in a row of its own: when I receive a signal, break it to samples, and store each sample separately with its timestamp. Assume the timestamps can be normalized across signals.

Kieveli
+1  A: 

I see two database operations here: the first is to store the data as it comes in, and the second is to retrieve the data in a (potentially large) number of ways.

As Kieveli says, since you'll be using discrete parts of the data (as opposed to all of the data all at once), storing it as a blob won't help you when it comes time to read it. So for the first task, storing the data line by line would be optimal.

This might also be "good enough" when querying the data. However, if performance is an issue, and/or if you get massive amounts of volume [100,000 sensors x 1 per 15 minutes x 4 hours = 9,600,000 rows per day, x 5 years = 17,529,600,000 or so rows in five years]. To my mind, if you want to write flexible queries against that kind of data, you'll want some form of star schema structure (as gets used in data warehouses).

Whether you load the data directly into the warehouse, or let it build up "row by row" to be added to the warehouse ever day/week/month/whatever, depends on time, effort, available resources, and so on.

A final suggestion: when you set up a test environment for your new code, load it with several years of (dummy) data, to see how it will perform.

Philip Kelley