views:

68

answers:

3

I have a database which stores temperature-logging data from various instruments. Data may be logged as often as once per minute. One approach to designing a log table would be to put each log entry in its own row along with the device ID, a time stamp, and a sequence number (even if the clock on a device is changed, it should be possible to sort entries in the order the measurements were actually taken). That would seem incredibly grossly inefficient, however, since every 16-bit measurement would have probably 16 bytes of other data attached to it, in addition to whatever the system adds for indexing. I recognize that it is often senseless to try to optimize every last byte out of a database, but expanding data by a factor of 9:1 or worse seems silly.

At present, I aggregate the records into groups of equally-spaced readings, and store one group per record in a variable-length opaque binary format along with the device ID, time stamp and sequence number for the first reading, and interval between readings. This works nicely, and for all I know may be the best approach, but it doesn't allow for much in the way of queries.

Is there any nice approach for handling such data sets without excessive redundancy?

A: 

I believe you should use RRDtool to store such data. Wikipedia article.

lashtal
@lashtal: Although I didn't mention it in the requirements, the logging frequency may be changed, and there may be a need to keep historical data when that happens.
supercat
+7  A: 

Your data doesn't expand by a factor of 9. Your data stay roughly the same, because you do not have a 16 bit measurement to start with. Your measurement is the Nth measurement from a specific device at a specific moment. So your data does have a sequence number, a device ID and a timestamp even before you add them to the database, whether you're willing to account for it or not.

If you store data in a relational table (SQL), store it in a relational format: normalized. One record per row. Store information in queryable format. 'aggregating' records in a an opaque binary format makes your entire database useless, as the data cannot be queried, aggregated, filtered, nothing. Your definition of 'this works nicely' is basically 'I can write the data and nobody can make any use of it', which is hardly 'nice'. You may just as well dump the data to /dev/nul...

Store the data as proper records. Store data as proper database types, don't use 'opaque blobs'. And 'data may be logged as often as once per minute' is not 'frequent' by any database standards. If you'd say '100 times per second' then we'd have something to talk about.

Remus Rusanu
@Remus Rusanu: Before the data goes into the database, it's kept by logging measurement devices. Those devices will store data in packed records of up to 32 measurements; the device ID and next-record sequence number only need to be held once in the device, and the timestamp and record frequency are only stored in the device once per record. When looking at a record in isolation, the metadata is "real data", but when looking at consecutive records in aggregate, the amount of real data in the set is far less than the total amount for records viewed in isolation.
supercat
@Remus Rusanu: There's a guarantee that no more than an day of data may be aggregated per record. It is thus possible to find all records for a given interval by starting the search a day before the time of interest. The other desired type of query would be to find the minimum and maximum temperature in a time interval; this could be assisted by adding fields for the minimum and maximum temperature within a record; records within a day of the start or end of the interval of interest would have to be parsed, but those in the central part could be used as-is.
supercat
@supercat: *Why* do you store the data in a *database*?
Remus Rusanu
@Remus Rusanu: Data is collected from a plurality of units. Storing stuff in a database seems better than using a separate file for each unit.
supercat
@supercat: Good developers usually give answers more in the lines of 'I need to query the data', 'I need to leverage the high-availability and disaster-recoverability of an RDBMS' or 'I need to expose the data to be consumed by other stake holders'.
Remus Rusanu
@Remus Rusanu: The end-user will likely use data for three purposes: -1- See whether the temperature has been doing anything unusual recently (e.g. view a graph to see if the thermostat cycling pattern changes); -2- Demonstrate to auditors that products are being stored in acceptable conditions; -3- If some time down the road (which could theoretically be years) a product is found to be spoiled, demonstrate that it was being kept in acceptable conditions while in custody of the monitoring equipment's owner.
supercat
@supercat: say your customers want to expose the graph on the intranet and they use Reporting Services for this. Would they prefer to see the data as a queryiable table with one measurement per row, or as an 'opaque blob' they have to crack in code to extract the information? Or say they need to look at the data 15 years down the road, after your app has moved on to run on the SuperPlatformOfTheFuture v. 5. All of the sudden they need to install these old app, based on an outdated runtime noone knows how to install just to *see* the data.
Remus Rusanu
@Remus Rusanu: The program supports some internal graphing features, and it supports an "export to csv" function. I do share your concern, but the idea of redundantly storing all the metadata with every measurement doesn't seem appealing. What I'd like would be a nice way to associate some metadata with a plurality of measurements, so I was seeking ideas for how best to do that.
supercat
+1  A: 

Is this really a problem? Imagine we overestimate a bit and say that there is 50 bytes of data+metadata per measurement. Google suggests that you won't have many problems unless you are in a really tight environment.

Clueless