views:

509

answers:

5

I am trying to identify possible methods for storing 100 channels of 25 Hz floating point data. This will result in 78,840,000,000 data-points per year.

Ideally all this data would be efficiently available for Web-sites and tools such as Sql Server reporting services. We are aware that relational databases are poor at handling time-series of this scale but have yet to identify a convincing time-series specific database.

Key issues are compression for efficient storage yet also offering easy and efficient queries, reporting and data-mining.

  • How would you handle this data?

  • Are there features or table designs in Sql Server that could handle such a quantity of time-series data?

  • If not, are there any 3rd party extensions for Sql server to efficiently handle mammoth time-series?

  • If not, are there time-series databases that specialise in handling such data yet provide natural access through Sql, .Net, and Sql Reporting services?

thanks!

+1  A: 

I'd partition the table by, say, date, to split the data into tiny bits of 216,000,000 rows each.

Provided that you don't need a whole-year statistics, this is easily servable by indexes.

Say, the query like "give me an average for the given hour" will be a matter of seconds.

Quassnoi
Thanks for the answer. Using sql server partioning or just multiple tables? If its multiple tables, are there any design patterns for easily handling queries across tables?
Duncan
@Duncan: SQL Server paritioning would be the best, but I don't know how many partitions will it allow. With multiple tables, you can run a scheduler to recreate views as SELECT * FROM table_20090609 UNION ALL SELECT * FROM table_20090608 etc. Don't forget to include a partitioning column into the tables.
Quassnoi
+1  A: 

I suppose you need a random access to the data series. The idea that I've already used for rainfall data table, is to subdivide the entire dataset in smaller part, to create an entry for each few minutes or even one minute. Then you can pop this, still big, array from the db and access directly to the needed part, you can find a direct correlation between time offset and byte offset.

Lopoc
Thanks for the answer. Using a blob for the big array? Are there any design approaches for making a blob easily queryable e.g. views?
Duncan
+1  A: 

The feature set you're describing is for an Analysis Cube. Check out Analysis services from Microsoft if you're in that part of the tech world:

http://msdn.microsoft.com/en-us/library/ms175609(SQL.90).aspx

As far as the model you're describing, you need to implement a Kimball model (the standard data warehousing model) with a Time dimension. I ran into this problem storing media log files a while back.

Good luck.

Chris B. Behrens
Thanks for the answer. Its difficult to know where to get started with data-warehousing. I have googled and read around your link but would benefit from something like a sample project that tackles a similar problem. Are you aware of anything like this?
Duncan
Chris B. Behrens
A: 

You have

A. 365 x 24 x 100 = 876,000 hourly signals (all channels) per year

B. each signal comprising 3600 * 25 = 90,000 datapoints

How about if you store data as one row per signal, with columns for summary/query stats for currently supported use cases, and a blob of the compressed signal for future ones?

bubaker
Thanks for the answer. I may not fully understand the suggestion. Is the suggestion for each row to like (signalId, timeperiod, float ave, float min, float max, blob raw)?Are there any examples of making a blob data easily queryable e.g. views?
Duncan
Something along those lines, but I'm not sure about making blob data queryable at all.. My thinking was to limit queries to additional stats columns as needed.
bubaker
A: 

You can check out Infobright Community or Enterprise Edition, I think. It is column-oriented storage designed for analytics purposes and big (existing installations up to 30 TB now as they say) data and good compression rate.

Data loader are also pretty fast and connectors exists for ETL-tools (Talend, kettle and so on).

Community edition available free under GNU GPL terms, but allows to add data only via native loader. Enterprise edition supports add/update by single row via DML.

Another benefit that you can use it with all tools that supports MySQL connections.

Column-orientation allow you, f.e., add columns for date component on every needed aggregation level (I use date, week numbers, months and qtr.) for better performance, but it's good without it as well.

I use it for relatively small (yet) amount of business transactions data for analytic purposes with R as data analysis tool via mysql interface and python (numpy) scripts as some kind of ETL.

Cons: lack of official utf-8 support, aggregation by function values (select month(date from ...)) not implemented yet (plan: july 2009, AFAIK), but I use ETL for this.

Link: http://www.infobright.org/Download/ICE/

zzr
Thanks - I'll take a look.
Duncan
please feel free to share you experience after exploring ICE :) I’m working on architecture of our small analysis/reporting app with R, Infobright and Django as reports viewer and interested in new thoughts about storing/representing large data :)
zzr