views:

612

answers:

5

I am evaluating options for efficient data storage in Java. The data set is time stamped data values with a named primary key. e.g.

Name: A|B|C:D
Value: 124
TimeStamp: 01/06/2009 08:24:39,223

Could be a stock price at a given point in time, so it is, I suppose, a classic time series data pattern. However, I really need a generic RDBMS solution which will work with any reasonable JDBC compatible database as I would like to use Hibernate. Consequently, time series extensions to databases like Oracle are not really an option as I would like the implementor to be able to use their own JDBC/Hibernate capable database.

The challenge here is simply the massive volume of data that can accumulate in a short period of time. So far, my implementations are focused around defining periodical rollup and purge schedules where raw data is aggregated into DAY, WEEK, MONTH etc. tables, but the downside is the early loss of granularity and the slight inconvenience of period mismatches between periods stored in different aggregates.

The challenge has limited options since there is an absolute limit to how much data can be physically compressed while retaining the original granularity of the data, and this limit is exacerbated by the directive of using a relational database, and a generic JDBC capable one at that.

Borrowing a notional concept from classic data compression algorithms, and leveraging the fact that many consecutive values for the same named key can expected to be identical, I am wondering if there is way I can seamlessly reduce the number of stored records by conflating repeating values into one logical row while also storing a counter that indicates, effectively, "the next n records have the same value". The implementation of just that seems simple enough, but the trade off is that the data model is now hideously complicated to query against using standard SQL, especially when using any sort of aggregate SQL functions. This significantly reduces the usefulness of the data store since only complex custom code can restore the data back to a "decompressed" state resulting in an impedance mismatch with hundreds of tools that will not be able to render this data properly.

I considered the possibility of defining custom Hibernate types that would basically "understand" the compressed data set and blow it back up and return query results with the dynamically created synthetic rows. (The database will be read only to all clients except the tightly controlled input stream). Several of the tools I had in mind will integrate with Hibernate/POJOS in addition to raw JDBC (eg. JasperReports) But this does not really address the aggregate functions issue and probably has a bunch of other issues as well.

So I am part way to resigning myself to possibly having to use a more proprietary [possibly non-SQL] data store (any suggestions appreciated) and then focus on the possibly less complex task of writing a pseudo JDBC driver to at least ease integration with external tools.

I heard reference to something called a "bit packed file" as a mechanism to achieve this data compression, but I do not know of any databases that supply this and the last thing I want to do (or can do, really....) is write my own database.

Any suggestions or insight ?

+3  A: 
cletus
+1  A: 

I would look at a column oriented database. It would be great for this sort of application

Javamann
Thanks Javaman. I knew about these but did not know there were so many decent open source ones. (Don't want to force users to a commercial app). So I looked at LucidDB and it looks like the ticket. The efficiency, compression, user defined transforms and foreign tables get me what I want.
Nicholas
A: 

Thanks for the answers.

Cletus, I appreciate the outline, but one of the tradeoffs I cannot make is abandoning DB flexibility and compatibility with JDBC/Hibernate to allow the use of all the available tools. Moreover, although I did not clearly state this, I do not want to force my users into adopting a [possibly expensive] commercial solution. If they have Database Brand X, let 'em use it. If they don't care, we recommend open source Database Brand Y. Basically the application has multiple faces, one of them being a repository for incoming data, but another face is a reporting source and I really don't want to get into the business of writing report generators.

While I have not really load tested it yet, I am very impressed with LucidDB. It is a column oriented database and it provides good query performance and seemingly good data compression. It has a JDBC driver though no Hibernate dialect exists for it yet, as far as I can tell. It also supports user defined transformations which in short, I think will allow me to seamlessly implement my idea of compressing repeating and consecutive values into one "row", but blow them back out into multiple "synthetic" rows at query time, all done invisibly to the query caller. Lastly, it supports this nifty feature of foreign tables where other JDBC supporting database tables can be fronted in LucidDB. I think this may be invaluable to providing some level of support for other databases.

Thanks for the pointer, Javaman. It zoned me in on LucidDB.

Nicholas
A: 

Many JDBC-capable database management systems (e.g. Oracle) provide compression in the physical storage engine. Oracle, for example, has the notion of a "compressed" table without decompression overhead:

http://www.ardentperf.com/wp-content/uploads/2007/07/advanced-compression-datasheet.pdf

Apocalisp
+1  A: 

You will probably find it interesting to listen to Michael Stonebraker's presentation at Money:Tech. He hits on a number of the things you mention needing and he illustrates how the big three elephants (SQL Server, Oracle, and DB2) will never be able to suite the needs of tick stores (which it looks like you are building). He digs beyond column stores, which I agree is the right direction. He even discusses compression and speed, which are both issues for you.

here are some more links you may find interesting:

JD Long