I am evaluating options for efficient data storage in Java. The data set is time stamped data values with a named primary key. e.g.
Name: A|B|C:D
Value: 124
TimeStamp: 01/06/2009 08:24:39,223
Could be a stock price at a given point in time, so it is, I suppose, a classic time series data pattern. However, I really need a generic RDBMS solution which will work with any reasonable JDBC compatible database as I would like to use Hibernate. Consequently, time series extensions to databases like Oracle are not really an option as I would like the implementor to be able to use their own JDBC/Hibernate capable database.
The challenge here is simply the massive volume of data that can accumulate in a short period of time. So far, my implementations are focused around defining periodical rollup and purge schedules where raw data is aggregated into DAY, WEEK, MONTH etc. tables, but the downside is the early loss of granularity and the slight inconvenience of period mismatches between periods stored in different aggregates.
The challenge has limited options since there is an absolute limit to how much data can be physically compressed while retaining the original granularity of the data, and this limit is exacerbated by the directive of using a relational database, and a generic JDBC capable one at that.
Borrowing a notional concept from classic data compression algorithms, and leveraging the fact that many consecutive values for the same named key can expected to be identical, I am wondering if there is way I can seamlessly reduce the number of stored records by conflating repeating values into one logical row while also storing a counter that indicates, effectively, "the next n records have the same value". The implementation of just that seems simple enough, but the trade off is that the data model is now hideously complicated to query against using standard SQL, especially when using any sort of aggregate SQL functions. This significantly reduces the usefulness of the data store since only complex custom code can restore the data back to a "decompressed" state resulting in an impedance mismatch with hundreds of tools that will not be able to render this data properly.
I considered the possibility of defining custom Hibernate types that would basically "understand" the compressed data set and blow it back up and return query results with the dynamically created synthetic rows. (The database will be read only to all clients except the tightly controlled input stream). Several of the tools I had in mind will integrate with Hibernate/POJOS in addition to raw JDBC (eg. JasperReports) But this does not really address the aggregate functions issue and probably has a bunch of other issues as well.
So I am part way to resigning myself to possibly having to use a more proprietary [possibly non-SQL] data store (any suggestions appreciated) and then focus on the possibly less complex task of writing a pseudo JDBC driver to at least ease integration with external tools.
I heard reference to something called a "bit packed file" as a mechanism to achieve this data compression, but I do not know of any databases that supply this and the last thing I want to do (or can do, really....) is write my own database.
Any suggestions or insight ?