ansaurus

Question

What are my options to store and query huge amounts of data where a lot of it is repeating ?

Answer 1

+3 A:

cletus 2009-01-06 13:48:11

Answer 2

+1 A:

I would look at a column oriented database. It would be great for this sort of application

Javamann 2009-01-06 16:13:23

Thanks Javaman. I knew about these but did not know there were so many decent open source ones. (Don't want to force users to a commercial app). So I looked at LucidDB and it looks like the ticket. The efficiency, compression, user defined transforms and foreign tables get me what I want.

Nicholas 2009-01-07 17:57:05

Answer 3

A:

Thanks for the answers.

Cletus, I appreciate the outline, but one of the tradeoffs I cannot make is abandoning DB flexibility and compatibility with JDBC/Hibernate to allow the use of all the available tools. Moreover, although I did not clearly state this, I do not want to force my users into adopting a [possibly expensive] commercial solution. If they have Database Brand X, let 'em use it. If they don't care, we recommend open source Database Brand Y. Basically the application has multiple faces, one of them being a repository for incoming data, but another face is a reporting source and I really don't want to get into the business of writing report generators.

While I have not really load tested it yet, I am very impressed with LucidDB. It is a column oriented database and it provides good query performance and seemingly good data compression. It has a JDBC driver though no Hibernate dialect exists for it yet, as far as I can tell. It also supports user defined transformations which in short, I think will allow me to seamlessly implement my idea of compressing repeating and consecutive values into one "row", but blow them back out into multiple "synthetic" rows at query time, all done invisibly to the query caller. Lastly, it supports this nifty feature of foreign tables where other JDBC supporting database tables can be fronted in LucidDB. I think this may be invaluable to providing some level of support for other databases.

Thanks for the pointer, Javaman. It zoned me in on LucidDB.

Nicholas 2009-01-07 18:09:15

Answer 4

A:

Many JDBC-capable database management systems (e.g. Oracle) provide compression in the physical storage engine. Oracle, for example, has the notion of a "compressed" table without decompression overhead:

http://www.ardentperf.com/wp-content/uploads/2007/07/advanced-compression-datasheet.pdf

Apocalisp 2009-01-07 18:31:29

Answer 5

+1 A:

You will probably find it interesting to listen to Michael Stonebraker's presentation at Money:Tech. He hits on a number of the things you mention needing and he illustrates how the big three elephants (SQL Server, Oracle, and DB2) will never be able to suite the needs of tick stores (which it looks like you are building). He digs beyond column stores, which I agree is the right direction. He even discusses compression and speed, which are both issues for you.

here are some more links you may find interesting:

LucidDB - Open Source Column Store
An academic paper on column stores vs row stores
Someone else blogging about a similar struggle
SQLStream is worth knowing about

JD Long 2009-03-18 16:36:25

ansaurus

tags:

views:

answers:

What are my options to store and query huge amounts of data where a lot of it is repeating ?

related questions