tags:

views:

609

answers:

5

I've really been struggling to make SQL Server into something that, quite frankly, it will never be. I need a database engine for my analytical work. The DB needs to be fast and does NOT need all the logging and other overhead found in typical databases (SQL Server, Oracle, DB2, etc.)

Yesterday I listened to Michael Stonebraker speak at the Money:Tech conference and I kept thinking, "I'm not really crazy. There IS a better way!" He talks about using column stores instead of row oriented databases. I went to the Wikipedia page for column stores and I see a few open source projects (which I like) and a few commercial/open source projects (which I don't fully understand).

My question is this: In an applied analytical environment, how do the different column based DB's differ? How should I be thinking about them? Anyone have practical experience with multiple column based systems? Can I leverage my SQL experience with these DBs or am I going to have to learn a new language?

I am ultimately going to be pulling data into R for analysis.

EDIT: I was requested for some clarification in what exactly I am trying to do. So, here's an example of what I would like to do: Create a table that has 4 million rows and 20 columns (5 dims, 15 facts). Create 5 aggregation tables that calculate max, min, and average for each of the facts. Join those 5 aggregations back to the starting table. Now calculate the percent deviation from mean, percent deviation of min, and percent deviation from max for each row and add it to the original table. This table data does not get new rows each day, it gets TOTALLY replaced and the process is repeated. Heaven forbid if the process must be stopped. And the logs... ohhhhh the logs! :)

A: 

It looks like an implementation change (2-D array in column-major order, instead of row-major order), rather than an interface change.

Think "strategy" pattern, rather than being an entire paradigm shift. Of course, I've never used these products, so they may in fact force a paradigm shift down your throat. I don't know why, though.

Roboprog
A: 

We might be better able to help you reach an informed decision if you described [1] your specific goal and [2] the issues you're running into with SQL Server.

Jason Weber
edit added.. thanks for reading!
JD Long
+1  A: 

Hi! I have some experience with Infobright Community edition --- column-or. db, based on mysql.

Pro:

  • you can use mysql interfaces/odbc mysql drivers, from R too
  • fast enough queries on big chunks of data selection (because of KnowledgeGrid & data packs)
  • very fast native data loader and connectors for ETL (talend, kettle)
  • optimized exactly that operations what I (and I think most of us) use (selection by factor levels, joining etc)
  • special "lookup" option for optimized storing R factor variables ;) (ok, char/varchar variables with relatively small levels number/rows number)
  • FOSS
  • paid support option
  • ?

Cons:

  • no insert/update operations in Community edition (yet?), data loading only via native data loader/ETL connectors
  • no utf-8 official support (collation/sort etc), planned for q3 2009
  • no functions in aggregate queries f.e. select month (date) from ...) yet, planned for July(?) 2009, but because of column storage, I prefer simply create date columns for every aggregation levels (week number, month, ...) I need
  • cannot installed on existing mysql server as storage engine (because of own optimizer, if I understood correctly), but you may install Infobright & mysql on different ports if you need
  • ?

Resume: Good FOSS solution for daily analytical tasks, and, I think, your tasks as well.

zzr
+3  A: 

The short answer is that for analytic data, a column store will tend to be faster, with less tuning required.

A row store, the traditional database architecture, is good at inserting small numbers of rows, updating rows in place, and querying small numbers of rows. In a row store, these operations can be done with one or two disk block I/Os.

Analytic databases typically load thousands of records at a time; sometimes, as in your case, they reload everything. They tend to be denormalized, so have a lot of columns. And at query time, they often read a high proportion of the rows in the table, but only a few of these columns. So, it makes sense from an I/O standpoint to store values of the same column together.

Turns out that this gives the database a huge opportunity to do value compression. For instance, if a string column has an average length of 20 bytes but has only 25 distinct values, the database can compress to about 5 bits per value. Column store databases can often operate without decompressing the data.

Often in computer science there is an I/O versus CPU time tradeoff, but in column stores the I/O improvements often improve locality of reference, reduce cache paging activity, and allow greater compression factors, so that CPU gains also.

Column store databases also tend to have other analytic-oriented features like bitmap indexes (yet another case where better organization allows better compression, reduces I/O, and allows algorithms that are more CPU-efficient), partitions, and materialized views.

The other factor is whether to use a massively parallel (MMP) database. There are MMP row-store and column-store databases. MMP databases can scale up to hundreds or thousands of nodes, and allow you to store humungous amounts of data, but sometimes have compromises like a weaker notion of transactions or a not-quite-SQL query language.

I'd recommend that you give LucidDB a try. (Disclaimer: I'm a committer to LucidDB.) It is open-source column store database, optimized for analytic applications, and also has other features such as bitmap indexes. It currently only runs on one node, but utilizes several cores effectively and can handle reasonable volumes of data with not much effort.

What's the easiest to use ETL tool for LucidDB? Kettle?
JD Long
+2  A: 

4 million rows times 20 columns times 8 bytes for a double is 640 mb. Following the rule of thumb that R creates three temporary copies for every object, we get to around 2 gb. That is not a lot by today's standard.

So this should be doable in memory on a suitable 64-bit machine with a 'decent' amount of ram (say 8 gb or more). Installing Ubuntu or Debian (possibly in the server version) can be done in a few minutes.

Dirk Eddelbuettel
Damn you Dirk, you actually did the math! ;) I anticipate scaling size, but you may be right that just going to 64 bit will allow me to scale just fine.
JD Long