views:

364

answers:

6

I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.

Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.

Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.

+3  A: 

I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases. You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development. Although this may not be what you were expecting, it may be worth considering.

Damir Sudarevic
+5  A: 

http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.

Hypertable is based on google's bigTable project. see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks

you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).

hope that helps

Mike Valstar
Thanks--this is this the kind of thing I'm looking for.
Sam Lee
Looks interesting but it kind of raises a warning flag they show stats with AOL data instead of standardized tests from www.tpc.org (they have datasets very similar to AOL's.)
alecco
+2  A: 

In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:

http://hadoop.apache.org/

They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.

Paul McMillan
more specifically [http://hadoop.apache.org/hbase/ Hbase] which is runs on the hadoop core
Mike Valstar
A: 

Hi, I have a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.

Hope this helps a bit,

Ivan

Ivanhoe
+1  A: 

Pentaho Mondrian

  • Open source
  • Uses standard relational database
  • MDX (think pivot table)

I use this.

Ramon
+2  A: 

It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.

Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.

Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.

About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.

alecco