views:

313

answers:

3

Hi,

We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more.

The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this solutions for several reasons:

  • as typical with those kinds of scripts, they fall into the write-once and never-touched-again category
  • tracking things in "real-time" is necessary (we have separate toolset to query the last few hours ATM).
  • this is slow and non-"agile"

Although I have some experience in dealing with huge datasets for scientific usage, I am a complete beginner as far as traditional RDBM go. It seems that using column-oriented database for analytics could be a solution (the analytics don't need most of the data we have in the app database), but I would like to know what other options are available for this kind of issues.

+2  A: 

You will want to google Star Schema. The basic idea is to model a special data warehouse / OLAP instance of your existing OLTP system in a way that is optimized to provided the type of aggregations you describe. This instance will be comprised of facts and dimensions.

In the example below, sales 'facts' are modeled to provide analytics based on customer, store, product, time and other 'dimensions'.

alt text

You will find Microsoft's Adventure Works sample databases instructive, in that they provide both the OLTP and OLAP schemas along with representative data.

Ryan Cox
thanks - would you have any suggestion for a good introduction (book ?) to this kind of stuff (I am more interested in principles than actual vendor specific solutions)
David Cournapeau
the MS example is complete, and well documented, thus useful even if you are deploying on a different architecture. in terms of a good book, I have enjoyed Joe Celko's SQL books. Looks like he has one with an OLAP slant that contains relevant details. Reviews on amazon are a bit lumpy http://www.amazon.com/Celkos-Analytics-Kaufmann-Management-Systems/dp/0123695120/
Ryan Cox
also, as an aside, I was curious what the web analytics project piwik used in terms of a schema; thinking it might also be a useful example. i was surprised to see that they are not doing any type of rollup / fact / dimension schema see: http://dev.piwik.org/trac/browser/trunk/misc/db-schema.png
Ryan Cox
+1  A: 

Hi Ryan

The canonical handbook on Star-Schema style data warehouses is Raplh Kimball's "The Data Warehouse Toolkit" (there's also the "Clickstream Data Warehousing" in the same series, but this is from 2002 I think, and somewhat dated, I think that if there's a new version of the Kimball book it might serve you better. If you google for "web analytics data warehouse" there are a bunch of sample schema available to download & study.

On the other hand, a lot of the no-sql that happens in real life is based around mining clickstream data, so it might be worth see what the Hadoop/Cassandra/[latest-cool-thing] community has in the way of case studies to see if your use case matches well with what they can do.

Jamie
+1  A: 

There are special db's for analytics like Greenplum, Aster data, Vertica, Netezza, Infobright and others. You can read about those db's on this site: http://www.dbms2.com/

TTT