views:

169

answers:

2

What type of architecture, design and software would one need to provide something similar to the excellent custom report functionality provided by google analytics. To be more specific we want to user to be able to specify dimensions and metrics from a list and generate a report.

  • Do we need a data warehouse?
  • Do we need OLAP?
  • Would the data access layer require an ORM, dynamic sql or stored procedures?
  • Are there any 3rd party or opensource products that can get us part way there?

Is there anybody else (company, developer) out there who have even accomplished this functionality at the level of google? Example?

Note

I'm not interested in building a google analytics competitor. I'm looking to apply the ease of reporting to our own unique datasets.

Thanks

A: 

You might want to check out http://haveamint.com/. Its not free but as a product it lets you host your own analytics.

Piwik (http://piwik.org/) is a great open source implementation.

As far as building a google analytics competitor, the analytics will not be the difficult part. If your service would catch fire, the biggest difficulty would be scaling at the database.

sestocker
+1  A: 

You definitely need a data warehouse with lots of ETLs, agg and pre-agg processes running at off-peak hours. OLAP cubes don't really scale for high volume web analytics.

For data collection, you'll probably also want to use MSMQ or similar and hardware load balancing as well. Disk I/O is a typical bottle neck so working in memory and doing some pre-aggregation certainly helps. At my previous job at Microsoft some of our legacy data collection systems were logging directly into log files instead of a database. We used log parser) and were doing a lot of ETL and aggregation pretty much around the clock.

Data collection is at the heart of it and you're going to have to build a state of the art data warehouse if you intend to scale. We relied almost exclusively on Stored Procedures. We had thousands of them, some dauntingly complex and heavily optimized. Other than performance, scalability is also a big concern.

aleemb