views:

380

answers:

3

does anybody know how data in google analytics is orginized? Difficult selection from large amounts of data they perform very-very fast, what structure of database is it?

+1  A: 

Many places and applications in the Google portfolio use the MapReduce algorithm for storage and processing of large quantities of data.

See the Google Research Publications on MapReduce for further information and also have a look at page 4 and page 5 of this Baseline article.

Kosi2801
+2  A: 

I'd assume they use their 'Big Table' http://en.wikipedia.org/wiki/BigTable

Jens Schauder
+3  A: 

AFAIK Google Analytics is derived from Urchin. As it has been said it is possible that since now Analytics is part of the Google family it is using MapReduce/BigTable. I can assume that Google had integrated the old format of Urchin DB with the new BigTable/MapReduce.

I found this links which talk about Urchin DB. Probably some of the things are still in use at the moment.

http://www.advanced-web-metrics.com/blog/2007/10/16/what-is-urchin/

this says:

[snip] ...still use a proprietary database to store reporting data, which makes ad-hoc queries a bit more limited, since you have to use Urchin-developed tools rather than the more flexible SQL tools.

http://www.urchinexperts.com/software/faq/#ques45

What type of database does Urchin use?

Urchin uses a proprietary flat file database for report data storage. The high-performance database architecture handles very high traffic sites efficiently. Some of the benefits of the data base architecture include:

* Small database footprint approximately 5-10% of raw logfile size
* Small number of database files required per profile (9 per month of historical reporting)
* Support for parallel processing of load-balanced webserver logs for increased performance
* Databases are standard files that are easy to back up and restore using native operating system utilitiesv 

More info about Urchin

http://www.google.com/support/urchin45/bin/answer.py?answer=28737

Long time ago I used to have a tracker and on their site they were discussing about data normalization: http://www.2enetworx.com/dev/articles/statisticus5.asp

There you can find a bit of info of how to reduce the data in DB and maybe it is a good start in research.

dawez