Anyone have any idea or know of any articles that discusses how google analytics stores and processes the data that comes in from the urchin calls? Curious about the architecture.
thanks!
Anyone have any idea or know of any articles that discusses how google analytics stores and processes the data that comes in from the urchin calls? Curious about the architecture.
thanks!
I think analytics is totally closed. However, if you haven't read about Facebook's Scribe it is probably worth checking out. Also, an extreme case of scalable distributed, logging, and analyzing.
Hi,
i don't know especially about analytics, but in general Google uses (ehm.. invented?) Map/Reduce.
There are several open source databases which support using Map/Reduce calls, e.g. CouchDb, which is a document-oriented database.
These types of application use Geolocation for determining the location of the user on base of the ip address. Additional information is found out via JavaScripts opjects window.navigator (useragent, platform, language, ...) and screen (dimensions, color depth)
edit:
there is evidence that google uses it's BigTable-DB-Engine (which corresponds to MapReduce) for reader, maps & youtube.
on dbms2.com, they even say that analytics uses MapReduce (could be categorized as "insider knowledge").
Their own docs on "How Data Is Calculated" give you a pretty good idea of what data they collect and how they calculate their metrics:
http://code.google.com/apis/analytics/docs/concepts/gaConceptsOverview.html#howDataIsCalculated
As you mentioned, these calculations are distributed across many machines using Google's homegrown architecture, which includes Map/Reduce: