views:

424

answers:

6

Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?

Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.

I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.

Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?

+2  A: 

Have a look at the paper Interpreting the Data: Parallel Analysis with Sawzall by Google. This is a paper on the tool Google uses for log analysis.

marcog
+3  A: 

Unfortunately there is no one size fits all answer.

I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.

Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.

Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.

in full disclosure, I am a developer on the Cascading project.

http://www.asterdata.com/

http://www.cascading.org/

cwensel
+5  A: 

The book Hadoop: The definitive Guide by O'Reilly has a chapter which discusses how hadoop is used at two real-world companies.

http://my.safaribooksonline.com/9780596521974/ch14

caskey
+1  A: 

There is a good case study here http://www.rightscale.com/bi about how CrowdStar is doing large scale log analytics.

A: 

Datameer (www.datameer.com) offers a solution for simplifying log file aggregation.

Teresa Wingfield
A: 

To learn more about Facebook's analytics infrastructure, you can watch this video from the 2010 Hadoop Summit.

Jeff Hammerbacher