views:

590

answers:

2

Hello,

I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.

But here is my problem : I can't figure out how it can help with http server logs analysis.

My understanding is that big companies (Facebook for instance) use MapReduce for the purpose of computing their http logs in order to speed up the process of extracting audience statistics out of these. The company I work for, while smaller than Facebook, has a big volume of web logs to compute everyday (100Go growing between 5 and 10 percent every month). Right now we process these logs on a single server, and it works just fine. But distributing the computing jobs instantly come to mind as a soon-to-be useful optimization.

Here are the questions I can't answer right now, any help would be greatly appreciated :

  • Can the MapReduce concept really be applied to weblogs analysis ?
  • Is MapReduce the most clever way of doing it ?
  • How would you split the web log files between the various computing instances ?

Thank you.
Nicolas

A: 
  • Can the MapReduce concept really be applied to weblogs analysis ?

Sure. What sort of data are you storing?

  • Is MapReduce the most clever way of doing it ?

It would allow you to query across many commodity machines at once, so yes it can be useful. Alternatively, you could try Sharding.

  • How would you split the web log files between the various computing instances ?

Generally you would distribute your data using a consistent hashing algorithm, so you can easily add more instances later. You should hash by whatever would be your primary key in an ordinary database. It could be a user id, an ip address, referer, page, advert; whatever is the topic of your logging.

Nick Retallack
Here you find an excellent explanation of consistent hashing: http://michaelnielsen.org/blog/?p=613
tuinstoel
+3  A: 

Can the MapReduce concept really be applied to weblogs analysis ?

Yes.

You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified):

  192.168.1.1,FireFox x.x,username1
  192.168.1.1,FireFox x.x,username1
  192.168.1.2,FireFox y.y,username1
  192.168.1.7,IE 7.0,username1

You can extract browsers, ignoring version, using a map operation to get this list:

FireFox
FireFox
FireFox
IE

Then reduce to get this : FireFox,3 IE,1

Is MapReduce the most clever way of doing it ?

It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs.

To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.

You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs...

With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.

How would you split the web log files between the various computing instances ?

By number of elements or lines if it's a text-based logfile.

In order to test MapReduce, I'd like to suggest that you play with Hadoop.

Osama ALASSIRY
First of all, sorry for the delay.Thanks a lot for your very high-quality answer. It helps a lot !
Nicolas
As an alternative to splitting the log files, you could parallelize your "log analysis" script across n cores. And if you were to run this script on a virtualized cluster (of say, 96 cores), your code will run flawlessly without any changes. You need to identify and isolate the "smallest" unit of work that is side-effect free and deals with immutable data. This may require you to re-design code, possibly.Besides Hadoop is comparatively harder to setup (and where I live, expertise is harder to find).
imran.fanaswala