ansaurus

Question

How is MapReduce a good method to analyse http server logs ?

Answer 1

A:

Can the MapReduce concept really be applied to weblogs analysis ?

Sure. What sort of data are you storing?

Is MapReduce the most clever way of doing it ?

It would allow you to query across many commodity machines at once, so yes it can be useful. Alternatively, you could try Sharding.

How would you split the web log files between the various computing instances ?

Generally you would distribute your data using a consistent hashing algorithm, so you can easily add more instances later. You should hash by whatever would be your primary key in an ordinary database. It could be a user id, an ip address, referer, page, advert; whatever is the topic of your logging.

Nick Retallack 2009-06-02 12:12:42

Here you find an excellent explanation of consistent hashing: http://michaelnielsen.org/blog/?p=613

tuinstoel 2009-07-24 18:38:21

Answer 2

+3 A:

Can the MapReduce concept really be applied to weblogs analysis ?

Yes.

You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified):

  192.168.1.1,FireFox x.x,username1
  192.168.1.1,FireFox x.x,username1
  192.168.1.2,FireFox y.y,username1
  192.168.1.7,IE 7.0,username1

You can extract browsers, ignoring version, using a map operation to get this list:

FireFox
FireFox
FireFox
IE

Then reduce to get this : FireFox,3 IE,1

Is MapReduce the most clever way of doing it ?

It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs.

To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.

You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs...

With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.

How would you split the web log files between the various computing instances ?

By number of elements or lines if it's a text-based logfile.

In order to test MapReduce, I'd like to suggest that you play with Hadoop.

Osama ALASSIRY 2009-06-02 12:26:32

First of all, sorry for the delay.Thanks a lot for your very high-quality answer. It helps a lot !

Nicolas 2009-06-11 10:01:39

As an alternative to splitting the log files, you could parallelize your "log analysis" script across n cores. And if you were to run this script on a virtualized cluster (of say, 96 cores), your code will run flawlessly without any changes. You need to identify and isolate the "smallest" unit of work that is side-effect free and deals with immutable data. This may require you to re-design code, possibly.Besides Hadoop is comparatively harder to setup (and where I live, expertise is harder to find).

imran.fanaswala 2010-04-10 20:27:52

ansaurus

tags:

views:

answers:

How is MapReduce a good method to analyse http server logs ?

related questions