Managing a Large Number of Log Files Distributed Over Many Machines

views:

answers:

+5 Q:

Managing a Large Number of Log Files Distributed Over Many Machines

We have started using a third party platform (GigaSpaces) that helps us with distributed computing. One of the major problems we are trying to solve now is how to manage our log files in this distributed environment. We have the following setup currently.

Our platform is distributed over 8 machines. On each machine we have 12-15 processes that log to separate log files using java.util.logging. On top of this platform we have our own applications that use log4j and log to separate files. We also redirect stdout to a separate file to catch thread dumps and similar.

This results in about 200 different log files.

As of now we have no tooling to assist in managing these files. In the following cases this causes us serious headaches.

Troubleshooting when we do not beforehand know in which process the problem occurred. In this case we currently log into each machine using ssh and start using grep.
Trying to be proactive by regularly checking the logs for anything out of the ordinary. In this case we also currently log in to all machines and look at different logs using less and tail.
Setting up alerts. We are looking to setup alerts on events over a threshold. This is looking to be a pain with 200 log files to check.

Today we have only about 5 log events per second, but that will increase as we migrate more and more code to the new platform.

I would like to ask the community the following questions.

How have you handled similar cases with many log files distributed over several machines logged through different frameworks?
Why did you choose that particular solution?
How did your solutions work out? What did you find good and what did you find bad?

Many thanks.

+1 A:

I'd suggest taking a look at a log aggregation tool like Splunk or Scribe.

(Also, I think this is more of a ServerFault question, as it has to do with administration of your app and it's data, not so much about creating the app.)

matt b 2010-10-25 13:13:32

Thank you for your suggestions. What are your experiences with those tools? Might indeed be better at ServerFault, agreed.

Kristoffer E 2010-10-25 13:21:44

I have used splunk to watch logs from about 40 servers. Worked really nice. Only downside was the front end was a little heavy (javascript magic crashed firefox on ubuntu), but likely it has improved since then.

bwawok 2010-10-25 13:28:56

Personally I haven't dealt yet with having Splunk automatically pick up log data, but just manually import data into it - but it's frontend and analysis tools look fantastic

matt b 2010-10-25 13:42:34

The only piece of advice I can give you is to make sure you pass a transaction ID through your code and to make sure you log it when you do log, so that you can later correlate the different calls together.

Romain Hippeau 2010-10-25 13:36:18

+2 A:

I would recommend to pipe all your java logging to Simple Logging Facade for Java (SLF4J) and then redirect all logs from SLF4J to LogBack. SLF4J has special support for handling all popular legacy APIs (log4j, commons-logging, java.util.logging, etc), see here.

Once you have your logs in LogBack you can use one of it's many appenders to aggregate logs over several machines, for details, see the manual section about appenders. Socket, JMS and SMTP seem to be the most obvious candidates.

LogBack also has built-in support for monitoring for special conditions in log files and filtering events sent to particular appender. So you could set up SMTP appender to send you an e-mail every time there is an ERROR level event in logs.

Finally, to ease troubleshooting, be sure to add some sort of requestID to all your incoming "requests", see my answer to this question for details.

EDIT: you could also implement your own custom LogBack appender and redirect all logs to Scribe.

Neeme Praks 2010-10-25 13:40:31

It is worth to point out, that redirecting everything to Scribe would essentially create a single point of failure in the system, e.g. when Scribe daemon is down.

Eugene Kuleshov 2010-10-25 14:52:12

Well, the fault-tolerance aspect of the final solution is typically very deployment-specific and as such, left as an exercise to the architect responsible for the final solution. Still, it is something to keep in mind.

Neeme Praks 2010-10-25 15:19:56

I would like to add that it is not a real "single point of failure" in the sense that if Scribe central node is down, the whole solution is unaffected - individual Scribe nodes just queue the log records locally until the central node is back up again. Scribe downtime only affects the logging subsystem availability.

Neeme Praks 2010-10-25 15:25:32

+1 A:

An interesting option to explore would be to run Hadoop Cluster on those nodes and write a custom Map Reduce job for searching and aggregating results specific for your applications.

Eugene Kuleshov 2010-10-25 13:51:07

I would transfer the file to a centralized machine to run a analyzer mechanism on it. May be you can use a Hadoop Cluster to do that and run map/reduce jobs to do the analyzing...Copy it very 5 minutes to the haddop cluster etc. I'm not sure if this fits your needs. In that relationship it might be a good idea to look at Scribe as already mentioned.

khmarbaise 2010-10-25 13:55:15

ansaurus

tags:

views:

answers:

Managing a Large Number of Log Files Distributed Over Many Machines

related questions