views:

354

answers:

6

My team has inherited support for 100+ applications. The applications don't have any kind of common architecture, so the ones that do logging usually do it with custom code to local files or a local database, and it's all unmanaged. We want to change that.

We're slowly migrating the applications over to using log4net and standardising the types of things that are logged. The next question becomes: where should we send the logs?

I was thinking that it would be good to use a central SQL Server dedicated to receiving all the logs, which would provide easy maintenance (one place for backups/archiving) and provide the future possibility of some data mining and trend analysis.

Is that the best practice for this kind of thing, or is there some dedicated application logging server we should be looking at instead?

Update: I should have been more clear than just casually mentioning log4net and SQL Server: we're a Microsoft house, with most things written in .NET. UNIX solutions are no good for us.

Thanks!

A: 

If your running on *nix machines, the traditional solution is syslog.

John Paulett
A: 

On Unix, there's syslog.
Also, you might want to check out this case study.

luvieere
+1  A: 

If you have log4net log to the local EventViewer, you can mine these logs on a Windows 2008 box, see this centralized auditing article.

On that box, you can then easily import these events and provide some management and mining tools on top of it.

Wim Hollebrandse
Alas, we're a government shop so everything is Windows 2003. :-/ Thanks anyway.
Stewart Johnson
+1  A: 

As the other responses have pointed out, the closest thing to an industry standard is syslog. But don't despair because you're living in a Windows world. Kiwi have a syslog daemaon which runs on Windows, and it is free. Find out more.

APC
+2  A: 

SQL would work, but I've used Splunk (www.splunk.com) to aggregate logs. I was able to find some surprising information based on the way Splunk allows you to set up indexes on your data, and then use their query tools to make some nice graphs. You can download a basic version of it for free too.

Nathan
+6  A: 

One world of caution: at 100+ apps in a big shop, with hundreds perhaps thousands of hosts running those apps, steer clear of anything that induces a tight coupling. This pretty much rules out connect directly to SQL Server or any database solution, because your application logging will be dependent on the availability of the log repository.

Availability of the central repository is a little more complicated than just 'if you can't connect, don't log it' because usually the most interesting events occur when there are problems, not when things go smooth. If your logging drops entries exactly when things turn interesting, it will never be trusted to solve incidents and as such will fail to gain traction and support for other stake holders (ie. the application owners).
If you decide that you can implement retention and retry failed log info delivery on your own, you are facing an uphill battle: it is not a trivial task and is much more complex than it sounds, starting from eficient and reliable storage of the retained information and ending with putting in place good retry and inteligent fallback logic.

You also must have an answer to the problems of authentication and security. Large orgs have multiple domains with various trust relations, employees venture in via VPN or Direct Access from home, some applications run unattended, some services are configured to run as local users, some machines are not joined to the domain etc etc. You better have an asnwer to the question how is the logging module of each application, everywhere is deployed, going to authenticate with the central repository (and what situations are going to be unsuported).

Ideally you would use an out-of-the box delivery mechanism for your logging module. MSMQ is probably the most appropiate fit: robust asynchronous reliable delivery (at least to the extent of most use cases), available on every Windows host when is installed (optional). Which is the major pain point, your applications will take a dependency on a non-default OS component.

The central repository storage has to be able to deliver the information requested, perhaps:

  • the application developers investigating incidents
  • customer support team investigating a lost transaction reported by a customer complaint
  • the security org doing forensics
  • the business managers demanding statistics, trends and aggregated info (BI).

The only storage capable of delivering this for any serious org (size, lifetime) is a relational engine, so probably SQL Server. Doing analysis over text files is really not going to go the distance.

So I would recommend a messaging based log transport/delivery (MSMQ) and a relational central repository (SQL Server) perhaps with aanalitycal component on top of it (Analysis Services Data Mining). as you see, this is clearly no small feat and it covers slightly more than just configuring log4net.

As for what to log, you say you already give a thought but I'd like to chime in my extra 2c: often times, specially on incident investigation, you will like the ability to request extra information. This means you would like to know certain files content from the incident machine, or some registry keys, or some performance counter values, or a full process dump. It is very usefull to be able to reuqest this information from the central repository interface, but is impractical to always collect this information, just in case is needed. Which implies there has to be some sort of bidirectional communication between the applictaion and the central repository, when the application reports an incident it can be asked to add extra information (eg a dump of the process at fault). There has to be a lot of infrastructure in place for something like this to occur, from the protocol between application logging and the central repository, to the ability of the central repository to recognize an incident repeat, to the capacity of the loggin library to collect the extra information required and not least the ability of an operator to mark incidents as needing extra information on next occurence.

I understand this answer goes probably seems overkill at the moment, but I was involved with this problem space for quite a while, I had looked at many online crash reports from Dr. Watson back in the day when I was with MS, and I can tell you that these requirement exists, they are valid concerns and when implemented the solution helps tremendously. Ultimately, you can't fix what you cannot measure. A large organisation depends on good management and monitoring of its application stock, including logging and auditing.

There are some third party vendors that offer solutions, some even integrated with log4net, like bugcollect.com (Full disclosure: that's my own company), Error Traffic Controller or Exceptioneer and other.

Remus Rusanu
Was your 'One **world** of caution..' a deliberate pun? I mean, it clearly is more of a world than a word. ;-)
Wim Hollebrandse
@Wim: honest typo, but I'll leav it as it, makes more fun
Remus Rusanu