tags:

views:

1464

answers:

5

I'm currently working on a parser for our internal log files (generated by log4php, log4net and log4j). So far I have a nice regular expression to parse the logs, except for one annoying bit: Some log messages span multiple lines, which I can't get to match properly. The regex I have now is this:

(?<date>\d{2}/\d{2}/\d{2})\s(?<time>\d{2}):\d{2}:\d{2}),\d{3})\s(?<message>.+)

The log format (which I use for testing the parser) is this:

07/23/08 14:17:31,321 log 
message
spanning
multiple
lines
07/23/08 14:17:31,321 log message on one line

When I run the parser right now, I get only the line the log starts on. If I change it to span multiple lines, I get only one result (the whole log file).

Help please ;-)


@samjudson:

You need to pass the RegexOptions.Singleline flag in to the regular expression, so that "." matches all characters, not just all characters except new lines (which is the default).

I tried that, but then it matches the whole file. I also tried to set the message-group to .+? (non-greedy), but then it matches a single character (which isn't what I'm looking for either).

The problem is that the pattern for the message matches on the date-group as well, so when it doesn't break on a new-line it just goes on and on and on.


I use this regex for the message group now. It works, unless there's a pattern IN the log message which is the same as the start of the log message.

(?<message>(.(?!\d{2}/\d{2}/\d{2}\s\d{2}:\d{2}:\d{2},\d{3}\s\[\d{4}\]))+)
A: 

You need to pass the RegexOptions.Singleline flag in to the regular expression, so that "." matches all characters, not just all characters except new lines (which is the default).

samjudson
+1  A: 

The problem you have is that you need to terminate the RegEx pattern so it knows when one message ends and then next starts.

When you were running in default mode the newline was working as an implicit terminator.

The problem is if you go into multiline mode there's no terminator so the pattern will gobble up the whole file. Non-greedy matches a few characters as possible which will be just one.

Now, if use the date for the next message as the terminator I think your parser will only get every other line.

Is there something else in the file you could to terminate the pattern?

Dave Webb
+2  A: 

You obviously need that "messages lines" can be distinguished from "log lines"; if you allow the message part to start with date/time after a new line, then there is simply no way to determine what is part of a message and what not. So, instead of using the dot, you need an expression that allows anything that does not include a newline followed by a date and time.

Personally, however, I would not use a regular expression to parse the whole log entry. I prefer using my own loop to iterate over each line and use one simple regular expression to determine whether a line is the start of a new entry or not. Also from the point of readability this would have my preference.

mweerden
+3  A: 

This will only work if the log message doesn't contain a date at the beginning of the line, but you could try adding a negative look-ahead assertion for a date in the "message" group:

(?<date>\d{2}/\d{2}/\d{2})\s(?<time>\d{2}:\d{2}:\d{2},\d{3})\s(?<message>(.(?!^\d{2}/\d{2}/
\d{2}))+)

Note that this requires the use of the RegexOptions.MultiLine flag.

Jeff Hillman
A: 

You might find it a lot easier to parse the file with a proper parser generator - ANTLR can generate one in C#... Context Free parsers only seem hard until you "get" them - after that, they are much simpler and friendlier to use than Regular Expressions...

Daren Thomas