tags:

views:

111

answers:

3

I'm struggling with regex for splitting logs files into log sequence in order to match pattern inside these sequences. log format is:

timestamp fieldA fieldB fieldn log message1 
timestamp fieldA fieldB fieldn log message2
log message2bis
timestamp fieldA fieldB fieldn log message3 

The timestamp regex is known.

I want to extract every log sequence (potentialy multiline) between timestamps. And I want to keep the timestamp.

I want in the same time to keep the exact count of lines.

What I need is how to decorate timestamp pattern to make it split my log file in log sequence. I can not split the whole file as a String, since the file content is provided in a CharBuffer

Here is sample method that will be using this log sequence matcher:

private void matches(File f, CharBuffer cb) {
    Matcher sequenceBreak = sequencePattern.matcher(cb);    // sequence matcher
    int lines = 1;
    int sequences = 0;

    while (sequenceBreak.find()) {
        sequences++;

        String sequence = sequenceBreak.group();
        if (filter.accept(sequence)) {
            System.out.println(f + ":" + lines + ":" + sequence);                
        }

        //count lines
        Matcher lineBreak = LINE_PATTERN.matcher(sequence);
        while (lineBreak.find()) {
            lines++;
        }

        if (sequenceBreak.end() == cb.limit()) {
            break;
        }
    }        
}
A: 

I don't see any regex in your code, but here's a hint:

By defailt the dot . in regex matches everything except a new-line. If you want it to match a new line, you'd need Pattern.DOTALL as an argument to Pattern.compile(str, flags)

Another way to match new-lines is to use the predefined group \s which matches [\t\n\x0B\f\r]

Bozho
You may also need the flag Pattern.MULTILINE
M. Jessup
+1  A: 

It sounds like you want the regex to match the entire log sequence, from the timestamp to the end of the last line, including the line separator. Assuming every log sequence but the last one is followed immediately by another log sequence, you should be able to use a lookahead for a timestamp to find the end of the sequence.

Pattern sequencePattern = pattern.compile(
    "^timestamp.*?(?=timestamp|\z)",
    Pattern.DOTALL | Pattern.MULTILINE);

If that's not fast or accurate enough, this should work better:

Pattern sequencePattern = pattern.compile(
    "^timestamp.*+(?:(?:\r\n|[\r\n])(?!timestamp).*+)*+(?:\r\n|[\r\n])?",
    Pattern.MULTILINE);

Of course, I'm assuming you'll replace timestamp with the real timestamp regex. Just out of curiosity, have you considered using Scanner's findWithinHorizon method for this? Seems to me it could save you a lot of work.

Alan Moore
Thanks Alan, I'm glad you have understood the sense of my question, since it looks quite obfuscated even for my own eye... Has you suggest I've dropped the regex in favor of a Scanner, the code is more simpler and working fine.
Guillaume
+1  A: 

If I understand your question correctly, you want to split a file using a regular expression, but you can't use Java's built-in Split() method. In that case, just write your own Split() method.

Iterate over all the regex matches. For the first match, store the timestamp and the ending position of the match. For subsequent matches, take the text between the stored ending position of the previous match and the starting position of the present match and associate that with the previous match. Then store the timestamp and ending position of the present match. After the loop, take the text between the stored ending position of the last match and the end of the file and associate that with the last match.

Using a regex that matches just the timestamps and using a bit of procedural code to get the text between the timestamps will be (far) more efficient than trying to come up with a regex that matches the timestamp and everything up to the next timestamp.

Jan Goyvaerts
Thanks Jan, I was thinking of something like that to solve this problem, but I was hoping for the 'magical regex that will do the job'.
Guillaume
Alan's answer is as close to "magic" as you'll get. But if performance and maintainability are important to you, I'd recommend using the simple "timestamp" regex and let the procedural code do the work as I described in my answer. Regexes that do "magic" are the reason why some people think they're evil.
Jan Goyvaerts