tags:

views:

312

answers:

9

I need to parse a log in the following format:

===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8

I only need to extract each item's title (next line after ===== Item 5484/14800 =====) and the result.
So i need to keep only the line with the item title and the result for that title and discard everything else.
The issue is that sometimes a item has notes (maxim 3) and sometimes the result is displayed without additional notes so this makes it tricky.
Any help would be appreciated. I'm doing the parser in python but don't need the actual code but some pointing in how could i achive this?

LE: The result I'm looking for is to discard everything else and get something like:

('This is the item title','Foo')
then
('This is this items title','Bar')
+5  A: 
1) Loop through every line in the log

    a)If line matches appropriate Regex:

      Display/Store Next Line as the item title.
      Look for the next line containing "Result 
      XXXX." and parse out that result for 
      including in the result set.

EDIT: added a bit more now that I see the result you're looking for.

Ian Jacobs
+1 ... the OP asked for it this way "Any help would be appreciated. I'm doing the parser in python but don't need the actual code but some pointing in how could i achive this?"
basszero
Good thing he didn't want code, I don't know python worth a damn :)
Ian Jacobs
A: 

Parsing is not done using regex. If you have a reasonably well structured text (which it looks as you do), you can use faster testing (e.g. line.startswith() or such). A list of dictionaries seems to be a suitable data type for such key-value pairs. Not sure what else to tell you. This seems pretty trivial.


OK, so the regexp way proved to be more suitable in this case:

import re
re.findall("=\n(.*)\n", s)

is faster than list comprehensions

[item.split('\n', 1)[0] for item in s.split('=\n')]

Here's what I got:

>>> len(s)
337000000
>>> test(get1, s) #list comprehensions
0:00:04.923529
>>> test(get2, s) #re.findall()
0:00:02.737103

Lesson learned.

Armandas
+1  A: 

Maybe something like (log.log is your file):

def doOutput(s): # process or store data
    print s

s=''
for line in open('log.log').readlines():
    if line.startswith('====='):
        if len(s):
            doOutput(s)
            s=''
    else:
        s+=line
if len(s):
    doOutput(s)
RSabet
+1  A: 

I would recommend starting a loop that looks for the "===" in the line. Let that key you off to the Title which is the next line. Set a flag that looks for the results, and if you don't find the results before you hit the next "===", say no results. Else, log the results with the title. Reset your flag and repeat. You could store the results with the Title in a dictionary as well, just store "No Results" when you don't find the results between the Title and the next "===" line.

This looks pretty simple to do based on the output.

DoxaLogos
A: 

You could try something like this (in c-like pseudocode, since i don't know python):

string line=getline();
regex boundary="^==== [^=]+ ====$";
regex info="^Info: (.*)$";
regex test_data="Test ([^.]*)\. Result ([^.]*)\. Time ([^.]*)\.$";
regex stats="Stats: (.*)$";
while(!eof())
{
  // sanity check
  test line against boundary, if they don't match, throw excetion

  string title=getline();

  while(1)
  {  
    // end the loop if we finished the data
    if(eof()) break;

    line=getline();
    test line against boundary, if they match, break
    test line against info, if they match, load the first matched group into "info"
    test line against test_data, if they match, load the first matched group into "test_result", load the 2nd matched group into "result", load the 3rd matched group into "time"
    test line against stats, if they match, load the first matched group into "statistics"
  }

  // at this point you can use the variables set above to do whatever with a line
  // for example, you want to use title and, if set, test_result/result/time.

}
Blindy
A: 

Here's some not so good looking perl code that does the job. Perhaps you can find it useful in some way. Quick hack, there are other ways of doing it (I feel that this code needs defending).

#!/usr/bin/perl -w
#
# $Id$
#

use strict;
use warnings;

my @ITEMS;
my $item;
my $state = 0;

open(FD, "< data.txt") or die "Failed to open file.";
while (my $line = <FD>) {
    $line =~ s/(\r|\n)//g;
    if ($line =~ /^===== Item (\d+)\/\d+/) {
        my $item_number = $1;
        if ($item) {
            # Just to make sure we don't have two lines that seems to be a headline in a row.
            # If we have an item but haven't set the title it means that there are two in a row that matches.
            die "Something seems to be wrong, better safe than sorry. Line $. : $line\n" if (not $item->{title});
            # If we have a new item number add previuos item and create a new.
            if ($item_number != $item->{item_number}) {
                push(@ITEMS, $item);
                $item = {};
                $item->{item_number} = $item_number;
            }
        } else {
            # First entry, don't have an item.
            $item = {}; # Create new item.
            $item->{item_number} = $item_number;
        }
        $state = 1;
    } elsif ($state == 1) {
        die "Data must start with a headline." if (not $item);
        # If we already have a title make sure it matches.
        if ($item->{title}) {
            if ($item->{title} ne $line) {
                die "Title doesn't match for item " . $item->{item_number} . ", line $. : $line\n";
            }
        } else {
            $item->{title} = $line;
        }
        $state++;
    } elsif (($state == 2) && ($line =~ /^Info:/)) {
        # Just make sure that for state 2 we have a line that match Info.
        $state++;
    } elsif (($state == 3) && ($line =~ /^Test finished\. Result ([^.]+)\. Time \d+ secunds{0,1}\.$/)) {
        $item->{status} = $1;
        $state++;
    } elsif (($state == 4) && ($line =~ /^Stats:/)) {
        $state++; # After Stats we must have a new item or we should fail.
    } else {
        die "Invalid data, line $.: $line\n";
    }
}
# Need to take care of the last item too.
push(@ITEMS, $item) if ($item);
close FD;

# Loop our items and print the info we stored.
for $item (@ITEMS) {
    print $item->{item_number} . " (" . $item->{status} . ") " . $item->{title} . "\n";
}
Johan Soderberg
+5  A: 

I know you didn't ask for real code but this is too great an opportunity for a generator-based text muncher to pass up:

# data is a multiline string containing your log, but this
# function could be easily rewritten to accept a file handle.
def get_stats(data):

   title = ""
   grab_title = False

   for line in data.split('\n'):
      if line.startswith("====="):
         grab_title = True
      elif grab_title:
         grab_title = False
         title = line
      elif line.startswith("Test finished."):
         start = line.index("Result") + 7
         end   = line.index("Time")   - 2
         yield (title, line[start:end])


for d in get_stats(data):
   print d


# Returns:
# ('This is the item title', 'Foo')
# ('This is this items title', 'Bar')
# ('This is the title of this item', 'FooBar')

Hopefully this is straightforward enough. Do ask if you have questions on how exactly the above works.

Triptych
+1  A: 

Regular expression with group matching seems to do the job in python:

import re

data = """===== Item 5483/14800  =====
This is the item title
Info: some note
===== Item 5483/14800 (Update 1/3) =====
This is the item title
Info: some other note
===== Item 5483/14800 (Update 2/3) =====
This is the item title
Info: some more notes
===== Item 5483/14800 (Update 3/3) =====
This is the item title
Info: some other note
Test finished. Result Foo. Time 12 secunds.
Stats: CPU 0.5 MEM 5.3
===== Item 5484/14800  =====
This is this items title
Info: some note
Test finished. Result Bar. Time 4 secunds.
Stats: CPU 0.9 MEM 4.7
===== Item 5485/14800  =====
This is the title of this item
Info: some note
Test finished. Result FooBar. Time 7 secunds.
Stats: CPU 2.5 MEM 2.8"""


p =  re.compile("^=====[^=]*=====\n(.*)$\nInfo: .*\n.*Result ([^\.]*)\.",
                re.MULTILINE)
for m in re.finditer(p, data):
     print "title:", m.group(1), "result:", m.group(2)er code here

If You need more info about regular expressions check: python docs.

maciejka
nice use of multiline. Only issue is that it doesn't scale very well (you need to keep the entire file in memory at once)
Triptych
How about if he used itertools.groupby to go through the items?
oylenshpeegul
It is rather a suggestion than a complete solution. It will scale better if it reads into a buffer until encounters line begining with '====='. Then buffer can be parsed with above re.
maciejka
+1  A: 

This is sort of a continuation of maciejka's solution (see the comments there). If the data is in the file daniels.log, then we could go through it item by item with itertools.groupby, and apply a multi-line regexp to each item. This should scale fine.

import itertools, re

p = re.compile("Result ([^.]*)\.", re.MULTILINE)
for sep, item in itertools.groupby(file('daniels.log'),
                                   lambda x: x.startswith('===== Item ')):
    if not sep:
        title = item.next().strip()
        m = p.search(''.join(item))
        if m:
            print (title, m.group(1))
oylenshpeegul