views:

215

answers:

3

I have a log file that looks like the following:

2010-05-12 12:23:45 Some sort of log entry
2010-05-12 01:45:12 Request XML: <RootTag>
<Element>Value</Element>
<Element>Another Value</Element>
</RootTag>
2010-05-12 01:45:32 Response XML: <ResponseRoot>
<Element>Value</Element>
</ResponseRoot>
2010-05-12 01:45:49 Another log entry

What I want to do is extract the Request and Response XML (and ultimately dump them into their own single files). I had a similar parser that used egrep but the XML was all on one line, not multiple ones like above.

The log files are also somewhat large, hitting 500-600 megs a log. Smaller logs I would read in via a PHP script and use regex matching, but the amount of memory required for such a large file would more than likely kill the script.

Is there an easy way using the built-in tools on a Linux box (CentOS in this case) to extract multiple lines or am I going to have to bite the bullet and use Perl or PHP to read in the entire file to extract it?

+1  A: 

Your question implies you're not thinking right; if there's a way to do what you're asking in one language (there is) ... then you can do it in any language.

There's no reason to read the entire log into memory. You just read it line by line and extract the information you want. You just need to keep a state as to where you are (not in tag, inside RootTag, inside ResponseRoot, etc) and process the data as you wish.

Brian Roach
I'm not saying I don't want to do it in a proper language, I just know that sometimes you can cobble together some nice things with the built in tools :) If it's not possible then I'll be coding something
dragonmantank
+2  A: 
# Example usage:
# perl script.pl data.xml RootTag > RootTag.xml

use strict;
use warnings;

my $tag = pop;

while (<>){
    if ( s/.*(<$tag>)/$1/ .. s/(<(\/)$tag>).*/$1/ ){
        print;
        last if $2;
    }
}

See the docs for details on the flip-flop operator.

FM
Nice, neat, and simple. I extended it to dump each report to a file in case there are multiple XML sets, but this was more than enough to get me started.
dragonmantank
In your case you probably need to write $tag=qr/(?:RootTag|ResponseRoot)/; the qr creates a regexp and the (?:...) means that the brackets are do not capture the match, neither is necessary but it's probably good practice to use them. Also, the assumption here is that those 2 tags are only used as roots for their respective XML fragments (which is likely).
mirod
+2  A: 

Sounds like a job for sed (I was so tempted to say SuperSed ;-)

sed -n '/^<.\+>/H; /\(Request\|Response\) XML/{s/^.*</</;x;p}; ${x;p}' xmllog

where xmllog is your log file's name. You'll get a blank line at the beginning, but that can be filtered out with egrep '.+' or even just tail -n +2.

By way of explanation, sed is a little interpreter for programs that consist of a list of matching conditions and corresponding actions. sed runs through a file line by line (hence the name, "stream editor" -> "sed") and for each line, for each condition in the program that matches the text on the line, it applies the corresponding action. In this case:

/^<.\+>/

is a regular expression condition that matches any line which contains < followed by any character (.) repeated one or more times (\+) followed by > - basically any line with an XML tag. The associated action is H which appends the line to a "hold buffer". The other condition is

/\(Request\|Response\) XML/

which, of course, is a regexp that matches either Request or Response followed by a space and then XML. The corresponding action is

{s/^.*</</;x;p}

which first does a substitution (s) of the beginning of the line (^) followed by any character (.) repeated any number of times (*) followed by <, with just <. Basically that gets rid of anything before the first XML tag on the line. Then it switches (x) the line just read with the "hold buffer" (which contains the XML of the previous log message) and prints (p) the stuff that was just swapped in from the hold buffer. Finally,

$

matches the end of the input, and {x;p} again just swaps the contents of the hold buffer into the "print buffer" and then prints it.

You can alter the command to suit your needs, for example if you need something to delimit the different records, this'll put a blank line between them:

sed -n '/^<.\+>/H; /\(Request\|Response\) XML/{s/^.*</\n</;x;p}; ${x;p}' xmllog

(in that case, of course, don't use egrep to filter out the blank line at the beginning).

David Zaslavsky
Really, really inventive. I need to learn more sed and awk commands.
dragonmantank
It actually took me a little while reading the manual to piece that together... but sed and awk are really handy things to at least be familiar with. (FWIW my first thought was Perl)
David Zaslavsky