ansaurus

Question

How to extract block of XML from a log file on Linux

Answer 1

+1 A:

Your question implies you're not thinking right; if there's a way to do what you're asking in one language (there is) ... then you can do it in any language.

There's no reason to read the entire log into memory. You just read it line by line and extract the information you want. You just need to keep a state as to where you are (not in tag, inside RootTag, inside ResponseRoot, etc) and process the data as you wish.

Brian Roach 2010-05-14 02:08:14

I'm not saying I don't want to do it in a proper language, I just know that sometimes you can cobble together some nice things with the built in tools :) If it's not possible then I'll be coding something

dragonmantank 2010-05-14 02:47:47

Answer 2

+2 A:

# Example usage:
# perl script.pl data.xml RootTag > RootTag.xml

use strict;
use warnings;

my $tag = pop;

while (<>){
    if ( s/.*(<$tag>)/$1/ .. s/(<(\/)$tag>).*/$1/ ){
        print;
        last if $2;
    }
}

See the docs for details on the flip-flop operator.

FM 2010-05-14 02:48:31

Nice, neat, and simple. I extended it to dump each report to a file in case there are multiple XML sets, but this was more than enough to get me started.

dragonmantank 2010-05-14 03:35:44

In your case you probably need to write $tag=qr/(?:RootTag|ResponseRoot)/; the qr creates a regexp and the (?:...) means that the brackets are do not capture the match, neither is necessary but it's probably good practice to use them. Also, the assumption here is that those 2 tags are only used as roots for their respective XML fragments (which is likely).

mirod 2010-05-14 07:48:54

Answer 3

+2 A:

Sounds like a job for sed (I was so tempted to say SuperSed ;-)

sed -n '/^<.\+>/H; /\(Request\|Response\) XML/{s/^.*</</;x;p}; ${x;p}' xmllog

where xmllog is your log file's name. You'll get a blank line at the beginning, but that can be filtered out with egrep '.+' or even just tail -n +2.

By way of explanation, sed is a little interpreter for programs that consist of a list of matching conditions and corresponding actions. sed runs through a file line by line (hence the name, "stream editor" -> "sed") and for each line, for each condition in the program that matches the text on the line, it applies the corresponding action. In this case:

/^<.\+>/

is a regular expression condition that matches any line which contains < followed by any character (.) repeated one or more times (\+) followed by > - basically any line with an XML tag. The associated action is H which appends the line to a "hold buffer". The other condition is

/\(Request\|Response\) XML/

which, of course, is a regexp that matches either Request or Response followed by a space and then XML. The corresponding action is

{s/^.*</</;x;p}

which first does a substitution (s) of the beginning of the line (^) followed by any character (.) repeated any number of times (*) followed by <, with just <. Basically that gets rid of anything before the first XML tag on the line. Then it switches (x) the line just read with the "hold buffer" (which contains the XML of the previous log message) and prints (p) the stuff that was just swapped in from the hold buffer. Finally,

matches the end of the input, and {x;p} again just swaps the contents of the hold buffer into the "print buffer" and then prints it.

You can alter the command to suit your needs, for example if you need something to delimit the different records, this'll put a blank line between them:

sed -n '/^<.\+>/H; /\(Request\|Response\) XML/{s/^.*</\n</;x;p}; ${x;p}' xmllog

(in that case, of course, don't use egrep to filter out the blank line at the beginning).

David Zaslavsky 2010-05-14 03:18:08

Really, really inventive. I need to learn more sed and awk commands.

dragonmantank 2010-05-14 03:36:17

It actually took me a little while reading the manual to piece that together... but sed and awk are really handy things to at least be familiar with. (FWIW my first thought was Perl)

David Zaslavsky 2010-05-14 06:26:38

ansaurus

tags:

views:

answers:

How to extract block of XML from a log file on Linux

related questions