views:

47

answers:

1

I'm trying to extract data from log files in XML format. As these are huge, I am using XML::Twig to extract the relevant data from a buffer instead of the whole file(s)

As these are concatenaded data from STDIN, the XML is far from well formed. So frequently the parser stops with an error. How can I get the XML parser to ignore the errors and only extract the tags I am interested in? Do I have to fall back to regular expression parsing (start-tag - end-tag)?

+3  A: 

I would actually just accumulate the data between <message></message> tags and then parse that string, assuming the content of each message is small:

#!/usr/bin/perl

use strict; use warnings;

use XML::Simple;
use Data::Dumper;

my $in_message;
my $message;

LOGENTRY:
while ( my $line = <DATA> ) {
    while ( $line =~ /^<message/ .. $line =~ m{</message>$} ) {
        $message .= $line;
        next LOGENTRY;
    }
    if ( $message ) {
        process_message($message);
        $message = '';
    }
}

sub process_message {
    my ($message) = @_;

    my $xml = XMLin(
        $message,
        ForceArray => 1,
    );
    print Dumper $xml;
}

__DATA__
ldksj
lskdfj
lksd

sdfk

<message sender="1">Hi</message>

sdk
dkj

<message sender="2">Hi yourself!</message>

sd

Output:

$VAR1 = {
          'sender' => '1',
          'content' => 'Hi'
        };
$VAR1 = {
          'sender' => '2',
          'content' => 'Hi yourself!'
        };
Sinan Ünür
Thanks for the (very elegant!) solution suggestion. However I am afraid that each <message> tag will not be on a single line, a new <message> can start on the same line etc.. something that makes it (slightly) more difficult to parse! I was hoping that Xml:Twig would safe me from the work of setting it all up, maintaining buffers etc
goorj