ansaurus

Question

What is the fastest way to pull a few element values out of XML files in Perl?

Answer 1

A:

Awk

awk 'BEGIN{
 RS="</doc>"
 FS="</someparentnode>"
}

{
  for(i=1;i<=NF;i++){
     if( $i~/pickme/){
        m=split($i,a,"</node>")
        for(o=1;o<=m;o++){
          if(a[o]~/<node>/){
            gsub(/.*<node>/,"",a[o])
            print a[o]
          }
        }
     }
  }
}' file

Perl

#!/usr/bin/perl
$/ = '</doc>';
$FS = '</someparentnode>';
while (<>) {
    chomp;
    @F = split $FS,;
    for ($i=0;$i<=$#F; $i++) {
        if ($F[$i] =~ /pickme/) {
            $M=(@a=split('</node>', $F[$i]));
            for ($o=0; $o<$M; $o++) {
                if ($a[$o]=~/<node>/) {
                    $a[$o] =~ s/.*<node>//sg;
                    print $a[$o];
                }
            }
        }
    }
}

output

$ perl script.pl file
This is the data I want

$ ./shell.sh
This is the data I want

ghostdog74 2010-03-14 08:59:39

There is a problem, which is that the tag in question is nested and repeated, and I have to get pick the right instance of the element based on an attribute of its parent element (on a previous line) so that probably wont work. Sadly.

Anon Guy 2010-03-14 09:14:05

then show some examples of your xml file and the things you want to get.

ghostdog74 2010-03-14 09:28:45

The issues mentioned by Anon Guy are exactly why you do not parse XML with regular expressions.

Svante 2010-03-14 10:37:47

wrong. OP's data is well formatted, that means there's a structure. That's why you CAN use regex.

ghostdog74 2010-03-14 11:38:11

Example added above. It's actually the way lines that are wrapped, rather than the XML structure, that is causing me to pause before jumping in and coding a solution using grep/regex.

Anon Guy 2010-03-14 18:28:44

Answer 2

+7 A:

2 stand-alone XML-aware options (which I wrote, so I might be biased ;--) are xml_grep (included with XML::Twig) and xml_grep2 (in App::xml_grep2).

You would write xml_grep -t '*[@attrib="pickme"]' *.xml or xml_grep2 -t '//*[@attrib="pickme"]' *.xml (the -t option gives you the result as text instead of XML). Also in both cases all of the documents will be parsed, but the next version of xml_grep will add an option to limit the number of results per file, and to stop parsing each file as soon as this number is reached.

Otherwise, if you need speed and if the code needs to be integrated, you can use XML::Twig, with a handler triggered on the element(s) you want, and a call to finish_now when you've found it, which will abort parsing and let go on to the next file.

XML::LibXML is also an option, although you will then have to parse completely each document and use XPath (easy but might be slower), use SAX (may be faster but is painful to code) or use the pull-parser (probably the best option but I have never used it).

Update after your update: the code with XML::Twig would look like this:

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $twig= XML::Twig->new( twig_handlers => { '*[@attrib="pickme"]' => \&pickme });

foreach my $file (@ARGV)
  { $twig->parsefile( $file); }

sub pickme
  { my( $twig, $node)= @_;
    print $node->text, "\n";
    $twig->finish_now;
  }

mirod 2010-03-14 10:09:33

I think this will do, the only problem I have is that CSW on Solaris comes with XML::Twig 1.13 from January 2003 (!) and that doesn't support the finish_now call. I'll try with a simple $twig->finish and upgrade the module if that's not fast enough. Thank you.

Anon Guy 2010-03-24 09:34:58

If you don't have finish_now you can always simulate it: wrap the call to parsefile in an eval (eval { $twig->parsefile( $file); }), and die in the handler. Adding error handling would be done by using die "found pickme" in the handler and checking that $@ starts with "found pickme" after the eval.

mirod 2010-03-24 12:16:53

Answer 3

A:

If you want to do it fast, I would recommend you use XML::Bare instead of XML::Simple or XML::Twig.

I'm using it to parse through several 2-5Mb XML files and the speedup is amazing: 0.2 seconds vs 4 minutes, in some cases. Details here: http://darkpan.com/files/xml-parsing-perl-gripes.txt.

mfontani 2010-03-15 13:43:59

In the document you reference, which parser were you using with XML::Simple? By default it uses XML::SAX::PurePerl which is indeed very slow, but you can make it use XML::Parser or XML::LibXML as its SAX parser, and you should see a big improvement.

mirod 2010-03-15 15:42:17

Updated link as per suggestion by mirod: still 0.2 seconds vs either 30+ or 15+ seconds using XML::LibXML or XML::Parser; thanks for the downvotes :)

mfontani 2010-03-15 18:42:25

2 things: XML::Bare is not really an XML parser, so I am not sure how much of an improvement it is over pure regexps in terms of future-proofing the code. And in your benchmark, using simplify in XML::Twig is probably not the fatest way to get the results you want. It's hard to tell without the XML. If you publish a benchmark, it would be nice to make enough data available for others to reproduce and improve the results.

mirod 2010-03-15 19:56:19

ansaurus

tags:

views:

answers:

What is the fastest way to pull a few element values out of XML files in Perl?

related questions