views:

148

answers:

3

I have a bunch of XML files that are about 1-2 megabytes in size. Actually, more than a bunch, there are millions. They're all well-formed and many are even validated against their schema (confirmed with libxml2).

All were created by the same app, so they're in a consistent format (though this could theoretically change in the future).

I want to check the values of one element in each file from within a Perl script. Speed is important (I'd like to take less than a second per file) and as noted I already know the files are well-formed.

I am sorely tempted to simply 'open' the files in Perl and scan through until I see the element I am looking for, grab the value (which is near the start of the file), and close the file.

On the other hand, I could use an XML parser (which might protect me from future changes to the XML formatting) but I suspect it will be slower than I'd like.

Can anyone recommend an appropriate approach and/or parser?

Thanks in advance.

Update

Here's the structure/complexity of the data I am trying to pull out:

<doc>
  ...
  <someparentnode attrib="notme" attrib2="5">
    <node>Not this one</node>
  </someparentnode>
  <someparentnode attrib="pickme" attrib2="5">
    <node>This is the data I want</node>
  </someparentnode>
  <someparentnode attrib="notme" 
     attrib2="reallyreallylonglineslikethisonearewrapped">
    <node>Not this one either and it may be 
      wrapped too.</node>
  </someparentnode>
  ...    
</doc>

The hierarchy goes a several levels deeper than that, but I think that covers off the sorts of things I am trying to do.

A: 

Awk

awk 'BEGIN{
 RS="</doc>"
 FS="</someparentnode>"
}

{
  for(i=1;i<=NF;i++){
     if( $i~/pickme/){
        m=split($i,a,"</node>")
        for(o=1;o<=m;o++){
          if(a[o]~/<node>/){
            gsub(/.*<node>/,"",a[o])
            print a[o]
          }
        }
     }
  }
}' file

Perl

#!/usr/bin/perl
$/ = '</doc>';
$FS = '</someparentnode>';
while (<>) {
    chomp;
    @F = split $FS,;
    for ($i=0;$i<=$#F; $i++) {
        if ($F[$i] =~ /pickme/) {
            $M=(@a=split('</node>', $F[$i]));
            for ($o=0; $o<$M; $o++) {
                if ($a[$o]=~/<node>/) {
                    $a[$o] =~ s/.*<node>//sg;
                    print $a[$o];
                }
            }
        }
    }
}

output

$ perl script.pl file
This is the data I want

$ ./shell.sh
This is the data I want
ghostdog74
There is a problem, which is that the tag in question is nested and repeated, and I have to get pick the right instance of the element based on an attribute of its parent element (on a previous line) so that probably wont work. Sadly.
Anon Guy
then show some examples of your xml file and the things you want to get.
ghostdog74
The issues mentioned by Anon Guy are exactly why you do not parse XML with regular expressions.
Svante
wrong. OP's data is well formatted, that means there's a structure. That's why you CAN use regex.
ghostdog74
Example added above. It's actually the way lines that are wrapped, rather than the XML structure, that is causing me to pause before jumping in and coding a solution using grep/regex.
Anon Guy
+7  A: 

2 stand-alone XML-aware options (which I wrote, so I might be biased ;--) are xml_grep (included with XML::Twig) and xml_grep2 (in App::xml_grep2).

You would write xml_grep -t '*[@attrib="pickme"]' *.xml or xml_grep2 -t '//*[@attrib="pickme"]' *.xml (the -t option gives you the result as text instead of XML). Also in both cases all of the documents will be parsed, but the next version of xml_grep will add an option to limit the number of results per file, and to stop parsing each file as soon as this number is reached.

Otherwise, if you need speed and if the code needs to be integrated, you can use XML::Twig, with a handler triggered on the element(s) you want, and a call to finish_now when you've found it, which will abort parsing and let go on to the next file.

XML::LibXML is also an option, although you will then have to parse completely each document and use XPath (easy but might be slower), use SAX (may be faster but is painful to code) or use the pull-parser (probably the best option but I have never used it).

Update after your update: the code with XML::Twig would look like this:

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $twig= XML::Twig->new( twig_handlers => { '*[@attrib="pickme"]' => \&pickme });

foreach my $file (@ARGV)
  { $twig->parsefile( $file); }

sub pickme
  { my( $twig, $node)= @_;
    print $node->text, "\n";
    $twig->finish_now;
  }
mirod
I think this will do, the only problem I have is that CSW on Solaris comes with XML::Twig 1.13 from January 2003 (!) and that doesn't support the finish_now call. I'll try with a simple $twig->finish and upgrade the module if that's not fast enough. Thank you.
Anon Guy
If you don't have finish_now you can always simulate it: wrap the call to parsefile in an eval (eval { $twig->parsefile( $file); }), and die in the handler. Adding error handling would be done by using die "found pickme" in the handler and checking that $@ starts with "found pickme" after the eval.
mirod
A: 

If you want to do it fast, I would recommend you use XML::Bare instead of XML::Simple or XML::Twig.

I'm using it to parse through several 2-5Mb XML files and the speedup is amazing: 0.2 seconds vs 4 minutes, in some cases. Details here: http://darkpan.com/files/xml-parsing-perl-gripes.txt.

mfontani
In the document you reference, which parser were you using with XML::Simple? By default it uses XML::SAX::PurePerl which is indeed very slow, but you can make it use XML::Parser or XML::LibXML as its SAX parser, and you should see a big improvement.
mirod
Updated link as per suggestion by mirod: still 0.2 seconds vs either 30+ or 15+ seconds using XML::LibXML or XML::Parser; thanks for the downvotes :)
mfontani
2 things: XML::Bare is not really an XML parser, so I am not sure how much of an improvement it is over pure regexps in terms of future-proofing the code. And in your benchmark, using simplify in XML::Twig is probably not the fatest way to get the results you want. It's hard to tell without the XML. If you publish a benchmark, it would be nice to make enough data available for others to reproduce and improve the results.
mirod