ansaurus

Question

In Perl, how can I parse an XML file that is too large to fit in available memory?

Answer 1

+1 A:

Try the XML::Parser module. Should be what you need.

another link

KLee1 2010-07-08 23:31:36

Great link, thanks.

Paul Tomblin 2010-07-08 23:59:17

It's generally not recommended to write new code against the XML::Parser API. XML::SAX is a similar API but allows the use of objects for tracking the state of the parse (rather than globals) and works with a number of different parser libraries. http://perl-xml.sourceforge.net/perl-sax/

Grant McLean 2010-07-09 21:08:01

Answer 2

+1 A:

You should use a streaming parser, such as XML::Parser (which in turn is a layer above expat). You will have to register handlers for the tags you are interested in, and do the book-keeping yourself. As with other streaming models, such as SAX, you do not get a whole view of the file at once (except for the subset you explicitly consume in your code).

Yann Ramin 2010-07-08 23:32:13

The problem is I don't even know what tags exist. That's why I want to write the program.

Paul Tomblin 2010-07-08 23:37:28

You can use the start and end handlers in XML::Parser to see what tags are where.

KLee1 2010-07-08 23:38:39

Answer 3

+2 A:

You want to use a SAX parser XML::SAX Implement start_element and end_element methods to build your node tree

falconcreek 2010-07-08 23:41:10

Answer 4

+8 A:

See Processing an XML document chunk by chunk in XML::Twig.

Sinan Ünür 2010-07-09 00:35:07

Answer 5

+1 A:

Here's a solution using XML::Parser. Comments welcome.

use XML::Parser;

%elemMap = ();

@context = ();

sub on_start {
    my ($p, $elemName, @alist) = @_;
    my $parent = @context[-1];
    if ($parent) {
        $elemMap{$parent}{$elemName}++;
    }        
    push(@context, $elemName);
}

sub on_end {
    pop(@context);
}

$p = new XML::Parser(Handlers => {Start => \&on_start, End => \&on_end});
$p->parse(STDIN);

while (my ($elem, $childElems) = each(%elemMap)) {
    while (my ($childElem, $count) = each(%{$childElems})) {
        print "$elem > $childElem: $count\n";
    }
}

Owen S. 2010-07-09 00:39:17

Not too different than what I wrote.

Paul Tomblin 2010-07-09 11:23:04

Answer 6

A:

When you are first trying to figure out the structure of an unknown XML file, open it in less or more and start paging through it. Don't use an editor that tries to load the entire file into memory unless you like waiting for your machine a lot.

Building a parser when you have no idea how the data is structured is going to be very frustrating so don't jump into coding first, jump into exploring until you know enough to begin coding.

Greg 2010-07-09 06:27:58

Really? You can "page through" a 13 million line file and remember all the nodes you saw, and which ones had AptUids, and what values of codeSurface you saw under the Rwy tag, and everything? Yeah, right.

Paul Tomblin 2010-07-09 11:25:39

No, but I can get an idea of the layout of the XML file and make informed decisions about WHERE to start coding, rather than jump in. The nice thing about XML is it is usually is very repetitive, as the data it represents is usually the same sort of thing over and over. Learn how it repeats and the problem reduces itself.

Greg 2010-07-11 01:50:02

The problem does not appear to be understanding the layout, but handling the file programatically.

Thorbjørn Ravn Andersen 2010-08-18 23:40:31

ansaurus

tags:

views:

answers:

In Perl, how can I parse an XML file that is too large to fit in available memory?

related questions