tags:

views:

124

answers:

6

I have a very large XML file (If you care, it's an AIXM file from EAD, but that's not important). In order to figure out how it is used, I want to write a simple script that goes through and for every node, record what subnodes occur below it and how many times, so I can see which nodes contain <AptUid> and whether most <Rdn> nodes have a <GeoLat> node or not, that sort of thing.

I tried to just load the whole thing into a hashref using XML::Simple, but it's too big to fit into memory. Is there an XML parser that will allow me to just look at the file a piece at a time?

+1  A: 

Try the XML::Parser module. Should be what you need.

another link

KLee1
Great link, thanks.
Paul Tomblin
It's generally not recommended to write new code against the XML::Parser API. XML::SAX is a similar API but allows the use of objects for tracking the state of the parse (rather than globals) and works with a number of different parser libraries. http://perl-xml.sourceforge.net/perl-sax/
Grant McLean
+1  A: 

You should use a streaming parser, such as XML::Parser (which in turn is a layer above expat). You will have to register handlers for the tags you are interested in, and do the book-keeping yourself. As with other streaming models, such as SAX, you do not get a whole view of the file at once (except for the subset you explicitly consume in your code).

Yann Ramin
The problem is I don't even know what tags exist. That's why I want to write the program.
Paul Tomblin
You can use the start and end handlers in XML::Parser to see what tags are where.
KLee1
+2  A: 

You want to use a SAX parser XML::SAX Implement start_element and end_element methods to build your node tree

falconcreek
+8  A: 

See Processing an XML document chunk by chunk in XML::Twig.

Sinan Ünür
+1  A: 

Here's a solution using XML::Parser. Comments welcome.

use XML::Parser;

%elemMap = ();

@context = ();

sub on_start {
    my ($p, $elemName, @alist) = @_;
    my $parent = @context[-1];
    if ($parent) {
        $elemMap{$parent}{$elemName}++;
    }        
    push(@context, $elemName);
}

sub on_end {
    pop(@context);
}

$p = new XML::Parser(Handlers => {Start => \&on_start, End => \&on_end});
$p->parse(STDIN);

while (my ($elem, $childElems) = each(%elemMap)) {
    while (my ($childElem, $count) = each(%{$childElems})) {
        print "$elem > $childElem: $count\n";
    }
}
Owen S.
Not too different than what I wrote.
Paul Tomblin
A: 

When you are first trying to figure out the structure of an unknown XML file, open it in less or more and start paging through it. Don't use an editor that tries to load the entire file into memory unless you like waiting for your machine a lot.

Building a parser when you have no idea how the data is structured is going to be very frustrating so don't jump into coding first, jump into exploring until you know enough to begin coding.

Greg
Really? You can "page through" a 13 million line file and remember all the nodes you saw, and which ones had AptUids, and what values of codeSurface you saw under the Rwy tag, and everything? Yeah, right.
Paul Tomblin
No, but I can get an idea of the layout of the XML file and make informed decisions about WHERE to start coding, rather than jump in. The nice thing about XML is it is usually is very repetitive, as the data it represents is usually the same sort of thing over and over. Learn how it repeats and the problem reduces itself.
Greg
The problem does not appear to be understanding the layout, but handling the file programatically.
Thorbjørn Ravn Andersen