tags:

views:

559

answers:

3

Does reading XML data like in the following code create the DOM tree in memory?

my $xml = new XML::Simple;

my $data = $xml->XMLin($blast_output,ForceArray => 1);

For large XML files should I use a SAX parser, with handlers, etc.?

+3  A: 

I have not used the XML::Simple module before, but from the documentation it appears to create a simple hash in memory. This is not a full DOM tree, but may well be enough for your requirements.

For large XML files, using a SAX parser would be faster and have a smaller memory footprint, but then it would again depend upon your needs. If you just need to process the data in a serial fashion, then using XML::SAX would probably suit your needs. If you need to manipulate your whole tree, then maybe using something like XML::LibXML would be better for you.

It is all horses for courses i'm afraid

Xetius
+7  A: 

For large XML files, you can either use XML::LibXML, in DOM mode if the document fits in memory, or using the pull mode (see XML::LibXML::Reader) or XML::Twig (which I wrote, so I'm biased, but it works generally well for files that are too big to fit in memory).

I am not a fan of SAX, which is hard to use and in fact quite slow.

mirod
I'm using `XML::Twig` for large files
Ivan Nevostruev
+1  A: 

I would say yes to both. The XML::Simple library will create the entire tree in memory and it's a large multiple on the size of the file. For many applications if your XML is over 100MB or so, it'll be practical impossible to entirely load into memory in perl. A SAX parser is a way of getting "events" or notifications as the file is read and tags are opened or closed.

Depending on your usage patterns, either a SAX or a DOM based parser could be faster: for example, if you are trying to handle just a few nodes, or every node, in a large file, the SAX mode is probably best. For example, reading a large RSS feed and attempting to parse every item in it.

On the other hand, if you need to cross-reference one part of the file with another part, a DOM parser or accessing via XPath will make more sense - writing it in the "inside-out" manner that a SAX parser requires will be clumsy and tricky.

I recommend trying a SAX parser at least once, because the event-driven thinking required to do so is good exercise.

I've had good success with XML::SAX::Machines to set up SAX parsing in perl - if you want multiple filters and pipelines it's easy to set up. For simpler setups (i.e 99% of the time) you just need a single sax filter (look at XML::Filter::Base) and tell XML::SAX::Machines to just parse the file (or read from filehandle) using your filter. Here's a thorough article.

Doug Treder