Does reading XML data like in the following code create the DOM tree in memory?
my $xml = new XML::Simple;
my $data = $xml->XMLin($blast_output,ForceArray => 1);
For large XML files should I use a SAX parser, with handlers, etc.?
Does reading XML data like in the following code create the DOM tree in memory?
my $xml = new XML::Simple;
my $data = $xml->XMLin($blast_output,ForceArray => 1);
For large XML files should I use a SAX parser, with handlers, etc.?
I have not used the XML::Simple module before, but from the documentation it appears to create a simple hash in memory. This is not a full DOM tree, but may well be enough for your requirements.
For large XML files, using a SAX parser would be faster and have a smaller memory footprint, but then it would again depend upon your needs. If you just need to process the data in a serial fashion, then using XML::SAX would probably suit your needs. If you need to manipulate your whole tree, then maybe using something like XML::LibXML would be better for you.
It is all horses for courses i'm afraid
For large XML files, you can either use XML::LibXML, in DOM mode if the document fits in memory, or using the pull mode (see XML::LibXML::Reader) or XML::Twig (which I wrote, so I'm biased, but it works generally well for files that are too big to fit in memory).
I am not a fan of SAX, which is hard to use and in fact quite slow.
I would say yes to both. The XML::Simple library will create the entire tree in memory and it's a large multiple on the size of the file. For many applications if your XML is over 100MB or so, it'll be practical impossible to entirely load into memory in perl. A SAX parser is a way of getting "events" or notifications as the file is read and tags are opened or closed.
Depending on your usage patterns, either a SAX or a DOM based parser could be faster: for example, if you are trying to handle just a few nodes, or every node, in a large file, the SAX mode is probably best. For example, reading a large RSS feed and attempting to parse every item in it.
On the other hand, if you need to cross-reference one part of the file with another part, a DOM parser or accessing via XPath will make more sense - writing it in the "inside-out" manner that a SAX parser requires will be clumsy and tricky.
I recommend trying a SAX parser at least once, because the event-driven thinking required to do so is good exercise.
I've had good success with XML::SAX::Machines to set up SAX parsing in perl - if you want multiple filters and pipelines it's easy to set up. For simpler setups (i.e 99% of the time) you just need a single sax filter (look at XML::Filter::Base) and tell XML::SAX::Machines to just parse the file (or read from filehandle) using your filter. Here's a thorough article.