tags:

views:

430

answers:

7

I know there are some very good Perl XML parsers like XML::Xerces, XML::Parser::Expat, XML::Simple, XML::RapidXML, XML::LibXML, XML::Liberal, etc.

Which XML parser would you select for parsing large files and on what parameter would you decide one over another? If the one you would like to select is not in list then please suggest it.

+9  A: 

With a 15 GB file, your parser would have to be SAX based because with such file sizes, simply being able to process the data is your first task.

I recommend you read XML::SAX::Intro.

Sinan Ünür
I do not have any other criteria, only thing is about size of data, also I will be receiving XML files that large as snapshot and frequency would vary a lot, so I, down the line, would have multiple snapshots of XML file like snapshot id 1, snapshot id 2 ...and parsing has to be done once file transferred is completed so there would be no streaming parsing of xml data, once xml file is completely transferred than only parsing has to take place.
Rachel
That may be true, but just because parsing must be done only with a complete file, doesn't mean that streaming parsing is out of the question. It's still a good idea to use a streaming parser with a very large file even when the entire document is sitting on a hard drive.
jprete
No, it does not have to be SAX based. E.g., XML::Twig is not a SAX parser.
runrig
+14  A: 

If you're parsing files of that size, you'll want to avoid any parser that tries to load the entire document in memory and construct a DOM (domain object model).

Instead, look for a SAX style parser - one that treats the input file as a stream, raising events when events and attributes are encountered. This approach allows you to process the file gradually, without having to hold the entire thing in memory at once.

Bevan
To add meat to this answer, avoid XML::Simple like plague for any large data sets.
DVK
Also, any reason why it's essentially a repeat of earlier Sinan's answer? :)
DVK
I suspect that Sinan and I gave the same answer at the same time - when I first saw this question, there was no answer at all; While I wrote this, Sinan wrote his.
Bevan
@Bevan: Exactly. It took me a while before I realized you had posted an answer as well because I was so pre-occupied with tracking down links to the modules mentioned in Rachel's post. By the time I realized we had essentially posted the same answer, I had a few upvotes and following on the heels of this morning's serial downvoting of a few of my answers, I selfishly wanted to hold on to my points. ;-) @DVK thanks for noticing.
Sinan Ünür
Agreed - my personal favourite is XML::Twig. I've processed XML streams with XML::Twig, ie. theoretically infinite GB :)
Mark Aufflick
Sinan: I noticed the downvotes you got yesterday. But it's your own fault, you messed with the mighty Krish/Kirsh/Joe.
innaM
+3  A: 

You could also consider using a database with XML extensions (see here for an example). You could do a bulk load of XML data into the database, then you can do SQL queries (or XQueries) on that data.

tster
I am using MySQL database and am not sure if we have this feature of putting direct XML into MySQL and querying XML out of the database.
Rachel
+3  A: 

For parsing such files I always used XML::Parser. Simple, accessible anywhere and working well.

depesz
+2  A: 

I'm going for a mutated version of tster's answer above. Load the bloody thing into a DB (if possible, via direct XML import, if not, by using SAX parser to parse the file and produce loadable data sets). Then, use the DB as the data store. At 15G, you are pushing way beyond the size of data that should be manipulated on outside of DB.

DVK
In fact, this my first advice to anyone who might have to manipulate and/or query the same data set multiple times. In this case, my focus would on getting any intermediaries out of the way between the raw data and the database.
Sinan Ünür
+5  A: 

A SAX parser is one option. Other options that don't involve loading the entire doc into memory are XML::Twig and XML::Rules.

runrig
+3  A: 

As you would expect I would suggest XML::Twig, which will let you process the file chunk-by-chunk. This of course assumes that you can process your file this way. It will probably be easier to use than SAX, as you can process the tree for each chunk with DOM-like methods.

An alternative would be to use the pull parser mode, which is a little similar to what XML::Twig offers.

mirod