What would be your choice of Perl XML Parsers for files greater than 15 GB?

views:

430

answers:

+3 Q:

What would be your choice of Perl XML Parsers for files greater than 15 GB?

I know there are some very good Perl XML parsers like XML::Xerces, XML::Parser::Expat, XML::Simple, XML::RapidXML, XML::LibXML, XML::Liberal, etc.

Which XML parser would you select for parsing large files and on what parameter would you decide one over another? If the one you would like to select is not in list then please suggest it.

+9 A:

With a 15 GB file, your parser would have to be SAX based because with such file sizes, simply being able to process the data is your first task.

I recommend you read XML::SAX::Intro.

Sinan Ünür 2009-10-19 19:13:35

I do not have any other criteria, only thing is about size of data, also I will be receiving XML files that large as snapshot and frequency would vary a lot, so I, down the line, would have multiple snapshots of XML file like snapshot id 1, snapshot id 2 ...and parsing has to be done once file transferred is completed so there would be no streaming parsing of xml data, once xml file is completely transferred than only parsing has to take place.

Rachel 2009-10-19 19:20:46

That may be true, but just because parsing must be done only with a complete file, doesn't mean that streaming parsing is out of the question. It's still a good idea to use a streaming parser with a very large file even when the entire document is sitting on a hard drive.

jprete 2009-10-19 19:27:34

No, it does not have to be SAX based. E.g., XML::Twig is not a SAX parser.

runrig 2010-01-15 03:41:41

+14 A:

If you're parsing files of that size, you'll want to avoid any parser that tries to load the entire document in memory and construct a DOM (domain object model).

Instead, look for a SAX style parser - one that treats the input file as a stream, raising events when events and attributes are encountered. This approach allows you to process the file gradually, without having to hold the entire thing in memory at once.

Bevan 2009-10-19 19:14:25

To add meat to this answer, avoid XML::Simple like plague for any large data sets.

DVK 2009-10-19 20:10:56

Also, any reason why it's essentially a repeat of earlier Sinan's answer? :)

DVK 2009-10-19 20:12:06

I suspect that Sinan and I gave the same answer at the same time - when I first saw this question, there was no answer at all; While I wrote this, Sinan wrote his.

Bevan 2009-10-19 21:21:58

@Bevan: Exactly. It took me a while before I realized you had posted an answer as well because I was so pre-occupied with tracking down links to the modules mentioned in Rachel's post. By the time I realized we had essentially posted the same answer, I had a few upvotes and following on the heels of this morning's serial downvoting of a few of my answers, I selfishly wanted to hold on to my points. ;-) @DVK thanks for noticing.

Sinan Ünür 2009-10-19 21:47:36

Agreed - my personal favourite is XML::Twig. I've processed XML streams with XML::Twig, ie. theoretically infinite GB :)

Mark Aufflick 2009-10-20 01:39:40

Sinan: I noticed the downvotes you got yesterday. But it's your own fault, you messed with the mighty Krish/Kirsh/Joe.

innaM 2009-10-20 08:34:22

+3 A:

You could also consider using a database with XML extensions (see here for an example). You could do a bulk load of XML data into the database, then you can do SQL queries (or XQueries) on that data.

tster 2009-10-19 19:32:14

I am using MySQL database and am not sure if we have this feature of putting direct XML into MySQL and querying XML out of the database.

Rachel 2009-10-19 19:34:30

+3 A:

For parsing such files I always used XML::Parser. Simple, accessible anywhere and working well.

depesz 2009-10-19 20:11:17

+2 A:

I'm going for a mutated version of tster's answer above. Load the bloody thing into a DB (if possible, via direct XML import, if not, by using SAX parser to parse the file and produce loadable data sets). Then, use the DB as the data store. At 15G, you are pushing way beyond the size of data that should be manipulated on outside of DB.

DVK 2009-10-19 20:15:26

In fact, this my first advice to anyone who might have to manipulate and/or query the same data set multiple times. In this case, my focus would on getting any intermediaries out of the way between the raw data and the database.

Sinan Ünür 2009-10-19 21:49:23

+5 A:

A SAX parser is one option. Other options that don't involve loading the entire doc into memory are XML::Twig and XML::Rules.

runrig 2009-10-19 21:36:52

+3 A:

As you would expect I would suggest XML::Twig, which will let you process the file chunk-by-chunk. This of course assumes that you can process your file this way. It will probably be easier to use than SAX, as you can process the tree for each chunk with DOM-like methods.

An alternative would be to use the pull parser mode, which is a little similar to what XML::Twig offers.

mirod 2009-10-20 08:11:50

ansaurus

tags:

views:

answers:

What would be your choice of Perl XML Parsers for files greater than 15 GB?

related questions