ansaurus

Question

Processing large XML file with libxml-ruby chunk by chunk

Answer 1

+1 A:

When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.

The event-based model is employed by the SAX-style parser, and derivative implementations.

Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html

REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

tadman 2010-01-04 15:19:47

I am aware of tree-based vs. stream-based parsing. According to the API documentation XML::Reader parses the stream and models a cursor. The latter is advanced by `next` and `expand`. However, the documentation lacks a good example how to use it for big files.

Christian Lindig 2010-01-04 21:12:23

Examples are always a problem, yeah. I prefer tree-based parsers, they're usually much easier to use, but for instances like this you're stuck using something more SAXy. The good news is that a lot of Java code examples, which are built around the SAX method, are fairly portable to Ruby. Looks like paradigmatic has a better solution, though.

tadman 2010-01-05 18:35:25

Answer 2

+2 A:

When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:

Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.

I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs

Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .

paradigmatic 2010-01-04 19:28:14

Thanks for the pointer. `XML::Reader` is indeed a pull parser based on a cursor that is advanced using `next` and that can read an entire sub-tree using `expand`. My code is working except that it leaks memory and I suspect that this is caused by some basic misunderstanding about how to use it on big files. Any XML::Reader expert wants to comment?

Christian Lindig 2010-01-04 21:24:50

Answer 3

A:

Hello Christian,

Have you made any progress on this or heard from others? I'm having the same issue with a seg fault....the xml file is ~327MB and 4.5 million lines....

/home/dyoung/projects/aop/xml_parse.rb:46: [BUG] Segmentation fault ruby 1.9.1p243 (2009-07-16 revision 24175) [i486-linux] .... .... ....

2010-01-25 23:46:33

No progress, I'm sorry. I moved to the most recent versions of Ruby 1.8 and libxml on Debian Testing and Ruby still segfaulted on me.

Christian Lindig 2010-01-27 14:48:29

Hmm...this is a major bummer...On smaller files with the same content everything works great, but when I try to parse the 327MB+ file libxml-ruby just pukes after awhile....Please don't say I need to use java...yuk

2010-01-27 18:50:08

Christian,How large are your XML files (MB and wc -l count) and what OS are you running on?

2010-01-27 19:53:08

Christian, I ending up running an awk script that split the monster file into 3-4K smaller ones. THen I loop over then in a ruby block and it seems to work fine....It's acutally faster since my xpath queries are not trying to comb thru a 300MB+ file.

2010-02-01 19:07:25

The file is 660M. I've just tried again two things: (1) just reading the file using next works. (2) But when I also try to expand nodes I run out of memory, even when I call node.remove! explicitly. Luckily, the seg faults are gone after I updated to the most recent versions on Debian Testing.

Christian Lindig 2010-03-26 20:34:47

Answer 4

A:

I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like

my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!

Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.

Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

Naofumi 2010-02-11 13:36:27

Thanks. It still does not work for me as I run out of memory. However, reading the file in a loop calling next without using expand does work. I suspect a memory leak in the expand method.

Christian Lindig 2010-03-26 20:36:08

ansaurus

tags:

views:

answers:

Processing large XML file with libxml-ruby chunk by chunk

related questions