views:

755

answers:

4

I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>) using libxml in Ruby. I have tried the Reader class in combination with the expand method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:

   File.open('dblp.xml') do |io|
      dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
      pubFactory = PubFactory.new

      i = 0
      while dblp.read do
        case dblp.name
          when 'article', 'inproceedings', 'book': 
            pub = pubFactory.create(dblp.expand)
            i += 1
            puts pub
            pub = nil
            $stderr.puts i if i % 10000 == 0
            dblp.next
          when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
            # ignore for now
            dblp.next 
          else
            # nothing
        end
      end  
    end

The key here is that dblp.expand reads an entire subtree (like an <article> record) and passes it as an argument to a factory for further processing. Is this the right approach?

Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?

def first(root, node)
    x = root.find(node).first
    x ? x.content : nil
end

pub.pages   = first(node,'pages') # node contains expanded node from dblp.expand
+1  A: 

When processing XML, two common options are tree-based, and event-based. The tree-based approach typically reads in the entire XML document and can consume a large amount of memory. The event-based approach uses no additional memory but doesn't do anything unless you write your own handler logic.

The event-based model is employed by the SAX-style parser, and derivative implementations.

Example with REXML: http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch08s01.html

REXML: http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

tadman
I am aware of tree-based vs. stream-based parsing. According to the API documentation XML::Reader parses the stream and models a cursor. The latter is advanced by `next` and `expand`. However, the documentation lacks a good example how to use it for big files.
Christian Lindig
Examples are always a problem, yeah. I prefer tree-based parsers, they're usually much easier to use, but for instances like this you're stuck using something more SAXy. The good news is that a lot of Java code examples, which are built around the SAX method, are fairly portable to Ruby. Looks like paradigmatic has a better solution, though.
tadman
+2  A: 

When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:

  • Push parsers like SAX, where you react to encoutered tags as you get them (see tadman answer).
  • Pull parsers, where you control a "cursor" in the XML file that you can move with simple primitives like go up/go down etc.

I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when... constructs

Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .

paradigmatic
Thanks for the pointer. `XML::Reader` is indeed a pull parser based on a cursor that is advanced using `next` and that can read an entire sub-tree using `expand`. My code is working except that it leaks memory and I suspect that this is caused by some basic misunderstanding about how to use it on big files. Any XML::Reader expert wants to comment?
Christian Lindig
A: 

Hello Christian,

Have you made any progress on this or heard from others? I'm having the same issue with a seg fault....the xml file is ~327MB and 4.5 million lines....

/home/dyoung/projects/aop/xml_parse.rb:46: [BUG] Segmentation fault ruby 1.9.1p243 (2009-07-16 revision 24175) [i486-linux] .... .... ....

No progress, I'm sorry. I moved to the most recent versions of Ruby 1.8 and libxml on Debian Testing and Ruby still segfaulted on me.
Christian Lindig
Hmm...this is a major bummer...On smaller files with the same content everything works great, but when I try to parse the 327MB+ file libxml-ruby just pukes after awhile....Please don't say I need to use java...yuk
Christian,How large are your XML files (MB and wc -l count) and what OS are you running on?
Christian, I ending up running an awk script that split the monster file into 3-4K smaller ones. THen I loop over then in a ruby block and it seems to work fine....It's acutally faster since my xpath queries are not trying to comb thru a 300MB+ file.
The file is 660M. I've just tried again two things: (1) just reading the file using next works. (2) But when I also try to expand nodes I run out of memory, even when I call node.remove! explicitly. Luckily, the seg faults are gone after I updated to the most recent versions on Debian Testing.
Christian Lindig
A: 

I had the same problem, but I think I solved it by calling Node#remove! on the expanded node. In your case, I think you should do something like

my_node = dblp.expand
[do what you have to do with my_node]
dblp.next
my_node.remove!

Not really sure why this works, but if you look at the source for LibXML::XML::Reader#expand, there's a comment about freeing the node. I am guessing that Reader#expand associates the node to the Reader, and you have to call Node#remove! to free it.

Memory usage wasn't great, even with this hack, but at least it didn't keep on growing.

Naofumi
Thanks. It still does not work for me as I run out of memory. However, reading the file in a loop calling next without using expand does work. I suspect a memory leak in the expand method.
Christian Lindig