I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>
) using libxml in Ruby. I have tried the Reader class in combination with the expand
method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:
File.open('dblp.xml') do |io|
dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
pubFactory = PubFactory.new
i = 0
while dblp.read do
case dblp.name
when 'article', 'inproceedings', 'book':
pub = pubFactory.create(dblp.expand)
i += 1
puts pub
pub = nil
$stderr.puts i if i % 10000 == 0
dblp.next
when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
# ignore for now
dblp.next
else
# nothing
end
end
end
The key here is that dblp.expand
reads an entire subtree (like an <article>
record) and passes it as an argument to a factory for further processing. Is this the right approach?
Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?
def first(root, node)
x = root.find(node).first
x ? x.content : nil
end
pub.pages = first(node,'pages') # node contains expanded node from dblp.expand