tags:

views:

68

answers:

2

I am using Nokogiri which works for small documents well. But for a 180 kb html file I have to increase the process stack size (via ulimit -s) and the parsing takes a long time. Let alone xpath queries.

Are there faster alternatives available using a stock ruby 1.8 distribution?

I am getting used to xpath, but the alternative does not need necessarily support xpath.

Criteria are just:

  1. fast to write
  2. fast execution
  3. robust resulting parser
A: 

You may find that for larger XML documents DOM parsing is not very performant. This is because the parser has to build an in-memory map of the structure of the XML document.

The other approach that generally requires a smaller memory footprint is to use an event-driven SAX parser.

Nokogiri has full support for SAX.

Steve Weet
+3  A: 

Nokogiri is based on libxml2, which is one of the fastest XML parsers in any language. It is written in C, but there are bindings in many languages.

The problem is that the more complex the file, the longer it takes to build a complete DOM structure in memory. XPath relies on this DOM structure, so it is the slowest way to parse an XML document.

SAX is often what people turn to for speed. It is more event driven: it notifies you of a start element, end element, etc, and you write handlers to react to them. It's a bit of a pain because you end up keeping track of state yourself (e.g. which elements you're "inside").

There is a middle ground: some parsers have a "pull parsing" capability where you have a cursor-like navigation. You still visit each node sequentially, but you can "fast-forward" to the end of an element you're not interested in. It's got the speed of SAX but a better interface for many uses. I don't know if Nokogiri can do this for HTML, but I'd look into its Reader API if you're interested.

Mark Thomas