tags:

views:

1315

answers:

6

I have a 2.4 MB XML file, an export from Microsoft Project (hey I'm the victim here!) from which I am requested to extract certain details for re-presentation. Ignoring the intelligence or otherwise of the request, which library should I try first from a Ruby perspective?

I'm aware of the following (in no particular order):

I'd prefer something packaged as a Ruby gem, which I suspect the Chilkat library is not.

Performance isn't a major issue - I don't expect the thing to need to run more than once a day (once a week is more likely). I'm more interested in something that's as easy to use as anything XML-related is able to get.

EDIT: I tried the gemified ones:

hpricot is, by a country mile, easiest. For example, to extract the content of the SaveVersion tag in this XML (saved in a file called, say 'test.xml')

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Project xmlns="http://schemas.microsoft.com/project"&gt;
    <SaveVersion>12</SaveVersion>
</Project>

takes something like this:

doc = Hpricot.XML(open('test.xml'))
version = (doc/:Project/:SaveVersion).first.inner_html

hpricot seems to be relatively unconcerned with namespaces, which in this example is fine: there's only one, but would potentially be a problem with a complex document. Since hpricot is also very slow, I rather imagine this would be a problem that solves itself.

libxml-ruby is an order of magnitude faster, understands namespaces (it took me a good couple of hours to figure this out) and is altogether much closer to the XML metal - XPath queries and all the other stuff are in there. This is not necessarily a Good Thing if, like me, you open up an XML document only under conditions of extreme duress. The helper module was mostly helpful in providing examples of how to handle a default namespace effectively. This is roughly what I ended up with (I'm not in any way asserting its beauty, correctness or other value, it's just where I am right now):

xml_parser = XML::Parser.new
xml_parser.string = File.read(path)
doc = xml_parser.parse
@root = doc.root
@scopes = { :in_node => '', :in_root => '/', :in_doc => '//' }
@ns_prefix = 'p'
@ns = "#{@ns_prefix}:#{@root.namespace[0].href}"
version = @root.find_first(xpath_qry("Project/SaveVersion", :in_root), @ns).content.to_i

def xpath_qry(tags, scope = :in_node)
  "#{@scopes[scope]}" + tags.split(/\//).collect{ |tag| "#{@ns_prefix}:#{tag}"}.join('/')
end

I'm still debating the pros and cons: libxml for its extra rigour, hpricot for the sheer style of _why's code.

EDIT again, somewhat later: I discovered HappyMapper ('gem install happymapper') which is hugely promising, if still at an early stage. It's declarative and mostly works, although I have spotted a couple of edge cases that I don't have fixes for yet. It lets you do stuff like this, which parses my Google Reader OPML:

module OPML
  class Outline
    include HappyMapper
    tag 'outline'
    attribute :title, String
    attribute :text, String
    attribute :type, String
    attribute :xmlUrl, String
    attribute :htmlUrl, String
    has_many :outlines, Outline
  end
end

xml_string = File.read("google-reader-subscriptions.xml")

sections = OPML::Outline.parse(xml_string)

I already love it, even though it's not perfect yet.

+1  A: 

I have used libXML before for xml parsing, it has a nice API and is fast.

MatthewFord
A: 

Maybe you could distill the XML with a xslt stage prior running in Ruby?

epatel
I took a course on XSLT a few years back - I still wake up screaming some nights. My ageing brain doesn't equate it with "easy to use" I'm afraid.
Mike Woodhouse
Ha...I agree, but some people like it.
epatel
you get a vote for that comment! :)
epatel
A: 

Take the one that offers full XPath support and has some samples that get you started immediately ;)

VVS
+4  A: 

RubyInside had an article about that recently. Check it out.

webmat
+3  A: 

Hpricot is probably the best tool for you -- it is easy to use and should handle 2mg file with no problem.

Speedwise libxml should be the best. I used libxml2 binding for python few months ago (at that moment rb-libxml was stale). Streaming interface worked the best for me (LibXML::XML::Reader in ruby gem). It allows to process file while it is downloading, is a bit more userfriendly than SAX and allowed me to load data from 30mb xml file from internet to a MySQL database in a little more than a minute.

dimus
+2  A: 

Nokogiri wraps libxml2 and libxslt with a clean, Rubyish API that supports namespaces, XPath and CSS3 queries. Fast, too. http://nokogiri.org/

Thomas