tags:

views:

731

answers:

4

Hi everyone,

I know that Hpricot is still a standard but I remember hearing about a faster more expressive HTML parser for Ruby.

Does anybody know what it's called and if it is worth switching to from Hpricot??

Thanks in advance

+6  A: 

You are probably thinking about Nokogiri. I have not used it myself, but "everyone" is talking about it and the benchmarks do look interesting:

hpricot:html:doc  48.930000 3.640000 52.570000 ( 52.900035)
hpricot2:html:doc  4.500000 0.020000  4.520000 (  4.518984)
nokogiri:html:doc  3.640000 0.130000  3.770000 (  3.770642)
Wes Oldenbeuving
The other nice thing about Nokogiri is that is was built with Hpricot in mind, so the syntax is almost identical. So switching shouldn't be a big deal at all for you.
mwilliams
+2  A: 

There is also Rubyful Soup

Which sells itself as a lightweight quick and dirty parser. I found the interface very intuitive and 'Ruby-ish' when using it for a project in the past, which is perhaps a little surprising given that it is a Python port.

Edit: looks like it's no longer maintained unfortunately so it's probably not the one you were looking for. Looks like Nokogiri is the on you've been hearing about.

Andy
+1  A: 

Have a look at scrubyt that has recently been released and looks promising. One added advantage is you can even crawl the pages where you have to log in an example would be scraping google analytics page using a ruby script.

andHapp
+1  A: 

Don't use regular expressions -- ruby's regex stuff is way too slow. Hpricot is awesome and Nokogiri looks promising, though I've not used it directly yet.

sammich