ansaurus

Question

What is the best way to parse a web page in Ruby?

Answer 1

+3 A:

try hpricot, its well... awesome

I've used it several times for screen scraping.

ethyreal 2008-09-26 03:35:19

Answer 2

+15 A:

Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.

require 'hpricot'
require 'open-uri'

doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i

And so forth.

Armin Ronacher 2008-09-26 03:41:45

Answer 3

A:

I always really like what Ilya Grigorik writes, and he wrote up a nice post about using hpricot.

I also read this post a while back and it looks like it would be useful for you.

Haven't done either myself, so YMMV but these seem pretty useful.

Cameron Booth 2008-09-27 17:33:29

Answer 4

A:

Something I ran into trying to do this before is that few web pages are well-formed XML documents. Hpricot may be able to deal with that (I haven't used it) but when I was doing a similar project in the past (using Python and its library's built in parsing functions) it helped to have a pre-processor to clean up the HTML. I used the python bindings for HTML Tidy as this and it made life a lot easier. Ruby bindings are here but I haven't tried them.

Good luck!

Atiaxi 2008-09-27 19:59:44

ansaurus

tags:

views:

answers:

What is the best way to parse a web page in Ruby?

related questions