views:

4123

answers:

2

I am trying to parse an HTML table using Hpricot but am stuck, not able to select a table element from the page which has a specified id.

Here is my ruby code:-

require 'rubygems'
require 'mechanize'
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

doc = Hpricot(page.body)

puts doc.to_html # Here the doc contains the full HTML page

puts doc.search("//table[@id='gvw_offices']").first # This is NIL

Can anyone help me to identify what's wrong with this.

Thanks so much

+1  A: 

Mechanize will use hpricot internally (it's mechanize's default parser). What's more, it'll pass the hpricot stuff on to the parser, so you don't have to do it yourself:

require 'rubygems'
require 'mechanize'

#You don't really need this if you don't use hpricot directly
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

puts page.parser.to_html # page.parser returns the hpricot parser

puts page.at("//table[@id='gvw_offices']") # This passes through to hpricot

Also note that page.search("foo").first is equivalent to page.at("foo").

Pesto
+1  A: 

Note that Mechanize no longer uses Hpricot (it uses Nokogiri) by default in the later versions (0.9.0) and you have to explicitly specify Hpricot to continue using with:

  WWW::Mechanize.html_parser = Hpricot

Just like that, no quotes or anything around Hpricot - there's probably a module you can specify for Hpricot, because it won't work if you put this statement inside your own module declaration. Here's the best way to do it at the top of your class (before opening module or class)

require 'mechanize'
require 'hpricot'

# Later versions of Mechanize no longer use Hpricot by default
# but have an attribute we can set to use it
begin
  WWW::Mechanize.html_parser = Hpricot
rescue NoMethodError
  # must be using an older version of Mechanize that doesn't
  # have the html_parser attribute - just ignore it since 
  # this older version will use Hpricot anyway
end

By using the rescue block you ensure that if they do have an older version of mechanize, it won't barf on the nonexistent html_parser attribute. (Otherwise you need to make your code dependent on the latest version of Mechanize)

Also in the latest version, WWW::Mechanize::List was deprecated. Don't ask me why because it totally breaks backward compatibility for statements like

page.forms.name('form1').first

which used to be a common idiom that worked because Page#forms returned a mechanize List which had a "name" method. Now it returns a simple array of Forms.

I found this out the hard way, but your usage will work because you're using find which is a method of array.

But a better method for finding the first form with a given name is Page#form so your form finding line becomes

form = page.form('form1')

this method works with old an new versions.

Rhubarb