views:

936

answers:

4

I am trying to screen scrape a web page (using Mechanize) which displays the records in a grid page wise. I am able to read the values displayed in the first page but now need to navigate to the next page to read appropriate values.

<tr>
    <td><span>1</span></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$2')">2</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$3')" >3</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$4')" >4</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$5')" >5</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$6')">6</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$7')" >7</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$8')">8</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$9')" >9</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$10')" >10</a></td>
    <td><a href="javascript:__doPostBack('gvw_offices','Page$11')">...</a></td>
</tr>

I am able to get through all the links but when I try this:-

links = (row/"a")
links.each do |link|
    agent.click link.attributes['href']   # This fails 
    agent.click link   # This also fails
end

Reason is that agent.click expects the URL as an argument.

Is there a way where we can read all the values when they are displayed page wise ? If not how can we have such a click action when the href is a postback and not a URL??

+4  A: 

Mechanize cannot handle javascript, so basically you have two options:

  • use scrubyt and firewatir: it's a way to script your browser (so Firefox handles the javascript part)
  • manually check the base url and dynamically add the page number

something like:

base_url = 'http://example.com/gvw_offcies&amp;page='
links.each do |link|
  page_number = ... #get the page number from link
  agent.get base_url+page_number
end
Gaetan Dubar
It's a good solution!
Geo
Problem is that this page is using ASP.Grid to display records page wise, therefore the link for each page no is a post back and does not have a direct URL. Are you saying that if we add grid name and page number in the URL we can cause that postback (although it didn't worked when I tried it)?
MOZILLA
I'm not familiar with ASP.net but a post back is basically a POST request to the current page, isn't it ? so you may try something like agent.post current_url, {"page_number" => page_number}
Gaetan Dubar
A: 

I'd use something like webscarab to simply see where the POST requests that Javascript does are actually going. Especially for the AJAX stuff, they are just HTTP requests anyway.
Just start it and set it as a proxy in Firefox. Most of the time you can see some sort of pattern and just scrape those URLs directly

Marc Seeger
A: 

You could try using Celerity in Jruby and pass the page to an HTML parsing library. Celerity is supposed to be API compliant with Watir and is a wrapper around HtmlUnit. I was using mechanize for data gathering but had to switch to this for a few of the sites that were generated in JS.

http://celerity.rubyforge.org/

Tyler
A: 

all of the solutions above I have tried in the past for a good length of time (especially Celerity), but my conclusion isthat they are all horrible and have serious short comings that makes life very difficult since they are based on the same HtmlUnit engine for handling Javascript.

Celerity is not a screen scraping tool, it is lacking in Windows Management, and is based on HTMLUNIT engine which is not at all great at handling Javascript. However, it works fast for sites using minimal to medium level of Javascript and AJAX requests. It's based on ruby which will be a relief for those who dislike Java.

Your best bet is to use Selenium WebDriver API. This requires X display on your linux server and it's slower than HtmlUnit but it will not nag you with many problems you will have using anything derived or wrapping HtmlUnit. There is an option to use HtmlUnit but you sacrifice accuracy, consistency for speed. HtmlUnit is a whole lot faster for scraping.

However, speed is always not a good thing with scraping other sites you do not own, since it will usually warrant an IP ban.

My personal advice is stay clear from anything using HtmlUnit engine, and use Selenium which directly remote controls the browser of your choice for maximum accuracy and reliability.

Kim Jong Woo