views:

364

answers:

3

I am wondering how to use Ruby to scrape a website, with the goal of launching a new browser with the destination page loaded. This is needed, because the destination page is not stateless, and requires a number of session parameters.

For an example flow, see how Kayak.com does this. 1. Go to Kayak.com, and search for a hotel in Chicago, checking in on 1/21/2010, and checking out on 1/22/2010 2. select the first result, and choose orbitz 3. kayak takes you to the booking page on orbitz. to do so, it has to build a session, since orbitz does not have permalinks to their booking page.

Any thoughts on how to do this with Ruby?

+2  A: 

take a look at this library http://mechanize.rubyforge.org/mechanize/

bobbywilson0
+1 and watch its screencast on http://railscasts.com/episodes/191-mechanize
khelll
A: 

You may want to check Mechanize, a ruby gem for scraping that acts like a browser and persist the session, here you can find a good screencast.

makevoid
Thanks, I am already using mechanize. I should have been more clear. What I want to do is use mechanize to get to a certain point on a website, then redirect the user to that page. For example, my RoR site would take some moving truck rental search parameters, do a search on Uhaul.com, and then redirect the user to the search results page on Uhaul.com. Any idea of how to use mechanize and the redirect_to_url method to do that?
DougB
It's hard because the truck search form seems to do something with javascript and you can't mechanize the usual way but you can try posting params manually to build the request >> agent.post(url, {:param=>"value"} ) << good luck!
makevoid
A: 

The art of scraping a web page is identifying which parameters from the page are used to create a given response, finding them in the raw page source and then to scrape with every available combination of those parameters. You probably don't want a session variable as such, because most sites will discard old sessions after a certain time, but you want to be able to create the search string that will be redirected to the relevant results page or just a straight url for the results page in question.

I would expect to need to have some kind of configuration for each different site you want to scrape data from, as they will all vary in design and parameter names. Some may offer a partner web service to make your job easier and it is well worth using this if possible as it's likely to be more reliable and less susceptible to changes in the design of the site.

Even with tools like Mechanize as mentioned above, expect to need quite a lot of somewhat dirty manual configuration to get everything working nicely as a lot of sites you're working with are unlikely to have the best html and design and there is a good chance of needing to hunt down javascript or ajax links around the place as well.

glenatron