views:

1784

answers:

7

Assume I have the entire HTML of a Google search results page. Does anyone know of any existing code (Ruby?) to scrape/parse the first page of Google search results? Ideally it would handle the Shopping Results and Video Results sections that can spring up anywhere.

If not, what's the best Ruby-based tool for screenscraping in general?

To clarify: I'm aware that it's difficult/impossible to get Google search results programmatically/API-wise AND simply CURLing results pages has a lot of issues. There's concensus on both of these points here on stackoverflow. My question is different.

+1  A: 

This should be very simple thing, have a look at Screen Scraping with ScrAPI screen cast by Ryan Bates. You still can do without scraping libs, just stick to simple things like nokogiri.

Update:

From nokogiri's documentation:

  require 'nokogiri'
  require 'open-uri'

  # Get a Nokogiri::HTML:Document for the page we’re interested in...

  doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))

  # Do funky things with it using Nokogiri::XML::Node methods...

  ####
  # Search for nodes by css
  doc.css('h3.r a.l').each do |link|
    puts link.content
  end

  ####
  # Search for nodes by xpath
  doc.xpath('//h3/a[@class="l"]').each do |link|
    puts link.content
  end

  ####
  # Or mix and match.
  doc.search('h3.r a.l', '//h3/a[@class="l"]').each do |link|
    puts link.content
  end
khelll
A: 

You should be able to accomplish your goal easily with Mechanize.

Edit: Actually, if you already have the results, all you need is HPricot or Nokogiri.

Avdi
Thanks so much!
Jimbo
You're welcome! And see my update: if you already have the results, Mechanize may be overkill.
Avdi
+3  A: 

I'm unclear as to why you want to be screen scraping in the first place. Perhaps the REST search API would be more appropriate? It will return the results in JSON format, which will be much easier to parse, and save on bandwidth. For example, if your search was 'foo bar', you could just send a GET request to http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=foo+bar and handle the response.

For more information, see this blog post or the official documentation.

pkaeding
It does not return the same results sadly. See: http://code.google.com/p/google-ajax-apis/issues/detail?id=43
Anders Rune Jensen
A: 

@pkaeding -

The issue with your solution is that the ajax version only returns 4 results by default and 8 max. It's a limit built into that particular solution.

There are ruby libraries of gdata (gem install gdata) that will give you full access to the gdata api, but those are alot more complicated than the nokogiri solution that @khelll provided.

tfrey
A: 

Checkout this professional article about google scraping It covers details on how to scrape google, how to avoid detection. On what happens when you are detected and how to get more than the average 1000 results out of Google.

Included is an advanced Google scraper in PHP which is ready out of the box including proxy support (and a way to get proxies easily).. Just all you need to get started professionally.

Scrape Google
A: 

you can find 2 ruby examples here: http://snippets.dzone.com/posts/show/4133 one of the examples is using hpricot, the other is using scrubyt

z3cko
A: 

I would suggest httparty + google ajax search api

knoopx