views:

6

answers:

0

Hello all. I'm trying to scrape the the Yellow Pages website. Specifically, this link http://www.yellowpages.com/santa-barbara-ca/restaurants. My code works perfectly except for one small problem. Because the "Next" link to go to the next page of restaurants is a relative link, Scrubyt's "next_page" function doesn't work...apparently it only accepts full URL's. Do you know of any workarounds for this? It's kind of stunting my web-scraping abilities...

My code is as follows:

require 'rubygems'
require 'scrubyt'

yellowpages_data = Scrubyt::Extractor.define do

   #Grab the page
   fetch 'http://www.yellowpages.com/santa-barbara-ca/restaurants'

   listing "//div[@class='listing_content']" do

       # Scrape the data from the page
       name "Pascucci"
       street "//span[@class='street-address']"
       city "//span[@class='locality']"
    state "//span[@class='region']"
    zip_code "//span[@class='postal-code']"
    phone "//span[@class='business-phone phone']"

    next_page "Next", :limit => 2 # go to the next page
 end
end

puts yellowpages_data.to_xml.write($stdout, 1)