Hello all. I'm trying to scrape the the Yellow Pages website. Specifically, this link http://www.yellowpages.com/santa-barbara-ca/restaurants. My code works perfectly except for one small problem. Because the "Next" link to go to the next page of restaurants is a relative link, Scrubyt's "next_page" function doesn't work...apparently it only accepts full URL's. Do you know of any workarounds for this? It's kind of stunting my web-scraping abilities...
My code is as follows:
require 'rubygems'
require 'scrubyt'
yellowpages_data = Scrubyt::Extractor.define do
#Grab the page
fetch 'http://www.yellowpages.com/santa-barbara-ca/restaurants'
listing "//div[@class='listing_content']" do
# Scrape the data from the page
name "Pascucci"
street "//span[@class='street-address']"
city "//span[@class='locality']"
state "//span[@class='region']"
zip_code "//span[@class='postal-code']"
phone "//span[@class='business-phone phone']"
next_page "Next", :limit => 2 # go to the next page
end
end
puts yellowpages_data.to_xml.write($stdout, 1)