views:

114

answers:

2

I have a starting page of http://www.example.com/startpage which has 1220 listings broken up by pagination in the standard way eg 20 results per page.

I have code working that parses the first page of results and follows links that contain "example_guide/paris_shops" in their url. I then use Nokogiri to pull specific data of that final page. All works well and the 20 results are written to a file.

However I can't seem to figure out how to also get Anemone to crawl to the next page of results (http://www.example.com/startpage?page=2) and then continue to parse that page and then the 3rd page (http://www.example.com/startpage?page=3) and so on.

So I'd like to ask if anyone knows how I can get anemone to start on a page, parse all the links on that page (and the next level of data for specific data) but then follow the pagination to the next page of results so anemone can start parsing again and so on and on. Given that the pagination links are different from the links in the results Anemone doesn't of course follow them.

At the moment I am loading the url for the first page of results, letting that finish and then pasting in the next url for the 2nd page of results etc etc. Very manual and inefficient especially for getting hundreds of pages.

Any help would be much appreciated.

require 'rubygems'
require 'anemone'
require 'nokogiri'
require 'open-uri'

Anemone.crawl("http://www.example.com/startpage", :delay => 3) do |anemone|
  anemone.on_pages_like(/example_guide\/paris_shops\/[^?]*$/) do | page |

doc = Nokogiri::HTML(open(page.url))

name = doc.at_css("#top h2").text unless doc.at_css("#top h2").nil?
address = doc.at_css(".info tr:nth-child(3) td").text unless doc.at_css(".info tr:nth-child(3) td").nil?
website = doc.at_css("tr:nth-child(5) a").text unless doc.at_css("tr:nth-child(5) a").nil?

open('savedwebdata.txt', 'a') { |f|
  f.puts "#{name}\t#{address}\t#{website}\t#{Time.now}"
}
  end
end
A: 

Without having actual HTML or a real site to hit it's hard to give exact examples. I've done what you're trying to do many times, and you really only need open-uri and nokogiri.

There are a bunch of different ways to determine how to move from one page to another, but when you know how many elements are on a page and how many pages there are I'd use a simple loop of 1200 / 20 = 60 pages. The gist of the routine looks like:

require 'open-uri'
require 'nokogiri'

1.upto(60) do |page_num|
  doc = Nokogiri::HTML(open("http://www.example.com/startpage?page=#{page_num}"))
  ... grab the data you want ...
  ... sleep n seconds to be nice ...
end

You might want to look into using Mechanize to crawl the site. It's not a crawler per se, but instead is a toolkit making it easy to navigate a site, fill in forms and submit them, deal with authentication, sessions, etc. It uses Nokogiri internally and makes it easy to walk the document and extract things using regular Nokogiri syntax.

Greg
Thanks Greg - that should help get me started. I used Anemone for its ease of grabbing pages that match thru its "on_pages_like". I'm guessing Nokogiri can probably also do this so will poke around the docs and see what I get. Once again - thx for your help
ginga
From your description of how the pages are laid out, you don't need to look for the next page information; They're using a standard next-page link, so just stuff the page number into a string and retrieve it. Nokogiri can easily locate the next-page link, but it doesn't look like that is necessary so don't waste time if you can avoid it. This page on Stack Overflow might help: http://stackoverflow.com/questions/2807500/following-a-link-using-nokogiri-for-scraping
Greg
A: 

actually Anemone has the nokogiri doc built into it. if you call page.doc that is a nokogiri doc so no need to have two nokogiri docs

Davinj