ansaurus

Question

Help needed with screen scraping using anemone and nokogiri

Answer 1

A:

Without having actual HTML or a real site to hit it's hard to give exact examples. I've done what you're trying to do many times, and you really only need open-uri and nokogiri.

There are a bunch of different ways to determine how to move from one page to another, but when you know how many elements are on a page and how many pages there are I'd use a simple loop of 1200 / 20 = 60 pages. The gist of the routine looks like:

require 'open-uri'
require 'nokogiri'

1.upto(60) do |page_num|
  doc = Nokogiri::HTML(open("http://www.example.com/startpage?page=#{page_num}"))
  ... grab the data you want ...
  ... sleep n seconds to be nice ...
end

You might want to look into using Mechanize to crawl the site. It's not a crawler per se, but instead is a toolkit making it easy to navigate a site, fill in forms and submit them, deal with authentication, sessions, etc. It uses Nokogiri internally and makes it easy to walk the document and extract things using regular Nokogiri syntax.

Greg 2010-10-01 06:12:45

Thanks Greg - that should help get me started. I used Anemone for its ease of grabbing pages that match thru its "on_pages_like". I'm guessing Nokogiri can probably also do this so will poke around the docs and see what I get. Once again - thx for your help

ginga 2010-10-02 11:53:47

From your description of how the pages are laid out, you don't need to look for the next page information; They're using a standard next-page link, so just stuff the page number into a string and retrieve it. Nokogiri can easily locate the next-page link, but it doesn't look like that is necessary so don't waste time if you can avoid it. This page on Stack Overflow might help: http://stackoverflow.com/questions/2807500/following-a-link-using-nokogiri-for-scraping

Greg 2010-10-02 16:49:13

Answer 2

A:

actually Anemone has the nokogiri doc built into it. if you call page.doc that is a nokogiri doc so no need to have two nokogiri docs

Davinj 2010-10-04 05:25:01

ansaurus

tags:

views:

answers:

Help needed with screen scraping using anemone and nokogiri

related questions