ansaurus

Question

What is a Ruby web crawler library that allow xpath access and the equivalent of save as webpage complete?

Answer 1

+1 A:

Mechanize is very good for those sort of things.

http://mechanize.rubyforge.org/mechanize/

In particular this page will help:

http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html

Under the covers Mechanize uses Nokogiri to parse the document. Here's a simple version using Open-URI and Nokogiri to read a page, extract all links and write the HTML.

Added example:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://some.web.site'))

Accessing the links is easy. This uses CSS accessors:

hrefs = (doc/'a[href]').map{ |a| a['href'] }

This uses XPath to do the same thing:

hrefs = (doc/'//a[@href]').map{ |a| a['href'] }

Saving the content is easy. Create a file, and ask Nokogiri to spit it out as HTML:

File.new('some_web_site.html', 'w') { |fo| fo.puts doc.to_html }

Greg 2010-10-22 02:41:33

I had a look on your guide, it explain how to access a page, but I didn't find anything how to save the document as a file.

Guillaume Coté 2010-10-26 07:05:56

Saving the document as a file isn't part of Mechanize. It's a very simple thing YOU write once you have the content, which is available from Mechanize. An alternate is to use Open-URI to get the URL, Nokogiri to extract the links in question, then File.new() to write the content. Your question doesn't give enough information to write anything beyond rudimentary example code, but I'll add something to my answer.

Greg 2010-10-26 18:35:25

Actually, in the page object there is an inherited method save, but the Ruby doc is less clear about those method that javadoc, so it took me a while to find it.

Guillaume Coté 2010-10-27 02:30:56

I'm not surprised it has file capability, the people who wrote it are official "Smart People".

Greg 2010-10-27 21:31:43

ansaurus

tags:

views:

answers:

What is a Ruby web crawler library that allow xpath access and the equivalent of save as webpage complete?

related questions