tags:

views:

32

answers:

1

Hi.

I don't need to crawl the whole internet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. What library would be appropriate to program that?

+1  A: 

Mechanize is very good for those sort of things.

http://mechanize.rubyforge.org/mechanize/

In particular this page will help:

http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html


Under the covers Mechanize uses Nokogiri to parse the document. Here's a simple version using Open-URI and Nokogiri to read a page, extract all links and write the HTML.

Added example:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://some.web.site'))

Accessing the links is easy. This uses CSS accessors:

hrefs = (doc/'a[href]').map{ |a| a['href'] }

This uses XPath to do the same thing:

hrefs = (doc/'//a[@href]').map{ |a| a['href'] }

Saving the content is easy. Create a file, and ask Nokogiri to spit it out as HTML:

File.new('some_web_site.html', 'w') { |fo| fo.puts doc.to_html }
Greg
I had a look on your guide, it explain how to access a page, but I didn't find anything how to save the document as a file.
Guillaume Coté
Saving the document as a file isn't part of Mechanize. It's a very simple thing YOU write once you have the content, which is available from Mechanize. An alternate is to use Open-URI to get the URL, Nokogiri to extract the links in question, then File.new() to write the content. Your question doesn't give enough information to write anything beyond rudimentary example code, but I'll add something to my answer.
Greg
Actually, in the page object there is an inherited method save, but the Ruby doc is less clear about those method that javadoc, so it took me a while to find it.
Guillaume Coté
I'm not surprised it has file capability, the people who wrote it are official "Smart People".
Greg