Hi.
I don't need to crawl the whole internet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. What library would be appropriate to program that?
Hi.
I don't need to crawl the whole internet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. What library would be appropriate to program that?
Mechanize is very good for those sort of things.
http://mechanize.rubyforge.org/mechanize/
In particular this page will help:
http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html
Under the covers Mechanize uses Nokogiri to parse the document. Here's a simple version using Open-URI and Nokogiri to read a page, extract all links and write the HTML.
Added example:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://some.web.site'))
Accessing the links is easy. This uses CSS accessors:
hrefs = (doc/'a[href]').map{ |a| a['href'] }
This uses XPath to do the same thing:
hrefs = (doc/'//a[@href]').map{ |a| a['href'] }
Saving the content is easy. Create a file, and ask Nokogiri to spit it out as HTML:
File.new('some_web_site.html', 'w') { |fo| fo.puts doc.to_html }