I want to process all links but external ones from the whole web site. Is there any easy way how to identify that the link is external and skip it?
My code looks so far like (the site url is passed through command line argument)
I am using mechanize (0.9.3) and ruby 1.8.6 (2008-08-11 patchlevel 287) [i386-mswin32]
Please note that the web site can use relative path so there is no host/domain and it makes it bit more complicated
require 'mechanize'
def process_page(page)
puts
puts page.title
STDIN.gets
page.links.each do |link|
process_page($agent.get(link.href))
end
end
$agent = WWW::Mechanize.new
$agent.user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.4) Gecko/20091016 Firefox/3.5.4'
process_page($agent.get(ARGV[0]))