ansaurus

Question

process all links but external ones (ruby + mechanize)

Answer 1

+1 A:

Use the link's uri method:

  page.links.each do |link|
     next unless link.uri.host.match(/(www\.)?thissite\.com/)
     process_page($agent.get(link.href))
  end

CodeJoust 2010-04-27 03:44:32

@COdeJoust: it looks good but `in `process_page': undefined method `url' for #<WWW::Mechanize::Page::Link "" "Statement.html"> (NoMethodError)`

Radek 2010-04-27 03:51:44

ok, it looks that the method is actually .uri but sometimes you can have relative path within the web server and then I get `undefined method 'match' for nil:NilClass (NoMethodError)` because there is no host

Radek 2010-04-27 04:22:24

Answer 2

+1 A:

URI has some methods that make it pretty easy to see whether you are looking at a local URL or one on another site.

This is a minor modification from the URI .route_to() docs example:

require 'uri'

URI.parse('/main.rbx?page=1').host # => nil
URI.parse('main.rbx?page=1').host  # => nil

Internal URLs have no host so I'd parse the URLs in question and look to see if they have a host. If not, it's internal to the site.

A URL pointing to an external site will return a value for the host but so will a full URL for the site in question so you have to do some more massaging.

uri = URI.parse('http://my.example.com')

uri.route_to('http://my.example.com/main.rbx?page=1').host  # => nil
uri.route_to('http://another.com/main.rbx?page=1').host # => "another.com"

If it has a host see whether that host matches your starting URL's host. You can do that by a substring search or a regex match, but both of those have a possibility of returning false-positives if a sub-string match occurs.

Instead, I'd use URI's methods to avoid those false positives; Use route_to() to try to build a relative path to the URL. If the result has a .host value then it's external.

Greg 2010-04-27 05:48:44

very good answer. Thank you for that.

Radek 2010-04-27 06:51:46

Thanks. It comes from having slammed into the wall a bunch of times doing it ways that I *thought* would work, but didn't. There's no guarantee this will cover every situation, but using URI helps to weed out a lot of unexpected problems. :-)

Greg 2010-04-27 19:44:35

ansaurus

tags:

views:

answers:

process all links but external ones (ruby + mechanize)

related questions