tags:

views:

108

answers:

2
+1  Q: 

Ruby Regex Help

I want to Extract the Members Home sites links from a site. Looks like this

<a href="http://www.ptop.se" target="_blank">

i tested with it this site

http://www.rubular.com/

<a href="(.*?)" target="_blank">

Shall output http://www.ptop.se,

Here comes the code

    require 'open-uri'
    url = "http://itproffs.se/forumv2/showprofile.aspx?memid=2683"
    open(url) { |page| content = page.read()
    links = content.scan(/<a href="(.*?)" target="_blank">/)
    links.each {|link| puts #{link} 
    }
    }

if you run this, it dont works. why not?

+1  A: 

Several issues with your code

  1. I don't know what you mean by using #{link}. But if you want to append a '#' character to the link make sure you wrap that with quotes. ie "#{link}"
  2. String.scan accepts a block. Use it to loop through the matches.
  3. The page you are trying to access does not return any links that the regex would match anyway.

Here's something that would work:

require 'open-uri'
url = "http://itproffs.se/forumv2/"
open(url) do |page|
    content = page.read()
    content.scan(/<a href="(.*?)" target="_blank">/) do |match|
         match.each { |link| puts link}
        end
end

There're better ways to do it, I am sure. But this should work.

Hope it helps

dmondark
Your first point is not true. You probably *should* use do/end for clarity, but multi-line blocks can use curly braces.
Ed Swangren
I did not know that. You're right. I apologize for the misinformation. Corrected.
dmondark
+1  A: 

I would suggest that you use one of the good ruby HTML/XML parsing libraries e.g. Hpricot or Nokogiri.

If you need to log in on the site you might be interested in a library like WWW::Mechanize.

Code example:

require "open-uri"
require "hpricot"
require "nokogiri"

url = "http://itproffs.se/forumv2"

# Using Hpricot 
doc = Hpricot(open(url))
doc.search("//a[@target='_blank']").each { |user| puts "found #{user.inner_html}" }

# Using Nokogiri
doc = Nokogiri::HTML(open(url))
doc.xpath("//a[@target='_blank']").each { |user| puts "found #{user.text}" }
sris