views:

49

answers:

2

Good afternoon,

I'm learning about using RegEx's in Ruby, and have hit a point where I need some assistance. I am trying to extract 0 to many URLs from a string.

This is the code I'm using:

sStrings = ["hello world: http://www.google.com", "There is only one url in this string http://yahoo.com . Did you get that?", "The first URL in this string is http://www.bing.com and the second is http://digg.com","This one is more complicated http://is.gd/12345 http://is.gd/4567?q=1", "This string contains no urls"]
sStrings.each  do |s|
  x = s.scan(/((http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.[\w-]*)?)/ix)
  x.each do |url|
    puts url
  end
end

This is what is returned:

http://www.google.com
http
.google
nil
nil
http://yahoo.com
http
nil
nil
nil
http://www.bing.com
http
.bing
nil
nil
http://digg.com
http
nil
nil
nil
http://is.gd/12345
http
nil
/12345
nil
http://is.gd/4567
http
nil
/4567
nil

What is the best way to extract only the full URLs and not the parts of the RegEx?

Thanks

Jim

+4  A: 

You could use anonymous capture groups (?:...) instead of (...).

I see that you are doing this in order to learn Regex, but in case you really want to extract URLs from a String, take a look at URI.extract, which extracts URIs from a String. (require "uri" in order to use it)

dominikh
+1  A: 

You can create a non-capturing group using (?:SUB_PATTERN). Here's an illustration, with some additional simplifications thrown in. Also, since you're using the /x option, take advantage of it by laying out your regex in a readable way.

sStrings = [
    "hello world: http://www.google.com",
    "There is only one url in this string http://yahoo.com . Did you get that?",
    "... is http://www.bing.com and the second is http://digg.com",
    "This one is more complicated http://is.gd/12345 http://is.gd/4567?q=1",
    "This string contains no urls",
]

sStrings.each  do |s|
    x = s.scan(/
        https?:\/\/
        \w+
        (?: [.-]\w+ )*
        (?:
            \/
            [0-9]{1,5}
            \?
            [\w=]*
        )?
    /ix)

    p x
end

This is fine for learning, but don't really try to match URLs this way. There are tools for that.

FM