tags:

views:

60

answers:

4

I am getting completely different reults from string.scan and several regex testers...

I am just trying to grab the domain from the string, it is the last word.

The regex in question:

/([a-zA-Z0-9\-]*\.)*\w{1,4}$/

The string (1 single line, verified in Ruby's runtime btw)

str = 'Show more results from software.informer.com'

Work fine, but in ruby....

irb(main):050:0> str.scan /([a-zA-Z0-9\-]*\.)*\w{1,4}$/
=> [["informer."]]

I would think that I would get a match on software.informer.com ,which is my goal.

+2  A: 

You are getting a match on software.informer.com. Check the value of $&. The return of scan is an array of the captured groups. Add capturing parentheses around the suffix, and you'll get the .com as part of the return value from scan as well.

The regex testers and Ruby are not disagreeing about the fundamental issue (the regex itself). Rather, their interfaces are differing in what they are emphasizing. When you run scan in irb, the first thing you'll see is the return value from scan (an Array of the captured subpatterns), which is not the same thing as the matched text. Regex testers are most likely oriented toward displaying the matched text.

FM
Hm, I'm new to regex :/... but I still don't get why the regex testers and ruby vary, even the "ruby regex tester" is failing me. Hm, and also I want 1 match, not several. This method gets me more matches...?
Zombies
+3  A: 

Your regex is correct, the result has to do with the way String#scan behaves. From the official documentation:

"If the pattern contains groups, each individual result is itself an array containing one entry per group."

Basically, if you put parentheses around the whole regex, the first element of each array in your results will be what you expect.

Alex Reisner
Interesting... but to me, parenthess seems unavoidable, and yet affect the way scan works. Any tips...?
Zombies
Parentheses here are a little confusing because they have two distinct functions: grouping a sub-expression for repetition, and forming the output of `scan`. One could fix this by introducing another symbol for controlling scan's output, but I think the parentheses usually work pretty well (you often end up with what you want, naturally) and introducing externally-dependent (method-related) symbols into regular expressions does not seem like a good idea.
Alex Reisner
A: 

How about doing this :

/([a-zA-Z0-9\-]*\.*\w{1,4})$/

This returns

informer.com

On your test string.

http://rubular.com/regexes/13670

marcgg
+2  A: 

It does not look as if you expect more than one result (especially as the regex is anchored). In that case there is no reason to use scan.

'Show more results from software.informer.com'[ /([a-zA-Z0-9\-]*\.)*\w{1,4}$/ ]
#=> "software.informer.com"

If you do need to use scan (in which case you obviously need to remove the anchor), you can use (?:) to create non-capturing groups.

'foo.bar.baz lala software.informer.com'.scan( /(?:[a-zA-Z0-9\-]*\.)*\w{1,4}/ )
#=> ["foo.bar.baz", "lala", "software.informer.com"]
sepp2k