tags:

views:

974

answers:

5

I have the following text

tooooooooooooon

According to this book I'm reading, when the ? follows after any quantifier, it becomes non greedy.

My regex to*?n is still returning tooooooooooooon.

It should return ton shouldn't it?

Any idea why?

+1  A: 

The string you are searching in (the haystack as it were) does not contain the substring "ton".

It does however contain the substring "tooooooooooooon".

Hank
As a further to this, your regular expression can only return something that it can find. It cannot find 'ton' in your string, because it doesn't exist, but it CAN find 'tooooooooooooon'.
Matthew Scharley
Great comment, I'll add that for clarity.
Hank
+19  A: 

A regular expression can only match a fragment of text that actually exists.

Because the substring 'ton' doesn't exist anywhere in your string, it can't be the result of a match. A match will only return a substring of the original string

EDIT: To be clear, if you were using the string below, with an extra 'n'

toooooooonoooooon

this regular expression (which doesn't specify 'o's)

t.*n

would match the following (as many characters as possible before an 'n')

toooooooonoooooon

but the regular expression

t.*?n

would only match the following (as few characters as possible before an 'n')

toooooooon
Gareth
you could also use a regx to replace, but it will only replace matches.
Ady
Very clear explanation! I think your last example should be changed from "t.*?n" to "to*?n" -- as it is, "t.*?n" matches the entire string. See my answer for details.
Adam Liss
Adam: no, try it for yourself in your favourite language -- I used http://www.rubular.com/regexes/4760 -- yours should be the same. All *? does is find the shortest possible match rather than the longest.
Gareth
+3  A: 

Regexps try to match everything in them. Because there are no less 'o's to match than every o in toooon to match the n, everything is matched. Also, because you are using o*? instead of o+? you are not requiring an o to be present.

Example, in Perl

$a = "toooooo";
$b = "toooooon";

if ($a =~ m/(to*?)/) {
        print $1,"\n";
}
if ($b =~ m/(to*?n)/) {
        print $1,"\n";
}

~>perl ex.pl
t
toooooon
Vinko Vrsalovic
I would use the same string to test against. Makes it clearer what your trying to highlight.
Brad Gilbert
+3  A: 

A regular expression es always eager to match.

Your expression says this:

A 't', followed by *as few as possible* 'o's, followed by a 'n'.

That means any o's necessary will be matched, because there is an 'n' at the end, which the expression is eager to reach. Matching all the o's is it's only possibility to succeed.

Tomalak
A: 

The Regex always does its best to match. The only thing you are doing in this case would be slowing your parser down, by having it backtrack into the /o*?/ node. Once for every single 'o' in "tooooon". Whereas with normal matching, it would take as many 'o's, as it can, the first time through. Since the next element to match against is 'n', which won't be matched by 'o', there is little point in trying to use minimal matching. Actually, when the normal matching fails, it would take quite a while for it to fail. It has to backtrack through every 'o', until there is none left to backtrack through. In this case I would actually use maximal matching /to*+n/. The 'o' would take all it could, and never give any of it back. This would make it so that when it fails it fails quickly.

Minimal RE succeeding:

'toooooon' ~~ /to*?n/

 t  o  o  o  o  o  o  n       
{t}                           match [t]
[t]                           match [o] 0 times
[t]<n>                        fail to match [n] -> retry [o]
[t]{o}                        match [o] 1 times
[t][o]<n>                     fail to match [n] -> retry [o]
[t][o]{o}                     match [o] 2 times
[t][o][o]<n>                  fail to match [n] -> retry [o]

. . . .

[t][o][o][o][o]{o}            match [o] 5 times
[t][o][o][o][o][o]<n>         fail to match [n] -> retry [o]
[t][o][o][o][o][o]{o}         match [o] 6 times
[t][o][o][o][o][o][o]{n}      match [n]

Normal RE succeeding:

(NOTE: Similar for Maximal RE)

'toooooon' ~~ /to*n/

 t  o  o  o  o  o  o  n       
{t}                           match [t]
[t]{o}{o}{o}{o}{o}{o}         match [o] 6 times
[t][o][o][o][o][o][o]{n}      match [n]

Failure of Minimal RE:

'toooooo' ~~ /to*?n/

 t  o  o  o  o  o  o

. . . .

. . . .

[t][o][o][o][o]{o}            match [o] 5 times
[t][o][o][o][o][o]<n>         fail to match [n] -> retry [o]
[t][o][o][o][o][o]{o}         match [o] 6 times
[t][o][o][o][o][o][o]<n>      fail to match [n] -> retry [o]
[t][o][o][o][o][o][o]<o>      fail to match [o] 7 times -> match failed

Failure of Normal RE:

'toooooo' ~~ /to*n/

 t  o  o  o  o  o  o       
{t}                           match [t]
[t]{o}{o}{o}{o}{o}{o}         match [o] 6 times
[t][o][o][o][o][o][o]<n>      fail to match [n] -> retry [o]
[t][o][o][o][o][o]            match [o] 5 times
[t][o][o][o][o][o]<n>         fail to match [n] -> retry [o]

. . . .

[t][o]                        match [o] 1 times
[t][o]<o>                     fail to match [n] -> retry [o]
[t]                           match [o] 0 times
[t]<n>                        fail to match [n] -> match failed

Failure of Maximal RE:

'toooooo' ~~ /to*+n/

 t  o  o  o  o  o  o
{t}                           match [t]
[t]{o}{o}{o}{o}{o}{o}         match [o] 6 times
[t][o][o][o][o][o][o]<n>      fail to match [n] -> match failed
Brad Gilbert