tags:

views:

142

answers:

3

Given an alternation like /(foo|foobar|foobaz)/ does Perl 5.8 or 5.10 make any promises about which of the three will be used first, and if it does where in the documentation does it make that promise?

See the related question Does Perl 6 make any promises about the order alternations will be used?

+3  A: 

There seems to be a promise made in perldoc perlrequick:

To match dog or cat , we form the regex dog|cat . As before, perl will try to match the regex at the earliest possible point in the string. At each character position, perl will first try to match the first alternative, dog . If dog doesn't match, perl will then try the next alternative, cat . If cat doesn't match either, then the match fails and perl moves to the next position in the string.

perldoc perlretut seems to make the promise in an even stronger way (but with a caveat):

"cats"          =~ /c|ca|cat|cats/; # matches "c"
"cats"          =~ /cats|cat|ca|c/; # matches "cats"

Here, all the alternatives match at the first string position, so the first alternative is the one that matches. If some of the alternatives are truncations of the others, put the longest ones first to give them a chance to match.

"cab" =~ /a|b|c/ # matches "c"
                 # /a|b|c/ == /[abc]/

The last example points out that character classes are like alternations of characters. At a given character position, the first alternative that allows the regexp match to succeed will be the one that matches.

Chas. Owens
? but with character classes, either the class matches or it doesn't; there isn't any difference between how "0"=~/[\d0]/ and "0"=~/[0\d]/ match
ysth
Similarly there it doesn't matter how "0" =~ /(?:0|1|2|3|4|5|6|7|8|9)/ matches, either the alternation matches or it doesn't. The trick is what happens with (a|ab) vs (ab|a). I couldn't remember if Perl made a promise about how it would resolve or not. If no promise is made then the regex engine can be optimized to run matches in parallel and whichever one matches the whole regex first would win. Of course, that could be bad since it would lead to non-deterministic behaviour.
Chas. Owens
+4  A: 

http://perldoc.perl.org/perlre.html#Combining-RE-Pieces:

if we match a regular expression a|ab against "abc" , will it match substring "a" or "ab" ? One way to describe which substring is actually matched is the concept of backtracking (see "Backtracking"). However, this description is too low-level and makes you think in terms of a particular implementation.

Another description starts with notions of "better"/"worse"

Again, for elementary pieces there is no such question, since at most one match at a given position is possible. This section describes the notion of better/worse for combining operators. In the description below S and T are regular subexpressions.

...

  • S|T

When S can match, it is a better match than when only T can match.

(In context, this is qualified to only when the match using S matches at least as early in the string as that using T.)

ysth
Nice, I missed that because they didn't use either alternation or alternations.
Chas. Owens
+1, but man I find that final quoted sentence unclear -- why didn't they just say "When S can match, it is always a better match than T"?
j_random_hacker
@j_random_hacker I believe the answer to that question is "patches are welcome" (grin). There is also this bit at the end: "This document varies from difficult to understand to completely and utterly opaque. The wandering prose riddled with jargon is hard to fathom in several places."
Chas. Owens
@Chas: Thanks :) Unfortunately, my answer to "patches are welcome" is "the laziness is strong in j_random_hacker"... :)
j_random_hacker
+1  A: 

In general, the default regex engine in Perl tries to make the leftest and longest match, in that order. If it can match the leftmost alternation possibility and still satisfy the rest of the regex it will.

However, you can change the regex engine.

brian d foy
Yeah, I saw a comment on http://stackoverflow.com/questions/765910 and I couldn't remember if the behaviour was just an artifact of how the default engine works or a promise made in the docs. I was also remembering stuff from Perl 6 where alternation is changing, so I figured it would be a good set of questions to have solid pointers to the docs for. Trying to keep 5.8, 5.10, and 6 separate in my head is proving difficult. It is almost as bad as when I learned C and C++ at the same time.
Chas. Owens