tags:

views:

103

answers:

3

I am trying to write a regex to recognize a single line of text, with underscore ( _ ) recognized as a line continuation character. For example, "foo_\nbar" should be considered a single line, because the "foo" ends in an underscore. I am trying:

$txt = "foo_\nbar";
print "$&\n" if $txt =~ /.*(_\n.*)*/;

However, this prints only:

foo_

This seems to violate the "leftmost longest" rule for Perl regexes!

Interestingly, if I remove the last star (*) in the regex, i.e.:

$txt = "foo_\nbar";
print "$&\n" if $txt =~ /.*(_\n.*)/;

it does print:

foo_
bar

But I need the star to recognize "0 or more" continuations!

What am I doing wrong?

+4  A: 

Perl doesn't do "leftmost longest"; instead, each regex feature has a well-defined way of acting. Your initial * will match as many times as possible, so long as the rest of the regex can match at all. To prevent it from swallowing the _, do something like:

/(.*(?!(?<=_)\n)_\n)*.*/
ysth
Wow... that is some heavy regex magic...
JoelFan
Not really: `.*` match non-newlines, `(?!` but don't end with, `(?<=_)` something preceded by a `_`, `\n` that is a newline `)*` repeated for as many lines as possible `.*` and get the following line too
ysth
ZyX's is much nicer but a less literal translation of the defined problem.
ysth
+6  A: 

Why is this happening was explained by @ysth. To fix it you may use the following regex:

/([^_\n]|_.)*/s
ZyX
A: 

There are two basic flavors of regular expression designs:

POSIX defines the leftmost-longest flavor. For example: changing any "a|b" to "b|a" does nothing to the full match.

PERL defines the left-biased flavor. Each "a|b" checks the left-branch "a" and if this can match then "b" is never checked. Thus "a|b" is rarely the same as "b|a". Here a* is like ()|a|aa|aaa|aaaa|...

Chris Kuklewicz
no, a* is like ...|aaaa|aaa|aa|a|(). a*? is like ()|a|aa|aaa|aaaa|....
ysth