tags:

views:

76

answers:

2

I have a very simple substitution:

my $s = "<a>test</a> <a>test</a>";
$s =~ s{ <a> .+? </a> $ }{WHAT}x;

print "$s\n";

that prints:

WHAT

But I was expecting:

<a>test</a> WHAT

What do I misunderstand about "end string anchor" in interaction with ungreedy option?


So, I was wrong about regexp engine. Indeed, dont humanize code - it doing rightly what you wrote, not you "think do".

Its just find first <a>, then find </a>$. First lockup are positive, pattern matched.

Right pattern must be something about:

$s =~ s{ <a> (?! .* <a> ) .* </a> }{WHAT}x;

thats give me correctly

<a>test</a> WHAT

because now I really asked regexp for last <a>.

I think its less efficient [^<]+, but more flexible.

+4  A: 

The non-greedy modifier (and regexes in general) works from left-to-right, so in essence what is happening here is that it tries to find the shortest string that matches after the first <a> until the next </a> that is at the end of the string.

This does what you would expect:

my $s="<a>test</a> <a>test</a>";
$s =~ s#<a>[^<>]+</a>$#WHAT#;

print "$s\n";

What is the problem you're trying to solve?

szbalint
There's no point in using the non-greedy modifier here, because `[^<>]` can never match `<`, which is the next character in the regex.
cjm
You're of course right.
szbalint
Alternatively, only much less efficiently, `s#^.*\K<a>.*</a>$#WHAT#`; sometimes this approach is needed.
ysth
"What is the problem you're trying to solve" - I try to understood WHAT happened...
Meettya
Why [^<>]+ ? Just [^<]+ to be enough.
Meettya
@Meettya: yeah, `[^<]+` should be enough. I only added the other angle bracket for clarity.
szbalint
@ysth: what is "\K" in s#^.*\K<a>.*</a>$#WHAT# ?
Meettya
@Meettya: it's a 5.10ism; it says don't include anything before that in what part of the string is considered matched
ysth
@ysth: O! I see, thank you for new knowledge.
Meettya
+5  A: 

This is one of the reasons you don't use a regex to match HTML. Try using a parser instead. See this question and its answers for more reasons not use a regex, and this question and its answers for examples of how to use an HTML parser.

Chas. Owens
Yes, yes, yes... I now about parser and regex trouble.But!I NEVER EVER hear about priority between 'end string anchor' and ungreedy option. Change tags to simply letter and you are haven't HTML parser anymore.Theoretically it`s no way to chose "who is the great?", it is logical to be working: 1. go to the END of string; 2. find </a> tag directly at the and; 3. find FIRST <a> tag.I mean in thats case to be logically change 'leftmost' rule to mirrored 'rightmost'. Is`nt it?
Meettya