tags:

views:

457

answers:

4

This is more a puzzle question for my curiosity than anything else. I'm looking for a single regular expression substitution that will convert entity escaped ampersands to an unescaped ampersands only within href attributes in an html file. For example:

<a href="http://example.com/index.html?foo=bar&amp;amp;baz=qux&amp;amp;frotz=frobnitz"&gt;
Me, myself &amp; I</a>

Would convert to:

<a href="http://example.com/index.html?foo=bar&amp;baz=qux&amp;frotz=frobnitz"&gt;
Me, myself &amp; I</a>

Now, I can do this in several statements but I'm curious if any perl regex gurus can do it in one.

The closest I've come so far is the following regex that doesn't work because lookbehinds can't be of variable length. Of course, it might not work even if they were allowed, I'm not sure.

s/(?<=href=".*?)&amp;(?=.*?")/&/g;

Thanks.

+3  A: 

Adapting your close approximation:

while (s/(?<=href=")([^"]*?)&amp;/$1&/) {}

This is a cheat; but it is a single regex. The key part is the non-greedy scan for characters that are not a closing double quote followed by the &amp; string. The other observation to make is that given the input:

<a href="http://example.com/index.html?x=y&amp;amp;amp;amp;y=z"&gt;

You will get out:

<a href="http://example.com/index.html?x=y&amp;y=z"&gt;

You have to decide whether that matters.

The difficulty with any non-iterative solution is that once you've read the 'href="' in the first match, you won't be seeing it again for subsequent matches.

Jonathan Leffler
Jonathan, I love your answer. I'd come up with a similar regex when I was playing around with the problem but didn't think to drop it in a while loop. I'm still curious though if there's a way with just a single regex. Thanks!
Hans Lawrenz
+1  A: 

Don't try to parse non-regular languages with regular expressions. Get an HTML parser from CPAN, then operate just on the element you need.

Svante
My goal here is just to learn if this is possible. I'm not so concerned with the *correct* way to work with HTML. The HTML is really just for example sake. I appreciate your answer though.
Hans Lawrenz
@hrwl then consider this an important lesson: a regex is not appropriate for parsing HTML. You don't learn to use a screwdriver by using it drive nails.
Chas. Owens
I'd more likely say the lesson is to use more abstract examples. The goal was never to learn about parsing HTML.
Hans Lawrenz
+2  A: 

This regex will do what you want in a single line of Perl code, without the inefficient while loop (which makes the regex begin from the start each time) or the lookbehind:

s/((href="|\G)[^"]*?&)amp;/$1/g;

The trick is to use \G to make the regex "remember" that it was inside an href attribute.

This regex also correctly replaces &amp;amp; with &amp;

The only imperfection is that if & occurs at the very start of the subject string, it'll be replaced too. If you want to avoid that, use:

s/((href="|\G(?!\A))[^"]*?&)amp;/$1/g;
Jan Goyvaerts
+1  A: 

OK. First of all - the &amp; in hrefs is perfectly fine, so I don't understand why you want to change it - actually html with & in hrefs wouldn't be valid!

Second - if you need it for anything - you really should use some sensible HTML Parser.

Third what you want can be done quite easily, but not really nicely:

s{href="([^"]*)"}{my $q=$1; $q =~ s/\&amp;/&/g; 'href="' . $q . '"'}eg;

But, please: the fact that it is technically possible doesn't mean that you should use it.

depesz