ansaurus

Question

Single perl regex for removing escaped ampersands from inside href attributes but not elsewhere

Answer 1

+3 A:

Adapting your close approximation:

while (s/(?<=href=")([^"]*?)&amp;/$1&/) {}

This is a cheat; but it is a single regex. The key part is the non-greedy scan for characters that are not a closing double quote followed by the & string. The other observation to make is that given the input:

<a href="http://example.com/index.html?x=y&amp;amp;amp;amp;y=z"&gt;

You will get out:

<a href="http://example.com/index.html?x=y&amp;y=z"&gt;

You have to decide whether that matters.

The difficulty with any non-iterative solution is that once you've read the 'href="' in the first match, you won't be seeing it again for subsequent matches.

Jonathan Leffler 2009-05-19 03:07:26

Jonathan, I love your answer. I'd come up with a similar regex when I was playing around with the problem but didn't think to drop it in a while loop. I'm still curious though if there's a way with just a single regex. Thanks!

Hans Lawrenz 2009-05-19 13:33:40

Answer 2

+1 A:

Don't try to parse non-regular languages with regular expressions. Get an HTML parser from CPAN, then operate just on the element you need.

Svante 2009-05-19 07:14:25

My goal here is just to learn if this is possible. I'm not so concerned with the *correct* way to work with HTML. The HTML is really just for example sake. I appreciate your answer though.

Hans Lawrenz 2009-05-19 13:31:55

@hrwl then consider this an important lesson: a regex is not appropriate for parsing HTML. You don't learn to use a screwdriver by using it drive nails.

Chas. Owens 2009-07-01 02:00:58

I'd more likely say the lesson is to use more abstract examples. The goal was never to learn about parsing HTML.

Hans Lawrenz 2009-07-01 19:04:14

Answer 3

+2 A:

This regex will do what you want in a single line of Perl code, without the inefficient while loop (which makes the regex begin from the start each time) or the lookbehind:

s/((href="|\G)[^"]*?&)amp;/$1/g;

The trick is to use \G to make the regex "remember" that it was inside an href attribute.

This regex also correctly replaces &amp; with &

The only imperfection is that if & occurs at the very start of the subject string, it'll be replaced too. If you want to avoid that, use:

s/((href="|\G(?!\A))[^"]*?&)amp;/$1/g;

Jan Goyvaerts 2009-07-01 01:33:41

Answer 4

+1 A:

OK. First of all - the & in hrefs is perfectly fine, so I don't understand why you want to change it - actually html with & in hrefs wouldn't be valid!

Second - if you need it for anything - you really should use some sensible HTML Parser.

Third what you want can be done quite easily, but not really nicely:

s{href="([^"]*)"}{my $q=$1; $q =~ s/\&amp;/&/g; 'href="' . $q . '"'}eg;

But, please: the fact that it is technically possible doesn't mean that you should use it.

depesz 2009-07-01 07:46:39

ansaurus

tags:

views:

answers:

Single perl regex for removing escaped ampersands from inside href attributes but not elsewhere

related questions