views:

57

answers:

2

My goal is to replace all instances of a trailing - to a trailing + within tag brackets. Lets assume the line to be replaced looks like this:

<h> aa- aa- </h> <h> ba- ba- </h> 

and should afterwards look like

<h> aa+ aa+ </h> <h> ba+ ba+ </h>

First I tried this expression:

s/<h>(.*?)-(.*?)<\/h>/<h>$1+$2<\/h>/g;

which yielded this output:

<h> aa+ aa- </h> <h> ba+ ba- </h>

The g option does lead to more than one substitution per line, but only for the first instance per tag bracket (and only if both round brackets contain the question mark).

To narrow down the problem, I then tried to to achieve substitution disregarding the tags. The expression

s/(.*?)-(.*?)/$1+$2/g; 

leads indeed to the desired result

<h> aa+ aa+ </h> <h> ba+ ba+ </h>

This will substitute outside of the tag brackets as well, of course.

So what is the problem with my first expression, and how can I achieve my goal of complete substitution within tag brackets?

A: 

Here's one way to do it: split the string into tagged bits and non-tagged bits, and perform the replacement only on the tagged bits.

$_ = join("", map { if(/^<h>/) { # if it's a tagged bit...
                        s/-($|\s|<)/+$1/g; # replace all trailing '-'s
                    }
                    $_}
                  split m!(<h>.*?</h>)!) # split into tagged and non-tagged bits
Gilles
+1  A: 

Since you're parsing XML with regular expressions (not a good idea in the general case), I assume you are willing to make a few assumptions about your input. If so, the following substitution might be good enough.

It replaces minus signs with plus signs, provided that the minus sign is: (a) at a word boundary, and (b) followed by some optional non-left-angle-bracket-text and then a close tag. No need to worry about the start tag if we can assume a valid document. The second condition is enforced with a look-ahead assertion so that the regular expression won't consume the string, allowing you to replace all such minus signs.

s/ \b- (?= [^<]* <\/h>) /+/xg;

Another option is to run your regex until it fails to replace anything. In a scalar context a global substitution returns the number of replacements made, which can serve as your test for when to stop processing a line:

my $n = 1;
$n = s/YOUR_REGEX/YOUR_REPLACE/g while $n;
FM