tags:

views:

366

answers:

6
$a = "<no> 3232 </no> "

$a =~ s/<no>(.*)</no>/000/gi ;

I am expecting that $a becomes "<no> 000 </no> ", but it is not working.

+3  A: 

Firstly, the / in is being interpreted as the end of your pattern and that's causing syntax errors. Choose a different delimiter for your substitution operator:

s|<no>.*</no>|000|gi;

But then you have a set of capturing brackets and you're not using what they are capturing. Which makes me think that perhaps even fixing the syntax won't give you the behaviour you want. You don't want to replace the tags, so you can add those to the replacement:

s|<no>.*</no>|<no>000</no>|gi;

Or not replace them at all by using lookarounds so they aren't part of the matched text:

s|(?<=<no>).*(?=</no>)|000|gi;

But given that "it's not working" isn't a very good description of the problem, I don't know what you're expecting to see.

davorg
This solution also removes the `<no>` and `</no>` tags.
mobrule
Sure it does. But it's not a solution. It's reimplementing what the original poster had but without the syntax errors. Then, perhaps, we can start to discuss what he really requires :-)
davorg
Now davorg fixes that :) I think the OP probably is doing something a lot more complicated and oversimplifying it for us.
brian d foy
+4  A: 

If you just want to replace the text between the tags, then you may want to look at lookahead and lookbehind assertions. And you need to either use a regex delimiter other than "/" or escape the "/" in the regex:

$a = "<no> 3232 </no> ";
$a =~ s#(?<=<no>).*?(?=</no>)# 000 #gi;
print "$a\n";
runrig
+9  A: 

You need look-around assertions.

$a =~ s|(?<=<no> ).*(?= </no>)|000|gi;
# $a is now "<no> 000 </no> "

Have you considered reading a Perl book or two? You are not learning effectively if you have to come to Stack Overflow to ask that sort of questions that can be easily answered by reading the fine documentation.

daxim
Using greedy match (`.*` instead of `.*?`) will almost certainly produce undesired behavior in the presence of multiple or nested tags (you seem to have expected multiple tags since you specified the `g` flag.) Even using the lazy match (`.*?`) will produce undesired behaviour in the presence of nested tags. At least limit the damage: `s/<no>[^<]*<\/no>/<no> 000 </no>/g` or `s/(?<=<no>)[\s\d]*(?=<\/no>)/ 000 /g`
vladr
+4  A: 

You could forgo the fancy lookahead or lookaround assertions and come up with a slightly longer regular expression:

$str =~ s|<no>.*?</no>|<no>000</no>|gi;

It might be a little easier to read, but it's slightly counter-intuitive in that you're replacing <no>whatever</no> with <no>000</no>, i.e. you aren't just replacing the things between the <no></no>, you're replacing the whole string with another string that just so happens to have <no> and </no> in it.

CanSpice
+1  A: 

Firstly, the / in the closing is being treated as an end-quote to the regular expression. Either backslash it:

$a =~ s/<no>(.*)<\/no>/000/gi;

or use a different character to / in your regex:

$a =~ s~<no>(.*)</no>~000~gi;

Secondly, I'm guessing you're trying to parse an XML document with this and change data. I'm also guessing that you have many <no>...</no> sections in your document. The problem with the regular expression you gave is that the (.*) will match as much as possible, i.e. everything between the first <no> and the last </no> in your document, including any other tags in between. It also replaces the <no> and </no>.

You can use a non-greedy match, that is one that will match as little as possible. You can put a question mark after the * like so:

$a =~ s~<no>(.*?)</no>~000~gi;

Since this still replaces the <no>...</no>, you will probably want to put those back in:

$a =~ s~<no>(.*?)</no>~<no>000</no>~gi;

In the case where your <no> is instead a regular expression, you can't just put it into your substitution string. You can either use lookarounds as suggested by others, or just capture it and put it back in using $1..$9, like so:

$a =~ s~(<no>)(.*?)(</no>)~$1000$3~gi;

Why $3? Because $2 is whatever you captured with (.*?). Of course, since you don't actually care about what you've captured, you can just do this:

$a =~ s~(<no>).*?(</no>)~$1000$2~gi;

which is probably about as efficient as you're going to get for this problem.

As an aside, it is normally a bad idea to try to parse XML with regular expressions, because XML is too varied for regular expressions to parse. I quite like XML::LibXML for processing XML documents, but it is not at all simple to get in to. However, if you are confident about the precise format of your XML (or in fact it's not XML but just looks a bit like it) then regular expressions are OK as a local hack.

This is all covered in the perlre manpage, which is a must-read if you're going to do anything even remotely non-trivial with Perl regular expressions.

$ perldoc perlre

Hope all the examples help clarify things a bit.

Peter Corlett
+1  A: 

Just to keep this as simple as possible, you have a number of problems, so lets eliminate the obvious ones first.

First, you can't use the slash character ("/") by itself in a string because it has special significance for per; for example "/n" means print a new line and the slash is also used to separate the part of the regex. When you want to use a slash as a literal, the solution is to escape the slash with a backslash to tell perl you really do want a slash character not something special. So your original code would be better written like this:

$a = "<no> 3232 <\/no> ";
$a =~ s/<no>(.*)<\/no>/000/gi;

Now perl will interpret the <\/no> as </no>

Secondly, your regex is wrong. The s/// regex instructs perl to substitute/reformat the pattern in the first section with the pattern in the second section. Your instruction as it is tells perl to substitute everything between the first two slashes with "000" and assign it to variable $a.

The brackets you used in the regex allow you to break the expression into smnaller pieces and re-arrange things but you haven't used them, however you are on the right track. To re-use the parts of the expression in the first set of slashes that you want to keep, you place brackets around them. In the second part of the expression you can refer to those "pieces" by using $1, $2 etc. to refer to the stuff within each set of brackets.

Keeping this in mind you might be tempted to come up with somethign like:

$a = "<no> 3232 <\/no> ";
$a =~ s/(<no>).*(<\/no>)/$1000$2/gi;

This is close - as suggested above - but testing will reveal that it is still not quite right; even more mystifying the output you will get this time is </no>. This is because perl interprets the string as $1000 followed by $2 and $1000 does not refer to anything. Putting a space or something else after the $1 will correct the problem. (There's probably some way of terminating the $1 more correctly but I'll confess here that I don't know it.)

The following expression will work, but you'll get a space after the first so your out put will be <no> 000</no>

$a = "<no> 3232 <\/no> ";
$a =~ s/(<no>).*(<\/no>)/$1 000$2/gi;

My preference would be to use a variable in place of the string "000" and for that reason my code would probably look something like this:

$a = "<no> 3232 <\/no> ";
$b = "000";
$a =~ s/(<no>).*?(<\/no>)/$1$b$2/gi;

Using a variable makes things a bit clearer in my opinion (although they could be better named!) and also allows the text to be substituted (the "000") to be easily changed without having to mess with the regex. The ? in the regex is meant to ensure that the regex doesn't get "greedy if there is more than one set of no elements in the string - this causes the .* to sstop matching as soon as it encounters the matching pattern, in this case "".

Auctionitis