tags:

views:

71

answers:

5

I have a piece of Perl code (pattern matching) like this,

$var = "<AT>this is an at command</AT>";

if ($var =~ /<AT>([\s\w]*)<\/AT>/i)
{
    print "Matched in AT command\n";
    print "$var\n\n";
}

It works fine, if the content inbetween tags are without an Hyphen. It is not working if a hyphen is inserted between the string present inbetween tags like this... <AT>this is an at-command</AT>.

Can any one fix this regex to match even if hyphen is also inserted ??

help me pls

Senthil

+4  A: 

You can just add a hyphen in the char class as:

if ($var =~ /<AT>([\s\w-]*)<\/AT>/i)

Also since your regex has a / in it you can use a different delimiter, this way you can avoid escaping /:

if ($var =~m{<AT>([\s\w-]*)</AT>}i)
codaddict
Thanks codaddict ... It works fine ... (i think i had properly accepted the answer ...)
Senthil kumar
A: 

If you want to have everything between and you can use

if ($var =~ /<AT>((?:(?!<AT>).)*)<\/AT>/i)

And it's ungreedy.

Colin Hebert
Your patter is in fact greedy, but it's forced to give back what it took to satisfy the match. If for instance the "</AT>" is followed by 10,000 "x"s the capture will match all 10,000, then give them up one by one until it gives up the "</AT>" and can then match the end of the pattern.`/<AT>((?:(?!<\/AT>).)*)<\/AT>/i` or `/<AT>((?:(?!<\/?AT>).)*)<\/AT>/i` will prevent it from overmatching and then backtracking.A more efficient way to write that is `/<AT>((?:[^<]*|<(?!\/?AT>))*)<\/AT>/`, it avoids testing the negative look ahead for each character that is about to be matched.
Ven'Tatsu
+4  A: 

On character class

Your pattern contains this subpattern:

[\s\w]*

The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.

\s is the shorthand for whitespace character class; \w for word character class. Neither contains the hyphen.

The * is the zero-or-more repetition specifier.

Now you should understand why this pattern does not match a hyphen: it matches zero-or-more of characters that is either a whitespace or a word character. If you want to match a hyphen, then you can include it into the character class.

[\s\w-]*

If you also want to include the period, question mark, and exclamation mark, for example, then you can simply add them in as well:

[\s\w.!?-]*

Special note on hyphen

BE CAUTIOUS when including the hyphen in a character class. It is used as a regex metacharacter in character class definition to define character range. For example,

[a-z]

matches one of any character the range between 'a' and 'z', inclusive. By contrast,

[az-]

matches one of exactly 3 characters, 'a', 'z', and '-'. When you put - as the last element in a character class, it becomes a literal hyphen instead of range definition. You can also put it as the first element, or escape it (by preceding with backslash, which is the way you escape all other regex metacharacters too).

That is, the following 3 character class are identical:

[az-]         [-az]         [a\-z]

Related questions

polygenelubricants
Shouldn't `[a-\z]` be `[a\-z]`
codaddict
@codaddict: Correct, thanks for pointing that out.
polygenelubricants
This explanation is pretty good.. thank you buddy ...
Senthil kumar
A: 

You need to add more characters to your class like [\s\w-]* (as codaddict told you).

Moreover, you should maybe use a lookahead to match the end of your command ("I want to match that only if it is followed by the ending statement") like :

if ($var =~ /<AT>([^<]*)(?=<\/AT>)/i)

[^<] stands for "any character (including hyphen) except "<".

You could even add a lookbehind :

if ($var =~ (?<=/<AT>)([^<]*)(?=<\/AT>)/i)

For more complexe things (since you seem to want a little parser), you should look at the theory of grammar and at lex/yacc.

Elenaher
+1  A: 

Use \S instead of \w.

if ($var =~ /<AT>([\s\S]*)<\/AT>/i) {
Divya Saxena
Thanks for this modified code ..
Senthil kumar