tags:

views:

77

answers:

5
$line = " TEST: asdas :asd asdasad s";

if ($line =~ /(.*):(.*)/
{
  print "$1  = $2 "
}

I was expecting TEST =asdas :asd asdasad s

but its not working ? what is issue

+11  A: 

The correct way would be:

/([^:]+):(.*)/

or

/(.+?):(.*)/

This way, you're not matching "anything" on the left, you're matching "one or more non-colon characters" in the first example, or "matching the shortest possible string of any characters followed by a colon" in the second.

The even better way is to NOT use a regex. Use split.

my ($left,$right) = split( /:/, $line, 2 );

The ,2 says "I want at most two fields".

Andy Lester
Yap, Andy's right. '.*' is "greedy", it matches as many characters as possible.
OMG_peanuts
You don't have to *both* make the left part non-greedy *and* match only non-colons, either one is sufficient.
mscha
You're right, @mscha, I've added two alternative ways of doing it.
Andy Lester
+1 for `split`.
Sinan Ünür
+2  A: 

Two problems:

  1. was that you needed a closing parenthesis, ), at the end of your if statement
  2. you want a non-greedy expression to match the least amount before the first colon (:)

Try $line =~ m/(.*?):(.*)/ - note the .*? - this means match the minimum required. Normally .* means match the maximum possible.

PP
+1  A: 

Making the first .* non-greedy will also work:

if ($line =~ /(.*?):(.*)/) {
  print "$1  = $2 "
}
codaddict
+1  A: 
$line = " TEST: asdas :asd asdasad s";

if ($line =~ /(.*?):(.*)/)
{
    print "$1  = $2 "
}

Use the above instead. Here (.*?) means non-greedy matching. So it will match till it finds the first ':'

Sujoy
+3  A: 

The issue is, as said by others, your matching everything but the line ending greedily (.*). But what they don't tell you that when the regex engine matches everything up to the end of the line, it has to backtrack in order to satisfy the ':' condition. So after it has swallowed up all the non-linefeed characters, it starts backing up. As it is now going in reverse, the first colon it finds is the ':' right before 'asd'. The colon having been matched, it applies the second group to all non-linefeed characters, which it satisfies.

Whenever you can, you want to avoid backtracking in regexes. Since you want it to match the first colon, everything else before it should not be a colon. So the non-backtracking, determinant expression would be:

([^:]+):(.*)

Once you've seen the first colon, the greedy match is fine. However, if you had a string of spaces and non-spaces and you wanted to match up until the last non-space--thus trimming the string--you can't really specify that in a manner that won't backtrack, because you know whether you want an individual character only as a result of understanding where the character is as a part of the whole.

([^:]+):(.*\S)

When it gets to the end of input, it backtracks for the non-space that it still hasn't matched. And when it finds that, it terminates the capture.

Of course this is a trivial example, and alternative expressions can reduce backtracking. You might know that only single space characters will be accepted, so you can craft an expression that will at most backtrack once, but only to conclude the match:

([^:]+):((?:\S| \S)+)

Here it looks at the next character: if it's not a space, no problem; if it is a space, then only one more character needs to be read in order to determine whether it's a keeper. And as the space-with-following-non-space is the last option, it fails and completes the match.

This post from Regex Guru has a little more on this.

Axeman
+1 for shunning backtracking, but that last regex needs work. According to RegexBuddy, it takes 48 steps to match the OP's test string. Compare that to `([^:]+):((?: \S+)+)`, which takes only 17 steps. It isn't backtracking that's hurting you, it's the alternation in `(?:\S| \S)+`, matching one or two characters per iteration of the `+`.
Alan Moore
+1: Non-greedy matching will get the job done, but it's better to use a negated character class (e.g., `[^:]`) when possible. Not only does it tend to be more efficient, it also more explicitly conveys your meaning to future programmers. ("I want non-colon characters" vs. "I want any characters, I don't care what they are".)
Dave Sherohman