ansaurus

Question

Answer 1

+11 A:

The correct way would be:

/([^:]+):(.*)/

or

/(.+?):(.*)/

This way, you're not matching "anything" on the left, you're matching "one or more non-colon characters" in the first example, or "matching the shortest possible string of any characters followed by a colon" in the second.

The even better way is to NOT use a regex. Use split.

my ($left,$right) = split( /:/, $line, 2 );

The ,2 says "I want at most two fields".

Andy Lester 2010-10-20 15:50:13

Yap, Andy's right. '.*' is "greedy", it matches as many characters as possible.

OMG_peanuts 2010-10-20 15:53:19

You don't have to *both* make the left part non-greedy *and* match only non-colons, either one is sufficient.

mscha 2010-10-20 16:12:04

You're right, @mscha, I've added two alternative ways of doing it.

Andy Lester 2010-10-20 16:14:06

+1 for `split`.

Sinan Ünür 2010-10-21 14:52:21

Answer 2

+2 A:

Two problems:

was that you needed a closing parenthesis, ), at the end of your if statement
you want a non-greedy expression to match the least amount before the first colon (:)

Try $line =~ m/(.*?):(.*)/ - note the .*? - this means match the minimum required. Normally .* means match the maximum possible.

PP 2010-10-20 15:51:38

Answer 3

+1 A:

Making the first .* non-greedy will also work:

if ($line =~ /(.*?):(.*)/) {
  print "$1  = $2 "
}

codaddict 2010-10-20 15:54:21

Answer 4

+1 A:

$line = " TEST: asdas :asd asdasad s";

if ($line =~ /(.*?):(.*)/)
{
    print "$1  = $2 "
}

Use the above instead. Here (.*?) means non-greedy matching. So it will match till it finds the first ':'

Sujoy 2010-10-20 15:56:05

Answer 5

+3 A:

The issue is, as said by others, your matching everything but the line ending greedily (.*). But what they don't tell you that when the regex engine matches everything up to the end of the line, it has to backtrack in order to satisfy the ':' condition. So after it has swallowed up all the non-linefeed characters, it starts backing up. As it is now going in reverse, the first colon it finds is the ':' right before 'asd'. The colon having been matched, it applies the second group to all non-linefeed characters, which it satisfies.

Whenever you can, you want to avoid backtracking in regexes. Since you want it to match the first colon, everything else before it should not be a colon. So the non-backtracking, determinant expression would be:

([^:]+):(.*)

Once you've seen the first colon, the greedy match is fine. However, if you had a string of spaces and non-spaces and you wanted to match up until the last non-space--thus trimming the string--you can't really specify that in a manner that won't backtrack, because you know whether you want an individual character only as a result of understanding where the character is as a part of the whole.

([^:]+):(.*\S)

When it gets to the end of input, it backtracks for the non-space that it still hasn't matched. And when it finds that, it terminates the capture.

Of course this is a trivial example, and alternative expressions can reduce backtracking. You might know that only single space characters will be accepted, so you can craft an expression that will at most backtrack once, but only to conclude the match:

([^:]+):((?:\S| \S)+)

Here it looks at the next character: if it's not a space, no problem; if it is a space, then only one more character needs to be read in order to determine whether it's a keeper. And as the space-with-following-non-space is the last option, it fails and completes the match.

This post from Regex Guru has a little more on this.

Axeman 2010-10-20 23:48:46

+1 for shunning backtracking, but that last regex needs work. According to RegexBuddy, it takes 48 steps to match the OP's test string. Compare that to `([^:]+):((?: \S+)+)`, which takes only 17 steps. It isn't backtracking that's hurting you, it's the alternation in `(?:\S| \S)+`, matching one or two characters per iteration of the `+`.

Alan Moore 2010-10-21 02:16:34

+1: Non-greedy matching will get the job done, but it's better to use a negated character class (e.g., `[^:]`) when possible. Not only does it tend to be more efficient, it also more explicitly conveys your meaning to future programmers. ("I want non-colon characters" vs. "I want any characters, I don't care what they are".)

Dave Sherohman 2010-10-21 06:08:08

ansaurus

tags:

views:

answers:

how to match only once in regex perl

related questions