ansaurus

Question

A Question of Greedy vs. Negated Character Classes in Regex

Answer 1

A:

Not a direct answer to the question, but why not a different approach altogether, since you know the format of the lines already? For example, you could use a regex on the whitespace between the fields, or avoid regex altogether and split() on the whitespace, which is generally going to be faster than any regular expression, depending on the language you're using.

Jay 2008-10-03 19:39:11

cuz i'm looking to find out about the use of greediness vs negated character classes. doing something else wouldn't answer the question.

Keng 2008-10-03 19:42:12

Answer 2

+3 A:

[^\D]{1,} and \d+ is exactly the same. The regex parser will compile [^\D] and \d into character classes with the equal content, and + is just short for {1,}.

If you want lazy repetition you can add a ? at the end.

\d+?

The character classes are usually compiled into bitmaps for ASCII-characters. For Unicode (>=256) it is implementation dependent. One way could be to create a list of ranges, and use binary search on it.

For ASCII the lookup time is constant over the size. For Unicode it is logarithmic or linear.

MizardX 2008-10-03 19:42:14

well, the "+" isn't short for {1} because the + is greedy thus it goes all the way through and then backtracks when it fails thus comparing a 15 character line possibly upto 30 full characters.

Keng 2008-10-03 19:45:03

the ? also does backtracking thus slowing down the process.

Keng 2008-10-03 19:45:57

Answer 3

+1 A:

My initial tests show that [^\D{1,} is a bit slower than \d+, on a 184M file the former takes 9.6 seconds while the latter takes 8.2

Without capturing (the ()'s) both are about 1 second faster, but the difference between the two is about the same.

I also did a more extensive test where the captured value is printed to /dev/null, with a third version splitting on the space, results:

([^\D]{1,}): ~18s
(\d+): ~17s
(split / /)[1]: ~17s

Edit: split version improved and time decreased to be the same or lower than (\d+)

Fastest version so far (can anyone improve?):

while (<>)
{
    if ($foo = (split / /)[1])
    {
        print $foo . "\n";
    }
}

thelsdj 2008-10-03 19:44:22

OOOOOOooo...testing! very nice. great idea.

Keng 2008-10-03 19:47:00

can you try it with out the () that jj33 mentions?

Keng 2008-10-03 19:47:41

Answer 4

+1 A:

This is kind of a trick question as written because (\d)+ takes slightly longer due to the overhead of the capturing parentheses. If you change it to \d+ they take the same amount of time in my perl/system.

jj33 2008-10-03 19:44:25

oooooo.....good point!

Keng 2008-10-03 19:48:11

Answer 5

+1 A:

Yeah, I agree with MizardX... these two expressions are semantically equivalent. Although the grouping could require additional resources. That's not what you were asking about.

harpo 2008-10-03 19:46:50

Answer 6

+2 A:

Both your expressions have the same greediness. As others have said here, except for the capturing group they will execute in the same way.

Also in this case greediness won't matter much at the execution speed since you don't have anything following \d*. In this case the expression will simply process all the digits it can find and stop when the space is encountered. No backtracking should occur with these expressions.

To make it more explicit, backtracking should occur if you have an expression like this:

\d*123

In this case the parser will engulf all the digits, then backtrack to match the three following digits.

rslite 2008-10-03 19:57:05

Very nicely put. I didn't think about the no-more-char-after-the-digits affecting the backtracking.

Keng 2008-10-03 20:12:09

ansaurus

tags:

views:

answers:

A Question of Greedy vs. Negated Character Classes in Regex

related questions