tags:

views:

379

answers:

6

I have a very large file that looks like this (see below). I have two basic choices of regex to use on it (I know there may be others but I'm really trying to compare Greedy and Negated Char Class) methods.

ftp: [^\D]{1,}
ftp: (\d)+
ftp: \d+

Note: what if I took off the parense around the \d?

Now + is greedy which forces backtracking but the Negated Char Class require a char-by-char comparison. Which is more efficient? Assume the file is very-very large so minute differences in processor usage will become exaggerated due to the length of the file.

Now that you've answered that, What if my Negated Char Class was very large, say 18 different characters? Would that change your answer?

Thanks.

ftp: 1117 bytes
ftp: 5696 bytes
ftp: 3207 bytes
ftp: 5696 bytes
ftp: 7200 bytes

A: 

Not a direct answer to the question, but why not a different approach altogether, since you know the format of the lines already? For example, you could use a regex on the whitespace between the fields, or avoid regex altogether and split() on the whitespace, which is generally going to be faster than any regular expression, depending on the language you're using.

Jay
cuz i'm looking to find out about the use of greediness vs negated character classes. doing something else wouldn't answer the question.
Keng
+3  A: 

[^\D]{1,} and \d+ is exactly the same. The regex parser will compile [^\D] and \d into character classes with the equal content, and + is just short for {1,}.

If you want lazy repetition you can add a ? at the end.

\d+?

The character classes are usually compiled into bitmaps for ASCII-characters. For Unicode (>=256) it is implementation dependent. One way could be to create a list of ranges, and use binary search on it.

For ASCII the lookup time is constant over the size. For Unicode it is logarithmic or linear.

MizardX
well, the "+" isn't short for {1} because the + is greedy thus it goes all the way through and then backtracks when it fails thus comparing a 15 character line possibly upto 30 full characters.
Keng
the ? also does backtracking thus slowing down the process.
Keng
+1  A: 

My initial tests show that [^\D{1,} is a bit slower than \d+, on a 184M file the former takes 9.6 seconds while the latter takes 8.2

Without capturing (the ()'s) both are about 1 second faster, but the difference between the two is about the same.

I also did a more extensive test where the captured value is printed to /dev/null, with a third version splitting on the space, results:

([^\D]{1,}): ~18s
(\d+): ~17s
(split / /)[1]: ~17s

Edit: split version improved and time decreased to be the same or lower than (\d+)

Fastest version so far (can anyone improve?):

while (<>)
{
    if ($foo = (split / /)[1])
    {
        print $foo . "\n";
    }
}
thelsdj
OOOOOOooo...testing! very nice. great idea.
Keng
can you try it with out the () that jj33 mentions?
Keng
+1  A: 

This is kind of a trick question as written because (\d)+ takes slightly longer due to the overhead of the capturing parentheses. If you change it to \d+ they take the same amount of time in my perl/system.

jj33
oooooo.....good point!
Keng
+1  A: 

Yeah, I agree with MizardX... these two expressions are semantically equivalent. Although the grouping could require additional resources. That's not what you were asking about.

harpo
+2  A: 

Both your expressions have the same greediness. As others have said here, except for the capturing group they will execute in the same way.

Also in this case greediness won't matter much at the execution speed since you don't have anything following \d*. In this case the expression will simply process all the digits it can find and stop when the space is encountered. No backtracking should occur with these expressions.

To make it more explicit, backtracking should occur if you have an expression like this:

\d*123

In this case the parser will engulf all the digits, then backtrack to match the three following digits.

rslite
Very nicely put. I didn't think about the no-more-char-after-the-digits affecting the backtracking.
Keng