tags:

views:

72

answers:

2
+1  Q: 

python regex speed

Hi guys,

regarding regex (specifically python re), if we ignore the way the expression is written, is the length of the text the only factor for the time required to process the document? Or are there other factors (like how the text is structured) that play important roles too?

+2  A: 

Both the length of the text and its contents are important.

As an example the regular expression a+b will fail to match quickly on a string containing one million bs but more slowly on a string containing one million as. This is because more backtracking will be required in the second case.

import timeit
x = "re.search('a+b', s)"
print timeit.timeit(x, "import re;s='a'*10000", number=10)
print timeit.timeit(x, "import re;s='b'*10000", number=10)

Results:

6.85791902323
0.00795443275612
Mark Byers
`'regexp' in title and exists('Mark Byers')` => `True`.
OTZ
@Mark, are long repeating words in the text such as 'spam spam spam ....' significant? My regex basically looks for tabs and replace with whitespace str = re.sub("\t", " ",str) but for this particular piece of text, it seems forever. Based on your answer, in my case, it shouldn't matter.
goh
@goh: For that particular regular expression it will make no difference. That regular expression is very simple so there is not much backtracking.
Mark Byers
@Mark, I suppose for the regex above, the sheer length of the text determines the huge amount of time required. However, for one document with around 190000 characters it takes >5 mins where another with 20000 characters takes less than 1 min, it doesn't seem to fit in linearly though...
goh
@goh: The first example should be O(n^2) and the second O(n). The first has worse performance because it backtracks.
Mark Byers
+2  A: 

One important consideration can also be whether the text actually matches the regular expression. Take (as a contrived example) the regex (x+x+)+y from this regex tutorial.

When applied to xxxxxxxxxxy it matches, taking the regex engine 7 steps. When applied to xxxxxxxxxx, it fails (of course), but it takes the engine 2558 steps to arrive at this conclusion.

For xxxxxxxxxxxxxxy vs. xxxxxxxxxxxxxx it's already 7 vs 40958 steps, and so on exponentially...

This happens especially easily with nested repetitions or regexes where the same text can be matched by two or more different parts of the regex, forcing the engine to try all permutations before being able to declare failure. This is then called catastrophic backtracking.

Tim Pietzcker