ansaurus

Question

python regex speed

Answer 1

+2 A:

Both the length of the text and its contents are important.

As an example the regular expression a+b will fail to match quickly on a string containing one million bs but more slowly on a string containing one million as. This is because more backtracking will be required in the second case.

import timeit
x = "re.search('a+b', s)"
print timeit.timeit(x, "import re;s='a'*10000", number=10)
print timeit.timeit(x, "import re;s='b'*10000", number=10)

Results:

6.85791902323
0.00795443275612

Mark Byers 2010-09-27 06:37:17

`'regexp' in title and exists('Mark Byers')` => `True`.

OTZ 2010-09-27 06:39:47

@Mark, are long repeating words in the text such as 'spam spam spam ....' significant? My regex basically looks for tabs and replace with whitespace str = re.sub("\t", " ",str) but for this particular piece of text, it seems forever. Based on your answer, in my case, it shouldn't matter.

goh 2010-09-27 06:48:58

@goh: For that particular regular expression it will make no difference. That regular expression is very simple so there is not much backtracking.

Mark Byers 2010-09-27 06:53:00

@Mark, I suppose for the regex above, the sheer length of the text determines the huge amount of time required. However, for one document with around 190000 characters it takes >5 mins where another with 20000 characters takes less than 1 min, it doesn't seem to fit in linearly though...

goh 2010-09-27 07:50:20

@goh: The first example should be O(n^2) and the second O(n). The first has worse performance because it backtracks.

Mark Byers 2010-09-27 08:44:59

Answer 2

+2 A:

One important consideration can also be whether the text actually matches the regular expression. Take (as a contrived example) the regex (x+x+)+y from this regex tutorial.

When applied to xxxxxxxxxxy it matches, taking the regex engine 7 steps. When applied to xxxxxxxxxx, it fails (of course), but it takes the engine 2558 steps to arrive at this conclusion.

For xxxxxxxxxxxxxxy vs. xxxxxxxxxxxxxx it's already 7 vs 40958 steps, and so on exponentially...

This happens especially easily with nested repetitions or regexes where the same text can be matched by two or more different parts of the regex, forcing the engine to try all permutations before being able to declare failure. This is then called catastrophic backtracking.

Tim Pietzcker 2010-09-27 06:45:50

ansaurus

tags:

views:

answers:

python regex speed

related questions