I've written a Python utility to scan log files for known error patterns.
I was trying to speed up the search by providing the regex engine with additional pattern info. For example, not only that I'm looking for lines with "gold", I require that such line must start with an underscore, so: "^_.*gold" instead of "gold".
As 99% of the lines do not start with an underscore I was expecting a big performance payoff because the regex engine could abort reading the line after just one character. I was surprised to learn the other way.
The following program illustrated the problem:
import re
from time import time
def main():
line = r'I do not start with an underscore 123456789012345678901234567890'
p1 = re.compile(r"^_") # requires underscore as a first char
p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
patterns = (p1, p2)
for p in patterns:
start = time()
for i in xrange(1000*1000):
match = re.search(p, line)
end = time()
print 'Elapsed: ' + str(end-start)
main()
I've tried reviewing sre_compile.py looking for an explanation, but its code was too harry for me.
Could the observed performance be explained by that that the inclusion of the start of line character turns the regex engine operation from a simple substring op to a much more complicated backtracking state machine op? thereby outweighing any benefits like aborting the search after the first character?
Thinking so, I tried multiplying the line's length by x8, expecting the start of line search to shine, but instead the gap only grow wider (22sec Vs 6sec).
I'm puzzled :o am I'm missing something here?