I've got a fairly large string (~700k) against which I need to run 10 regexes and count all the matches of any of the regexes. My quick and dirty impl was to do something like re.search('(expr1)|(expr2)|...'), but I was wondering if we'd see any performance gains by matching in a loop instead:
In other words, I want to compare the performance of:
def CountMatchesInBigstring(bigstring, my_regexes):
"""Counts how many of the expressions in my_regexes match bigstring."""
count = 0
combined_expr = '|'.join(['(%s)' % r for r in my_regexes])
matches = re.search(combined_expr, bigstring)
if matches:
count += NumMatches(matches)
return count
vs
def CountMatchesInBigstring(bigstring, my_regexes):
"""Counts how many of the expressions in my_regexes match bigstring."""
count = 0
for reg in my_regexes:
matches = re.search(reg, bigstring)
if matches:
count += NumMatches(matches)
return count
I'll stop being lazy and run some tests tomorrow (and post the results), but I wondered whether the answer will jump out to someone who actually understands how regexes work :)