ansaurus

Question

counterintuitive slow Python regex search performance when providing a start of a String char

Answer 1

+1 A:

How about

if line[0] == "_" and "gold" in line:
   print "Yup, it starts with an underscore"
else:
   print "Nope it doesn't"

Seriously, don't overuse regex

Robus 2010-09-18 11:51:29

I can't decide whether this answer is more patronizing or insulting

Eli Bendersky 2010-09-18 12:36:50

Probably `line.startswith('_')` would be better.

Daenyth 2010-09-18 12:46:48

I appreciate the suggestion to keep it simple. It's just that my real needs are more complex. I do need real complicated regexps, I've narrowed it down in the above for the sake of the question.

Gili Nachum 2010-09-18 17:14:30

Answer 2

+1 A:

Regexes don't always behave as you'd expect. I don't understand the internals, so I can't precisely explain the behavior. One thing to note, if you change from search to match, the patterns switch which is faster (though that isn't exactly what you want).

You're doing the right thing: measure and use the technique that is empirically faster.

Ned Batchelder 2010-09-18 12:29:27

That's what I'll end up doing. That and SO recommendations :) Thanks.

Gili Nachum 2010-09-18 17:15:33

Answer 3

+2 A:

You're actually doing two things wrong: If you want to look at the beginning of the string use match not search. Also, don't use re.match( pattern, line), compile the pattern and use pattern.match(line).

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = p.match(line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

You will see that you have the expected behavior now - both pattern take exactly the same time.

THC4k 2010-09-18 12:53:15

but you've changed the semantics of the search, so what's the point?

Eli Bendersky 2010-09-18 13:06:47

This is not quite correct - you have to change the sixth line to `p2 = re.compile(r".*abcdefghijklmnopqrstuvwxyz")` or the functionality is different. After all, `p2` should be searched for anywhere in the line.

Tim Pietzcker 2010-09-18 13:08:54

@Eli Bendersky: The point is that if you want the pattern to fail on the first comparison you need to use match, not search. @Tim Pietzcker: Right. Btw: `r".*abcdefghijklmnopqrstuvwxyz"` will very long, because of the backtracking re engine (first `.*` matches everything, then it starts backtracking a lot). `r".-abcdefghijklmnopqrstuvwxyz"` is the well-behaved pattern for this case.

THC4k 2010-09-18 14:32:53

@THC4K: Yes I know the diff between search and match. I plan to use the method that proves to be the fastest. search() has the advantage that the patterns are less verbose. I accept the compile remarks. I'm sure re.search doesn't recompile a compiled pattern, but it did save time! End of day: switching to match() lowered the match time for ^_ pattern considerably (0.71s vs 1.96s before), but on the expense of increase the match time for .*abcdefghijklmnopqrstuvwxyz.* (1.06 vs 0.69s before).I guess search() is optimized for simple patterns while match() is optimized for more complex ones.

Gili Nachum 2010-09-18 17:08:20

@Gili Nachum: write `r".*?abcdefghijklmnopqrstuvwxyz"`. That's what I meant before, for some reason I just thought non-greedy matches are `-`, not `*?`.

THC4k 2010-09-18 18:00:12

Answer 4

+2 A:

Interesting observation! I've played a bit with it. My guess would be that the regexp engine will scan the entire string for an underscore, and matches this against a line beginning once a match has been found. Perhaps this has to do with uniform behaviour when using re.MULTILINE

If you use re.match in stead of re.search for the underscore pattern, both seem to be equally fast, i.e.

def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_.*") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    start = time()
    for i in xrange(1000*1000):
        match = re.match(p1, line)
    end = time() 
    print 'Elapsed: ' + str(end-start) 
    start = time()
    for i in xrange(1000*1000):
        match = re.search(p2, line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

In this case, match will require a match to start matching at the beginning of the string.

Also, be aware that the following use of precompiled patterns seems to be faster:

for p in patterns:
    start = time()
    for i in xrange(1000*1000):
        match = p.search(line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

But the speed difference remains...

Ivo van der Wijk 2010-09-18 13:03:02

It's indeed useful to separate the time spent compiling the pattern (which can be substantial in some cases) and the time spent matching it. However, I suspect this can probably be best described as a bug introduced either into PCRE or - more likely - into the conversion of it for Python; other RE engines seem to do just fine detecting whether to use anchored matching just from the REs themselves. (RE engines are *really* tricky code; there are some utterly Codethulhu-level evil gotchas in there.)

Donal Fellows 2010-09-18 23:23:12

ansaurus

tags:

views:

answers:

counterintuitive slow Python regex search performance when providing a start of a String char

related questions