tags:

views:

2597

answers:

12

Is there any benefit in using compile for regular expressions in Python?

h = re.compile('hello')
h.match('hello world')

vs

re.match('hello', 'hello world')
+1  A: 

Regular Expressions are compiled before being used when using the second version. If you are going to executing it many times it is definatly better to compile it first. If not compiling every time you match for one off's is fine.

Adam Peck
A: 

My understanding is that those two examples are effectively equivalent. The only difference is that in the first, you can reuse the compiled regular expression elsewhere without causing it to be compiled again.

Here's a reference for you: (http://diveintopython.org/refactoring/refactoring.html)

Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)

Matthew Maravillas
i didn't downvote you, but technically this is wrong: Python won't recompile anyway
Triptych
+2  A: 

FWIW:

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

dF
Major problems with your methodology here, since the setup argument is NOT including in the timing. Thus, you've removed the compilation time from the second example, and just average it out in the first example. This doesn't mean the first example compiles every time.
Triptych
Yes, I agree that this is not a fair comparison of the two cases.
Kiv
I see what you mean, but isn't that exactly what would happen in an actual application where the regexp is used many times?
dF
@dF: You're right, IF you only care about performance in one particular part of your code, and you're able to pre-compile the regex in another part. Otherwise, you need to time the re.compile call and include that in the second number for it to be a fair comparison.
Carl Meyer
@Triptych, @Kiv: The whole point of compiling regexps separate from use *is* to minimize compilation; removing it from the timing is exactly what dF should have done, because it represents real-world use most accurately. Compilation time is especially irrelevant with the way timeit.py does its timings here; it does several runs and only reports the shortest one, at which point the compiled regexp is cached.The extra cost you're seeing here is not the cost of compiling the regexp, but the cost of looking it up in the compiled regexp cache (a dictionary).
jemfinch
+46  A: 

Good question

I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is colloquial, and certainly not a great argument against compiling, but I've found the difference to be negligible.

EDIT: After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you're really only changing WHEN the regex gets compiled, and shouldn't be saving much time at all - only the time it takes to check the cache (a key lookup on an internal dict type).

From module re.py (comments are mine):

def match(pattern, string, flags=0):
    return _compile(pattern, flags).match(string)

def _compile(*key):

    # Does cache check at top of function
    cachekey = (type(key[0]),) + key
    p = _cache.get(cachekey)
    if p is not None: return p

    # ...
    # Does actual compilation on cache miss
    # ...

    # Caches compiled regex
    if len(_cache) >= _MAXCACHE:
        _cache.clear()
    _cache[cachekey] = p
    return p

I think the bottom line is that compiling is fine to do, perhaps even preferable depsite the minor readability hit, but you shouldn't expect any massive gains from pre-compilation.

Triptych
Great answer, I didn't realize it cached them this way :)
Kiv
thanks kiv. I had a hunch, but this question finally made me look for myself.
Triptych
Your conclusion is inconsistent with your answer. If regexs are compiled and stored automatically there is no need in most cases to do it by hand.
J.F. Sebastian
J. F. Sebastian, it serves as a signal to the programmer that the regexp in question will be used a lot and is not meant to be a throwaway.
kaleissin
Eddie Parker
I can only add that _MAXCACHE = 100 in 2.5+ and 3.0.
Constantin
I see the main advantage for using compiled regex if your re-using the same regex multiple times, thereby reducing the possibility for typos. If your just calling it once then uncompiled is more readable.
monkut
So, the main difference will be when you are using lots of different regex (more than _MAXCACHE), some of them just once and others lots of times... then it's important to keep your compiled expressions for those that are used more so they're not flushed out of the cache when it's full.
fortran
@J.F. - Also, if you rely on compile-and-cache, who knows when the cache may be cleared, and then your regex will have to be recompiled.
Chris Lutz
If you're using python < 2.7 or 3.1, "re.sub" lacks a 'flags' parameter. So if, say, you wanted to do case-insensitive re.sub, you're stuck doing ``re.compile("...", re.I).sub(...)``.
+1  A: 

This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

It's like everything about programming (everything in life actually). Apply common sense.

PEZ
As far as I can tell from my brief flick through, *Python in a Nutshell* doesn't mention use without re.compile(), which made me curious.
Mat
+14  A: 

For me, the biggest benefit to re.compile isn't any kind of premature optimization (which is the root of all evil, anyway). It's being able to separate definition of the regex from its use.

Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]+.

It's certainly possible to store strings and pass them to re.match; however, that's less readable:

num = "..."
# then, much later:
m = re.match(num, input)

Versus compiling:

num = re.compile("...")
# then, much later:
m = num.match(input)

Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.

Roger Pate
Ten bonus points if you spotted the typo in the repeated regex before reading this comment! :P
Roger Pate
I agree with this answer; oftentimes using re.compile results in more, not less readable code.
Carl Meyer
R.Pate: the first 0, because you said "without leading zeros"?
Adriano Varoli Piazza
No, the leading "0|" is to allow the value "0". The error is + instead of *.
Roger Pate
+1  A: 

Interestingly, compiling does prove more efficient for me (Python 2.5.2 on Win XP):

import re
import time

rgx = re.compile('(\w+)\s+[0-9_]?\s+\w*')
str = "average    2 never"
a = 0

t = time.time()

for i in xrange(1000000):
    if re.match('(\w+)\s+[0-9_]?\s+\w*', str):
    #~ if rgx.match(str):
        a += 1

print time.time() - t

Running the above code once as is, and once with the two if lines commented the other way around, the compiled regex is twice as fast

Eli Bendersky
Same issue as with dF's performance comparison. It's not really fair unless you include the performance cost of the compile statement itself.
Carl Meyer
Carl, I disagree. The compile is only executed once, while the matching loop is executed a million times
Eli Bendersky
@eliben: I agree with Carl Meyer. The compilation takes place in both cases. Triptych mentions that caching is involved, so in an optimal case (re stays in cache) both approaches are O(n+1), although the +1 part is kind of hidden when you don't use re.compile explicitly.
paprika
Don't write your own benchmarking code. Learn to use timeit.py, which is included in the standard distribution.
jemfinch
+3  A: 

in general I find it is easier to use flags (at least easier to remember how), like re.I when compiling patterns than to use flags inline.

foo_pat = re.compile('foo',re.I)

foo_pat.findall('some string FoO bar')
>> ['FoO']

vs

re.findall('(?i)foo','some string FoO bar')
>> ['FoO']
ptone
A: 

(months later) it's easy to add your own cache around re.match, or anything else for that matter --

""" Re.py: Re.match = re.match + cache  
    efficiency: re.py does this already (but what's _MAXCACHE ?)
    readability, inline / separate: matter of taste
"""

import re

cache = {}
_re_type = type( re.compile( "" ))

def match( pattern, str, *opt ):
    """ Re.match = re.match + cache re.compile( pattern ) 
    """
    if type(pattern) == _re_type:
        cpat = pattern
    elif pattern in cache:
        cpat = cache[pattern]
    else:
        cpat = cache[pattern] = re.compile( pattern, *opt )
    return cpat.match( str )

# def search ...

A wibni, wouldn't it be nice if: cachehint( size= ), cacheinfo() -> size, hits, nclear ...

Denis
A: 

I ran this test before stumbling upon the discussion here. However, having run it I thought I'd at least post my results.

I stole and bastardized the example in Jeff Friedl's "Mastering Regular Expressions". This is on a macbook running OSX 10.6 (2Ghz intel core 2 duo, 4GB ram). Python version is 2.6.1.

Run 1 - using re.compile

import re 
import time 
import fpformat
Regex1 = re.compile('^(a|b|c|d|e|f|g)+$') 
Regex2 = re.compile('^[a-g]+$')
TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    Regex1.search(TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    Regex2.search(TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.299 seconds
Character Class takes 0.107 seconds

Run 2 - Not using re.compile

import re 
import time 
import fpformat

TimesToDo = 1000
TestString = "" 
for i in range(1000):
    TestString += "abababdedfg"
StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^(a|b|c|d|e|f|g)+$',TestString) 
Seconds = time.time() - StartTime 
print "Alternation takes " + fpformat.fix(Seconds,3) + " seconds"

StartTime = time.time() 
for i in range(TimesToDo):
    re.search('^[a-g]+$',TestString) 
Seconds = time.time() - StartTime 
print "Character Class takes " + fpformat.fix(Seconds,3) + " seconds"

Alternation takes 2.508 seconds
Character Class takes 0.109 seconds
netricate
+1  A: 

I just tried this myself. For the simple case of parsing a number out of a string and summing it, using a compiled regular expression object is about twice as fast as using the re methods.

As others have pointed out, the re methods (including re.compile) look up the regular expression string in a cache of previously compiled expressions. Therefore, in the normal case, the extra cost of using the re methods is simply the cost of the cache lookup.

However, examination of the code, shows the cache is limited to 100 expressions. This begs the question, how painful is it to overflow the cache? The code contains an internal interface to the regular expression compiler, re.sre_compile.compile. If we call it, we bypass the cache. It turns out to be about two orders of magnitude slower for a basic regular expression, such as r'\w+\s+([0-9_]+)\s+\w*'.

Here's my test:

#!/usr/bin/env python
import re
import time

def timed(func):
    def wrapper(*args):
        t = time.time()
        result = func(*args)
        t = time.time() - t
        print '%s took %.3f seconds.' % (func.func_name, t)
        return result
    return wrapper

regularExpression = r'\w+\s+([0-9_]+)\s+\w*'
testString = "average    2 never"

@timed
def noncompiled():
    a = 0
    for x in xrange(1000000):
        m = re.match(regularExpression, testString)
        a += int(m.group(1))
    return a

@timed
def compiled():
    a = 0
    rgx = re.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiled():
    a = 0
    rgx = re.sre_compile.compile(regularExpression)
    for x in xrange(1000000):
        m = rgx.match(testString)
        a += int(m.group(1))
    return a


@timed
def compiledInLoop():
    a = 0
    for x in xrange(1000000):
        rgx = re.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

@timed
def reallyCompiledInLoop():
    a = 0
    for x in xrange(10000):
        rgx = re.sre_compile.compile(regularExpression)
        m = rgx.match(testString)
        a += int(m.group(1))
    return a

r1 = noncompiled()
r2 = compiled()
r3 = reallyCompiled()
r4 = compiledInLoop()
r5 = reallyCompiledInLoop()
print "r1 = ", r1
print "r2 = ", r2
print "r3 = ", r3
print "r4 = ", r4
print "r5 = ", r5

And here is the output on my machine:

$ regexTest.py 
noncompiled took 4.555 seconds.
compiled took 2.323 seconds.
reallyCompiled took 2.325 seconds.
compiledInLoop took 4.620 seconds.
reallyCompiledInLoop took 4.074 seconds.
r1 =  2000000
r2 =  2000000
r3 =  2000000
r4 =  2000000
r5 =  20000

The 'reallyCompiled' methods use the internal interface, which bypasses the cache. Note the one that compiles on each loop iteration is only iterated 10,000 times, not one million.

George