tags:

views:

83

answers:

3

Is there a Pythonic 'standard' for how regular expressions should be used?

What I typically do is perform a bunch of re.compile statements at the top of my module and store the objects in global variables... then later on use them within my functions and classes.

I could define the regexs within the functions I would be using them, but then they would be recompiled every time.

Or, I could forgo re.compile completely, but if I am using the same regex many times it seems like recompiling would incur unnecessary overhead.

+1  A: 

I personally use your first approach where the expressions I'll reuse are compiled early on and available globally to the functions / methods that will need them. In my experience this is reliable, and reduces total compile time for them.

g.d.d.c
+4  A: 

I also tend to use your first approach but I've never benchmarked this. One thing to note, from the documentation, is that:

The compiled versions of the most recent patterns passed to re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

One worry is that you could have regular expressions that don't get used. If you compile all expressions at module load time you could be incurring the cost of compiling the expression but never benefiting from that "optimization". I don't suppose this would matter unless you compile lots of regular expressions that never get used.

One thing I do recommend is to use the re.VERBOSE (or re.X) flag and include comments and white space to make anything beyond the most trivial regular expression more readable.

Andrew Walker
The reason I don't like my first approach is because it clogs up the namespace, and the actual code is not associated with the code that is running it. I wish there was a way to make the code easier to read.
orangeoctopus
If you want to make code easier to read, don't use regex. Of course that will probably complicate your code if you're using a lot of regexes.
Wayne Werner
One way to (semi)avoid clogging up the namespace is to put one or two underscores before your module variables to avoid exporting the variable or mangle its name.
Andrew Walker
+5  A: 

One way that would be a lot cleaner is using a dictionary:

PATTERNS = {'pattern1': re.compile('foo.*baz'),
            'snake': re.compile('python'),
            'knight': re.compile('[Aa]rthur|[Bb]edevere|[Ll]auncelot')}

That would solve your problem of having a polluted namespace, plus it's pretty obvious to anyone looking at your code what PATTERNS is and will be used for, and it satisfies the CAPS convention for globals. In addition, you can easily call re.match(PATTERNS[pattern]), or whatever it is your logic calls for.

Wayne Werner
I like this a lot! Thanks!
orangeoctopus