ansaurus

Question

I'm using Python regexes in a criminally inefficient manner

Answer 1

+2 A:

Never create your own programming language. Ever. (I used to have an exception to this rule, but not any more.)

There is always an existing language you can use which suits your needs better. If you elaborated on your use-case, people may help you select a suitable language.

JesperE 2008-09-28 20:19:40

Domain-specific language creation is a perfectly OK thing to do, if its appropriate. Never say never.

skaffman 2008-09-28 20:23:03

That may be, but there are already a zillion good choices for text templating languages in Python...

Dan 2008-09-28 20:27:12

Part of the reason I'm doing this is to increase my skills. If this was something I was getting paid for, I would follow your suggestion and do it the quickest way. As it is I'm trying to maximize learning, not programmer efficiency.

Schof 2008-09-28 20:31:01

Designing a language for self-education purposes was actually one of the exceptions I mentioned. Regarding DSLs: you can often get away by implementing a well-designed class-library or API instead.

JesperE 2008-09-29 15:22:12

Answer 2

+1 A:

You can match both kind of quotes in one go with r"(\"|')(.*?)\1" - the \1 refers to the first group, so it will only match matching quotes.

JacquesB 2008-09-28 20:20:03

Answer 3

A:

Don't call search twice in a row (in the loop conditional, and the first statement in the loop). Call (and cache the result) once before the loop, and then in the final statement of the loop.

eduffy 2008-09-28 20:22:31

Answer 4

+1 A:

You're calling re.compile quite a bit. A global variable for these wouldn't hurt here.

eduffy 2008-09-28 20:23:40

Answer 5

+9 A:

The first thing that may improve things is to move the re.compile outside the function. The compilation is cached, but there is a speed hit in checking this to see if its compiled.

Another possibility is to use a single regex as below:

MatchedQuotes = re.compile(r"(['\"])(.*)\1", re.LOCALE)
item = MatchedQuotes.sub(r'\2', item, 1)

Finally, you can combine this into the regex in processVariables. Taking Torsten Marek's suggestion to use a function for re.sub, this improves and simplifies things dramatically.

VariableDefinition = re.compile(r'<%(["\']?)(.*?)\1=(["\']?)(.*?)\3%>', re.LOCALE)
VarRepl = re.compile(r'<%(["\']?)(.*?)\1%>', re.LOCALE)

def processVariables(item):
    vars = {}
    def findVars(m):
        vars[m.group(2).upper()] = m.group(4)
        return ""

    item = VariableDefinition.sub(findVars, item)
    return VarRepl.sub(lambda m: vars[m.group(2).upper()], item)

print processVariables('<%"TITLE"="This Is A Test Variable"%>The Web <%"TITLE"%>')

Here are my timings for 100000 runs:

Original       : 13.637
Global regexes : 12.771
Single regex   :  9.095
Final version  :  1.846

[Edit] Add missing non-greedy specifier

[Edit2] Added .upper() calls so case insensitive like original version

Brian 2008-09-28 20:31:44

Nice work, gets my vote!

Torsten Marek 2008-09-28 21:23:22

Answer 6

+3 A:

sub can take a callable as it's argument rather than a simple string. Using that, you can replace all variables with one function call:

>>> import re
>>> var_matcher = re.compile(r'<%(.*?)%>', re.LOCALE)
>>> string = '<%"TITLE"%> <%"SHMITLE"%>'
>>> values = {'"TITLE"': "I am a title.", '"SHMITLE"': "And I am a shmitle."}
>>> var_matcher.sub(lambda m: vars[m.group(1)], string)
'I am a title. And I am a shmitle.

Follow eduffy.myopenid.com's advice and keep the compiled regexes around.

The same recipe can be applied to the first loop, only there you need to store the value of the variable first, and always return "" as replacement.

Torsten Marek 2008-09-28 20:36:43

Answer 7

+1 A:

If a regexp only contains one .* wildcard and literals, then you can use find and rfind to locate the opening and closing delimiters.

If it contains only a series of .*? wildcards, and literals, then you can just use a series of find's to do the work.

If the code is time-critical, this switch away from regexp's altogether might give a little more speed.

Also, it looks to me like this is an LL-parsable language. You could look for a library that can already parse such things for you. You could also use recursive calls to do a one-pass parse -- for example, you could implement your processVariables function to only consume up the first quote, and then call a quote-matching function to consume up to the next quote, etc.

Tyler 2008-09-28 20:47:53

Answer 8

+2 A:

Creating a templating language is all well and good, but shouldn't one of the goals of the templating language be easy readability and efficient parsing? The example you gave seems to be neither.

As Jamie Zawinsky famously said:

Some people, when confronted with a problem, think "I know, I'll use regular expressions!" Now they have two problems.

If regular expressions are a solution to a problem you have created, the best bet is not to write a better regular expression, but to redesign your approach to eliminate their use entirely. Regular expressions are complicated, expensive, hugely difficult to maintain, and (ideally) should only be used for working around a problem someone else created.

Dan Udey 2008-09-28 21:05:37

In principle, I agree. We should go to uservoice and force them to show this quote every time somebody writes a question and gives it the "regex" tag.

Torsten Marek 2008-09-28 21:25:43

I'm curious why you think the example has neither easy readability or efficient parsing? What could I change to make it more readable and easier to parse?

Schof 2008-09-30 04:31:20

Answer 9

A:

Why not use XML and XSLT instead of creating your own template language? What you want to do is pretty easy in XSLT.

WildJoe 2008-09-28 21:27:53

Answer 10

+1 A:

Why not use Mako? Seriously. What feature do you require that Mako doesn't have? Perhaps you can adapt or extend something that already works.

S.Lott 2008-09-28 22:07:43

ansaurus

tags:

views:

answers:

I'm using Python regexes in a criminally inefficient manner

related questions