views:

838

answers:

10

My goal here is to create a very simple template language. At the moment, I'm working on replacing a variable with a value, like this:

This input:

<%"TITLE"="This Is A Test Variable"%>The Web <%"TITLE"%>

Should produce this output:

The Web This Is A Test Variable

I've got it working. But looking at my code, I'm running multiple identical regexes on the same strings -- that just offends my sense of efficiency. There's got to be a better, more Pythonic way. (It's the two "while" loops that really offend.)

This does pass the unit tests, so if this is silly premature optimization, tell me -- I'm willing to let this go. There may be dozens of these variable definitions and uses in a document, but not hundreds. But I suspect there's obvious (to other people) ways of improving this, and I'm curious what the StackOverflow crowd will come up with.

def stripMatchedQuotes(item):
    MatchedSingleQuotes = re.compile(r"'(.*)'", re.LOCALE)
    MatchedDoubleQuotes = re.compile(r'"(.*)"', re.LOCALE)
    item = MatchedSingleQuotes.sub(r'\1', item, 1)
    item = MatchedDoubleQuotes.sub(r'\1', item, 1)
    return item




def processVariables(item):
    VariableDefinition = re.compile(r'<%(.*?)=(.*?)%>', re.LOCALE)
    VariableUse = re.compile(r'<%(.*?)%>', re.LOCALE)
    Variables={}

    while VariableDefinition.search(item):
        VarName, VarDef = VariableDefinition.search(item).groups()
        VarName = stripMatchedQuotes(VarName).upper().strip()
        VarDef = stripMatchedQuotes(VarDef.strip())
        Variables[VarName] = VarDef
        item = VariableDefinition.sub('', item, 1)

    while VariableUse.search(item):
        VarName = stripMatchedQuotes(VariableUse.search(item).group(1).upper()).strip()
        item = VariableUse.sub(Variables[VarName], item, 1)

    return item
+2  A: 

Never create your own programming language. Ever. (I used to have an exception to this rule, but not any more.)

There is always an existing language you can use which suits your needs better. If you elaborated on your use-case, people may help you select a suitable language.

JesperE
Domain-specific language creation is a perfectly OK thing to do, if its appropriate. Never say never.
skaffman
That may be, but there are already a zillion good choices for text templating languages in Python...
Dan
Part of the reason I'm doing this is to increase my skills. If this was something I was getting paid for, I would follow your suggestion and do it the quickest way. As it is I'm trying to maximize learning, not programmer efficiency.
Schof
Designing a language for self-education purposes was actually one of the exceptions I mentioned. Regarding DSLs: you can often get away by implementing a well-designed class-library or API instead.
JesperE
+1  A: 

You can match both kind of quotes in one go with r"(\"|')(.*?)\1" - the \1 refers to the first group, so it will only match matching quotes.

JacquesB
A: 

Don't call search twice in a row (in the loop conditional, and the first statement in the loop). Call (and cache the result) once before the loop, and then in the final statement of the loop.

eduffy
+1  A: 

You're calling re.compile quite a bit. A global variable for these wouldn't hurt here.

eduffy
+9  A: 

The first thing that may improve things is to move the re.compile outside the function. The compilation is cached, but there is a speed hit in checking this to see if its compiled.

Another possibility is to use a single regex as below:

MatchedQuotes = re.compile(r"(['\"])(.*)\1", re.LOCALE)
item = MatchedQuotes.sub(r'\2', item, 1)

Finally, you can combine this into the regex in processVariables. Taking Torsten Marek's suggestion to use a function for re.sub, this improves and simplifies things dramatically.

VariableDefinition = re.compile(r'<%(["\']?)(.*?)\1=(["\']?)(.*?)\3%>', re.LOCALE)
VarRepl = re.compile(r'<%(["\']?)(.*?)\1%>', re.LOCALE)

def processVariables(item):
    vars = {}
    def findVars(m):
        vars[m.group(2).upper()] = m.group(4)
        return ""

    item = VariableDefinition.sub(findVars, item)
    return VarRepl.sub(lambda m: vars[m.group(2).upper()], item)

print processVariables('<%"TITLE"="This Is A Test Variable"%>The Web <%"TITLE"%>')

Here are my timings for 100000 runs:

Original       : 13.637
Global regexes : 12.771
Single regex   :  9.095
Final version  :  1.846

[Edit] Add missing non-greedy specifier

[Edit2] Added .upper() calls so case insensitive like original version

Brian
Nice work, gets my vote!
Torsten Marek
+3  A: 

sub can take a callable as it's argument rather than a simple string. Using that, you can replace all variables with one function call:

>>> import re
>>> var_matcher = re.compile(r'<%(.*?)%>', re.LOCALE)
>>> string = '<%"TITLE"%> <%"SHMITLE"%>'
>>> values = {'"TITLE"': "I am a title.", '"SHMITLE"': "And I am a shmitle."}
>>> var_matcher.sub(lambda m: vars[m.group(1)], string)
'I am a title. And I am a shmitle.

Follow eduffy.myopenid.com's advice and keep the compiled regexes around.

The same recipe can be applied to the first loop, only there you need to store the value of the variable first, and always return "" as replacement.

Torsten Marek
+1  A: 

If a regexp only contains one .* wildcard and literals, then you can use find and rfind to locate the opening and closing delimiters.

If it contains only a series of .*? wildcards, and literals, then you can just use a series of find's to do the work.

If the code is time-critical, this switch away from regexp's altogether might give a little more speed.

Also, it looks to me like this is an LL-parsable language. You could look for a library that can already parse such things for you. You could also use recursive calls to do a one-pass parse -- for example, you could implement your processVariables function to only consume up the first quote, and then call a quote-matching function to consume up to the next quote, etc.

Tyler
+2  A: 

Creating a templating language is all well and good, but shouldn't one of the goals of the templating language be easy readability and efficient parsing? The example you gave seems to be neither.

As Jamie Zawinsky famously said:

Some people, when confronted with a problem, think "I know, I'll use regular expressions!" Now they have two problems.

If regular expressions are a solution to a problem you have created, the best bet is not to write a better regular expression, but to redesign your approach to eliminate their use entirely. Regular expressions are complicated, expensive, hugely difficult to maintain, and (ideally) should only be used for working around a problem someone else created.

Dan Udey
In principle, I agree. We should go to uservoice and force them to show this quote every time somebody writes a question and gives it the "regex" tag.
Torsten Marek
I'm curious why you think the example has neither easy readability or efficient parsing? What could I change to make it more readable and easier to parse?
Schof
A: 

Why not use XML and XSLT instead of creating your own template language? What you want to do is pretty easy in XSLT.

WildJoe
+1  A: 

Why not use Mako? Seriously. What feature do you require that Mako doesn't have? Perhaps you can adapt or extend something that already works.

S.Lott