views:

797

answers:

9

For instance, say I wanted a function to escape a string for use in HTML (as in Django's escape filter):

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        return string.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

This works, but it gets ugly quickly and appears to have poor algorithmic performance (in this example, the string is repeatedly traversed 5 times). What would be better is something like this:

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        # Note that ampersands must be escaped first; the rest can be escaped in 
        # any order.
        return replace_multi(string.replace('&', '&amp;'),
                             {'<': '&lt;', '>': '&gt;', 
                              "'": '&#39;', '"': '&quot;'})

Does such a function exist, or is the standard Python idiom to use what I wrote before?

+5  A: 

That's what Django does:

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))
Ken
+3  A: 

Apparently it's pretty common to implement that via regex. You can find an example of this at ASPN and here.

bebraw
+6  A: 

I prefer something clean like:

substitutions = [
    ('<', '&lt;'),
    ('>', '&gt;'),
    ...]

for search, replacement in substitutions:
    string = string.replace(search, replacement)
mkotechno
"clean"? Could you define "clean" as part of this answer?
S.Lott
This still has essentially the same performance as the first example, though.
meeselet
Also, `for replace in replaces: (search, replacement) = replace` is silly. The parenthesis aren't needed, and neither is the assignment. We can just do `for search, replacement in replaces:`
Chris Lutz
I know parentesis are not needed, but explicit is better than implicit, and I can also use *args os comprehensions, but I prefer to put an easy example.
mkotechno
+4  A: 

In accordance with bebraw's suggestion, here is what I ended up using (in a separate module, of course):

import re

class Subs(object):
    """
    A container holding strings to be searched for and replaced in
    replace_multi().

    Holds little relation to the sandwich.
    """
    def __init__(self, needles_and_replacements):
        """
        Returns a new instance of the Subs class, given a dictionary holding 
        the keys to be searched for and the values to be used as replacements.
        """
        self.lookup = needles_and_replacements
        self.regex = re.compile('|'.join(map(re.escape,
                                             needles_and_replacements)))

def replace_multi(string, subs):
    """
    Replaces given items in string efficiently in a single-pass.

    "string" should be the string to be searched.
    "subs" can be either:
        A.) a dictionary containing as its keys the items to be
            searched for and as its values the items to be replaced.
        or B.) a pre-compiled instance of the Subs class from this module
               (which may have slightly better performance if this is
                called often).
    """
    if not isinstance(subs, Subs): # Assume dictionary if not our class.
        subs = Subs(subs)
    lookup = subs.lookup
    return subs.regex.sub(lambda match: lookup[match.group(0)], string)

Example usage:

def escape(string):
    """
    Returns the given string with ampersands, quotes and angle 
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in 
    # any order.
    escape.subs = Subs({'<': '&lt;', '>': '&gt;', "'": '&#39;', '"': '&quot;'})
    return replace_multi(string.replace('&', '&amp;'), escape.subs)

Much better :). Thanks for the help.

Edit

Nevermind, Mike Graham was right. I benchmarked it and the replacement ends up actually being much slower.

Code:

from urllib2 import urlopen
import timeit

def escape1(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    return string.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

def escape2(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in
    # any order.
    escape2.subs = Subs({'<': '&lt;', '>': '&gt;', "'": '&#39;', '"': '&quot;'})
    return replace_multi(string.replace('&', '&amp;'), escape2.subs)

# An example test on the stackoverflow homepage.
request = urlopen('http://stackoverflow.com')
test_string = request.read()
request.close()

test1 = timeit.Timer('escape1(test_string)',
                     setup='from __main__ import escape1, test_string')
test2 = timeit.Timer('escape2(test_string)',
                     setup='from __main__ import escape2, test_string')
print 'multi-pass:', test1.timeit(2000)
print 'single-pass:', test2.timeit(2000)

Output:

multi-pass: 15.9897229671
single-pass: 66.5422530174

So much for that.

meeselet
You have an `IndentationError`. (You also have an insane API for `replace_multi`. Typechecking to simulate overloading is bad.)
Mike Graham
By the way seeing what you are escaping, you might want to take a look at cgi.escape. It seems to do the job except for those quote parts. :)
bebraw
i do not want to sound rude, but IMHO this solution is incredibly over-engineered. also, using a dictionary for replacements is the wrong solution whenever a search text appears in one of the replacement texts. because you can not predict which substitution will occur first, using a dictionary can result in substitutions done twice (e.g. '>' is first turned into '>', and then the ampersand is again escaped, '>').
flow
@flow: not exactly... it's a simple class and a function, pretty easily abstracted to another module. But it IS a redacted solution after all. Also, the partial ordering is taken notice of, which is why the ampersand is not replaced in the dictionary but instead in the string (as I said the comment in the code).
meeselet
+12  A: 

Do you have an application that is running too slow and you provided it to find that a line like this snippet is causing it to be slow? Bottlenecks occur at unexpected places.


The current snippet traverses the string 5 times, doing one thing each time. You are suggesting traversing it once, probably doing doing five things each time (or at least doing something each time). It isn't clear that this will automatically do a better job to me. Currently the algorithm used is O(n*m) (assuming the length of the string is longer than the stuff in the rules), where n is the length of the string and m is the number of substitution rules. You could, I think, reduce the algorithmic complexity to something like O(n*log(m)) and in the specific case we're in—where the original things are all only one character (but not in the case of multiple calls to replace in general)—O(n), but this doesn't matter since m is 5 but n is unbounded.

If m is held constant, then, the complexity of both solutions really goes to O(n). It is not clear to me that it is going to be a worthy task to try to turn five simple passes into one complex one, the actual time of which I cannot guess at the current moment. If there was something about it that could make it scale better, I would have thought it was much more worthwhile task.

Doing everything on one pass rather than consecutive passes also demands questions be answered about what to do about conflicting rules and how they are applied. The resolution to these questions is clear with a chain of replace.

Mike Graham
While I agree mostly, 1 traversal of the string would be better for memory locality assuming that the length of the string is much larger than the length of the replace list.
tster
+2  A: 

You can use reduce:

reduce(lambda s,r: s.replace(*r),
       [('&', '&amp;'),
        ('<', '&lt;'),
        ('>', '&gt;'),
        ("'", '&#39;'),
        ('"', '&quot;')],
       string)
Stephen
This code is less readable than a for loop or typed-out `replace` chain.
Mike Graham
I know that's the Python Union answer ("reduce is less readable than a for-loop") but I'm in favor of the discretionary use of reduce.
Stephen
+8  A: 

How about we just test various ways of doing this and see which comes out faster (assuming we are only caring about the fastest way to do it).

def escape1(input):
        return input.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

translation_table = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;',
    "'": '&#39;',
    '"': '&quot;',
}

def escape2(input):
        return ''.join(translation_table.get(char, char) for char in input)

import re
_escape3_re = re.compile(r'[&<>\'"]')
def _escape3_repl(x):
    s = x.group(0)
    return translation_table.get(s, s)
def escape3(x):
    return _escape3_re.sub(_escape3_repl, x)

def escape4(x):
    return unicode(x).translate(translation_table)

test_strings = (
    'Nothing in there.',
    '<this is="not" a="tag" />',
    'Something & Something else',
    'This one is pretty long. ' * 50
)

import time

for test_i, test_string in enumerate(test_strings):
    print repr(test_string)
    for func in escape1, escape2, escape3, escape4:
        start_time = time.time()
        for i in xrange(1000):
            x = func(test_string)
        print '\t%s done in %.3fms' % (func.__name__, (time.time() - start_time))
    print

Running this gives you:

'Nothing in there.'
    escape1 done in 0.002ms
    escape2 done in 0.009ms
    escape3 done in 0.001ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

'This one is pretty long. <snip>'
    escape1 done in 0.008ms
    escape2 done in 0.386ms
    escape3 done in 0.011ms
    escape4 done in 0.310ms

Looks like just replacing them one after another goes the fastest.

Edit: Running the tests again with 1000000 iterations gives the following for the first three strings (the fourth would take too long on my machine for me to wait =P):

'Nothing in there.'
    escape1 done in 0.001ms
    escape2 done in 0.008ms
    escape3 done in 0.002ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

The numbers are pretty much the same. In the first case they are actually even more consistent as the direct string replacement is fastest now.

Mike Boers
Using the timeit module will give much more accurate timing results - some of these results are going to be at the level of background noise.
Dave Kirby
@Dave if your issue is with the number of iterations, I increased it 1000x and it comes out pretty much the same.
Mike Boers
+1: Always measure first before doing optimizations
J.F. Sebastian
+1  A: 

If you work with non-Unicode strings and Python < 3.0, try an alternate translate method:

# Python < 3.0
import itertools

def escape(a_string):
    replacer= dict( (chr(c),chr(c)) for c in xrange(256))
    replacer.update(
        {'&': '&amp;',
         '<': '&lt;',
         '>': '&gt;',
         '"': '&quot;',
         "'": '&#39;'}
    )
    return ''.join(itertools.imap(replacer.__getitem__, a_string))

if __name__ == "__main__":
    print escape('''"Hello"<i> to George's friend&co.''')

$ python so2484156.py 
&quot;Hello&quot;&lt;i&gt; to George&#39;s friend&amp;co.

This is closer to a "single scan" of the input string, as per your wish.

EDIT

My intention was to create a unicode.translate equivalent that was not restricted to single-character replacements, so I came up with the answer above; I got a comment by user "flow" that was almost completely out of context, with a single correct point: the code above, as is, is intended to work with byte strings and not unicode strings. There is an obvious update (i.e. unichr() … xrange(sys.maxunicode+1)) which I strongly dislike, so I came up with another function that works on both unicode and byte strings, given that Python guarantees:

all( (chr(i)==unichr(i) and hash(chr(i))==hash(unichr(i)))
    for i in xrange(128)) is True

The new function follows:

def escape(a_string):
    replacer= {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;',
        '"': '&quot;',
        "'": '&#39;',
    }
    return ''.join(
        itertools.starmap(
            replacer.get, # .setdefault *might* be faster
            itertools.izip(a_string, a_string)
        )
    )

Notice the use of starmap with a sequence of tuples: for any character not in the replacer dict, return said character.

ΤΖΩΤΖΙΟΥ
sorry but this solution is (1) incorrect because using a dictionary misses the partial ordering of substitutions that must be observed. also it is (2) by far too obfuscated: using ‘imap’ to write a simple loop is not good. iteration can only give you something if you access intermediate results. (3) the solution is highly inefficient because it performs 256 substitions where only 5 were called for (so performance-wise the code is 5120% off the mark). (4) the restriction to byte strings is deplorable and exactly not what you want when programming for a multilingual, global network. sorry.
flow
Um. (1) sorry, but a partial ordering of substitutions is not required here. Are you sure you read the answer? Did you try the example? (2) I do access the intermediate results in the `.join` operation. What are you talking about? (3) Which 256 substitutions are you talking about? This is equivalent to a `.translate` call (part of the reason of using `itertools.imap` and `translate.__getitem__` is taking advantage of C speed) (4) You are partly correct about the restriction, but the OP just said string; however, a UTF-8 encoded *byte string* is also acceptable as an argument to the function.
ΤΖΩΤΖΙΟΥ
This is the first time I wish one could mark a comment as 'totally missing the point'.
ΤΖΩΤΖΙΟΥ
@flow: did you read and/or try the answer before commenting, or did you give it a quick glance before coming to a mistaken conclusion?
ΤΖΩΤΖΙΟΥ
flow
@flow: I'm sorry, you are incredibly wrong, wrong, wrong. If you gave me an 'E' because you *thought* my code is wrong, you would prove that you shouldn't be a teacher at all. Now, you might have read my code, but you didn't understand it. My code is correct. It's a shame you think that my code *might* escape an ampersand twice, because it doesn't. The convolution and the hairiness is in the eye of the beholder, especially if the beholder is you. You only had to try the code; you didn't. You're not a better interpreter of Python than Python itself, sorry.
ΤΖΩΤΖΙΟΥ
@flow: eventually, if you try to prove *my code* wrong, you will understand *you* were wrong. Even if you did, though, I bet it's hard for you to accept publicly the fact that you were wrong. So you'll ignore your error and insist on the convolution and hairiness. I'll understand.
ΤΖΩΤΖΙΟΥ
orestis
@orestis: Of course what you suggest is fine (and also one of Mike Boers' suggestions). Mine is essentially the same, just using itertools as a different approach.
ΤΖΩΤΖΙΟΥ
@ΤΖΩΤΖΙΟΥ: maybe i was a little quick in claiming your solution will (sporadically) produce wrong results. i am currently running and analyzing your code and will be back with results in a short while.
flow
+1  A: 

ok so i sat down and did the math. pls do not get mad at me i answer specifically discussing ΤΖΩΤΖΙΟΥ’s solution, but this would be somewhat hard to shoehorn inside a comment, so let me do it this way. i will, in fact, also air some considerations that are relevant to the OP’s question.

first up, i have been discussing with ΤΖΩΤΖΙΟΥ the elegance, correctness, and viability of his approach. turns out it looks like the proposal, while it does use an (inherently unordered) dictionary as a register to store the substitution pairs, does in fact consistently return correct results, where i had claimed it wouldn’t. this is because the call to itertools.starmap() in line 11, below, gets as its second argument an iterator over pairs of single characters/bytes (more on that later) that looks like [ ( 'h', 'h', ), ( 'e', 'e', ), ( 'l', 'l', ), ... ]. these pairs of characters/bytes is what the first argument, replacer.get, is repeatedly called with. there is not a chance to run into a situation where first '>' is transformed into '&gt;' and then inadvertently again into '&amp;gt;', because each character/byte is considered only once for substitution. so this part is in principle fine and algorithmically correct.

the next question is viability, and that would include a look at performance. if a vital task gets correctly completed in 0.01s using an awkward code but 1s using awesome code, then awkward might be considered preferable in practice (but only if the 1 second loss is in fact intolerable). here is the code i used for testing; it includes a number of different implementations. it is written in python 3.1 so we can use unicode greek letters for identifiers which in itself is awesome (zip in py3k returns the same as itertools.izip in py2):

import itertools                                                                  #01
                                                                                  #02
_replacements = {                                                                 #03
  '&': '&amp;',                                                                   #04
  '<': '&lt;',                                                                    #05
  '>': '&gt;',                                                                    #06
  '"': '&quot;',                                                                  #07
  "'": '&#39;', }                                                                 #08
                                                                                  #09
def escape_ΤΖΩΤΖΙΟΥ( a_string ):                                                  #10
  return ''.join(                                                                 #11
    itertools.starmap(                                                            #12
      _replacements.get,                                                          #13
      zip( a_string, a_string ) ) )                                               #14
                                                                                  #15
def escape_SIMPLE( text ):                                                        #16
  return ''.join( _replacements.get( chr, chr ) for chr in text )                 #17
                                                                                  #18
def escape_SIMPLE_optimized( text ):                                              #19
  get = _replacements.get                                                         #20
  return ''.join( get( chr, chr ) for chr in text )                               #21
                                                                                  #22
def escape_TRADITIONAL( text ):                                                   #23
  return text.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;')\    #24
    .replace("'", '&#39;').replace('"', '&quot;')                                 #25

these are the timing results:

escaping with SIMPLE            took 5.74664253sec for 100000 items
escaping with SIMPLE_optimized  took 5.11457801sec for 100000 items
escaping TRADITIONAL in-situ    took 0.57543013sec for 100000 items
escaping with TRADITIONAL       took 0.62347413sec for 100000 items
escaping a la ΤΖΩΤΖΙΟΥ          took 2.66592320sec for 100000 items

turns out the original poster’s concern that the ‘traditional’ method gets ‘ugly quickly and appears to have poor algorithmic performance’ appears partially unwarranted when put into this context. it actually performs best; when stashed away into a function call, we do get to see a 8% performance penalty (‘calling methods is expensive’, but in general you should still do it). in comparison, ΤΖΩΤΖΙΟΥ’s implementation takes around 5 times as long as the traditional method, which, given it’s higher complexity that has to compete with python’s long-honed, optimized string methods is no surprise.

there is yet another algorithm here, the SIMPLE one. as far as i can see, this very much does exactly what ΤΖΩΤΖΙΟΥ’s method does: it iterates over the characters/bytes in the text and performs a lookup for each, then joins all the characters/bytes together and returns the resulting escaped text. you can see that where one way to do that involves a fairly lengthy and myterious formulation, the SIMPLE implementation is actually understandable at a glance.

what really trips me up here, though, is how badly the SIMPLE approach is in performance: it is around 10 times as slow as the traditional one, and also twice as slow as ΤΖΩΤΖΙΟΥ’s method. i am completely at a loss here, maybe someone can come up with an idea why this should be so. it uses only the most basic building blocks of python and works with two implicit iterations, so it avoids to build throw-away lists and everything, but it still slow, and i don’t know why.

let me conclude this code review with a remark on the merit of ΤΖΩΤΖΙΟΥ’s solution. i have made it sufficiently clear i find the code hard to read and too overblown for the task at hand. more critical than that, however, i find the way he treats characters and makes sure that for a given small range of characters they will behave in a byte-like fashion a little irritating. sure it works for the task at hand, but as soon as i iterate e.g. over the bytestring 'ΤΖΩΤΖΙΟΥ' what i do is iterate over adjacent bytes representing single characters. in most situations this is exactly what you should avoid; this is precisely the reason why in py3k ‘strings’ are now the ‘unicode objects’ of old, and the ‘strings’ of old have become ‘bytes’ and ‘bytearray’. if i was to nominate the one feature of py3k that could warrant a possibly expensive migration of code from the 2 series to the 3 series, it would be this single property of py3k. 98% of all my encoding issues have just dissolved ever since, period, and there is no clever hack that could have me seriously doubt my move. said algorithm is not ‘conceptually 8bit-clean and unicode safe’, which to me is a seriously shortcome, given this is 2010.

flow