ok so i sat down and did the math. pls do not get mad at me i answer specifically discussing ΤΖΩΤΖΙΟΥ’s solution, but this would be somewhat hard to shoehorn inside a comment, so let me do it this way. i will, in fact, also air some considerations that are relevant to the OP’s question.
first up, i have been discussing with ΤΖΩΤΖΙΟΥ the elegance, correctness, and viability of his approach. turns out it looks like the proposal, while it does use an (inherently unordered) dictionary as a register to store the substitution pairs, does in fact consistently return correct results, where i had claimed it wouldn’t. this is because the call to itertools.starmap()
in line 11, below, gets as its second argument an iterator over pairs of single characters/bytes (more on that later) that looks like [ ( 'h', 'h', ), ( 'e', 'e', ), ( 'l', 'l', ), ... ]
. these pairs of characters/bytes is what the first argument, replacer.get
, is repeatedly called with. there is not a chance to run into a situation where first '>'
is transformed into '>'
and then inadvertently again into '>'
, because each character/byte is considered only once for substitution. so this part is in principle fine and algorithmically correct.
the next question is viability, and that would include a look at performance. if a vital task gets correctly completed in 0.01s using an awkward code but 1s using awesome code, then awkward might be considered preferable in practice (but only if the 1 second loss is in fact intolerable). here is the code i used for testing; it includes a number of different implementations. it is written in python 3.1 so we can use unicode greek letters for identifiers which in itself is awesome (zip
in py3k returns the same as itertools.izip
in py2):
import itertools #01
#02
_replacements = { #03
'&': '&', #04
'<': '<', #05
'>': '>', #06
'"': '"', #07
"'": ''', } #08
#09
def escape_ΤΖΩΤΖΙΟΥ( a_string ): #10
return ''.join( #11
itertools.starmap( #12
_replacements.get, #13
zip( a_string, a_string ) ) ) #14
#15
def escape_SIMPLE( text ): #16
return ''.join( _replacements.get( chr, chr ) for chr in text ) #17
#18
def escape_SIMPLE_optimized( text ): #19
get = _replacements.get #20
return ''.join( get( chr, chr ) for chr in text ) #21
#22
def escape_TRADITIONAL( text ): #23
return text.replace('&', '&').replace('<', '<').replace('>', '>')\ #24
.replace("'", ''').replace('"', '"') #25
these are the timing results:
escaping with SIMPLE took 5.74664253sec for 100000 items
escaping with SIMPLE_optimized took 5.11457801sec for 100000 items
escaping TRADITIONAL in-situ took 0.57543013sec for 100000 items
escaping with TRADITIONAL took 0.62347413sec for 100000 items
escaping a la ΤΖΩΤΖΙΟΥ took 2.66592320sec for 100000 items
turns out the original poster’s concern that the ‘traditional’ method gets ‘ugly quickly and appears to have poor algorithmic performance’ appears partially unwarranted when put into this context. it actually performs best; when stashed away into a function call, we do get to see a 8% performance penalty (‘calling methods is expensive’, but in general you should still do it). in comparison, ΤΖΩΤΖΙΟΥ’s implementation takes around 5 times as long as the traditional method, which, given it’s higher complexity that has to compete with python’s long-honed, optimized string methods is no surprise.
there is yet another algorithm here, the SIMPLE one. as far as i can see, this very much does exactly what ΤΖΩΤΖΙΟΥ’s method does: it iterates over the characters/bytes in the text and performs a lookup for each, then joins all the characters/bytes together and returns the resulting escaped text. you can see that where one way to do that involves a fairly lengthy and myterious formulation, the SIMPLE implementation is actually understandable at a glance.
what really trips me up here, though, is how badly the SIMPLE approach is in performance: it is around 10 times as slow as the traditional one, and also twice as slow as ΤΖΩΤΖΙΟΥ’s method. i am completely at a loss here, maybe someone can come up with an idea why this should be so. it uses only the most basic building blocks of python and works with two implicit iterations, so it avoids to build throw-away lists and everything, but it still slow, and i don’t know why.
let me conclude this code review with a remark on the merit of ΤΖΩΤΖΙΟΥ’s solution. i have made it sufficiently clear i find the code hard to read and too overblown for the task at hand. more critical than that, however, i find the way he treats characters and makes sure that for a given small range of characters they will behave in a byte-like fashion a little irritating. sure it works for the task at hand, but as soon as i iterate e.g. over the bytestring 'ΤΖΩΤΖΙΟΥ' what i do is iterate over adjacent bytes representing single characters. in most situations this is exactly what you should avoid; this is precisely the reason why in py3k ‘strings’ are now the ‘unicode objects’ of old, and the ‘strings’ of old have become ‘bytes’ and ‘bytearray’. if i was to nominate the one feature of py3k that could warrant a possibly expensive migration of code from the 2 series to the 3 series, it would be this single property of py3k. 98% of all my encoding issues have just dissolved ever since, period, and there is no clever hack that could have me seriously doubt my move. said algorithm is not ‘conceptually 8bit-clean and unicode safe’, which to me is a seriously shortcome, given this is 2010.