views:

896

answers:

4

I have a list of tuples, each containing a find/replace value that I would like to apply to a string. What would be the most efficient way to do so? I will be applying this iteratively, so performance is my biggest concern.

More concretely, what would the innards of processThis() look like?

x = 'find1, find2, find3'
y = [('find1', 'replace1'), ('find2', 'replace2'), ('find3', 'replace3')]

def processThis(str,lst):
     # Do something here
     return something

>>> processThis(x,y)
'replace1, replace2, replace3'

Thanks, all!

A: 
x = 'find1, find2, find3'
y = [('find1', 'replace1'), ('find2', 'replace2'), ('find3', 'replace3')]

def processThis(str,lst):
    for find, replace in lst:
        str = str.replace(find, replace)

    return str

>>> processThis(x,y)
'replace1, replace2, replace3'
Unknown
You beat me to it by about 5 seconds :)
Matt
Thanks, Unknown! This is the obvious solution which works well, but I'm concerned about performance. If len(y) is 100, and I apply it 10 times on a page, we're doing 1000 discrete str.replace() calls in one page. This seems suboptimal to me; wouldn't regex be better suited for this?
cpharmston
No, you would still be doing the same re.sub with a regex. If you want absolute fastest speed, you probably need to drop down to the C level to implement your own stream based Regex that does replacements based on a switch table.
Unknown
I retract my original comment. Mhawke's solution *might* be faster. I'm not sure if re.sub will only iterate over the string once, but this solution goes into C mode more often. You should either test this one out to see if it is fast enough or test Mhawke's solution just to make sure. I am sure that they are either about the same, or one is faster under different circumstances.
Unknown
"Unknown" is wrong. Doing it with a regex containing all of the source strings to match--"find1|find2|find3"--will not scan the string repeatedly. See mhawke's answer.
Glenn Maynard
Quick tests:- In very small tests where len(y) = 3, run three times on a 30 character string with 2 substitutions, str.replace() is marginally (~8%) faster- In medium-sized tests where len(y) = 20, run 20 times on a 200 character string with 10 substitutions, re.sub() is substantially (~22%) faster- In large tests where len(y) = 500, run 500 times on a 5000 character string with 25 substitutions, re.sub is much (~60%) fasterThanks for the great discussion, all!
cpharmston
+5  A: 

You could consider using re.sub:

import re
REPLACEMENTS = dict([('find1', 'replace1'),
                     ('find2', 'replace2'),
                     ('find3', 'replace3')])

def replacer(m):
    return REPLACEMENTS[m.group(0)]

x = 'find1, find2, find3'
r = re.compile('|'.join(REPLACEMENTS.keys()))
print r.sub(replacer, x)
mhawke
Thanks mhawke! How would this compare performance-wise to Unknown's solution?
cpharmston
@cpharmston: using replace() requires that it be called once for each substitution. If there are many substitutions and/or the working string is lengthy, this will be inefficient. re.sub() should process the working string once, but has some setup overhead.
mhawke
You can easily cache the compiled regex.
Glenn Maynard
Python already maintains a cache of compiled regexes. `fgrep cache your_python_directory/Lib/re.py`
John Machin
@mhawke: re.sub steps through each position in the text and tests if a "find" matches at that position -- no automaton. Time is O((size of text)*(number of "finds")*(average size of a "find")). Multiple uses of str.replace(): same. However: str.replace uses a Boyer-Moore variant to skip rapidly through text but traverses text multiple times, possibly trashing memory cache, and will chop up memory because it creates a new replacement string each time a "find" has to be replaced. re.sub traverses the text once with no skipping and creates repl string only once. re.sub prob winner; do benchmarks.
John Machin
Thanks, John. Does this mean anything to you? http://bit.ly/14E983 :)
cpharmston
May be you can replace r assignment with r = re.compile('|'.join(re.split(re.compile(', +'),x)))
siri
@cpharmston: Of course. It's my pi formula. What's the relevance?
John Machin
A: 
s = reduce(lambda x, repl: str.replace(x, *repl), lst, s)
Glenn Maynard
+1  A: 

A couple notes:

  1. The boilerplate argument about premature optimization, benchmarking, bottlenecks, 100 is small, etc.
  2. There are cases where the different solutions will return different results. if y = [('one', 'two'), ('two', 'three')] and x = 'one' then mhawke's solution gives you 'two' and Unknown's gives 'three'.
  3. Testing this out in a silly contrived example mhawke's solution was a tiny bit faster. It should be easy to try it with your data though.
jtb
If you'd be willing to oblige me, I'd love to hear the boilerplate argument. I'm still a noob at heart :)
cpharmston
+1 for pointing out that the results differ. This is important!
mhawke
@cpharmston: Whenever questions like this come up, on StackOverflow or elsewhere, somebody is bound to complain that performance isn't the most important consideration unless it's actually the case that this function is a bottleneck in your application. They'll suggest that it's better to worry about edge case correctness, readability/maintainability or even programming time. I hate to be that guy (off the clock) since this discussion is interesting, but since this one is a wash asymptotically and there were correctness differences I figured I'd at least hint at the argument.
jtb
Search StackOverflow for "premature optimization" to see plenty of discussion about those issues; even including a rather meta question about how to respond to people whining about premature optimization :-) http://stackoverflow.com/questions/438158/how-should-we-handle-premature-optimization-discussion-in-optimization-questions
jtb
Ah, thanks jtb. The argument against worrying about performance makes sense, but I have two counterpoints: (1) This will eventually end up being code an a reusable Django app, so in this case, it's probably prudent to plan for the worst case scenario. (2) Part of the reason I made it an academic exercise as I did was to help me learn Python. Even if the pragmatic difference is negligible, there is usually a right way to do things. The discussion (or disagreement) about the right way will usually lead to enlightening conversations, as this was. Thanks everyone!
cpharmston