ansaurus

Question

Iterative find/replace from a list of tuples in Python

Answer 1

A:

x = 'find1, find2, find3'
y = [('find1', 'replace1'), ('find2', 'replace2'), ('find3', 'replace3')]

def processThis(str,lst):
    for find, replace in lst:
        str = str.replace(find, replace)

    return str

>>> processThis(x,y)
'replace1, replace2, replace3'

Unknown 2009-07-24 02:53:08

You beat me to it by about 5 seconds :)

Matt 2009-07-24 02:54:14

Thanks, Unknown! This is the obvious solution which works well, but I'm concerned about performance. If len(y) is 100, and I apply it 10 times on a page, we're doing 1000 discrete str.replace() calls in one page. This seems suboptimal to me; wouldn't regex be better suited for this?

cpharmston 2009-07-24 03:01:03

No, you would still be doing the same re.sub with a regex. If you want absolute fastest speed, you probably need to drop down to the C level to implement your own stream based Regex that does replacements based on a switch table.

Unknown 2009-07-24 03:13:20

I retract my original comment. Mhawke's solution *might* be faster. I'm not sure if re.sub will only iterate over the string once, but this solution goes into C mode more often. You should either test this one out to see if it is fast enough or test Mhawke's solution just to make sure. I am sure that they are either about the same, or one is faster under different circumstances.

Unknown 2009-07-24 03:22:20

"Unknown" is wrong. Doing it with a regex containing all of the source strings to match--"find1|find2|find3"--will not scan the string repeatedly. See mhawke's answer.

Glenn Maynard 2009-07-24 03:28:07

Quick tests:- In very small tests where len(y) = 3, run three times on a 30 character string with 2 substitutions, str.replace() is marginally (~8%) faster- In medium-sized tests where len(y) = 20, run 20 times on a 200 character string with 10 substitutions, re.sub() is substantially (~22%) faster- In large tests where len(y) = 500, run 500 times on a 5000 character string with 25 substitutions, re.sub is much (~60%) fasterThanks for the great discussion, all!

cpharmston 2009-07-24 03:32:17

Answer 2

+5 A:

You could consider using re.sub:

import re
REPLACEMENTS = dict([('find1', 'replace1'),
                     ('find2', 'replace2'),
                     ('find3', 'replace3')])

def replacer(m):
    return REPLACEMENTS[m.group(0)]

x = 'find1, find2, find3'
r = re.compile('|'.join(REPLACEMENTS.keys()))
print r.sub(replacer, x)

mhawke 2009-07-24 02:55:05

Thanks mhawke! How would this compare performance-wise to Unknown's solution?

cpharmston 2009-07-24 03:00:15

@cpharmston: using replace() requires that it be called once for each substitution. If there are many substitutions and/or the working string is lengthy, this will be inefficient. re.sub() should process the working string once, but has some setup overhead.

mhawke 2009-07-24 03:13:11

You can easily cache the compiled regex.

Glenn Maynard 2009-07-24 03:37:14

Python already maintains a cache of compiled regexes. `fgrep cache your_python_directory/Lib/re.py`

John Machin 2009-07-24 03:45:59

@mhawke: re.sub steps through each position in the text and tests if a "find" matches at that position -- no automaton. Time is O((size of text)*(number of "finds")*(average size of a "find")). Multiple uses of str.replace(): same. However: str.replace uses a Boyer-Moore variant to skip rapidly through text but traverses text multiple times, possibly trashing memory cache, and will chop up memory because it creates a new replacement string each time a "find" has to be replaced. re.sub traverses the text once with no skipping and creates repl string only once. re.sub prob winner; do benchmarks.

John Machin 2009-07-24 04:11:55

Thanks, John. Does this mean anything to you? http://bit.ly/14E983 :)

cpharmston 2009-07-24 04:24:17

May be you can replace r assignment with r = re.compile('|'.join(re.split(re.compile(', +'),x)))

siri 2009-07-24 04:32:38

@cpharmston: Of course. It's my pi formula. What's the relevance?

John Machin 2009-07-24 07:08:39

Answer 3

A:

s = reduce(lambda x, repl: str.replace(x, *repl), lst, s)

Glenn Maynard 2009-07-24 03:05:45

Answer 4

+1 A:

A couple notes:

The boilerplate argument about premature optimization, benchmarking, bottlenecks, 100 is small, etc.
There are cases where the different solutions will return different results. if y = [('one', 'two'), ('two', 'three')] and x = 'one' then mhawke's solution gives you 'two' and Unknown's gives 'three'.
Testing this out in a silly contrived example mhawke's solution was a tiny bit faster. It should be easy to try it with your data though.

jtb 2009-07-24 03:40:13

If you'd be willing to oblige me, I'd love to hear the boilerplate argument. I'm still a noob at heart :)

cpharmston 2009-07-24 03:49:06

+1 for pointing out that the results differ. This is important!

mhawke 2009-07-24 04:14:24

@cpharmston: Whenever questions like this come up, on StackOverflow or elsewhere, somebody is bound to complain that performance isn't the most important consideration unless it's actually the case that this function is a bottleneck in your application. They'll suggest that it's better to worry about edge case correctness, readability/maintainability or even programming time. I hate to be that guy (off the clock) since this discussion is interesting, but since this one is a wash asymptotically and there were correctness differences I figured I'd at least hint at the argument.

jtb 2009-07-24 04:55:05

Search StackOverflow for "premature optimization" to see plenty of discussion about those issues; even including a rather meta question about how to respond to people whining about premature optimization :-) http://stackoverflow.com/questions/438158/how-should-we-handle-premature-optimization-discussion-in-optimization-questions

jtb 2009-07-24 05:01:47

Ah, thanks jtb. The argument against worrying about performance makes sense, but I have two counterpoints: (1) This will eventually end up being code an a reusable Django app, so in this case, it's probably prudent to plan for the worst case scenario. (2) Part of the reason I made it an academic exercise as I did was to help me learn Python. Even if the pragmatic difference is negligible, there is usually a right way to do things. The discussion (or disagreement) about the right way will usually lead to enlightening conversations, as this was. Thanks everyone!

cpharmston 2009-07-24 05:14:01

ansaurus

tags:

views:

answers:

Iterative find/replace from a list of tuples in Python

related questions