ansaurus

Question

Efficiently carry out multiple string replacements in Python

Answer 1

+1 A:

You may find that it is faster to create a regex and do all the replacements at once.

Also a good idea to move the replacement code out to a function so that you can memoize if you are likely to have duplicates in the list

>>> import re
>>> [re.sub('[aeiou]','',s) for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']


>>> def replacer(s, memo={}):
...   if s not in memo:
...     memo[s] = re.sub('[aeiou]','',s)
...   return memo[s]
... 
>>> [replacer(s) for s in strings if len(s) > 2]
['a', 'lst', 'of', 'strngs']

gnibbler 2010-07-29 23:46:59

Are you able to expand on this? Would `re.sub('[aeiou]', '', s)` replace all at the same time? If it's checking char by char, I was worried about the immutability of the Python string..

Tim McNamara 2010-07-29 23:53:34

@Tim, Yes, `[aeiou]` replaces all the vowels at once. Immutability is not a problem as you are creating new strings.

gnibbler 2010-07-29 23:56:17

@Tim The brackes `[]` are a character class, it matches any single one of them and replaces that char with the replacement string. This solution is nice if: all replacements are the same, and if you are only matching single chars.

zdav 2010-07-29 23:58:08

I didn't see your edits - you just had the initial answer about a regex to start off. The the memo arg is a nice touch! Thanks for your help.

Tim McNamara 2010-07-30 00:02:05

Answer 2

+3 A:

The specific example you give (deleting single characters) is perfect for the translate method of strings, as is substitution of single characters with single characters. If the input string is a Unicode one, then, as well as the two above kinds of "substitution", substitution of single characters with multiple character strings is also fine with the translate method (not if you need to work on byte strings, though).

If you need to replace substrings of multiple characters, then I would also recommend using a regular expression -- though not in the way @gnibbler's answer recommends; rather, I'd build the regex from r'onestring|another|yetanother|orthis' (join the substrings you want to replace with vertical bars -- be sure to also re.escape them if they contain special characters, of course) and write a simple substituting-function based on a dict.

I'm not going to offer a lot of code at this time since I don't know which of the two paragraphs applies to your actual needs, but (when I later come back home and check SO again;-) I'll be glad to edit to add a code example as necessary depending on your edits to your question (more useful than comments to this answer;-).

Edit: in a comment the OP says he wants a "more general" answer (without clarifying what that means) then in an edit of his Q he says he wants to study the "tradeoffs" between various snippets all of which use single-character substrings (and check presence thereof, rather than replacing as originally requested -- completely different semantics, of course).

Given this utter and complete confusion all I can say is that to "check tradeoffs" (performance-wise) I like to use python -mtimeit -s'setup things here' 'statements to check' (making sure the statements to check have no side effects to avoid distorting the time measurements, since timeit implicitly loops to provide accurate timing measurements).

A general answer (without any tradeoffs, and involving multiple-character substrings, so completely contrary to his Q's edit but consonant to his comments -- the two being entirely contradictory it is of course impossible to meet both):

import re

class Replacer(object):

  def __init__(self, **replacements):
    self.replacements = replacements
    self.locator = re.compile('|'.join(re.escape(s) for s in replacements))

  def _doreplace(self, mo):
    return self.replacements[mo.group()]

  def replace(self, s):
    return self.locator.sub(self._doreplace, s)

Example use:

r = Replacer(zap='zop', zip='zup')
print r.replace('allazapollezipzapzippopzip')

If some of the substrings to be replaced are Python keywords, they need to be passed in a tad less directly, e.g., the following:

r = Replacer(abc='xyz', def='yyt', ghi='zzq')

would fail because def is a keyword, so you need e.g.:

r = Replacer(abc='xyz', ghi='zzq', **{'def': 'yyt'})

or the like.

I find this a good use for a class (rather than procedural programming) because the RE to locate the substrings to replace, the dict expressing what to replace them with, and the method performing the replacement, really cry out to be "kept all together", and a class instance is just the right way to perform such a "keeping together" in Python. A closure factory would also work (since the replace method is really the only part of the instance that needs to be visible "outside") but in a possibly less-clear, harder to debug way:

def make_replacer(**replacements):
  locator = re.compile('|'.join(re.escape(s) for s in replacements))

  def _doreplace(mo):
    return replacements[mo.group()]

  def replace(s):
    return locator.sub(_doreplace, s)

  return replace

r = make_replacer(zap='zop', zip='zup')
print r('allazapollezipzapzippopzip')

The only real advantage might be a very modestly better performance (needs to be checked with timeit on "benchmark cases" considered significant and representative for the app using it) as the access to the "free variables" (replacements, locator, _doreplace) in this case might be minutely faster than access to the qualified names (self.replacements etc) in the normal, class-based approach (whether this is the case will depend on the Python implementation in use, whence the need to check with timeit on significant benchmarks!).

Alex Martelli 2010-07-29 23:58:15

Thanks for your detailed answer Alex. I was indeed looking for a more general answer. Sorry for being unclear with the question. I will edit the Q to reflect that.

Tim McNamara 2010-07-30 00:09:29

ansaurus

tags:

views:

answers:

Efficiently carry out multiple string replacements in Python

related questions