views:

186

answers:

5

I have a unicode string in Python and basically need to go through, character by character and replace certain ones based on a list of rules. One such rule is that a is changed to ö if a is after n. Also, if there are two vowel characters in a row, they get replaced by one vowel character and :. So if I have the string "natarook", what is the easiest and most efficient way of getting "nötaro:k"? Using Python 2.6 and CherryPy 3.1 if that matters.

edit: two vowels in a row does mean the same vowels (oo, aa, ii)

+1  A: 

Given your rules, I'd say you really want a simple state machine. Hmm, on second thought, maybe not; you can just look back in the string as you go.

I have a unicode string in Python and basically need to go through, character by character and replace certain ones based on a list of rules. One such rule is that a is changed to ö if a is after n. Also, if there are two vowel characters in a row, they get replaced by one vowel character and :. So if I have the string , what is the easiest and most efficient way of getting "nötaro:k"? Using Python 2.6 and CherryPy 3.1 if that matters.

vowel_set = frozenset(['a', 'e', 'i', 'o', 'u', 'ö'])

def fix_the_string(s):
    lst = []
    for i, ch in enumerate(s):
        if ch == 'a' and lst and lst[-1] == 'n':
            lst.append('ö')
        else if ch in vowel_set and lst and lst[-1] in vowel_set:
            lst[-1] = 'a' # "replaced by one vowel character", not sure what you want
            lst.append(':')
        else
            lst.append(ch)
    return "".join(lst)

print fix_the_string("natarook")

EDIT: Now that I saw the answer by @Anon. I think that's the simplest approach. This might actually be faster once you get a whole bunch of rules in play, as it makes one pass over the string; but maybe not, because the regexp stuff in Python is fast C code.

But simpler is better. Here is actual Python code for the regexp approach:

import re
pat_na = re.compile(r'na')
pat_double_vowel = re.compile(r'([aeiou])[aeiou]')

def fix_the_string(s):
    s = re.sub(pat_na, r'nö', s)
    s = re.sub(pat_double_vowel, r'\1:', s)
    return s

print fix_the_string("natarook") # prints "nötaro:k"
steveha
Couldn't `lst` simply be a string?
RedGlyph
Instead of lst we could use a string. But strings are immutable in Python, and we can't just replace the last character; we would have to do something like `s_new = s_new[:-1] + new_ch`. The real problem with that is that the performance really sucks. Appending to a list or replacing an element is a fast operation, but appending to a string involves copying the string to a new string and then adding the new character; this gives O(N**2) performance, very bad. Recent versions of Python have optimized this specific case, at least some of the time, but this is the traditional way to do it.
steveha
@steveha: CAUTION, you are testing against the last output character (`lst[-1]`) instead of against the previous input character -- depending on what else is in the OP's full set of rules, this may be a good idea or a bad idea.
John Machin
@steveha: you've got a point, indeed there's nothing in the problem description that says it's a small string. Nice `re` alternate solution too.
RedGlyph
+2  A: 

"I know, I'll use regular expressions!"

But seriously, regexes are really good for string manipulation.

You could write one per rule, like so:

s/na/nö/g
s/([aeiou])$1/$1:/g

Or you could generate them at runtime from some other source which lists them all.

Anon.
Never seen this syntax in Python, which module are you using?
RedGlyph
That syntax is typical of the vi text editor; it expresses the idea but it's not working Python code.
steveha
This is Perl syntax for regular expressions - replace the $ signs with backslashes to get the Python syntax.
Anon.
And of course, this is just the sample expressions - the boilerplate Python needed to actually perform the replacement is additional.
Anon.
My mistake. vi wouldn't use `$1` for replacement text; it uses `\1`. I was confused for a moment there.
steveha
couldn't this cause problems when running sequentially though? What if applying one of the first rules here? what happens with the string naa ?
GSto
+2  A: 

focus on easy and correct first, then consider efficiency if profiling indicates its a bottleneck.

The simple approach is:

prev = None
for ch in string:
  if ch == 'a':
    if prev == 'n':
      ...
  prev = ch
Will
+6  A: 
# -*- coding: utf-8 -*-

def subpairs(s, prefix, suffix):
    def sub(i, sentinal=object()):
        r = prefix.get(s[i:i+2], sentinal)
        if r is not sentinal: return r

        r = suffix.get(s[i-1:i+1], sentinal)
        if r is not sentinal: return r
        return s[i]

    s = '\0'+s+'\0'
    return ''.join(sub(i) for i in xrange(1,len(s)))

vowels = [(v+v, u':') for v in 'aeiou']

prefix = {}
suffix = {'na':u'ö'}
suffix.update(vowels)
print subpairs('natarook', prefix, suffix)
# prints: nötaro:k

prefix = {'na':u'ö'}
suffix = dict(vowels)
print subpairs('natarook', prefix, suffix)
# prints: öataro:k
Shane Holloway
That's an original idea, but maybe you want to explain it to the OP ;-)
RedGlyph
That's a good point! Dispatch is a table of (previous+current) character pairs. Subpairs iterates pairwise over the string using \0 to signify the string start. Each pair is looked up in the dispatch table for the substitution to use. If nothing is found, get returns the current char "c". Algorithm is linear complexity - O(n).
Shane Holloway
Also, how could I get it to do before rules as well? So for 'na':u'ö', change 'n' to 'ö' instead of 'a'
Brandon
Ok, I incorporated prefix and suffix mappings, as well as fixed vowel sequence to what was intended by OP. Algorithm remains O(n) complexity. Note that the inner sub() method could be replaced by a composite of algorithms.
Shane Holloway
+1  A: 

It might be simpler to do with a handmade list of regular expressions, rather than progmatically gererating them. I recommend the following code.

import re
# regsubs is a dictionary of regular expressions as keys, 
# and the replacement regexps as values
regsubs = {'na':u'nö',
           '([aeiou])\\1': '\\1:'}

def makesubs(s):
    for pattern, repl in regsubs.iteritems():
        s = re.sub(pattern, repl, s)
    return s

print makesubs('natarook')
# prints: nötaro:k
ironchefpython