views:

178

answers:

4

If this was PHP, I would probably do something like this:

function no_more_half_widths($string){
  $foo = array('1','2','3','4','5','6','7','8','9','10')
  $bar = array('1','2','3','4','5','6','7','8','9','10')
  return str_replace($foo, $bar, $string)
}

I have tried the .translate function in python and it indicates that the arrays are not of the same size. I assume this is due to the fact that the individual characters are encoded in utf-8. Any suggestions?

+3  A: 

I don't think there's a built-in function to do multiple replacements in one pass, so you'll have to do it yourself.

One way to do it:

>>> src = (u'1',u'2',u'3',u'4',u'5',u'6',u'7',u'8',u'9',u'10')
>>> dst = ('1','2','3','4','5','6','7','8','9','0')
>>> string = u'a123'
>>> for i, j in zip(src, dst):
...     string = string.replace(i, j)
... 
>>> string
u'a123'

Or using a dictionary:

>>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
>>> string = u'a123'
>>> for i, j in trans.iteritems():
...     string = string.replace(i, j)
...     
>>> string
u'a123'

Or finally, using regex (and this might actually be the fastest):

>>> import re
>>> trans = {u'1': '1', u'2': '2', u'3': '3', u'4': '4', u'5': '5', u'6': '6', u'7': '7', u'8': '8', u'9': '9', u'0': '0'}
>>> lookup = re.compile(u'|'.join(trans.keys()), re.UNICODE)
>>> string = u'a123'
>>> lookup.sub(lambda x: trans[x.group()], string)
u'a123'
Max Shawabkeh
+3  A: 

Using the unicode.translate method:

>>> table = dict(zip(map(ord,u'0123456789'),map(ord,u'0123456789')))
>>> print u'123'.translate(table)
123

It requires a mapping of code points as numbers, not characters. Also, using u'unicode literals' leaves the values unencoded.

jleedev
Nice! I didn't know `unicode` had a `translate()` method different from pure `str`, though in retrospect it makes perfect sense.
Max Shawabkeh
+4  A: 

The built-in unicodedata module can do it:

>>> import unicodedata
>>> foo = u'1234567890'
>>> unicodedata.normalize('NFKC', foo)
u'1234567890'

Note that it also normalizes all sorts of other things at the same time, like separate accent marks and Roman numeral symbols.

Daniel Newby
A: 

Regex approach

>>> re.sub(u"[\uff10-\uff19]",lambda x:chr(ord(x.group(0))-0xfee0),u"456")
u'456'
S.Mark