views:

712

answers:

3

I'm a Python beginner, and I have a utf-8 problem.

I have a utf-8 string and I would like to replace all german umlauts with ASCII replacements (in German, u-umlaut 'ü' may be rewritten as 'ue').

u-umlaut has unicode code point 252, so I tried this:

>>> str = unichr(252) + 'ber'
>>> print repr(str)
u'\xfcber'
>>> print repr(str).replace(unichr(252), 'ue')
u'\xfcber'

I expected the last string to be u'ueber'.

What I ultimately want to do is replace all u-umlauts in a file with 'ue':

import sys
import codecs      
f = codecs.open(sys.argv[1],encoding='utf-8')
for line in f: 
    print repr(line).replace(unichr(252), 'ue')

Thanks for your help! (I'm using Python 2.3.)

+4  A: 

repr(str) returns a quoted version of str, that when printed out, will be something you could type back in as Python to get the string back. So, it's a string that literally contains \xfcber, instead of a string that contains über.

You can just use str.replace(unichr(252), 'ue') to replace the ü with ue.

If you need to get a quoted version of the result of that, though I don't believe you should need it, you can wrap the entire expression in repr:

repr(str.replace(unichr(252), 'ue'))
Brian Campbell
A: 

I think it's easiest and clearer to do it on a more straightforward way, using directly the unicode representation os 'ü' better than unichr(252).

>>> s = u'über'
>>> s.replace(u'ü', 'ue')
u'ueber'

There's no need to use repr, as this will print the 'Python representation' of the string, you just need to present the readable string.

You will need also to include the following line at the beggining of the .py file, in case it's not already present, to tell the encoding of the file

#-*- coding: UTF-8 -*-

Added: Of course, the coding declared must be the same as the encoding of the file. Please check that as can be some problems (I had problems with Eclipse on Windows, for example, as it writes by default the files as cp1252. Also it should be the same encoding of the system, which could be utf-8, or latin-1 or others.


Also, don't use str as the definition of a variable, as it is part of the Python library. You could have problems later.

(I am trying on Python 2.6, I think in Python 2.3 the result is the same)

Khelben
An important point: actually save the source file in the encoding declared in the coding declaration. Many people forget that.
Mark Tolonen
You have to actually be able to edit this as UTF-8. If you whole system is set to Latin-1, for example, it won't work, whatever encoding declaration's in the file. Which means that the original way of using unichr is better if you can't guarantee UTF-8-use.
jae
I've added some info due the comments. @mark: Good point, I've had some problems with Eclipse for that reason.@jae I think that the best way would be declaring consistant encoding and then declaring the character as u'ü'. Declaring it as the unicode number seems to be quite confusing in code. I agree that can be inevitable in strange situatios, but, in my experience, usually is easy to use the proper encoding.
Khelben
A: 

You can avoid all that sourcefile encoding stuff and its problems. Use the Unicode names, then its screamingly obvious what you are doing and the code can be read and modified anywhere.

I don't know of any language where the only accented Latin letter is lower-case-u-with-umlaut-aka-diaeresis, so I've added code to loop over a table of translations under the assumption that you'll need it.

# coding: ascii

translations = (
    (u'\N{LATIN SMALL LETTER U WITH DIAERESIS}', u'ue'),
    (u'\N{LATIN SMALL LETTER O WITH DIAERESIS}', u'oe'),
    # et cetera
    )

test = u'M\N{LATIN SMALL LETTER O WITH DIAERESIS}ller von M\N{LATIN SMALL LETTER U WITH DIAERESIS}nchen'

out = test
for from_str, to_str in translations:
    out = out.replace(from_str, to_str)
print out

output:

Moeller von Muenchen
John Machin