views:

343

answers:

3

This Python script gets translit for Russian letters:

s = u'Код Обмена Информацией, 8 бит'.encode('koi8-r')
print ''.join([chr(ord(c) & 0x7F) for c in s]) # kOD oBMENA iNFORMACIEJ, 8 BIT

That works. But I want to modify it so as to get user input. Now I'm stuck at this:

s = raw_input("Enter a string you want to translit: ")

s = unicode(s)
s = s.encode('koi8-r')

print ''.join([chr(ord(c) & 0x7F) for c in s])

Ending up with this:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

What's wrong?

+2  A: 

s = unicode(s) expects ascii encoding by default. You need to supply it an encoding your input is in, e.g. s = unicode(s, 'utf-8').

laalto
That's very sad, btw. They should've used locale-default one.
alamar
Oh, I don't know @alamar - I find any time I'm using or talking to anyone about character encodings, failure to be explicit on both ends causes problems, and eventually there's an edge case where you have to supply the information anyhow - better to train people to do it all the time! :-)
Blair Conrad
Well, docs also doesn't specify what default is - even worse.
alamar
+1  A: 

try unicode(s, encoding) where encoding is whatever your terminal is in.

alamar
what's your terminal encoding?
alamar
A: 

Looking at the error messages that you are seeing, it seems to me that your terminal encoding is probably set to KOI8-R, in which case you don't need to perform any decoding on the input data. If this is the case then all you need is:

>>> s = raw_input("Enter a string you want to translit: ")
>>> print ''.join([chr(ord(c) & 0x7F) for c in s])
kOD oBMENA iNFORMACIEJ, 8 BIT

You can double check this by s.decode('koi8-r') which should succeed and return the equivalent unicode string.

mhawke