views:

1631

answers:

4

In python:

u'\u3053\n'

Is it utf-16?

I'm not really aware of all the unicode/encoding stuff, but this type of thing is coming up in my dataset, like if I have a=u'\u3053\n'.

print gives an exception and decoding gives an exception.

a.encode("utf-16") > '\xff\xfeS0\n\x00'
a.encode("utf-8") > '\xe3\x81\x93\n'

print a.encode("utf-8") > πüô
print a.encode("utf-16") >  ■S0

What's going on here?

+1  A: 

Here's the Unicode HowTo Doc for Python 2.6.2:

http://docs.python.org/howto/unicode.html

Also see the links in the Reference section of that document for other explanations, including one by Joel Spolsky.

Anon
+3  A: 

It's a unicode character that doesn't seem to be displayable in your terminals encoding. print tries to encode the unicode object in the encoding of your terminal and if this can't be done you get an exception.

On a terminal that can display utf-8 you get:

>>> print u'\u3053'
こ

Your terminal doesn't seem to be able to display utf-8, else at least the print a.encode("utf-8") line should produce the correct character.

sth
thanks yes, powershell , even powershell ISE doesn't seem "compatable" (for lack of a better understanding) with unicode in python.http://stackoverflow.com/questions/2105022/unicode-in-powershell-with-python-alternative-shells-in-windows
8steve8
A: 

Character U+3053 "HIRAGANA LETTER KO".

The \xff\xfe bit at the start of the UTF-16 binary format is the encoded byte order mark (U+FEFF), then "S0" is \x5e\x30, then there's the \n from the original string. (Each of the characters has its bytes "reversed" as it's using little endian UTF-16 encoding.)

The UTF-8 form represents the same Hiragana character in three bytes, with the bit pattern as documented here.

Now, as for whether you should really have it in your data set... where is this data coming from? Is it reasonable for it to have Hiragana characters in it?

Jon Skeet
+2  A: 
Alex Martelli