tags:

views:

47

answers:

2

Hi,

In my text file, I used a character with value larger than 127 for example 0xDC. Then I loaded that text file in a device. Then I read that text file and that character. Then the character was changed to 0xC3 and 0x9C. How come it change to two character?

Thanks

+2  A: 

Because that's the sequence for the character when encoded in UTF-8:

>>> '\xc3\x9c'.decode('utf-8')
u'\xdc'
Ignacio Vazquez-Abrams
Yeah exactly, but im wondering why encoded to two byte character.
sasayins
Because it's between 0x80 and 0x07ff. Below that encodes to 1 byte. Above that encodes to 3 or more bytes.
Ignacio Vazquez-Abrams
+1  A: 

From wikipedia:

"UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters."

Tanner