ansaurus

Question

Answer 1

+2 A:

Because that's the sequence for the character when encoded in UTF-8:

>>> '\xc3\x9c'.decode('utf-8')
u'\xdc'

Ignacio Vazquez-Abrams 2010-02-10 08:40:55

Yeah exactly, but im wondering why encoded to two byte character.

sasayins 2010-02-10 08:48:45

Because it's between 0x80 and 0x07ff. Below that encodes to 1 byte. Above that encodes to 3 or more bytes.

Ignacio Vazquez-Abrams 2010-02-10 08:51:46

Answer 2

+1 A:

From wikipedia:

"UTF-8 encodes each character (code point) in 1 to 4 octets (8-bit bytes), with the single octet encoding used only for the 128 US-ASCII characters."

Tanner 2010-02-10 08:54:36

ansaurus

tags:

views:

answers:

UTF-8 character change to two character

related questions