ansaurus

Question

Emacs, xterm, mousepad, C, Unicode and UTF-8: Trying to make sense of it all

Answer 1

+6 A:

Ok, so your problem here is due to mixing old-school C library calls (getc, printf %c) and UTF-8. Your code is correctly reading the three bytes which make up '€' - 226, 130 and 172 as decimal - but these values individually are not valid UTF-8 encoded glyphs.

If you look at the UTF-8 encoding, Integer values 0..127 are the encodings for the original US-ASCII character set. However 128..255 (i.e. all your bytes) are part of a multibyte UTF-8 character, and so don't correspond to a valid UTF-8 character invidually.

In other words the single byte '226' doesn't mean anything on it's own (as it's the prefix for a 3-byte character - as expected). The printf call prints it as a single byte, which is invalid with the UTF-8 encoding, so each different program copes with the invalid value in different ways.

Assuming you just want to 'see' what bytes UTF-8 character is made of, I suggest you stick to the integer output you already have (or maybe use hex if that is more sensible) - as your >127 bytes arn't valid unicode you're unlikely to get consistent results across different programs.

Dave Rigby 2009-07-17 22:22:52

Excellent answer, thank you. Now, as you correctly pointed out, how do I show the resulting character in xterm?

Dervin Thunk 2009-07-17 23:49:22

@Dervin: As I implied, there isn't a valid 'resulting character' - 226, 130 and 172 as single bytes don't make sense in UTF-8 - they arn't even printable characters in basic US-ASCII. Why exactly do you want to print them out?

Dave Rigby 2009-07-18 00:05:47

Try the techniques outlined at www.i18nguy.com/unicode/c-unicode.html to use Unicode with plain C.

DaveE 2009-07-20 04:45:10

Answer 2

+2 A:

The UTF-8 encoding says that the three bytes together in a string form the euro sign, or '€'. But single bytes, like the ones produced by your C program, doesn't make sense in a UTF-8 stream. That is why they are replaced with the U+FFFD "REPLACEMENT CHARACTER", or '�'.

E-macs is smart, it knows that the single bytes are invalid data for the output stream, and replaces it with a visible escape representation of the byte. ~~Mousepad output is really broken, I can't make any sense of it.~~ Mousepad is falling back to the CP1252 Windows codepage, where the individual bytes represent characters. The "comma" is not a comma, it is a low curved quote.

Juliano 2009-07-17 22:24:45

Mosepad is probably falling back to the standard Code Page. In CP1252 (Windows), decimal 130 = ',' decimal 226 = 'â', and decimal 172 = '¬'.

DaveE 2009-07-17 22:41:38

s/Mosepad/Mousepad/sheesh!

DaveE 2009-07-17 22:42:25

Re. Mousepad being broken: Firefox renders a text file with the same chars in the same way Mousepad does. Any clue there?

Dervin Thunk 2009-07-17 22:45:22

DaveE 2009-07-17 23:07:49

DaveE: So... THAT's why... It is not a comma, it is a special quote. I was wondering where did it get a comma (0x2C), I presumed that Mousepad was broken. I didn't know that Windows had an encoding which 0x82 (130) was very look-alike to a comma.

Juliano 2009-07-17 23:41:29

Dervin: Firefox is falling-back to the CP1252 codepage, as mentioned by DaveE.

Juliano 2009-07-17 23:43:49

Answer 3

+1 A:

The first thing you posted:

Character: � Integer: 226
Character: �, Integer: 130
Character: �, Integer: 172

Is the "correct" answer. When you print character 226 and the terminal expects utf8, there is nothing the terminal can do, you gave it invalid data. The sequence "226" "space" is an error. The ? character is a nice way of showing you that there is malformed data somewhere.

If you want to replicate your second example, you need to properly encode the character.

Imagine two functions; decode, which takes a character encoding and an octet stream and produces a list of characters; and encode, which takes an encoding an a list of characters and produces an octet stream. encode/decode should be reversible when your data is valid: encode( 'utf8', decode( 'utf8', "..." ) ) == "...".

Anyway, in the second example, the application ("mousepad?") is treating each octet in the three octet representation of the euro character as an individual latin1 character. It gets the octet, decodes it from latin-1 to some internal representation of a "character" (not octet or byte), and then encodes that character as utf8 and writes that to the terminal. That's why it works.

If you have GNU Recode, try this:

$ recode latin1..utf8
<three-octet representation of the euro character> <control-D>
â¬

What this did was treat each octet of the utf-8 representation as a latin1 character, and then converted each of those characters into something your terminal can understand. Perhaps running this through hd makes it clearer:

$ cat | hd
€
00000000  e2 82 ac 0a               |....|
00000004

As you can see, it's 3 octets for the utf-8 representation of the character, and then a newline.

Running through recode:

$ recode latin1..utf8 | hd
€
00000000  c3 a2 c2 82 c2 ac 0a      |.......|
00000007

This is the utf-8 representation of the "latin1" input string; something your terminal can display. The idea is if you output to your terminal, you'll see the euro sign. If you output , you get nothing, that's not valid. Finally, if you output , you get the "garbage" that is the "utf-8 representation" of the character.

If this seems confusing it is. You should never worry about the internal representation like this; if you are working with characters and you need to print them to a utf-8 terminal, you have to always encode to utf-8. If you are reading from a utf-8 encoded file, you need to decode the octets into characters before processing them in your application.

jrockway 2009-07-18 00:22:48

ansaurus

tags:

views:

answers:

Emacs, xterm, mousepad, C, Unicode and UTF-8: Trying to make sense of it all

related questions