views:

639

answers:

3

Disclaimer: My apologies for all the text below (for a single simple question), but I sincerely think that every bit of information is relevant to the question. I'd be happy to learn otherwise. I can only hope that, if successful, the question(s) and the answers may help others in Unicode madness. Here goes.

I have read all the usually highly-regarded websites about utf8, particularly this one is very good for my purposes, but I've read the classics too, like those mentioned in other similar questions in SO. However, I still lack the knowledge about how to integrate it all in my virtual lab. I use Emacs with

;; Internationalization
(prefer-coding-system 'utf-8)
(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)

in my .emacs, xterm started with

 LC_CTYPE=en_US.UTF-8 xterm -geometry 91x58\
-fn '-Misc-Fixed-Medium-R-SemiCondensed--13-120-75-75-C-60-ISO10646-1'

and my locale reads:

LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

My questions are the following (some of the answers may be the expected behavior of the application, but I still need to make sense of it, so bear with me):

Supposing the following C program:

#include <stdio.h>

int main(void) {
  int c;
  while((c=getc(stdin))!=EOF) {
    if(c!='\n') {
      printf("Character: %c, Integer: %d\n", c, c);
    }
  }
  return 0;
}

If I run this in my xterm I get:

€
Character: � Integer: 226
Character: �, Integer: 130
Character: �, Integer: 172

(just in case the chars I get are a white question mark within a black circle). The ints are the decimal representation of the 3 bytes needed to encode €, but I am not exactly sure why xterm does not display them properly.

Instead, Mousepad, eg, prints

Character: â, Integer: 226
Character: ,, Integer: 130 (a comma, standing forU+0082 <control>, why?!)
Character: ¬, Integer: 172

Meanwhile, Emacs displays

Character: \342, Integer: 226
Character: \202, Integer: 130
Character: \254, Integer: 172

QUESTION: The most general question I can ask is: How do I get everything to print the same character? But I am certain there will be follow-ups.

Thanks again, and apologies for all the text.

+6  A: 

Ok, so your problem here is due to mixing old-school C library calls (getc, printf %c) and UTF-8. Your code is correctly reading the three bytes which make up '€' - 226, 130 and 172 as decimal - but these values individually are not valid UTF-8 encoded glyphs.

If you look at the UTF-8 encoding, Integer values 0..127 are the encodings for the original US-ASCII character set. However 128..255 (i.e. all your bytes) are part of a multibyte UTF-8 character, and so don't correspond to a valid UTF-8 character invidually.

In other words the single byte '226' doesn't mean anything on it's own (as it's the prefix for a 3-byte character - as expected). The printf call prints it as a single byte, which is invalid with the UTF-8 encoding, so each different program copes with the invalid value in different ways.

Assuming you just want to 'see' what bytes UTF-8 character is made of, I suggest you stick to the integer output you already have (or maybe use hex if that is more sensible) - as your >127 bytes arn't valid unicode you're unlikely to get consistent results across different programs.

Dave Rigby
Excellent answer, thank you. Now, as you correctly pointed out, how do I show the resulting character in xterm?
Dervin Thunk
@Dervin: As I implied, there isn't a valid 'resulting character' - 226, 130 and 172 as single bytes don't make sense in UTF-8 - they arn't even printable characters in basic US-ASCII. Why exactly do you want to print them out?
Dave Rigby
Try the techniques outlined at www.i18nguy.com/unicode/c-unicode.html to use Unicode with plain C.
DaveE
+2  A: 

The UTF-8 encoding says that the three bytes together in a string form the euro sign, or '€'. But single bytes, like the ones produced by your C program, doesn't make sense in a UTF-8 stream. That is why they are replaced with the U+FFFD "REPLACEMENT CHARACTER", or '�'.

E-macs is smart, it knows that the single bytes are invalid data for the output stream, and replaces it with a visible escape representation of the byte. Mousepad output is really broken, I can't make any sense of it. Mousepad is falling back to the CP1252 Windows codepage, where the individual bytes represent characters. The "comma" is not a comma, it is a low curved quote.

Juliano
Mosepad is probably falling back to the standard Code Page. In CP1252 (Windows), decimal 130 = ',' decimal 226 = 'â', and decimal 172 = '¬'.
DaveE
s/Mosepad/Mousepad/sheesh!
DaveE
Re. Mousepad being broken: Firefox renders a text file with the same chars in the same way Mousepad does. Any clue there?
Dervin Thunk
DaveE
DaveE: So... THAT's why... It is not a comma, it is a special quote. I was wondering where did it get a comma (0x2C), I presumed that Mousepad was broken. I didn't know that Windows had an encoding which 0x82 (130) was very look-alike to a comma.
Juliano
Dervin: Firefox is falling-back to the CP1252 codepage, as mentioned by DaveE.
Juliano
+1  A: 

The first thing you posted:

Character: � Integer: 226
Character: �, Integer: 130
Character: �, Integer: 172

Is the "correct" answer. When you print character 226 and the terminal expects utf8, there is nothing the terminal can do, you gave it invalid data. The sequence "226" "space" is an error. The ? character is a nice way of showing you that there is malformed data somewhere.

If you want to replicate your second example, you need to properly encode the character.

Imagine two functions; decode, which takes a character encoding and an octet stream and produces a list of characters; and encode, which takes an encoding an a list of characters and produces an octet stream. encode/decode should be reversible when your data is valid: encode( 'utf8', decode( 'utf8', "..." ) ) == "...".

Anyway, in the second example, the application ("mousepad?") is treating each octet in the three octet representation of the euro character as an individual latin1 character. It gets the octet, decodes it from latin-1 to some internal representation of a "character" (not octet or byte), and then encodes that character as utf8 and writes that to the terminal. That's why it works.

If you have GNU Recode, try this:

$ recode latin1..utf8
<three-octet representation of the euro character> <control-D>
â¬

What this did was treat each octet of the utf-8 representation as a latin1 character, and then converted each of those characters into something your terminal can understand. Perhaps running this through hd makes it clearer:

$ cat | hd
€
00000000  e2 82 ac 0a               |....|
00000004

As you can see, it's 3 octets for the utf-8 representation of the character, and then a newline.

Running through recode:

$ recode latin1..utf8 | hd
€
00000000  c3 a2 c2 82 c2 ac 0a      |.......|
00000007

This is the utf-8 representation of the "latin1" input string; something your terminal can display. The idea is if you output to your terminal, you'll see the euro sign. If you output , you get nothing, that's not valid. Finally, if you output , you get the "garbage" that is the "utf-8 representation" of the character.

If this seems confusing it is. You should never worry about the internal representation like this; if you are working with characters and you need to print them to a utf-8 terminal, you have to always encode to utf-8. If you are reading from a utf-8 encoded file, you need to decode the octets into characters before processing them in your application.

jrockway