views:

160

answers:

2

Hi all!

If you are on Mac OS X 10.6, and you are familiar with character encoding AND the terminal please do this:

Open a terminal and type the following commands:

echo sørensen > test.txt iconv -f UTF8 -t ISO-8859-1 test.txt

You will see the output: "sørensen". Can somebody explain what is going on?

A: 

UTF-8 is multibyte encoding. Character ø is encoded by two bytes: C3-B8 . In encoding of your terminal (ISO-8859-1) this bytes are decoded as ø . Then you convert those bytes to ISO-8859-1's code of ø. Any questions?

Andrey
Thanks a lot for your reply. I am trying to understand this, so please bare with me :-)"ø" when typed into the terminal are treated as ISO-8859-1? If converted from UTF-8 to ISO-8859-1 suddenly the terminal can display it correctly because the two characters are read as UTF-8 and converted to ISO-8859-1. Does that mean that if the terminal was set to UTF-8 the two characters would be displayed as a "ø"?
jriff
I just checked and my terminal is set to UTF-8.
jriff
If my terminal is set to UTF-8 and I cat the file shouldn't I be getting an "ø" and not "ø"?
jriff
if your terminal is set to UTF-8 you should see ø. But i am sure that something is still configured wrong. conversion ø -> bytes -> UTF-8 gave me EXACTLY ø
Andrey
Can the file itself have an encoding?
jriff
And how do you convert it to bytes?
jriff
i took Notepad++, it can switch encodings. File doesn't have encoding itself. There is a concept BOM (byte order mask) that can specify encoding, or some files like XML have internal specification for encoding
Andrey
And that BOM or internal specification tells the software reading the file how to read the bytes in the file. What if there is none? Trial and error?
jriff
Actually, on Mac OS X the file *can* have an extended attribute (com.apple.TextEncoding) that can be used by the application creating the file to record the encoding. BUT, the command-line level utilities like "cat" and "echo" don't do anything with it, so this probably isn't relevant to your situation. You can view the attribute (if present) with a command like "xattr -p com.apple.TextEncoding test.txt". Applications that support this attribute include TextEdit, Safari, and Mail.app.
David Gelhar
it depends on software. custom file formats usually carry encoding and software that works with it can use it. in case of text file - yes, it is possible to don't know encoding. you open it in advanced text editor and pick.
Andrey
http://en.wikipedia.org/wiki/Byte_order_mark
Andrey
Thanks a lot for your help - I knew about the extended attribute but as you say - it means nothing to the utilities. I am just wondering how to write software that accepts an arbitrary encoding. I have always had the luxury of knowing the encoding so this is a whole new consideration for me :-|
jriff
A: 

I tried the "iconv" command from one file to another, looking at the data with "od -txC" with the following results:

Input:  c3  83  c2  b8         [ 2 utf8-chars Capital A tilde; Cedilla ]

Command: iconv -f utf-8 -t ISO-8859-1 < in.txt > out.txt

Output:  c3  b8    [ 2 ISO-8859-1 characters, Capital A tilde; Cedilla ]

So, the iconv conversion is correct.

But, if you instead treat the converted data as utf-8 (which Terminal is apparently doing), C3-B8 is "ø" (o-slash).

If you change your character encoding in Terminal (Preferences // Advanced // Character Encoding) to "Western (ISO Latin 1)" you'll see C3-B8 as "ø"

David Gelhar
David, thanks a lot. If I do echo ø > test.txtand cat it I get 'ø'.What encoding is that character in?
jriff
Can't say: all that tells you is that the input and output encodings of your terminal are the same. If you do `od -txC test.txt` you can see the raw (hex) value that's stored in the file, then you can deduce what encoding was used.
David Gelhar
If I created a file that contained just the two bytes c3 and b8 and ran that through iconv it woud be 100% up to iconv how to interpret those two bytes. If I used "-f UTF8" it would read the two bytes as the multibyte character 'ø'. If I used "-f ISO-8859-1" it would read the two bytes as the two singlebyte characters 'Ã' and '¸'.
jriff
When using "-t UTF8" iconv would take the character(s) that it read and generate the bytes needed for that specific target encoding. So if it read the two bytes as an 'ø' it would return the same two bytes. But if it read the two bytes as 'Ã' and '¸' it would return four bytes - two for each of the characters?
jriff
David Gelhar
Thanks - you have been so helpful - I think I get it now.
jriff
The real problem: A webservice posts some data to my Rails application. The header doesn't contain a charset or encoding parameter (and I have read that that means to default to ISO-8859-1). The problem is that the webservice actually posts UTF-8 encoded text. Rails processes the text using ISO-8859-1 and I get the 'Ã' and '¸' instead of a 'ø'. I can't do anything about the webservice and as far as I see it I now have an UTF-8 string with two bytes for 'Ã' and '¸'. What to do? Can I get back the 'ø'?
jriff