ansaurus

Question

Problem with iconv

Answer 1

A:

UTF-8 is multibyte encoding. Character ø is encoded by two bytes: C3-B8 . In encoding of your terminal (ISO-8859-1) this bytes are decoded as Ã¸ . Then you convert those bytes to ISO-8859-1's code of ø. Any questions?

Andrey 2010-03-26 13:59:32

Thanks a lot for your reply. I am trying to understand this, so please bare with me :-)"Ã¸" when typed into the terminal are treated as ISO-8859-1? If converted from UTF-8 to ISO-8859-1 suddenly the terminal can display it correctly because the two characters are read as UTF-8 and converted to ISO-8859-1. Does that mean that if the terminal was set to UTF-8 the two characters would be displayed as a "ø"?

jriff 2010-03-26 14:05:24

I just checked and my terminal is set to UTF-8.

jriff 2010-03-26 14:07:32

If my terminal is set to UTF-8 and I cat the file shouldn't I be getting an "ø" and not "Ã¸"?

jriff 2010-03-26 14:11:51

if your terminal is set to UTF-8 you should see ø. But i am sure that something is still configured wrong. conversion Ã¸ -> bytes -> UTF-8 gave me EXACTLY ø

Andrey 2010-03-26 14:42:47

Can the file itself have an encoding?

jriff 2010-03-26 14:50:01

And how do you convert it to bytes?

jriff 2010-03-26 14:50:42

i took Notepad++, it can switch encodings. File doesn't have encoding itself. There is a concept BOM (byte order mask) that can specify encoding, or some files like XML have internal specification for encoding

Andrey 2010-03-26 15:05:17

And that BOM or internal specification tells the software reading the file how to read the bytes in the file. What if there is none? Trial and error?

jriff 2010-03-26 15:20:57

Actually, on Mac OS X the file *can* have an extended attribute (com.apple.TextEncoding) that can be used by the application creating the file to record the encoding. BUT, the command-line level utilities like "cat" and "echo" don't do anything with it, so this probably isn't relevant to your situation. You can view the attribute (if present) with a command like "xattr -p com.apple.TextEncoding test.txt". Applications that support this attribute include TextEdit, Safari, and Mail.app.

David Gelhar 2010-03-26 15:22:30

it depends on software. custom file formats usually carry encoding and software that works with it can use it. in case of text file - yes, it is possible to don't know encoding. you open it in advanced text editor and pick.

Andrey 2010-03-26 15:28:44

http://en.wikipedia.org/wiki/Byte_order_mark

Andrey 2010-03-26 15:29:16

Thanks a lot for your help - I knew about the extended attribute but as you say - it means nothing to the utilities. I am just wondering how to write software that accepts an arbitrary encoding. I have always had the luxury of knowing the encoding so this is a whole new consideration for me :-|

jriff 2010-03-26 17:39:48

Answer 2

A:

I tried the "iconv" command from one file to another, looking at the data with "od -txC" with the following results:

Input:  c3  83  c2  b8         [ 2 utf8-chars Capital A tilde; Cedilla ]

Command: iconv -f utf-8 -t ISO-8859-1 < in.txt > out.txt

Output:  c3  b8    [ 2 ISO-8859-1 characters, Capital A tilde; Cedilla ]

So, the iconv conversion is correct.

But, if you instead treat the converted data as utf-8 (which Terminal is apparently doing), C3-B8 is "ø" (o-slash).

If you change your character encoding in Terminal (Preferences // Advanced // Character Encoding) to "Western (ISO Latin 1)" you'll see C3-B8 as "Ã¸"

David Gelhar 2010-03-26 14:26:34

David, thanks a lot. If I do echo ø > test.txtand cat it I get 'ø'.What encoding is that character in?

jriff 2010-03-26 14:47:42

Can't say: all that tells you is that the input and output encodings of your terminal are the same. If you do `od -txC test.txt` you can see the raw (hex) value that's stored in the file, then you can deduce what encoding was used.

David Gelhar 2010-03-26 15:07:21

If I created a file that contained just the two bytes c3 and b8 and ran that through iconv it woud be 100% up to iconv how to interpret those two bytes. If I used "-f UTF8" it would read the two bytes as the multibyte character 'ø'. If I used "-f ISO-8859-1" it would read the two bytes as the two singlebyte characters 'Ã' and '¸'.

jriff 2010-03-26 15:18:39

When using "-t UTF8" iconv would take the character(s) that it read and generate the bytes needed for that specific target encoding. So if it read the two bytes as an 'ø' it would return the same two bytes. But if it read the two bytes as 'Ã' and '¸' it would return four bytes - two for each of the characters?

jriff 2010-03-26 15:19:15

David Gelhar 2010-03-26 15:30:49

Thanks - you have been so helpful - I think I get it now.

jriff 2010-03-26 17:35:39

The real problem: A webservice posts some data to my Rails application. The header doesn't contain a charset or encoding parameter (and I have read that that means to default to ISO-8859-1). The problem is that the webservice actually posts UTF-8 encoded text. Rails processes the text using ISO-8859-1 and I get the 'Ã' and '¸' instead of a 'ø'. I can't do anything about the webservice and as far as I see it I now have an UTF-8 string with two bytes for 'Ã' and '¸'. What to do? Can I get back the 'ø'?

jriff 2010-03-26 17:37:23

ansaurus

tags:

views:

answers:

Problem with iconv

related questions