views:

294

answers:

3

Hi,

Before anyone recommends that I do a google search on this, I have. I just need a bit more clarity around what codepages and encodings.

If I use UTF8 encoding, and use an italian code page and then a french code page, does this mean ill get different characters even though the bytes havent changed?

+8  A: 

Joel has a nice summary of this:
http://www.joelonsoftware.com/articles/Unicode.html

And no. if I understand your question correctly it doesn't mean that. When you're converting UTF-8 to a specific code page, it is possible that only some of the characters are going to be converted. What happens to the ones that don't get converted depends on how you call the conversion. A possible result is that the characters which could not be mapped to the code page would be converted to question mark characters.

shoosh
+5  A: 

An encoding is simply a mapping between numerical values and "characters".

US-ASCII maps the number 65 to the letter A, 32 to a space and 49 to the digit "1". (How these things are rendered is another matter.) In fact, UTF-8 does the same! But there are other values which UTF-8 treats differently to ASCII. It is a variable-length encoding, i.e. a character may be encoded with 1, 2, 3, or 4 bytes; common characters generally consume less bytes.

Plain text files, including web pages, are stored and transmitted as sequences of bytes. These bytes are supposed to represent something textual. Software applications (like text editors and web browsers) are responsible for rending the information within these files on the screen. Usually they make use of library or OS functions.

If the software assumes a different encoding to the software that created the file, the wrong characters may be displayed!

Note that it is possible to convert between different encodings; however if you convert to an encoding that does not contain a certain character, the software must make a choice as to what to use instead. This conversion often happens transparently (when you save a file with a certain encoding, whatever you've typed must be changed into that encoding).

Artelius
A: 

UTF-8 includes all characters from your French and Italian code page, but the language specific code pages does not include all of each others characters.

So you can take input from each language and convert it to UTF-8 for storage, but you can not be certain that you will get the right characters if you take Italian input and show it as French.

Use UTF-8 all the way if you can.

idstam
Why was this downvoted? The question wasn't worded very clearly, but this answer seems correct to me.
Alan Moore