views:

277

answers:

6

Hi,

Having some issues getting my head around the differences between UTF-8, UTF-16, ASCII and ANSI. After doing some research I have some idea but it would be really useful if someone could explain exactly the difference between them (including the byte representation of a typical character from each).

I quess my question boils down to

1) How do each of the above store characters as bytes
2) What are the differences between the above standards
3) What is a code page
4) Method of converting characters between the various types.

Many many thanks :)

+11  A: 

I've found Joel's article on Unicode to explain this very well. Specifically it covers the history (essential for this subject), encodings (UTF-8/16 etc.) and code pages.

Brian Agnew
A: 

The O'Reilly CJKV Information Processing contains a lot of background on character sets and character encodings: with particular attention on CJKV data, of course. I found it useful to get my understanding beyond "how do I get a !*!**#@ Euro symbol to show up properly?".

araqnid
A: 

On Unix, use the program named recode or iconv to convert text files to another encoding, or use the iconv function (man 3 iconv) in your C or C++ program.

If you use Perl, use the Encode module for conversion (e.g. use Encode; print encode("utf-8", "\xabfoo")). If you use Python, use unicode.encode and/or str.decode (e.g. print u'\xabfoo'.encode('utf-8')).

pts
+3  A: 

To quickly attempt to answer your specific questions.

1: A certain combination of bits represents a certain character. A single character may be stored in multiple bytes.

2: Brief information on and differences between the encodings you mentioned.

ASCII
Includes definitions for 128 characters.

ANSI
Has more characters than ASCII, but still fits in an octet. Requires a code page.

UTF-8
This can be used to represent any Unicode character. There are many many more Unicode characters than there are ASCII ones. It stores each character in one to four octets of data.

UTF-16
Similar to UTF-8 but the basic unit is 16 bits. If you're just using English then you're wasting 8 bits on every character.

3: A code page is what specifies to the computer which (combination of bits) refers to which character. Unicode does not need code pages since each character has it's own unique bit combination. ANSI has code pages because it only has 256 available characters. For example if you were on an Arabic computer you would have Arabic set as the code page and Arabic characters could be displayed.

4: The method of conversion depends on the character set you are converting to and from and the code pages used (if any). Some conversions may not be possible. UTF-8 is backward compatible with ASCII, meaning if your text only includes the first 128 US characters it's exactly the same as the same text in ASCII encoding.

This answer was ad-hoc and there may be mistakes, corrections welcome.

CiscoIPPhone
A: 

A couple of random points that are useful to know:

  • An interesting thing about UTF-8 and ASCII is that the 127 ASCII characters are encoded in exactly the same way in UTF-8 (this may also be the case with other UTF schemes, I'm not sure) In other words, within the ASCII range or characters, both ASCII and UTF-8 are totally interchangeable.

    The way this this comes about is that UTF-8 is variable length; the "first" 127 characters are represented by a single byte each. Beyond that, it starts using multiple bytes. How does a decoder know whether to interpret a byte as a single ASCII character or as part of a multi-byte sequence? Because the bits at the beginning of the byte follow certain patterns: a zero bit at the start means it's a single-byte character, and n 1 bits means this byte is the beginning of an n byte sequence.

  • Also, different languages will convert their native strings into different encodings when you output them, for example, print them in a file or on the screen. Therefore, if you're interested in interchangeability between languages and platforms, you should always specify how you'd like your language's string types to be output. Otherwise you will get strange and unexpected errors!

  • UTF-8 is also the standard for XML.