How exactly does a program convert everything to UTF-8 internally?

views:

212

answers:

+3 Q:

How exactly does a program convert everything to UTF-8 internally?

does it use setlocale()?
does it assume utf-8 for all input strings when in a UTF-8 locale?
I understand what unicode is and how it is related to utf-8, but how does one "convert to it" internally with all their strings?

How does it convert all input strings to UTF-8? Does it use a C library function?

Does the current working locale have to be a UTF-8 locale?

UPDATE: if specific technical details could be in your answer, that would be great as that is more along the lines of what I'm looking for. I already understand the reasons for using UTF-8 internally and why it makes dealing with multiple locales much simpler.

UPDATE: an answer mentionated to simply use iconv and/or ICU, however, how does strcmp() along with all the other routines know to compare them as UTF-8 then? Does setlocale() have to be run? Or does it not matter?

+1 A:

A string is really an abstract concept. But inside a computer, any string will have a concrete representation as a number of bytes using a specific character encoding.

So "converting everything to UTF-8 internally" means that an application uses UTF-8 for all strings internally, has all its logic written to operate on UTF-8 strings, and converts every outside input from whatever encoding that string uses to UTF-8. It may or may not allow you to choose which encoding to use for output.

This is really the most sensible way to write an app capable of handling text in different encodings, at least if the internal logic can work efficiently on UTF-8 (i.e. doesn't require random access).

Michael Borgwardt 2010-05-07 11:45:48

Can you give more technical details? I understand the high-level details you have given me already.

xyld 2010-05-07 11:47:43

+2 A:

Uh... I guess you're asking about things such as libiconv or ICU, but... they're just libraries for converting character sets...

EDIT:

You can't use the standard C string-handling functions, since you're not dealing with standard C strings. UTF-8-capable versions of them are available in libraries such as glib or ICU.

Ignacio Vazquez-Abrams 2010-05-07 12:02:12

I guess you're right... heh

xyld 2010-05-07 12:03:54

Sure you can use the standard functions. One of the special charms of UTF-8 is that functions like strlen and strcmp work correctly on the simple cases.

bmargulies 2010-05-07 12:57:32

And if you assume that every case is a simple case then you will soon run into bugs left and right.

Ignacio Vazquez-Abrams 2010-05-07 13:10:08

if you want to compare two strings in C, they both have to be in the same encoding. strcmp() is just a memcmp() (or a byte-wise compare) which stops at value of 0. There will be no conversion whatsoever in C's strcmp. If you have to deal with different encodings (CP850, UTF-8, Ansi, Windows, Mac), you have to be very carefully what you compare, else you compare apples with pears.

The libraries mentioned above have a different implementation of strcmp(), which knows and handles the encoding, nut you always have to know and provide the encoding by yourself.

if you deal with XML, you may use libxml, which encodes for you in the correct (internal) representation, depending on the given encoding thru the xml-header.

encoding/character tables is one of the worst concept in C, ranging back to the olde days where character bytes where 7 bits long and the computer world only took place in the USA. (so no umlauts, accents, EURO-Sign, etc)

Peter Miehle 2010-05-07 12:53:05

+4 A:

It's a little hard to tell where to start here, since there are a lot of assumptions in play.

In C as we know and love it, there is a 'char' datatype. In all commonly-used implementations, that datatype holds an 8-bit byte.

In the language, as opposed to any library functions you use, these things are just twos-complement integers. They have no 'character' semantics whatsoever.

As soon as you start calling functions from the standard library with 'str' or 'is' in their names (e.g. strcmp, isalnum), you are dealing with character semantics.

C programs need to cope with the giant mess made of character semantics before the invention of Unicode. Various organizations invented a very large number of encoding standards. Some are one character per byte. Some are multiple characters per byte. In some, it's always safe to ask if (charvalue == 'a'). In others, that can get the wrong answer due to a multi-byte sequence.

In just about every modern environment, the semantics of the standard library are determined by the locale setting.

Where does UTF-8 come in? Quite some time ago, the Unicode Consortium was founded to try to bring order out of all this chaos. Unicode defines a character value (in a 32-bit character space) for many, many, many characters. The intent is to cover all the characters of practical use.

If you want your code to work in English, and Arabic, and Chinese, and Sumerian Cuneiform, you want Unicode character semantics, not to write code that is ducking and weaving different character encoding.

Conceptually, the easiest way to do this would be to use 32-bit characters (UTF-32), and thus you'd have one item per logical character. Most people have decided that this is impractical. Note that, in modern versions of gcc, the data type wchar_t is a 32-bit character --- but Microsoft Visual Studio does not agree, defining that data type to be 16 bit values (UTF-16 or UCS-2, depending on your point of view).

Most non-Windows C programs are much too invested in 8-bit characters to change. And so, the Unicode standard includes UTF-8, a representation of Unicode text as a sequence of 8-bit bytes. In UTF-8, each logical character is between 1 and 4 bytes in length. The basic ISO-646 ('ascii') characters 'play themselves', so simple operations on simple characters work as expected.

If your environment includes locales for UTF-8, then you can set the locale to a UTF-8 locale, and all the standard lib functions will just work. If your environment does not include locales for UTF-8, you'll need an add-on, like ICU or ICONV.

This whole discussion has stuck, so far, to data sitting in variables in memory. You also have to deal with reading and writing it. If you call open(2) or the Windows moral equivalent, you'll get the raw bytes from the file. If those are not in UTF-8, you'll have to convert them if you want to work in UTF-8.

If you call fopen(3), then the standard library may try to do you a favor and perform a conversion between its of the default encoding of files and its idea of what you want in memory. If you need, for example, to run a program on a system in a Greek locale and read in a file of Chinese in Big5, you'll need to be careful with the options you pass to fopen, or you'll perhaps want to avoid it. And you'll need ICONV or ICU to convert to and from UTF-8.

Your question mentions 'input strings.' Those could be a number of things. In a UTF-8 locale, argv will be UTF-8. File descriptor 0 will be UTF-8. If the shell is not running in a UTF-8 locale, and you call setlocale to a UTF-8 locale, you will not necessarily get value UTF-8 in argv. If you connect the contents of a file to a file descriptor, you will get whatever is in the file, in whatever encoding it has to be in.

bmargulies 2010-05-07 13:07:40

UTF-8 code points range from 1-6 bytes, not 1-4, because each successive byte has one less bit.

Ioan 2010-05-07 13:53:26

@loan, not any more. Representing surrogate pairs as two sets of three is no longer considered acceptable.

bmargulies 2010-05-07 14:05:34

@bmargulies, Not sure what you mean. I thought the reason it took 1-6 bytes was because 8b+7b+6b+5b+4b+3b = 33 bits to contain a possible 32-bit code point?

Ioan 2010-05-07 14:49:55

@loan, That's not how it works. See page 30 of The Unicode Standard 5.0, and then page 77. The maximum length is 4. They don't keep shrinking. Post a question and I'll type in the whole table :-)

bmargulies 2010-05-07 14:53:31

@bmargulies, I suppose you're right. :-) I thought maybe they would allow up to 32-bit values to better ensure room for languages... you know... just in case.

Ioan 2010-05-07 15:01:26

Unicode is currently defined as using values up to 0x10FFFF; that translates to at most 4 bytes in UTF-8. UTF-8 is capable of representing code points from 0 to 0xFFFFFF, with up to 6 bytes though - it's just not needed, at least for now.

iconiK 2010-05-10 17:50:54

@iconiK I beg to disagree. The UTC changed from supporting surrogate pairs as two three-byte sequences to insisting on using the 4-byte version for security reasons related to iDNS.

bmargulies 2010-05-10 18:08:26

@bmargulies, I didn't know that. Thanks for informing me!

iconiK 2010-05-10 20:32:05

Surrogate pairs were never used in UTF-8. However, code that misinterprets UTF-16 as UCS-2 will encode non-BMP characters that way. The official name for that encoding is CESU-8.

dan04 2010-07-09 06:58:38

+1 A:

ICU uses utf-16 internally (which is a good format for working internally), but has convenience routines for comparing utf-8. You tell it which locale you want to use to compare, or it can use the untailored UCA if you specify the locale "root".

Steven R. Loomis 2010-05-10 17:47:50

ansaurus

tags:

views:

answers:

How exactly does a program convert everything to UTF-8 internally?

related questions