Intra-Unicode "lean" Encoding Converters

views:

answers:

+1 Q:

Intra-Unicode "lean" Encoding Converters

Windows provides encoding conversion functions ("MultiByteToWideChar" and "WideCharToMultiByte") which are capable of UTF-8 to/from UTF-16 conversions, among other things. But I've seen people offer home-grown 30 to 40 line functions that claim also to perform UTF-8 / UTF-16 encoding conversions.

My question is, how reliable are such tiny converters? Can such a tiny amount of code handle problems such as converting a UTF-16 surrogate pair (such as <D800 DC00>) into a UTF-8 single four byte sequence (rather than making the mistake of converting into a pair of three byte sequences)? Can they correctly spot "unpaired" surrogate input, and provide an error?

In short, are such tiny converters mere toys, or can they be taken seriously? For that matter, why does unicode.org seemingly offer no advice on an algorithm for accomplishing such conversions?

+2 A:

Converting between UTF-8, -16 and -32 is a pretty simple process. It is simple because they all work with the same "character set", and just use different encodings to represent each code point.

The tricky part is converting to/from a non-UTF format. MultiByteToWideChar can do that. A 15-line conversion function can't.

jalf 2010-06-08 23:32:47

It's not just that they use the same character set. It's that UTF-8 was specifically designed to be easy to convert to and from Unicode code points. GB18030 can represent every valid Unicode character, but it's nontrivial to convert.

dan04 2010-06-09 03:16:40

UTF-8 and UTF-16 were also specifically designed to be able to losslessly convert data between each other. It is possible to convert a UTF-8 sequence directly to UTF-16, and vice vera, without decoding to UTF-32 in between.

Remy Lebeau - TeamB 2010-06-09 08:10:45

@Remy Lebeau: it's that mistaken assumption that leads to broken 30-line functions. Please show how you would encode `0xD801, 0xDC02` without going through `0x00010402`, following all applicable Unicode rules of course

MSalters 2010-06-09 09:01:00

There used to be a sample converter in C at the Unicode web site at ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/ but it was removed. I have no idea why as it was very useful and had a non-restrictive license - you would have to ask them.

It was pretty small and I have used it. I believe it did handle surrogate pairs properly but as I don't have the code in front of me I can't swear by it. I'm sure you can find copies of it elsewhere on the web though.

The downside is that it's of no use if you have to convert to or from a non-unicode character set as it's only between UTF variants.

Robert Tuck 2010-06-09 00:00:26

+2 A:

The open source ICU library has 113 lines of code for ucnv_fromUnicode_UTF8 (source/common/ucnv_u8.c). Error checking included, proper surrogate handling, some comments. You should only consider using something else if you don't like the naming conventions.

Hans Passant 2010-06-09 01:55:42

Yes, production quality functions can be that short. I've written full-strength, error checking, defensive, pedantic, understandable, bulletproof conversions for UTF-8 -> UTF-32 and UTF-32 to UTF-8 in about 50 lines each, with comments (but not including the unit tests). There are denser coding styles that could probably do the same in 30-40 lines for each function. There are also shortcuts you can take transcoding UTF-8 to/from UTF-16 directly without UTF-32 in between.

Adrian McCarthy 2010-06-09 02:40:14

+1 A:

You are correct - most "copy/paste" routines you can find on the Internet don't perform validity checks at all.

If you want a small library that performs those checks, take a look at UTF8-CPP. It has both "checked" and "unchecked" versions of the conversion functions.

Nemanja Trifunovic 2010-06-09 15:44:23

ansaurus

tags:

views:

answers:

Intra-Unicode "lean" Encoding Converters

related questions