How Do You Write Code That Is Safe for UTF-8?

views:

987

answers:

+8 Q:

How Do You Write Code That Is Safe for UTF-8?

We have a set of applications that were developed for the ASCII character set. Now, we're trying to install it in Iceland, and are running into problems where the Icelandic characters are getting screwed up.

We are working through our issues, but I was wondering: Is there a good "guide" out there for writing C++ code that is designed for 8-bit characters and which will work properly when UTF-8 data is given to it?

I can't expect everyone to read the whole Unicode standard, but if there is something more digestible available, I'd like to share it with the team so we don't run into these issues again.

Re-writing all the applications to use wchar_t or some other string representation is not feasible at this time. I'll also note that these applications communicate over networks with servers and devices that use 8-bit characters, so even if we did Unicode internally, we'd still have issues with translation at the boundaries. For the most part, these applications just pass data around; they don't "process" the text in any way other than copying it from place to place.

The operating systems used are Windows and Linux. We use std::string and plain-old C strings. (And don't ask me to defend any of the design decisions. I'm just trying to help fix the mess.)

Here is a list of what has been suggested:

You may want to use wide characters (wchar_t instead of char and std::wstring instead of std::string). This doesn't automatically solve 100% of your problems, but is good first step.

Also use string functions which are Unicode-aware (refer to documentation). If something manipulates wide chars or string it generally is aware that they are wide.

phjr 2008-09-25 16:39:12

Re-writing all the applications to use different character representations is not feasible.

Kristopher Johnson 2008-09-25 16:46:57

+2 A:

This looks like a comprehensive quick guide:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

Mark Ransom 2008-09-25 16:45:07

+1 A:

Be aware that full unicode doesn't fit in 16bit characters; so either use 32-bit chars, or variable-width encoding (UTF-8 is the most popular).

Javier 2008-09-25 16:59:05

Icelandic uses ISO Latin 1, so eight bits should be enough. We need more details to figure out what's happening.

2008-09-25 17:05:49

I'm not asking anyone to help me figure out what's wrong. I'm looking for general guidance and "best practices" for dealing with UTF-8.

Kristopher Johnson 2008-10-17 17:18:16

+1 A:

UTF-8 was designed exactly with your problems in mind. One thing I would be careful about is that ASCII is realy a 7-bit encoding, so if any part of your infrastructure is using the 8th bit for other purposes, that may be tricky.

Nemanja Trifunovic 2008-09-25 17:13:41

Yes, that is why we are surprised that UTF-8 has led to problems. We aren't doing anything special with the eighth bit, but it does appear that we are doing things in a few places that cause the text to be misinterpreted or modified in some way.

Kristopher Johnson 2008-09-25 17:25:23

Note that ASCII is 1 byte per char. UTF-8 is a multi-byte per character (When not ASCII so Iclandic counts). So any method that assumes 1 byte per char will not work. e.g. length()

Martin York 2008-09-25 17:27:46

+9 A:

Just be 8-bit clean, for the most part. However, you will have to be aware that any non-ASCII character splits across multiple bytes, so you must take account of this if line-breaking or truncating text for display.

UTF-8 has the advantage that you can always tell where you are in a multi-byte character: if bit 7 is set and bit 6 reset (byte is 0x80-0xBF) this is a trailing byte, while if bits 7 and 6 are set and 5 is reset (0xC0-0xDF) it is a lead byte with one trailing byte; if 7, 6 and 5 are set and 4 is reset (0xE0-0xEF) it is a lead byte with two trailing bytes, and so on. The number of consecutive bits set at the most-significant bit is the total number of bytes making up the character. That is:

110x xxxx = two-byte character
1110 xxxx = three-byte character
1111 0xxx = four-byte character
etc

The Icelandic alphabet is all contained in ISO 8859-1 and hence Windows-1252. If this is a console-mode application, be aware that the console uses IBM codepages, so (depending on the system locale) it might display in 437, 850, or 861. Windows has no native display support for UTF-8; you must transform to UTF-16 and use Unicode APIs.

Calling SetConsoleCP and SetConsoleOutputCP, specifying codepage 1252, will help with your problem, if it is a console-mode application. Unfortunately the console font selected has to be a font that supports the codepage, and I can't see a way to set the font. The standard bitmap fonts only support the system default OEM codepage.

Mike Dimmick 2008-09-25 17:54:24

+1 A:

You might want to check out icu. They might have functions available that would make working with UTF-8 strings easier.

Brett Hall 2008-09-25 18:11:11

+1 A:

Another introductory article from Joel Spolsky

Xavier Nodet 2008-09-25 19:36:11

Icelandic, like French, German, and most other languages of Western Europe, can be supported using an 8-bit character set (CP1252 on Windows, ISO 8859-1 aka Latin1 on *x). This was the standard approach before Unicode was invented, and is still quite common. As you say you have a constraint that you can't rewrite your app to use wchar, and you don't need to.

You shouldn't be surprised that UTF-8 is causing problems; UTF-8 encodes the non-ASCII characters (e.g. the accented Latin characters, thorn, eth, etc) as TWO BYTES each.

The only general advice that can be given is quite simple (in theory): (1) decide what character set you are going to support (Unicode, Latin1, CP1252, ...) in your system (2) if you are being supplied data encoded in some other fashion (e.g. UTF-8) then transcode it to your standard (e.g. CP1252) at the system border (3) if you need to supply data encoded in some other fashion, ...

John Machin 2009-06-29 13:20:02

UTF-8 uses 3 bytes for Chinese characters, actually, and might for rare characters even require 4 bytes. Better fix it properly if you're addressing it. The first byte will tell you how many follow: 110xxxxx means 2 byte char, 1110xxxx means 3 byte char, and 11110xxx means 4 byte char.

MSalters 2009-06-29 13:29:03

UTF-8 uses three bytes for characters from U+0800 to U+FFFF, actually ... covering not only Chinese, but the scripts used in several countries/languages: India, Sri Lanka, Myanmar aka Burma, Thai, Lao, Tibetan, Georgian, Korean, etc etc. My reference to "TWO BYTES" related to characters used in Icelandic. Read his lips: he's not going to rewrite this app to support characters wider than 8 bits. So he can't support Chinese, period. Hong Kong with its NOT-rare non-BMP HKSCS characters is definitely out of the question.

John Machin 2009-06-29 14:15:56

ansaurus

tags:

views:

answers:

How Do You Write Code That Is Safe for UTF-8?

related questions