ansaurus

Question

Convert ISO-8859-1 strings to UTF-8 in C/C++

Answer 1

+1 A:

The C++ standard does not provide functions to directly convert between charsets.

Depending on your OS, you can use iconv() on Linux, MultiByteToWideChar() & Co. on Windows. A library which provides large support for string conversion is the ICU library which is open source.

cytrinox 2010-10-30 17:29:10

Answer 2

+1 A:

ISO-8859-1 to UTF-8 involves nothing more than the encoding algorithm because ISO-8859-1 is a subset of Unicode. So you already have the Unicode code points. Check Wikipedia for the algorithm.

The C++ aspects -- integrating that with iostreams -- are much harder.

I suggest you walk around that mountain instead of trying to drill through it or climb it, that is, implement a simple string to string converter.

Cheers & hth.,

Alf P. Steinbach 2010-10-30 17:39:58

The algorithm is not entirely trivial, especially when novice to intermediate C coders often mistakenly use `char *` where `unsigned char *` is needed. More significant nontrivialities are in the definition of UTF-8, specifically that you need to reject surrogate codepoints and out-of-range values. Thankfully those won't come up in an encoder that only needs to handle ISO-8859-1 input, but if you write such a limited encoder it's likely someone will end up misusing it for a larger input range later without adding any checks.

R.. 2010-10-31 01:40:10

Answer 3

+5 A:

If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

unsigned char *in, *out;
while (*in)
    if (*in<128) *out++=*in++;
    else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

R.. 2010-10-30 17:53:12

Wow. This is very helpful! I wasn't looking forward to yet-another table lookup algorithm. Now for ANSEL-to-UTF-8...

gordonwd 2010-10-30 18:31:30

This certainly answers the question. But as I said in a comment above, people *will* send you CP-1252 mislabelled as ISO-8859-1. Web servers are the example that I've tripped over that persuaded me of the problem, but also text editors that claim to be saving as "Latin-1" when they aren't. That "if your source encoding will always be ISO-8859-1" is a pretty big "if", and it might be hard to track down and eliminate the miscreant responsible.

Steve Jessop 2010-10-30 18:46:03

@Steve: You could add an `else if (*in<192) goto error;` case to error-out on encountering any ISO-8859-1 control codes (which are probably misencoded Windows-1252 characters, and not useful characters anyway).

R.. 2010-10-31 01:36:05

@gordon: I'm not familiar with ANSEL, but you should be aware that ISO-8859-1 is the **only** legacy encoding that's this easy to convert to UTF-8. Everything else will require lookup tables. A Steve said, my "If.." is a **big** if.

R.. 2010-10-31 01:37:17

Answer 4

+1 A:

The Unicode folks have some tables that might help if faced with Windows 1252 instead of true ISO-8859-1. The definitive one seems to be this one which maps every code point in CP1252 to a code point in Unicode. Encoding the Unicode as UTF-8 is a straightforward exercise.

It would not be difficult to parse that table directly and form a lookup table from it at compile time.

RBerteig 2010-10-31 00:44:54

ansaurus

tags:

views:

answers:

Convert ISO-8859-1 strings to UTF-8 in C/C++

related questions