ansaurus

Question

How do I read UTF-8 characters via a pointer?

Answer 1

+2 A:

If you need to decode UTF-8 you need do develop an UTF-8 parser. UTF-8 is a variable-length encoding (1 to 4 bytes) so you really have to write a parser that is compliant with the standard : see wikipedia for example.

If you do not want to write your own parser, I suggest to use a library. You will find that in glib for example (I personnaly have used Glib::ustring, the C++ wrapper of glib) but also in any good general purpose library.

Edit:

I think that C++0x will include UTF-8 support too, but I'm no specialist...

my2c

neuro 2010-06-01 09:00:29

Answer 2

+3 A:

You have to decode the UTF-8 bit pattern to its unencoded UTF-32 representation. If you want the actual Unicode codepoint, then a wchar_t is NOT large enough to hold it, you have to use an unsigned int/long instead, ie:

#define IS_IN_RANGE(c, f, l)    (((c) >= (f)) && ((c) <= (l)))

u_long readNextChar (char*& p) 
{  
    // TODO: since UTF-8 is a variable-length
    // encoding, you should pass in the input
    // buffer's actual byte length so that you
    // can determine if a malformed UTF-8
    // sequence would exceed the end of the buffer...

    u_char c1, c2, *ptr = (uchar*) p;
    u_long uc = 0;
    int seqlen;
    // int datalen = ... available length of p ...;    

    /*
    if( datalen < 1 )
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }
    */

    c1 = ptr[0];

    if( (c1 & 0x80) == 0 )
    {
        uc = (u_long) (c1 & 0x7F);
        seqlen = 1;
    }
    else if( (c1 & 0xE0) == 0xC0 )
    {
        uc = (u_long) (c1 & 0x1F);
        seqlen = 2;
    }
    else if( (c1 & 0xF0) == 0xE0 )
    {
        uc = (u_long) (c1 & 0x0F);
        seqlen = 3;
    }
    else if( (c1 & 0xF8) == 0xF0 )
    {
        uc = (u_long) (c1 & 0x07);
        seqlen = 4;
    }
    else
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }

    /*
    if( seqlen > datalen )
    {
        // malformed data, do something !!!
        return (u_long) -1;
    }
    */

    for(int i = 1; i < seqlen; ++i)
    {
        c1 = ptr[i];

        if( (c1 & 0xC0) != 0x80 )
        {
            // malformed data, do something !!!
            return (u_long) -1;
        }
    }

    switch( seqlen )
    {
        case 2:
        {
            c1 = ptr[0];

            if( !IS_IN_RANGE(c1, 0xC2, 0xDF) )
            {
                // malformed data, do something !!!
                return (u_long) -1;
            }

            break;
        }

        case 3:
        {
            c1 = ptr[0];
            c2 = ptr[1];

            if( ((c1 == 0xE0) && !IS_IN_RANGE(c2, 0xA0, 0xBF)) ||
                ((c1 == 0xED) && !IS_IN_RANGE(c2, 0x80, 0x9F)) ||
                (!IS_IN_RANGE(c1, 0xE1, 0xEC) && !IS_IN_RANGE(c1, 0xEE, 0xEF)) )
            {
                // malformed data, do something !!!
                return (u_long) -1;
            }

            break;
        }

        case 4:
        {
            c1 = ptr[0];
            c2 = ptr[1];

            if( ((c1 == 0xF0) && !IS_IN_RANGE(c2, 0x90, 0xBF)) ||
                ((c1 == 0xF4) && !IS_IN_RANGE(c2, 0x80, 0x8F)) ||
                !IS_IN_RANGE(c1, 0xF1, 0xF3) )
            {
                // malformed data, do something !!!
                return (u_long) -1;
            }

            break;
        }
    }

    for(int i = 1; i < seqlen; ++i)
    {
        uc = ((uc << 6) | (u_long)(ptr[i] & 0x3F));
    }

    p += seqlen;
    return unicodeChar; 
}

Use a wchar_t only when dealing with UTF-16 codeunits instead.

Remy Lebeau - TeamB 2010-06-01 23:02:00

Perfect, thanks!

Jen 2010-06-02 04:59:04

sbi 2010-06-04 20:41:55

I just compiled this code using g++ 3.3.4, and I'm pretty impressed: The compiler moved all the code from the large `switch` statement up to where the `seqlen` variable gets set. Maybe that would be good for the original code, too, to become more readable.

Roland Illig 2010-06-12 20:22:40

@sbi: It's 16-bit in MSVC, and the Windows API expects 16-bit, pretty much enforcing 16-bit in Windows programming.

DeadMG 2010-06-12 20:27:08

@DeadMG: I'm must be missing your psychic abilities to draw these conclusions from the question.

sbi 2010-06-12 21:07:33

@sbi: I didn't, I drew them from your comment. I'm saying that wchar_t cannot be relied on to hold a Unicode character, since it may not be 32bit and infact in a major compiler/operating system, it is 16bit. Hence, the OP cannot rely on a wchar_t to hold his Unicode character, if he wishes to retain Windows compatibility.

DeadMG 2010-06-12 21:24:05

@DeadMG: I can see now that my comment might be misleading. I was struggling (and obviously failed) to say that I have no idea how you know the OP needs this for Windows.

sbi 2010-06-12 21:38:12

@sbi: Ah, right. I didn't- merely pointed out that assuming that wchar_t is 32bit isn't portable in a major way, as opposed to "isn't portable" like OpenMP isn't portable, but virtually all major compilers support it as far as I'm aware.

DeadMG 2010-06-12 21:44:01

@DeadMG: So I misunderstood your answer which was based on your misunderstanding of my answer which I gave because I misunderstood your comment? Oops. I'm afraid I got lost there somewhere...

sbi 2010-06-13 00:51:16

@sbi: Perhaps we should just agree that we don't know what the hell we're talking about and walk away.

DeadMG 2010-06-13 00:55:12

@DeadMG: Great idea. `:)`

sbi 2010-06-13 13:07:40

Answer 3

+1 A:

Also, is wchar_t the proper type to store a single Unicode character?

On Linux, yes. On Windows, wchar_t represents a UTF-16 code unit, which isn't necessarily a character.

The upcoming C++0x standard will provide the char16_t and char32_t types designed to represent UTF-16 and UTF-32.

If on a system where char32_t is unavailable and wchar_t is inadequate, use uint32_t to store Unicode characters.

dan04 2010-06-02 00:59:55

Answer 4

+1 A:

Here is a quick macro that will count UTF-8 bytes

#define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1

This will help you detect the size of the UTF-8 character for easier parsing.

gbrandt 2010-06-02 01:06:23

Answer 5

A:

This is my solution, in pure ANSI-C, including a unit test for the corner cases.

Beware that int must be at least 32 bits wide. Otherwise you have to change the definition of codepoint.

#include <assert.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

typedef unsigned char byte;
typedef unsigned int codepoint;

/**
 * Reads the next UTF-8-encoded character from the byte array ranging
 * from {@code *pstart} up to, but not including, {@code end}. If the
 * conversion succeeds, the {@code *pstart} iterator is advanced,
 * the codepoint is stored into {@code *pcp}, and the function returns
 * 0. Otherwise the conversion fails, {@code errno} is set to
 * {@code EILSEQ} and the function returns -1.
 */
int
from_utf8(const byte **pstart, const byte *end, codepoint *pcp) {
        size_t len, i;
        codepoint cp, min;
        const byte *buf;

        buf = *pstart;
        if (buf == end)
                goto error;

        if (buf[0] < 0x80) {
                len = 1;
                min = 0;
                cp = buf[0];
        } else if (buf[0] < 0xC0) {
                goto error;
        } else if (buf[0] < 0xE0) {
                len = 2;
                min = 1 << 7;
                cp = buf[0] & 0x1F;
        } else if (buf[0] < 0xF0) {
                len = 3;
                min = 1 << (5 + 6);
                cp = buf[0] & 0x0F;
        } else if (buf[0] < 0xF8) {
                len = 4;
                min = 1 << (4 + 6 + 6);
                cp = buf[0] & 0x07;
        } else {
                goto error;
        }

        if (buf + len > end)
                goto error;

        for (i = 1; i < len; i++) {
                if ((buf[i] & 0xC0) != 0x80)
                        goto error;
                cp = (cp << 6) | (buf[i] & 0x3F);
        }

        if (cp < min)
                goto error;

        if (0xD800 <= cp && cp <= 0xDFFF)
                goto error;

        if (0x110000 <= cp)
                goto error;

        *pstart += len;
        *pcp = cp;
        return 0;

error:
        errno = EILSEQ;
        return -1;
}

static void
assert_valid(const byte **buf, const byte *end, codepoint expected) {
        codepoint cp;

        if (from_utf8(buf, end, &cp) == -1) {
                fprintf(stderr, "invalid unicode sequence for codepoint %u\n", expected);
                exit(EXIT_FAILURE);
        }

        if (cp != expected) {
                fprintf(stderr, "expected %u, got %u\n", expected, cp);
                exit(EXIT_FAILURE);
        }
}

static void
assert_invalid(const char *name, const byte **buf, const byte *end) {
        const byte *p;
        codepoint cp;

        p = *buf + 1;
        if (from_utf8(&p, end, &cp) == 0) {
                fprintf(stderr, "unicode sequence \"%s\" unexpectedly converts to %#x.\n", name, cp);
                exit(EXIT_FAILURE);
        }
        *buf += (*buf)[0] + 1;
}

static const byte valid[] = {
        0x00, /* first ASCII */
        0x7F, /* last ASCII */
        0xC2, 0x80, /* first two-byte */
        0xDF, 0xBF, /* last two-byte */
        0xE0, 0xA0, 0x80, /* first three-byte */
        0xED, 0x9F, 0xBF, /* last before surrogates */
        0xEE, 0x80, 0x80, /* first after surrogates */
        0xEF, 0xBF, 0xBF, /* last three-byte */
        0xF0, 0x90, 0x80, 0x80, /* first four-byte */
        0xF4, 0x8F, 0xBF, 0xBF /* last codepoint */
};

static const byte invalid[] = {
        1, 0x80,
        1, 0xC0,
        1, 0xC1,
        2, 0xC0, 0x80,
        2, 0xC2, 0x00,
        2, 0xC2, 0x7F,
        2, 0xC2, 0xC0,
        3, 0xE0, 0x80, 0x80,
        3, 0xE0, 0x9F, 0xBF,
        3, 0xED, 0xA0, 0x80,
        3, 0xED, 0xBF, 0xBF,
        4, 0xF0, 0x80, 0x80, 0x80,
        4, 0xF0, 0x8F, 0xBF, 0xBF,
        4, 0xF4, 0x90, 0x80, 0x80
};

int
main() {
        const byte *p, *end;

        p = valid;
        end = valid + sizeof valid;
        assert_valid(&p, end, 0x000000);
        assert_valid(&p, end, 0x00007F);
        assert_valid(&p, end, 0x000080);
        assert_valid(&p, end, 0x0007FF);
        assert_valid(&p, end, 0x000800);
        assert_valid(&p, end, 0x00D7FF);
        assert_valid(&p, end, 0x00E000);
        assert_valid(&p, end, 0x00FFFF);
        assert_valid(&p, end, 0x010000);
        assert_valid(&p, end, 0x10FFFF);

        p = invalid;
        end = invalid + sizeof invalid;
        assert_invalid("80", &p, end);
        assert_invalid("C0", &p, end);
        assert_invalid("C1", &p, end);
        assert_invalid("C0 80", &p, end);
        assert_invalid("C2 00", &p, end);
        assert_invalid("C2 7F", &p, end);
        assert_invalid("C2 C0", &p, end);
        assert_invalid("E0 80 80", &p, end);
        assert_invalid("E0 9F BF", &p, end);
        assert_invalid("ED A0 80", &p, end);
        assert_invalid("ED BF BF", &p, end);
        assert_invalid("F0 80 80 80", &p, end);
        assert_invalid("F0 8F BF BF", &p, end);
        assert_invalid("F4 90 80 80", &p, end);

        return 0;
}

Roland Illig 2010-06-12 19:39:03

Epic fail. The poster is in C++, and you violated so many C++ idioms, I can't even begin to count.

DeadMG 2010-06-12 20:28:47

So I will count for you. (1) I included the C headers instead of the C++ headers. (2) I used pointers instead of references. (3) I did not use namespaces, but instead declared my functions `static`. (4) I declared the loop variable with a function-wide scope. But on the other hand, I didn't invent weird type names (`u_long`, `u_char`) and used them inconsistently (`u_char` vs. `uchar`) and without declaring them. I also managed to completely avoid any type cast (which the accepted answer uses a lot, and that's C-style, too.)

Roland Illig 2010-06-12 20:36:45

Roland Illig 2010-06-12 20:39:36

@Roland: Erm, by looking at the function's signature? @DeadMG: To be fair, this was announced to be an C solution.

sbi 2010-06-12 21:06:42

@sbi: It was. But the guy uses C++. If he asked for a C++ solution and I wrote one in Lua, Ruby or Python, would you consider that a good answer? Edit @ Roland Illig: I didn't suggest that your answer was worse than the accepted. I don't think much of that answer either.

DeadMG 2010-06-12 23:34:23

@DeadMG: At least I wouldn't complain that your Lua solution is bad C++ code. (I would, however, complain about the fact my C++ compiler can't compile it.)

sbi 2010-06-13 00:45:50

@sbi: I could abuse the preprocessor and compile it that way.

DeadMG 2010-06-13 00:56:37

@DeadMG: Yeah, you could, indeed. So?

sbi 2010-06-13 12:58:29

@sbi: The point is that such an answer is neither useful nor acceptable if it's not written in the OP's language.

DeadMG 2010-06-13 13:07:31

@DeadMG: Which is why a Lua answer is much less helpful than a C or preprocessor answer.

sbi 2010-06-13 13:48:13

ansaurus

tags:

views:

answers:

How do I read UTF-8 characters via a pointer?

related questions