ansaurus

Question

Validate Unicode String and Escape if Unicode is Invalid (C/C++)

Answer 1

A:

If you have character encoding conversion functions (like MultiByteToWideChar or WideCharToMultiByte on Windows), you can try converting the string from UTF-8 to UTF-16/32 and back again. If your original string was UTF-8, then you'll get back the same string. If not, you'll either get an error (useful if you just want a validity check), or have invalid UTF-8 bytes replaced by something valid (which is what you ultimately want).

dan04 2010-06-04 06:02:41

In the case of MultiByteToWideChar(), you don't need to call WideCharToMultiByte() to validate the result. MultiByteToWideChar() has a MB_ERR_INVALID_CHARS flag that causes the function to fail with GetLastError()=ERROR_NO_UNICODE_TRANSLATION if it detects an invalid character. As for converting invalid characters, it can convert invalid characters only to a default character (usually '?'), which is a lossy operation (you can't get back the original string from the result), in which case you would have to encode the original manually if you need the conversion to be loss-less.

Remy Lebeau - TeamB 2010-06-04 20:06:19

Answer 2

+1 A:

The following code is based on an IRI library I have been working on for awhile. Section 3.2 ("Converting URIs to IRIs") of RFC 3987 deals with converting invalid UTF-8 octets to valid UTF-8.

#define IS_IN_RANGE(c, f, l)    (((c) >= (f)) && ((c) <= (l)))

int UTF8BufferToUTF32Buffer(char *Data, int DataLen, unsigned long *Buffer, int BufLen, int *Eaten)
{
    if( Eaten )
    {
        *Eaten = 0;
    }

    int Result = 0;

    unsigned char b, b2;
    unsigned char *ptr = (unsigned char*) Data;
    unsigned long uc;

    int i = 0;
    int seqlen;

    while( i < DataLen )
    {
        if( (Buffer) && (!BufLen) )
            break;

        b = ptr[i];

        if( (b & 0x80) == 0 )
        {
            uc = (unsigned long)(b & 0x7F);
            seqlen = 1;
        }
        else if( (b & 0xE0) == 0xC0 )
        {
            uc = (unsigned long)(b & 0x1F);
            seqlen = 2;
        }
        else if( (b & 0xF0) == 0xE0 )
        {
            uc = (unsigned long)(b & 0x0F);
            seqlen = 3;
        }
        else if( (b & 0xF8) == 0xF0 )
        {
            uc = (unsigned long)(b & 0x07);
            seqlen = 4;
        }
        else
        {
            uc = 0;
            return -1;
        }

        if( (i+seqlen) > DataLen )
        {
            return -1;
        }

        for(int j = 1; j < seqlen; ++j)
        {
            b = ptr[i+j];

            if( (b & 0xC0) != 0x80 )
            {
                return -1;
            }
        }

        switch( seqlen )
        {
            case 2:
            {
                b = ptr[i];

                if( !IS_IN_RANGE(b, 0xC2, 0xDF) )
                {
                    return -1;
                }

                break;
            }

            case 3:
            {
                b = ptr[i];
                b2 = ptr[i+1];

                if( ((b == 0xE0) && !IS_IN_RANGE(b2, 0xA0, 0xBF)) ||
                    ((b == 0xED) && !IS_IN_RANGE(b2, 0x80, 0x9F)) ||
                    (!IS_IN_RANGE(b, 0xE1, 0xEC) && !IS_IN_RANGE(b, 0xEE, 0xEF)) )
                {
                    return -1;
                }

                break;
            }

            case 4:
            {
                b = ptr[i];
                b2 = ptr[i+1];

                if( ((b == 0xF0) && !IS_IN_RANGE(b2, 0x90, 0xBF)) ||
                    ((b == 0xF4) && !IS_IN_RANGE(b2, 0x80, 0x8F)) ||
                    !IS_IN_RANGE(b, 0xF1, 0xF3) )
                {
                    return -1;
                }

                break;
            }
        }

        for(int j = 1; j < seqlen; ++j)
        {
            uc = ((uc << 6) | (unsigned long)(ptr[i+j] & 0x3F));
        }

        if( Buffer )
        {
            *Buffer++ = uc;
            --BufLen;
        }

        ++Result;
        i += seqlen;
    }

    if( Eaten )
    {
        *Eaten = i;
    }

    return Result;
}

{
    std::string filename = "...";

    unsigned long ch;
    int eaten;

    std::string::size_type i = 0;
    while( i < filename.length() )
    {
        if( UTF8BufferToUTF32Buffer(&filename[i], filename.length()-i, &ch, 1, &eaten) == 1 )
        {
            i += eaten;
        }
        else
        {
            // replace the character at filename[i] with your chosen
            // escaping, and then increment i by the number of
            // characters used...
        }
    }
}

In your case, all you have to do is decide what kind of escaping you want to use. URIs/IRIs uses percent-encoding ("%NN", where "NN" is the 2-digit hex value of an octet).

Remy Lebeau - TeamB 2010-06-04 20:27:16

Very cool. Thanks. I'll munch on this. Haven't decided about the encoding; %NN works, but it's non-standard. Your code gets a bit simpler if your process from one string to a second and rely on a C++ vector for the manipulations...

vy32 2010-06-05 02:17:41

vy32 2010-06-06 07:52:53

That is not strictly a bug, just an implementation detail. The UTF8BufferToUTF32Buffer() function itself is what I copied from my actual library code. The rest is just an example demonstrating its usage for you. I don't usually use std::string in my code, but I do use other string classes where such usage of the '[]' operator is safe.

Remy Lebeau - TeamB 2010-06-07 23:45:43

ansaurus

tags:

views:

answers:

Validate Unicode String and Escape if Unicode is Invalid (C/C++)

related questions