views:

224

answers:

2

I have a program that reads arbitrary data from a file system and outputs results in Unicode. The problem I am having is that sometimes filenames are valid Unicode and sometimes they aren't. So I want a function that can validate a string (in C or C++) and tell me if it is a valid UTF-8 encoding. If it is not, I want to have the invalid characters escaped so that it will be a valid UTF-8 encoding. This is different than escaping for XML --- I need to do that also. But first I need to be sure that the Unicode is right.

I've seen some code from which I could hack this, but I would rather use some working code if it exists.

A: 

If you have character encoding conversion functions (like MultiByteToWideChar or WideCharToMultiByte on Windows), you can try converting the string from UTF-8 to UTF-16/32 and back again. If your original string was UTF-8, then you'll get back the same string. If not, you'll either get an error (useful if you just want a validity check), or have invalid UTF-8 bytes replaced by something valid (which is what you ultimately want).

dan04
In the case of MultiByteToWideChar(), you don't need to call WideCharToMultiByte() to validate the result. MultiByteToWideChar() has a MB_ERR_INVALID_CHARS flag that causes the function to fail with GetLastError()=ERROR_NO_UNICODE_TRANSLATION if it detects an invalid character. As for converting invalid characters, it can convert invalid characters only to a default character (usually '?'), which is a lossy operation (you can't get back the original string from the result), in which case you would have to encode the original manually if you need the conversion to be loss-less.
Remy Lebeau - TeamB
+1  A: 

The following code is based on an IRI library I have been working on for awhile. Section 3.2 ("Converting URIs to IRIs") of RFC 3987 deals with converting invalid UTF-8 octets to valid UTF-8.

#define IS_IN_RANGE(c, f, l)    (((c) >= (f)) && ((c) <= (l)))

int UTF8BufferToUTF32Buffer(char *Data, int DataLen, unsigned long *Buffer, int BufLen, int *Eaten)
{
    if( Eaten )
    {
        *Eaten = 0;
    }

    int Result = 0;

    unsigned char b, b2;
    unsigned char *ptr = (unsigned char*) Data;
    unsigned long uc;

    int i = 0;
    int seqlen;

    while( i < DataLen )
    {
        if( (Buffer) && (!BufLen) )
            break;

        b = ptr[i];

        if( (b & 0x80) == 0 )
        {
            uc = (unsigned long)(b & 0x7F);
            seqlen = 1;
        }
        else if( (b & 0xE0) == 0xC0 )
        {
            uc = (unsigned long)(b & 0x1F);
            seqlen = 2;
        }
        else if( (b & 0xF0) == 0xE0 )
        {
            uc = (unsigned long)(b & 0x0F);
            seqlen = 3;
        }
        else if( (b & 0xF8) == 0xF0 )
        {
            uc = (unsigned long)(b & 0x07);
            seqlen = 4;
        }
        else
        {
            uc = 0;
            return -1;
        }

        if( (i+seqlen) > DataLen )
        {
            return -1;
        }

        for(int j = 1; j < seqlen; ++j)
        {
            b = ptr[i+j];

            if( (b & 0xC0) != 0x80 )
            {
                return -1;
            }
        }

        switch( seqlen )
        {
            case 2:
            {
                b = ptr[i];

                if( !IS_IN_RANGE(b, 0xC2, 0xDF) )
                {
                    return -1;
                }

                break;
            }

            case 3:
            {
                b = ptr[i];
                b2 = ptr[i+1];

                if( ((b == 0xE0) && !IS_IN_RANGE(b2, 0xA0, 0xBF)) ||
                    ((b == 0xED) && !IS_IN_RANGE(b2, 0x80, 0x9F)) ||
                    (!IS_IN_RANGE(b, 0xE1, 0xEC) && !IS_IN_RANGE(b, 0xEE, 0xEF)) )
                {
                    return -1;
                }

                break;
            }

            case 4:
            {
                b = ptr[i];
                b2 = ptr[i+1];

                if( ((b == 0xF0) && !IS_IN_RANGE(b2, 0x90, 0xBF)) ||
                    ((b == 0xF4) && !IS_IN_RANGE(b2, 0x80, 0x8F)) ||
                    !IS_IN_RANGE(b, 0xF1, 0xF3) )
                {
                    return -1;
                }

                break;
            }
        }

        for(int j = 1; j < seqlen; ++j)
        {
            uc = ((uc << 6) | (unsigned long)(ptr[i+j] & 0x3F));
        }

        if( Buffer )
        {
            *Buffer++ = uc;
            --BufLen;
        }

        ++Result;
        i += seqlen;
    }

    if( Eaten )
    {
        *Eaten = i;
    }

    return Result;
}

{
    std::string filename = "...";

    unsigned long ch;
    int eaten;

    std::string::size_type i = 0;
    while( i < filename.length() )
    {
        if( UTF8BufferToUTF32Buffer(&filename[i], filename.length()-i, &ch, 1, &eaten) == 1 )
        {
            i += eaten;
        }
        else
        {
            // replace the character at filename[i] with your chosen
            // escaping, and then increment i by the number of
            // characters used...
        }
    }
}

In your case, all you have to do is decide what kind of escaping you want to use. URIs/IRIs uses percent-encoding ("%NN", where "NN" is the 2-digit hex value of an octet).

Remy Lebeau - TeamB
Very cool. Thanks. I'll munch on this. Haven't decided about the encoding; %NN works, but it's non-standard. Your code gets a bit simpler if your process from one string to a second and rely on a C++ vector for the manipulations...
vy32
vy32
That is not strictly a bug, just an implementation detail. The UTF8BufferToUTF32Buffer() function itself is what I copied from my actual library code. The rest is just an example demonstrating its usage for you. I don't usually use std::string in my code, but I do use other string classes where such usage of the '[]' operator is safe.
Remy Lebeau - TeamB