tags:

views:

1152

answers:

4

I am looking for a code snippet in plain old C that detects that the given string is in UTF-8 encoding. I know the solution with regex, but for various reasons it would be better to avoid using anything but plain C in this particular case.

Solution with regex looks like this (warning: various checks omitted):

#define UTF8_DETECT_REGEXP  "^([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$"

const char *error;
int         error_off;
int         rc;
int         vect[100];

utf8_re = pcre_compile(UTF8_DETECT_REGEXP, PCRE_CASELESS, &error, &error_off, NULL);
utf8_pe = pcre_study(utf8_re, 0, &error);

rc = pcre_exec(utf8_re, utf8_pe, str, len, 0, 0, vect, sizeof(vect)/sizeof(vect[0]));

if (rc > 0) {
    printf("string is in UTF8\n");
} else {
    printf("string is not in UTF8\n")
}
+3  A: 

You'd have to parse the string as UTF-8, see http://www.rfc-editor.org/rfc/rfc3629.txt It's very simple. If the parsing fails it's not UTF-8. There's several simple UTF-8 libraries around that can do this.

It could perhaps be simplified if you know the string is either plain old ASCII or it contains characters outside ASCII which are UTF-8 encoded . In which case you often don't need to care for the difference, the design of UTF-8 was that existing programs that could handle ASCII, could in most cases transparently handle UTF-8.

Keep in mind that ASCII is encoded in UTF-8 as itself, so ASCII is valid UTF-8.

A C string can be anything, is the problem you need to solve that you don't know if the content is ASCII,GB 2312,CP437,UTF-16, or any of the other dozen character encodings that makes a programmes life hard.. ?

nos
+3  A: 

You can use the UTF-8 detector integrated into Firefox. It is found in the universal charset detector and its pretty much a stand along C++ library. It should be extremely easy to find the class the recognizes UTF-8 and take only that.
What this class basically does is detect character sequences that are unique to UTF-8.

  • get the latest firefox trunk
  • go to \mozilla\extensions\universalchardet\
  • find the UTF-8 detector class (I don't quite remember what is it's exact name)
shoosh
+5  A: 

You cannot detect if a given string (or byte sequence) is a UTF-8 encoded text as for example each and every series of UTF-8 octets is also a valid (if nonsensical) series of Latin-1 (or some other encoding) octets. However not every series of valid Latin-1 octets are valid UTF-8 series. So you can rule out strings that do not conform to the UTF-8 encoding schema:

U+0000-U+007F    0xxxxxxx
U+0080-U+07FF    110yyyxx    10xxxxxx
U+0800-U+FFFF    1110yyyy    10yyyyxx    10xxxxxx
U+10000-U+10FFFF 11110zzz    10zzyyyy    10yyyyxx    10xxxxxx
Stefan Gehrig
+1 It's all about guessing. "This is probably utf8"
Magnus Skog
+6  A: 

Here's a (hopefully bug-free) implementation of this expression in plain C:

_Bool is_utf8(const char * string)
{
    if(!string)
     return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
     if( (// ASCII
       bytes[0] == 0x09 ||
       bytes[0] == 0x0A ||
       bytes[0] == 0x0D ||
       (0x20 <= bytes[0] && bytes[0] <= 0x7E)
      )
     ) {
      bytes += 1;
      continue;
     }

     if( (// non-overlong 2-byte
       (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
       (0x80 <= bytes[1] && bytes[1] <= 0xBF)
      )
     ) {
      bytes += 2;
      continue;
     }

     if( (// excluding overlongs
       bytes[0] == 0xE0 &&
       (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
       (0x80 <= bytes[2] && bytes[2] <= 0xBF)
      ) ||
      (// straight 3-byte
       ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
        bytes[0] == 0xEE ||
        bytes[0] == 0xEF) &&
       (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
       (0x80 <= bytes[2] && bytes[2] <= 0xBF)
      ) ||
      (// excluding surrogates
       bytes[0] == 0xED &&
       (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
       (0x80 <= bytes[2] && bytes[2] <= 0xBF)
      )
     ) {
      bytes += 3;
      continue;
     }

     if( (// planes 1-3
       bytes[0] == 0xF0 &&
       (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
       (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
       (0x80 <= bytes[3] && bytes[3] <= 0xBF)
      ) ||
      (// planes 4-15
       (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
       (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
       (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
       (0x80 <= bytes[3] && bytes[3] <= 0xBF)
      ) ||
      (// plane 16
       bytes[0] == 0xF4 &&
       (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
       (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
       (0x80 <= bytes[3] && bytes[3] <= 0xBF)
      )
     ) {
      bytes += 4;
      continue;
     }

     return 0;
    }

    return 1;
}
Christoph
Very nice. I was just hacking my nested ifs, but you were faster. I have not tested your solution but it looks good to me.
Ludwig Weinzierl
Since you're reading byte +1, +2, +3 and only check that byte != 0, this code can read past the end of the string. Even if it's zero terminated.
Lucas
Christoph