views:

123

answers:

2

Hello!

I have a database with lots of strings. Some of them are correctly UTF-8 encoded, some of them not. Therefore, I've set up a script which selects 100 strings from the db. The following function decides whether a string contains UTF-8 or not (no matter if it's correct):

function detectUTF8($text) {
    return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}         # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )+%xs',
    $text);
}

The output of of script is these strings containing UTF-8 and - after a line break - the utf8_decode() string. Since some strings are double encoded, I decode all strings which you can see there.

The result is a list with some entries with 2 strings each: one is correct, the other one is wrong. You can see it here. But how do I determine which one is correct?

I hope you can help me. Thanks in advance!

+1  A: 

You could use utf8_decode and check with your detectUTF8 function if it’s still valid UTF-8.

Gumbo
Thank you, so simple but I didn't find out! :D It seems to work, doesn't it? http://bit.ly/wZPZm
+2  A: 

mb_detect_encoding($text, "UTF-8");

You may have to build php with --enable-mbstring or install php-mbstring package with yum/apt, but php can help you detect multibyte string encoding.

simplemotives