ansaurus

Question

Answer 1

A:

Not sure if this would achieve the same thing, but couldn't you just use utf8_encode() on all text without worrying about detection? If the text is already UTF-8, it won't hurt it. And if it's not, it will be converted. If you've already thought about doing this, is there a reason this wouldn't work for you?

Marc W 2009-10-06 03:40:09

utf8_encode is not idempotent for byte sequences that are already UTF-8. Instead it converts them to UTF-8 as if they were previously ISO-8859-1; so you'll get eg. ‘Î±’ instead of 'α'.

bobince 2009-10-06 16:13:03

Answer 2

+3 A:

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string)
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);

bobince 2009-10-06 04:16:07

Thanks very much. I know developers always comment on the slowness of regexes - how careful should I be using this in big loops with lots of text? For example, a loop that iterates 200 times and cleanses text of 10,000 characters on each iteration.

Brian 2009-10-06 15:56:22

Whilst I'm not a fan of regex, in this case it shouldn't be that bad. Regex gets slow when you have successive or nested `?`/`*`/`+` sequences that can cause it to have to backtrack looking for different ways to match. That won't happen in this case.

bobince 2009-10-06 16:09:16

Excellent. So when using iconv as you describe above, if I specify CP1252 as the input charset, and the string is something other than CP1252 or ISO-8859-1, it will return a UTF-8 safe string, although some characters may be lost. Is that correct?

Brian 2009-10-06 16:25:09

It will return a UTF-8-safe string, yes. Non-ASCII characters will come as the wrong characters, but not dangerous ones.

bobince 2009-10-06 18:36:18

Answer 3

+1 A:

Have a look at http://www.phpwact.org/php/i18n/charsets for a guide about charsets. This page links to a page specifically for utf8.

Martijn 2009-10-06 06:19:29

Answer 4

A:

There is a bug when the $string > 5000. Consider this as a fix:

define('_is_utf8_split',5000);

function is_utf8($string) { // v1.01
    if (strlen($string) > _is_utf8_split) {
        // Based on: http://mobile-website.mobi/php-utf8-vs-iso-8859-1-59
        for ($i=0,$s=_is_utf8_split,$j=ceil(strlen($string)/_is_utf8_split);$i < $j;$i++,$s+=_is_utf8_split) {
            if (is_utf8(substr($string,$s,_is_utf8_split)))
                return true;
        }
        return false;
    } else {
        // From http://w3.org/International/questions/qa-forms-utf-8.html
        return preg_match('%^(?:
                [\x09\x0A\x0D\x20-\x7E]            # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )*$%xs', $string);
    }
}

velcrow 2010-10-27 15:51:38

ansaurus

tags:

views:

answers:

Ensuring valid utf-8 in PHP

related questions