views:

574

answers:

4

Hello,

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO-8859-1, or perhaps WINDOWS-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this? For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    } else {
        return $string;
    }
}

Thanks very much, Brian

A: 

Not sure if this would achieve the same thing, but couldn't you just use utf8_encode() on all text without worrying about detection? If the text is already UTF-8, it won't hurt it. And if it's not, it will be converted. If you've already thought about doing this, is there a reason this wouldn't work for you?

Marc W
utf8_encode is not idempotent for byte sequences that are already UTF-8. Instead it converts them to UTF-8 as if they were previously ISO-8859-1; so you'll get eg. ‘α’ instead of 'α'.
bobince
+3  A: 

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string)
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);
bobince
Thanks very much. I know developers always comment on the slowness of regexes - how careful should I be using this in big loops with lots of text? For example, a loop that iterates 200 times and cleanses text of 10,000 characters on each iteration.
Brian
Whilst I'm not a fan of regex, in this case it shouldn't be that bad. Regex gets slow when you have successive or nested `?`/`*`/`+` sequences that can cause it to have to backtrack looking for different ways to match. That won't happen in this case.
bobince
Excellent. So when using iconv as you describe above, if I specify CP1252 as the input charset, and the string is something other than CP1252 or ISO-8859-1, it will return a UTF-8 safe string, although some characters may be lost. Is that correct?
Brian
It will return a UTF-8-safe string, yes. Non-ASCII characters will come as the wrong characters, but not dangerous ones.
bobince
+1  A: 

Have a look at http://www.phpwact.org/php/i18n/charsets for a guide about charsets. This page links to a page specifically for utf8.

Martijn
A: 

There is a bug when the $string > 5000. Consider this as a fix:

define('_is_utf8_split',5000);

function is_utf8($string) { // v1.01
    if (strlen($string) > _is_utf8_split) {
        // Based on: http://mobile-website.mobi/php-utf8-vs-iso-8859-1-59
        for ($i=0,$s=_is_utf8_split,$j=ceil(strlen($string)/_is_utf8_split);$i < $j;$i++,$s+=_is_utf8_split) {
            if (is_utf8(substr($string,$s,_is_utf8_split)))
                return true;
        }
        return false;
    } else {
        // From http://w3.org/International/questions/qa-forms-utf-8.html
        return preg_match('%^(?:
                [\x09\x0A\x0D\x20-\x7E]            # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )*$%xs', $string);
    }
}
velcrow