tags:

views:

621

answers:

5

I would like to make sure that everything I know about UTF-8 is correct. I have been trying to use UTF-8 for a while now but I keep stumbling across more and more bugs and other weird things that make it seem almost impossible to have a 100% UTF-8 site. There is always a gotcha somewhere that I seem to miss. Perhaps someone here can correct my list or OK it so I don't miss anything important.

Database

Every site has to store there data somewhere. No matter what your PHP settings are you must also configure the DB. If you can't access the config files then make sure to "SET NAMES 'utf8'" as soon as you connect. Also, make sure to use utf8_ unicode_ ci on all of your tables. This assumes MySQL for a database, you will have to change for others.

Regex

I do a LOT of regex that is more complex than your average search-replace. I have to remember to use the "/u" modifier so that PCRE doesn't corrupt my strings. Yet, even then there are still problems apparently.

String Functions

All of the default string functions (strlen(), strpos(), etc.) should be replaced with Multibyte String Functions that look at the character instead of the byte.

Headers You should make sure that your server is returning the correct header for the browser to know what charset you are trying to use (just like you must tell MySQL).

header('Content-Type: text/html; charset=utf-8');

It is also a good idea to put the correct < meta > tag in the page head. Though the actual header will override this should they differ.

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

Questions

Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads or if I can just leave the strings/values as they are and still run them through these functions without a problem?

If I do need to convert everything to UTF-8 - then what steps should I take? mb_detect_encoding seems to be built for this but I keep seeing people complain that it doesn't always work. mb_check_encoding also seems to have a problem telling a good UTF-8 string from a malformed one.

Does PHP store strings in memory differently depending on what encoding it is using (like file types) or is it still stored like a regular sting with some of the chars being interpreted differently (like & amp; vs & in HTML). chazomaticus answers this question:

In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of.

If a give a non-UTF-8 string to a mb_* function will it ever cause a problem?

If a UTF string is improperly encoded will something go wrong (like a parsing error in regex?) or will it just mark an entity as bad (html)? Is there ever a chance that improperly encoded strings will result in function returning FALSE because the string is bad?

I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8") but I am not sure what the benefit is..?

Was UTF-16 written to address a limit in UTF-8? Like did UTF-8 run out of space for characters? (Y2(UTF)k?)

Functions

Here are are a couple of the custom PHP functions I have found but I haven't any way to verify that they actually work. Perhaps someone has an example which I can use. First is convertToUTF8() and then seems_utf8 from wordpress.

function seems_utf8($str) {
    $length = strlen($str);
    for ($i=0; $i < $length; $i++) {
        $c = ord($str[$i]);
        if ($c < 0x80) $n = 0; # 0bbbbbbb
        elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                return false;
        }
    }
    return true;
}

function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
}

If anyone is interested I found a great example page to use when testing UTf-8.

A: 

UTF-8 is fine, and doesn't have any limits that UTF-16 solves. PHP doens't changes it's way to store strings in memory (unlike Python). If the entire data flow is using UTF-8 (web forms receive UTF-8 data, tables use utf8 encoding and you're using the SET NAMES utf8, and data is stored without being altered (no charset conversion), that should be fine.

Adrián
by the way, you should use utf8_general_ci in your DB. you won't have any problems that may derivate from using utf8_unicode_ci
Adrián
A: 

For user inputs from form I add this attribute to my forms tags : accept-charset="utf-8". This way data you receive should always be utf-8 encoded.

p4bl0
I'm afraid this isn't reliable, as bobince correctly mentioned. You should set a header or meta-tag to force the browser into utf-8. This will automatically force the forms on the page to submit data as utf-8.
Martijn Heemels
+2  A: 

database/mysql: If you're using SET NAMES and e.g. php/mysql you're leaving mysql_real_escape_string() in the dark about the change in character encoding. This may lead to wrong results. So, if you're relying on an escape function like mysql_real_escape_string (because you're not using prepared statements) SET NAMES is a suboptimal solution. That's why mysql_set_charset() has been introduced or why gentoo applies a patch that adds the config parameter mysql.connect_charset for both php/mysql and php/mysqli.

The client usually doesn't indicate the encoding of the parameters it sends. If you expect utf-8 encoded data and treat it as such there may be encoding errors (byte sequences that are invalid in utf-8). So the data may not display as expected or a parser may abort the parsing. But at least the user input cannot "escape" and do more harm e.g. in an inline sql statement or html output. E.g. take the script (saved as iso-8859-1 or utf-8, doesn't matter)

<?php
$s = 'abcxyz';
var_dump(htmlspecialchars($s, ENT_QUOTES, 'utf-8'));
// adding the byte sequence for äöü in iso-8859-1
$s = 'abc'. chr(0xE4) . chr(0xF6) . chr(0xFC). 'xyz';
var_dump(htmlspecialchars($s, ENT_QUOTES, 'utf-8'));

prints

string(6) "abcxyz"
string(0) ""

E4F6FC is not a valid utf-8 byte sequence, therefore htmlspecialchars returns an empty string. Other functions may return ? or another "special" character. But at least they will not "mistake" a character as a malicious control character - as long as they all stick to the "proper" encoding (utf-8 in this case).

accept-charset doesn't guarantee that you will receive only data with that encoding. For all you know the client may not even has "used"/parsed your html document containing the form element. It may help and there's no reason why you shouldn't set that attribute. But it's not "dependable".

VolkerK
Regarding SET NAMES: So basically, prior to PHP 5.2.3, mysql_real_escape_string was useless if you couldn't change the server configuration and it didn't fit what you needed? That really sounds like something that needs to be written explicitly in the PHP docs - and it also sounds like I should get around to updating my DB code, just to be on the safe side...
Michael Madsen
While http://php.net/mysql_set_charset doesn't explain why SET NAMES can be bad at least it says "Using mysql_query() to execute SET NAMES .. is not recommended."
VolkerK
+8  A: 

Most of what you are doing now should be correct.

Some notes: any utf_* collation in MySQL would store your data correctly as UTF-8, the only difference between them is the collation (alphabetical order) applied when sorting.

You can tell Apache and PHP to issue the correct charset headers setting AddDefaultCharset utf-8 in httpd.conf/.htaccess and default_charset = "utf-8" in php.ini respectively.

You can tell the mbstring extension to take care of the string functions. This works for me:

mbstring.internal_encoding=utf-8
mbstring.http_output=UTF-8
mbstring.encoding_translation=On
mbstring.func_overload=6

(this leaves the mail() function untouched - I found setting it to 7 played havoc with my mail headers)

For charset conversion take a look at https://sourceforge.net/projects/phputf8/.

PHP doesn't care at all about what's in the variable, it just stores and retrieves blindly its content.

You'll have unexpected results if you declare one mbstring.internal_encoding and supply to a mb_* function strings in another encoding. You can anyway safely send ASCII to utf-8 functions.

If you're worried about somebody posting incorrectly encoded stuff on purpose I believe you shoud consider HTML Purifier to filter GET/POST data before processing.

Accept-charset has been in the specs since forever, but its real-world support in browsers is more or less zero. The browser will tipically use the encoding af the page containing the form.

UTF-16 is not the big brother of UTF-8, it just serves a different purpose.

djn
+9  A: 

Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads

No. The user agent should be submitting data in UTF-8 format; if not you are losing the benefit of Unicode.

The way to ensure a user-agent submits in UTF-8 format is to serve the page containing the form it's submitting in UTF-8 encoding. Use the Content-Type header (and meta http-equiv too if you intend the form to be saved and work standalone).

I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8")

Don't. It was a nice idea in the HTML standard, but IE never got it right. It was supposed to state an exclusive list of allowable charsets, but IE treats it as a list of additional charsets to try, on a per-field basis. So if you have an ISO-8859-1 page and an “accept-charset="UTF-8"” form, IE will first try to encode a field as ISO-8859-1, and if there's a non-8859-1 character in there, then it'll resort to UTF-8.

But since IE does not tell you whether it has used ISO-8859-1 or UTF-8, that's of absolutely no use to you. You would have to guess, for each field separately, which encoding was in use! Not useful. Omit the attribute and serve your pages as UTF-8; that's the best you can do at the moment.

If a UTF string is improperly encoded will something go wrong

If you let such a sequence get through to the browser you could be in trouble. There are ‘overlong sequences’ which encode an low-numbered codepoint in a longer sequence of bytes than is necessary. This means if you are filtering ‘<’ by looking for that ASCII character in a sequence of bytes, you could miss one, and let a script element into what you thought was safe text.

Overlong sequences were banned back in the early days of Unicode, but it took Microsoft a very long time to get their shit together: IE would interpret the byte sequence ‘\xC0\xBC’ as a ‘<’ up until IE6 Service Pack 1. Opera also got it wrong up to (about, I think) version 7. Luckily these older browsers are dying out, but it's still worth filtering overlong sequences in case those browsers are still about now (or new idiot browsers make the same mistake in future). You can do this, and fix other bad sequences, with a regex that allows only proper UTF-8 through, such as this one from W3.

If you are using mb_ functions in PHP, you might be insulated from these issues. I can't say for sure as mb_* was unusable fragile when I was still writing PHP.

In any case, this is also a good time to remove control characters, which are a large and generally unappreciated source of bugs. I would remove chars 9 and 13 from submitted string in addition to the others the W3 regex takes out; it is also worth removing plain newlines for strings you know aren't supposed to be multiline textboxes.

Was UTF-16 written to address a limit in UTF-8?

No, UTF-16 is a two-byte-per-codepoint encoding that's used to make indexing Unicode strings easier in-memory (from the days when all of Unicode would fit in two bytes; systems like Windows and Java still do it that way). Unlike UTF-8 it is not compatible with ASCII, and is of little-to-no use on the Web. But you occasionally meet it in saved files, usually ones saved by Windows users who have been misled by Windows's description of UTF-16LE as “Unicode” in Save-As menus.

seems_utf8

This is very inefficient compared to the regex!

Also, make sure to use utf8_unicode_ci on all of your tables.

You can actually sort of get away without this, treating MySQL as a store for nothing but bytes and only interpreting them as UTF-8 in your script. The advantage of using utf8_unicode_ci is that it will collate (sort and do case-insensitive compares) with knowledge about non-ASCII characters, so eg. ‘ŕ’ and ‘Ŕ’ are the same character. If you use a non-UTF8 collation you should stick to binary (case-sensitive) matching.

Whichever you choose, do it consistently: use the same character set for your tables as you do for your connection. What you want to avoid is a lossy character set conversion between your scripts and the database.

bobince
Thanks for the link to the W3 function. I found a PHP version in the docs http://us3.php.net/manual/en/function.mb-detect-encoding.php#68607
Xeoncross
You said "don't use accept-charset on forms" because it doesn't work correctly in IE for non-UTF8 forms. Is there a benefit of adding `accept-charset="UTF-8"` (which I haven't heard contains a problem) if your page is already in UTF-8?
philfreo
@philfreo: Nope, adding `accept-charset="UTF-8"` on a page that is already UTF-8 will have no effect (neither in browsers following the standard nor in IE).
bobince
Okay, thanks. I asked this as a specific question which you may want to answer, here: http://stackoverflow.com/questions/3719974/is-there-any-benefit-to-adding-accept-charsetutf-8-to-html-forms-if-the-page/3720010, and a related question here: http://stackoverflow.com/questions/3715264/how-to-handle-user-input-of-invalid-utf-8-characters
philfreo