views:

1933

answers:

6

It often happens that characters such as é gets transformed to é, even though the collation for the MySQL DB, table and field is set to utf8_general_ci. The encoding in the Content-Type for the page is also set to UTF8.

I know about utf8_encode/decode, but I'm not quite sure about where and how to use it.

I have read the "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article, but I need some MySQL / PHP specific pointers.

Question: How do I ensure that user entered data containing international characters doesn't get corrupted?

+2  A: 

Not much to be said that isn't covered by this article

http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet

Peter Bailey
+9  A: 

On the first look at http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet I think that one important thing is missing (perhaps I overlooked this one). Depending on your MySQL installation and/or configuration you have to set the connection encoding so that MySQL knows what encoding you're expecting on the client side (meaning the client side of the MySQL connection, which should be you PHP script). You can do this by manually issuing a

SET NAMES utf8

query prior to any other query you send to the MySQL server.

If your're using PDO on the PHP side you can set-up the connection to automatically issue this query on every (re)connect by using

$db=new PDO($dsn, $user, $pass);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, "SET NAMES utf8");

when initializing your db connection.

Stefan Gehrig
It's mentoined in the comments somewhere, but yes, it is difficult to miss!
Jrgns
For anyone just reading this (as of March 2010), the article referenced can now be found at http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
bdl
A: 

For better unicode correctness, you should use utf8_unicode_ci (though the documentation is a little vague on the differences). You should also make sure the following Mysql flags are set correctly -

  • default-character-set=utf8
  • skip-character-set-client-handshake //Important so the client doesn't enforce another encoding

Those can be set in the mysql configuration file (under the [mysqld] tab) or at run time by sending the appropriate queries.

Eran Galperin
+2  A: 

Things you should do:

  • Make sure Apache puts out UTF-8 content. Do this in your httpd.conf, or use PHP's header()-function to do it manually.
  • Make sure your database connection is UTF8. "SET NAMES utf8" does the trick.
  • Make sure all your tables are set to UTF8.
  • Make sure all your PHP and template files are encoded as UTF8 if you store international characters in them.

You usually don't have to do to much using the mb_string or utf8_encode/decode-functions when you do this.

Vegard Larsen
+5  A: 

Collation and charset are not the same thing. Your collation needs to match the charset, so if your charset is utf-8, so should the collation. Picking the wrong collation won't garble your data though - Just make string-comparison/sorting work wrongly.

That said, there are several places, where you can set charset settings in PHP. I would recommend that you use utf-8 throughout, if possible. Places that needs charset specified are:

  • The database. This can be set on database, table and field level, and even on a per-query level.
  • Connection between PHP and database.
  • HTTP output; Make sure that the HTTP-header Content-Type specifies utf-8. You can set default values in PHP and in Apache, or you can use PHP's header function.
  • HTTP input. Generally forms will be submitteed in the same charset as the page was served up in, but to make sure, you should specify the accept-charset property. Also make sure that URL's are utf-8 encoded, or avoid using non-ascii characters in url's (And GET parameters).

utf8_encode/decode functions are a little strangely named. They specifically convert between latin1 (ISO-8859-1) and utf-8. If everything in your application is utf-8, you won't have to use them much.

There are at least two gotchas in regards to utf-8 and PHP. The first is that PHP's builtin string functions expect strings to be single-byte. For a lot of operations, this doesn't matter, but it means than you can't rely on strlen and other functions. There is a good run-down of the limitations at this page. Usually, it's not a big problem, but especially when using 3-party libraries, you need to be aware that things could blow up on this. One option is also to use the mb_string extension, which has the option to replace all troublesome functions with utf-8 aware alternatives. It's still not a 100% bulletproof solution, but it'll work for most cases.

Another problem is that some installations of PHP still has the magic_quotes setting turned on. This problem is orthogonal to utf-8, but can lead to some head scratching. Turn it off, for your own sanity's sake.

troelskn
A: 

Regardless of the language it's written in, if you were to create an app that allows a wide array of encodings, handle it in pieces:

  • Identify the encoding
    • somehow you want to find out what kind of encoding you're dealing with, otherwise, it's pretty pointless to consider it further. You'll end up with junk chars.
  • Handle your bytes
    • think of these strings less like 'strings' of characters, and more like lists of bytes
    • PHP is especially sneaky. Don't let it truncate your data on-the-fly. If you're regexing a UTF-8 string, make sure you identify it as such
  • Store for the LCD
    • Again, you don't want to truncate data. If you're storing a sentence in English, can you also store a set of Mandarin glyphps? How about Arabic? Which of these is going to require the most space? Account for it.
Pete Karl II