views:

139

answers:

2

I have the following test script on my server:

<?php
echo "Test is: " . $_GET['test'];
?>

If I call it with a url like example.com/script.php?test=ɿ (ɿ being a multibyte character), the resulting page looks like this:

Test is: É¿

If I try to do anything with the value in $_GET['test'], such as save it a mysql database, I have the same problem. What do I need to to do make PHP handle this value correctly?

+3  A: 

Have you told the user agent your HTTP response is UTF-8?

header ('Content-type: text/html; charset=utf-8');

You might also want to ensure your HTML markup declares the encoding also, e.g.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For your database, are your tables and mysql client settings set up for UTF-8? If you check your database using a mysql command line client, is your terminal environment set up to expect UTF-8?

In a nutshell, you must check every step: from the raw source data, the code which touches it, the storage systems which retain it, and the tools you use to display and debug it.

Paul Dixon
What's the point in repeating the header?
Alix Axel
If the document is stored in another retrieval system, the original HTTP headers are lost - for example, if you save the HTML to a local hard disc.
Paul Dixon
Yeah, I mean what's the point in using the first `header()` call? The meta tag does the same.
Alix Axel
If the default_charset ini parameter is set php sends a content-type header including the charset. http clients (usually) prefer the http header over the http-equiv setting. So you might want to avoid ambiguities/errors caused by different ini settings and make the charset explicit in both the http header and the meta/http-equiv element.
VolkerK
I would say simply "because you can", but there maybe more justification beyond that :) One thing it does allow you to do is probe the content type of the request via HEAD request.
Paul Dixon
VolkerK put it better than I, +1 to that!
Paul Dixon
Adding that header causes it to display correctly on the resulting page, but doesn't help my database problem. What do I need to do other than set the collation to utf8_unicode_ci on the table and database? (and column)
takteek
Oh, didn't think of HEAD requests. Good point.
VolkerK
@takteek: that depends a bit on the API you're using to connect to the mysql server. If you're using mysql\_connect() (i.e. the php-mysql extension) search Stackoverflow for mysql\_set\_charset()
VolkerK
Ah, mysql_set_charset('utf8') fixed everything. Thanks. I think this was probably a case where I would have found that if I just looked for 5 more minutes. I got impatient since it's 5:30 AM. :)
takteek
After a restful ...nap you might be interested in http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html to learn more about what mysql\_set\_charset() does and why `SET NAMES 'utf8'` is not the whole story when using mysql\_query() (it doesn't notify the client lib about the change of character encoding which may -in rare cases- lead to wrong results of mysql\_real\_escape\_string()). I _guess_ `SET names` is safe when using prepared statements (mysqli, pdo).
VolkerK
@VolkerK: +1 Wow, my foundations just got shattered... Mind commenting on http://stackoverflow.com/questions/1933411/mysql-and-utf-8 please?
Alix Axel
@Alix: To glue the shards of your foundation back together again, I'm not even sure if this can be exploited when switching the conn.charset from latin1 to _utf8_. It may be but I guess it's not time for _panic(!)_,yet. ;-) Chris Shiflett used the GBK (simplified Chinese) charset for his demo. Take a look at http://ilia.ws/archives/103-mysql_real_escape_string-versus-Prepared-Statements.html which is a "reply" to the addslashes vs. real\_escape\_string() article. mysql\_set\_charset() was introduced with php 5.2.3 (31-May-2007), after both articles (the latter was published January 22. 2006).
VolkerK
@VolkerK: Thanks! From what I've understood it seems that it's safe to use UTF-8, they also only seem to mention the danger of the `SET CHARACTER SET` query, they don't go into much detail about `SET NAMES`.
Alix Axel
Neither sql statement informs the client lib. I only have a basic understanding of utf-8 and don't _know for sure_ whether this can be exploited when switching from latin1 to utf8 or not. But since calling set\_charset() instead of mysql\_query('SET...) doesn't introduce more complexity and closes a potential hole I'd definitely prefer the safe route here. I prefer prepared statements anyway ;-)
VolkerK
+1  A: 

UTF-8 all the way through…


Follow the steps, specifically:

  • SET NAMES 'utf8' upon connection to the MySQL DB
  • <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in your HTML
Alix Axel