views:

75

answers:

1

I've made a test program that is basically just a textarea that I can enter characters into and when I click submit the characters are written to a MySQL test table (using PHP).

The test table is collation is UTF-8.

The script works fine if I want to write a é or ú to the database it writes fine. But then if I add the following meta statement to the <head> area of my page:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

...the characters start becoming scrambled.

My theory is that the server is imposing some encoding that works well, but when I add the UTF-8 directive it overrides this server encoding and that this UTF-* encoding doesn't include the characters such as é and ú. But I thought that UTF-8 encoded all (bar Klingon etc) characters.

Basically my program works but I want to know why when I add the directive it doesn't. I think I'm missing something.

Any help/teaching most appreciated.

Thanks in advance.

+1  A: 

Firstly, PHP generally doesn't handle the Unicode character set or UTF-8 character encoding. With the exception of (careful use of) mb_... functions, it just treats strings as binary data.

Secondly, you need to tell the MySQL client library what character set / encoding you're working with. The 'SET NAMES' SQL command does the job, and different MySQL clients (mysql, mysqli etc..) provide access to it in different ways, e.g. http://www.php.net/manual/en/mysqli.set-charset.php

Your browser, and MySQL client, are probably both defaulting to latin1, and coincidentally matching. MySQL then knows to convert the latin1 binary data into UTF-8. When you set the browser charset/encoding to UTF-8, the MySQL client is interpreting that UTF-8 data as latin1, and incorrectly transcoding it.

So the solution is to set the MySQL client to a charset matching the input to PHP from the browser.

Note also that table collation isn't the same as table character set - collation refers to how strings are compared and sorted. Confusing stuff, hope this helps!

Paul Annesley
Thanks very much. I've had a look and I think my problems (at least my UTF-8 related problems) are more basic that this. If I forget the DB for the time being and just put this little page on my server:<html><head></head><body>é</body></html>...All is well an é appears, but if I do this:<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>é</body></html>...I get question marks.The same thing happens on both IE and FF.
Columbo
sorry that didn't come out too well.
Columbo
Perhaps your text editor is saving the file as a non-UTF-8 encoding?
Paul Annesley
Thanks Paul, you were correct - when I UTF-8 encoded my code in notepad the simple script above worked. But, I still had the original database issue so I went back to your main post and it has solved the issue for me. I added the MySQL query SET NAMES utf8 from PHP (mysql_query("SET NAMES utf8");) before my main query therefore telling MySQL what character set to expect and it works. Therefore your third paragraph explains to me what is happening - I was wrongly assuming collation was all I needed to worry about. Thanks for the knowledge.
Columbo