views:

1545

answers:

4

i'm completely confused by what i've read about character sets. I'm developing an interface to store french text formatted in html inside a mysql database.

What i understood was that the safe way to have all french special characters displayed properly would be to store them as utf8. so i've created a mysql database with utf8 specified for the database and each table. I can see through phpmyadmin that the characters are stored exactly the way it is supposed to. But outputting these characters via php gives me erratic results: accented characters are replaced by meaningless characters. Why is that ?

do i have to utf8_encode or utf8_decode them? note: the html page character encodign is set to utf8.

more generally, what is the safe way to store this data? Should i combine htmlentities, addslashes, and utf8_encode when saving, and stripslashes,html_entity_decode and utf8_decode when i output?

+7  A: 

MySQL performs character set conversions on the fly to something called the connection charset. You can specify this charset using the sql statement

SET NAMES utf8

or use a specific API function such as mysql_set_charset():

mysql_set_charset("utf8", $conn);

If this is done correctly there's no need to use functions such as utf8_encode() and utf8_decode().

You also have to make sure that the browser uses the same encoding. This is usually done using a simple header:

header('Content-type: text/html;charset=utf-8');

(Note that the charset is called utf-8 in the browser but utf8 in MySQL.)

In most cases the connection charset and web charset are the only things that you need to keep track of, so if it still doesn't work there's probably something else your doing wrong. Try experimenting with it a bit, it usually takes a while to fully understand.

Emil H
thanks. I do this set names query systematically prior to inserting and selecting the data. that didn't help. My question is really more about how php and the end browser manipulate the sql data and how to control it.
pixeline
A: 

In adition to what Emil H said, you also need this in your page head tag:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
dr Hannibal Lecter
+1  A: 

It is useful to consider the PHP-generated front end and the MySQL backend separate components. MySQL should not have to worry about display logic, nor should PHP assume that the backend does any sort of preprocessing on the data.

My advice would be to store the data in plain characters using utf8 encoding, and escape any dangerous characters with MySQLs methods. PHP then reads the utf8 encoded data from database, processes them (with htmlentities(), most often), and displays it via whichever template you choose to use.

Emil H. correctly suggested using

 SET NAMES utf8

which should be the first thing you call after making a MySQL connection. This makes the MySQL treat all input and output as utf8.

Note that if you have to use utf8_encode or utf8_decode functions, you are not setting the html character encoding correctly. It is easiest to require that every component of your system uses utf8, since that way you should never have to do manual encoding/decoding, which can cause hard to track issues later on.

Jukka Dahlbom
+2  A: 

I strongly recomend to read this article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky, to understand what are you doing and why.

Luis Melgratti