You should be using UTF-8 all the way through. Check that:
your connection to the database is UTF-8 (using mysql_set_charset
);
the pages you're outputting are marked as UTF-8 (<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
);
when you output strings from the database, you HTML-encode them using htmlspecialchars()
and not htmlentities()
.
htmlentities
HTML-encodes all non-ASCII characters, and by default assumes you are passing it bytes in ISO-8859-1. So if you pass it “
encoded as UTF-8 (bytes 0xE2, 0x80, 0x9C), you'd get “
, instead of the expected “
or “
. This can be fixed by passing in utf-8
as the optional $charset
argument.
However it's usually easier to just use htmlspecialchars()
instead, as this leaves non-ASCII characters alone, as raw bytes instead of HTML entity references. This results in a smaller page output, so is preferable as long as you're sure the HTML you're producing will keep its charset information (which you can usually rely on, except in context like sending snippets of HTML in a mail or something).
htmlspecialchars()
does have an optional $charset
argument too, but setting it to utf-8
is not critical since that results in no change of behaviour over the default ISO-8859-1 charset. If you are producing output in old-school multibyte encodings like Shift-JIS you do have to worry about setting this argument correctly, but today that's quite rare as most sane people use UTF-8 in preference.