Okay, there is a ton of stuff out there on sanitizing strings but very little, that I can find, on the best methods to prepare user input (like what I'm typing now) for inserting into a content management system then how to filter it coming out.
I'm building two multilingual (Japanese, English + other Romance languages) CMSs and having a heck of a time with getting both special characters like ®, ™, to display along with Japanese characters.
I continue to get very inconsistent results.
I have everything set to UTF-8:
web page: and
.htaccess file: AddDefaultCharset UTF-8 AND (to force the issue)
after each db connection: mysql_query("SET NAMES 'UTF8'");
each database, table, and field is also set to utf8_general_ci
Magic quotes are off. I preprocess user input first with the default settings of htmlpurifier, then I run this function on it:
function html_encode($var) {
// Encodes HTML safely for UTF-8. Use instead of htmlentities.
$var = htmlentities($var, ENT_QUOTES, 'UTF-8');
// convert pesky special characters to unicode
$look = array('™', '™','®','®');
$safe = array('™', '™', '®', '®');
$var = str_replace($look, $safe, $var);
$var = mysql_real_escape_string($var);
return $var;
}
That get's it in to the database.
I return it from the database by filtering everything with this function:
function decodeit($var) {
return html_entity_decode(stripcslashes($var), ENT_QUOTES, 'UTF-8');
}
Unfortunately, after all this I STILL get inconsistent results. Most often the ® symbols become little diamonds.
I've searched all over for a good tut on this but can't seem to find what are the best methods...