views:

193

answers:

5

Okay, there is a ton of stuff out there on sanitizing strings but very little, that I can find, on the best methods to prepare user input (like what I'm typing now) for inserting into a content management system then how to filter it coming out.

I'm building two multilingual (Japanese, English + other Romance languages) CMSs and having a heck of a time with getting both special characters like ®, ™, to display along with Japanese characters.

I continue to get very inconsistent results.

I have everything set to UTF-8:

web page: and

.htaccess file: AddDefaultCharset UTF-8 AND (to force the issue)

after each db connection: mysql_query("SET NAMES 'UTF8'");

each database, table, and field is also set to utf8_general_ci

Magic quotes are off. I preprocess user input first with the default settings of htmlpurifier, then I run this function on it:

function html_encode($var) {

     // Encodes HTML safely for UTF-8. Use instead of htmlentities.
     $var = htmlentities($var, ENT_QUOTES, 'UTF-8');

     // convert pesky special characters to unicode
     $look = array('™', '™','®','®');
     $safe = array('™', '™', '®', '®'); 

     $var = str_replace($look, $safe, $var);

     $var = mysql_real_escape_string($var); 

     return $var; 
          }

That get's it in to the database.

I return it from the database by filtering everything with this function:

function decodeit($var) {

     return html_entity_decode(stripcslashes($var), ENT_QUOTES, 'UTF-8');
          }

Unfortunately, after all this I STILL get inconsistent results. Most often the ® symbols become little diamonds.

I've searched all over for a good tut on this but can't seem to find what are the best methods...

+1  A: 

Sorry the web page headers got scrubbed by the wysiwyg editor. For clarity's sake:

Web page headers are:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;

And

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
fred
A: 

http://us3.php.net/utf8_encode http://us3.php.net/utf8-decode

That should help.

J.J.
A: 

Everything is already encoded utf8. Decoding it to ISO-8859-1 would merely wreck any Japanese.

fred
A: 

I once had an issue with encoding that came down to the encoding of the php files themselves. So basically make sure the files themselves are encoded to utf-8. In vim you can do :e ++enc=

sofia
the text got wrangled. It's :e ++enc=utf-8
sofia
+1  A: 

Don't put htmlentities in your database! Never call html_entities(), it should be deprecated from php. Use htmlspecialchars but when you display the text, not before you put it in the database. The point is to prevent your data from being treated as html. There is no point in translating trademark symbols or copyright symbols, because they don't cause a risk. The only html you need to worry about is: > < & ' "

John C