views:

662

answers:

2

I am using HTML Purifier in my PHP project and am having trouble getting it to work properly with user input.

I am having users enter in HTML using a WYSIWYG editor (TinyMCE), but whenever a user enters in the HTML entity   (non-breaking space) it gets saved into the database as this weird foreign character (Â).

However, the thing is, when I edit the saved entry using the WYSIWYG editor it gets displayed properly as  . It also functions properly when displayed, only that in the source code it appears as a real space, but not the non-breaking space character.

Also, in the MySQL database it displays as the weird foreign character.

I read the doc about Unicode and HTML Purifier and changed my database and web page encoding to be UTF-8, but I am still having problems with the non-breaking space character not being mangled. The other HTML entities, such as &lt; and &gt;, get saved as < and >, but why not &nbsp;?

+1  A: 

The non-breaking space isn't being saved in your database as one weird foreign character, it's being saved as two characters. The Unicode non-breaking space character is encoded in UTF-8 as 0xC2 0xA0, which in ISO-8859-1 looks like " " (i.e. a weird foreign character followed by a non-breaking space).

You're probably forgetting to do SET NAMES 'utf8' on your database connection, which causes PHP to send its data to MySQL as ISO-8859-1 (the default).

Have a look at "UTF-8 all the way through…" to see how to properly set up UTF-8 when using PHP and MySQL.

mercator
A: 

It may also help you to know that &#160; is an alternate for &nbsp; which you will likely require if you ever output any human readable XML ;)

Jay