views:

259

answers:

1

Using PHP against a UTF-8 compliant database. Here's how input goes in.

  1. user types input into textarea
  2. textarea encoded with javascript escape()
  3. passed via HTTP post
  4. decoded with PHP rawurldecode()
  5. passed through HTMLPurifier with default settings
  6. escaped for MySQL and stored in database

And it comes out in the usual way and I run unescape() on page load. This is to allow people to, say, copy and paste directly from a word document and have the smart quotes show up.

But HTMLPurifier seems to be clobbering non-UTF-8 special characters, ones that escape() to a simple % expression, like Ö, which escapes to %D6, whereas smartquotes escape to %u2024 or something and go into the database that way. It takes out both the special character and the one immediately following.

I need to change something in this process. Perhaps I need to change multiple things.

What can I do to not get special characters clobbered?

+4  A: 
  1. textarea encoded with javascript escape()

escape isn't safe for non-ascii. Use escapeURIComponent

  1. passed via HTTP post

I assume that you use XmlHttpRequest? If not, make sure that the page containing the form is served as utf-8.

  1. decoded with PHP rawurldecode()

If you access the value through $_POST, you should not decode it, since that has already been done. Doing so will mess up data.

  1. escaped for MySQL and stored in database

Make sure you don't have magic quotes turned on. Make sure that the database stores tables as utf-8 (The encoding and the collation must be both utf-8). Make sure that the connection between php and MySql is utf-8 (Use set names utf8, if you don't use PDO).

Finally, make sure that the page is served as utf-8 when you output the string again.

troelskn
The system as it was could handle smart quotes just fine, so I didn't doubt my UTF chops. encodeURIComponent gets it through HTMLPurifier just fine, thanks. But I'm seeing %20s in $_POST. Weirdly enough though, decodeURIComponent complains about invalid URIs when I try to run it on the resulting page, though unescape works just fine. Any ideas?
Glazius
Smart quotes aren't a guarantee for working utf-8. Some clients will interpret iso-8859-1 as cp-1252, if it contains smart quotes. You need to double check the entire pipeline. Try getting it to work with a plain html-form before adding JS on top of it. Try testing with Chinese characters, since they don't exist in cp-1252 nor iso-8859-1.
troelskn