views:

102

answers:

2

Driving me nuts...

Page with form is encoded as Unicode (UTF-8) via:

<meta http-equiv="content-type" content="text/html; charset=utf-8">

entry column in database is text utf8_unicode_ci

copying text from a Word document with " in it, like this: “1922.” is insta-fail and ends up in the database as â��1922.â�� (typing new data into the form, including " works fine... it's cut and pasting from Word...)

PHP steps behind the scenes are:

  • grab value from POST
  • run through HTML Purifier default settings
  • run through mysql_real_escape_string
  • insert query into dbase

Help?

+1  A: 

Call mysql_set_charset to let the database know you are going to be sending it UTF-8 encoded strings.

typing new data into the form, including " works fine...

Well " is a normal ASCII quote. and aren't, they're smart quotes, which are non-ASCII characters. Whether they come from Word is unimportant; all your non-ASCII characters will be treated the same.

  • grab value from POST
  • run through HTML Purifier default settings

That's a bad idea. HTML Purifier should be run over strings that are HTML and you intend to output as HTML, for the relatively rare case where you need to let users submit HTML.

It is totally the wrong thing to run over all input text. Normally you should be allowing any old text, and then when you output that text inside HTML you should be calling htmlspecialchars() over it.

Otherwise you're breaking the ability of users to enter < and & like I am in this post, and you still risk cross-site-scripting when you are outputting processed or non-input-sourced data.

bobince
Hi Bob! I was using HTML Purifier to strip out all HTML from that form field, as it does get displayed on the site. Is that still a bad practice?
Andrew Heath
If text content from the database is getting output as raw HTML, that's a really bad thing. You need to fix it on the output end by calling `htmlspecialchars()` every time you drop a string into HTML, for example: `<p>Hello <?php echo htmlspecialchars($name); ?>!</p>` or `echo "<p>Hello ".htmlspecialchars($name)."!</p>"`; *never* `echo "<p>Hello $name!</p>";`. (You can make a function with a shorter name to avoid so much typing.) You can't fix this properly on the input end at all, HTMLPurifier or no.
bobince
I'm currently running output through htmlentities before display. I don't also need to use htmlspecialchars, do I?
Andrew Heath
You should use `htmlspecialchars()` as a better version of `htmlentities()`. `htmlentities` needlessly tries to encode all non-ASCII characters, and defaults to treating them as ISO-8859-1, so if you're using any other charset like UTF-8, it will totally screw them over unless you remember to pass the `charset` argument in each call. `htmlspecialchars()` only encodes the few characters like `<` that really need it. It's almost always the better function; it's a shame that so much crappy PHP tutorials, if they bother mention HTML-escaping at all, jump for nasty `htmlentities()`.
bobince
Thanks for sticking with me Bob. I'll make the adjustments and test it tonight.
Andrew Heath
Everything is coming out OK now, Bob. Adding mysql_query('SET NAMES utf8'); to my init.php include sorted it out in the end. Thank you for the related help with htmlspecialchars/htmlentities!
Andrew Heath
Yes, `SET NAMES utf8` does the same thing as `mysql_set_charset`, albeit in a marginally less efficient way.
bobince
+1  A: 

“1922.” and "1922." are 2 different strings.
The quotes from word are not double quotes “ != "

The column that you describe is text utf8_unicode_ci. utf8_unicode_ci is the collation, make sure the charset on that column is set to utf8.

Then I would make sure that you setup correct encoding for each connection using SET NAMES utf8 COLLATE utf8_unicode_ci...

If you've done that and it's still not saved properly, make sure your php has mbstrings enabled and try to work with mb_ functions.

There are many root causes you might have, but I think the charset on column and SET NAMES ... should solve it.

michal kralik
calling mysql_query('SET NAMES utf8'); in the init.php include was the final piece of the puzzle, thank you
Andrew Heath