views:

528

answers:

3

I have an HTML form, and some users are copy/pasting text from MS Word. When there are single quotes or double quotes, they get translated into funny characters like:

'€™ and ’

The database column is collation utf8_general_ci.

How do I get the appropriate characters to show up?

Edit: Problem solved. Here's how I fixed it:

Ran mysql_query("SET NAMES 'utf8'"); before adding/retreiving from the database. (thanks to Donal's comment below).

And somewhat odd, the php function urlencode($text) was applied when displaying, so that had to be removed.

I also made sure that the headers for the page and the ajax request/response were all utf8.

+2  A: 

Check the encoding that the page uses. Encode it using UTF-8 as well, and add a meta tag describing the encoding:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Marius
A: 

We have a PHP function that tries to clean up the mess with smart quotes. It's a bit of a mess, since it's grown a bit organically as cases popped up during prototype development. It may be of some help, though:

function convert_smart_quotes($string) {
    $search = array(chr(0xe2) . chr(0x80) . chr(0x98),
                    chr(0xe2) . chr(0x80) . chr(0x99),
                    chr(0xe2) . chr(0x80) . chr(0x9c),
                    chr(0xe2) . chr(0x80) . chr(0x9d),
                    chr(0xe2) . chr(0x80) . chr(0x93),
                    chr(0xe2) . chr(0x80) . chr(0x94),
                    chr(226) . chr(128) . chr(153),
                    '’','“','â€<9d>','â€"','  ');

     $replace = array("'","'",'"','"',' - ',' - ',"'","'",'"','"',' - ',' ');

    return str_replace($search, $replace, $string);
}
Mike A.
I've done this myself, but I think it's a bad idea. If you have a text process or any other kind of process that corrupts your data, fix the process so it doesn't corrupt the data, don't just make piecemeal corrections to the output.
d__
+1  A: 

This looks like a classic case of unicode (UTF-8 most likely) characters being interpreted as iso-8859-1. There are a couple places along the way where the characters can get corrupted. First, the client's browser has to send the data. It might corrupt the data if it can't convert the characters properly to the page's character encoding. Then the server reads the data and decodes the bytes into characters. If the client and server disagree about the encoding used then the characters will be corrupted. Then the data is stored in the database; again there is potential for corruption. Finally, when the data is written on the page (for display to the browser) the browser may misinterpret the bytes if the page doesn't adequately indicate it's encoding.

You need to ensure that you are using UTF-8 throughout. The default for web pages is iso-8859-1, so your web pages should be served with the Content-Type header or the meta tag

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

(make sure you really are serving the text in that encoding).

By using UTF-8 along all parts of the process you will avoid problems with all working web browsers and databases.

Mr. Shiny and New
+1, there's no one local fix for these problems, the important thing is the mindset of being encoding-aware wherever you're transmitting or storing text.
d__