views:

1149

answers:

4

I have started debugging my RSS feed because it has some strange characters in it (i.e. the missing-character glyph). I started with two excellent beginner resources:

The reason I believe our RSS feed is having problems is because users are copy&pasteing MS Word documents into a textarea on the site and our PHP pages are using the "iso-8859-1" charset which is incompatible with the special "Windows-1252" encodings for things like bullet points and smart quotes used by MS Word.

So I'm hoping to fix the issue, all I'll need to do is start using "utf-8" in the pages that take/give user input??. I.e. set the following in the HEAD section:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

The real reason I'm raising this question though, is because my DB fields that store my user input are in "latin1_swedish_ci" and I want to know whether I NEED to convert them to "utf8_general_ci"? MySQL doesn't really care about the charset does it? It just sees a bunch of bytes and If I put Unicode into a field collated as Latin it'll still come back out as Unicode right? Changing the field will be tiresome because the field is part of a FULLTEXT index where the other fields will also need their collation changing which means dropping the index and rebuilding it (which is no small task when there's large amounts of TEXT involved).

A: 

No - definitively not. As MySQL posseses the ability to transform strings from one character set into another on the fly, it's important though that your MySQL server knows what character set you're working with on the client side (client side = PHP script, NOT the client accessing your webpage). This can be done by issuing the query

SET NAMES 'utf8';

prior to any other query you send to the server. MySQL will then do the appropriate conversions from your client character set into the internal MySQL character set into the table and/or column character set and all the way back. So generally you only have to worry about setting the correct client character set. This character set must match the character set you use to output your data to the webserver.

Please have a look at the MySQL manual:

Stefan Gehrig
A: 

In HTTP the character encoding is declared by the charset parameter in the Content-Type header field of the HTTP response. Other declaration are overwritten by the declaration in the HTTP header:

[…] user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):

  1. An HTTP "charset" parameter in a "Content-Type" field.
  2. A META declaration with "http-equiv" set to "Content-Type" and a value set for "charset".
  3. The charset attribute set on an element that designates an external resource.

Additionally you should explicitly declare the accepted character encoding with the accept-charset attribute in the form element. Otherwise the user agent may take (but must not) the character encoding used in the form document to encode the input data:

The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element.

This should give you the best chance that the incoming data is encoded correctly. But it’s not guarateed. So better check if the data is acutally encoded with UTF-8 (there are functions/algorithms to do this).

Gumbo
+3  A: 

The real reason I'm raising this question though, is because my DB fields that store my user input are in "latin1_swedish_ci" and I want to know whether I NEED to convert them to "utf8_general_ci"?

No. latin1_swedish_ci and utf8_general_ci are collations - not charsets. The collation won't affect the way that characters are stored or input/output. It only controls how sorting functions order their results. The collation - to work as expected - should match the storage charset. So if your tables are stored in utf8, you should use a utf8 collation.

The storage charset for mysql is not directly tied to the charset in php. You can use utf8 as the storage characterset for Mysql, while using iso-8859-1 in php. In that case, you need to tell Mysql about it, by setting the charset on the connection (set names XXX). Mysql will then convert as needed. If you don't use the same charset on Mysql and php, you'll end up with the charset capacity that is the lowest dommon denominator, so even though strings are stored in utf8, you'll not have the full unicode range of characters available. Therefore you should use utf8 in both Mysql and php.

troelskn
A: 

To save someone some time searching for how to change the mysql connection charset nicely with pdo/mysql here's how i do it:

$dbc = new pdo('mysql:dbname=DBNAME;host=DBHOST', $user, $pw, array(PDO::MYSQL_ATTR_INIT_COMMAND => sprintf( "SET NAMES %s", $charset ) ) );
Galen