views:

392

answers:

5

We have an application hosted on both a local development server and a live site. We are experiencing UTF-8 corruption issues and are looking to figure out how to resolve them.

The system is run using symfony 1.0 with Propel.

On our development server, we are running PHP 5.2.0 and MySQL 5.0.32. We do not experience corrupted UTF-8 characters there.

On our live site, PHP 5.2.10 and MySQL 5.0.81 is running. On that server, certain characters such as ố and Σ are corrupted once they are stored in the database. The corrupted characters are showing up as either question marks or approximations of the original character with adjacent question marks.

Examples of corruption:

Uncorrupted: ố Corrupted: ô?

Uncorrupted: Σ Corrupted: ?

We are currently using the following techniques on both development and live servers:

  1. Executing the following queries prior to execution of any other queries:

    SET NAMES 'utf8' COLLATE 'utf8_unicode_ci'
    SET CHARSET 'utf8'
    
  2. Setting the <meta> Content-Type value to:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
  3. Adding the following to our .htaccess file:

    AddDefaultCharset utf-8
    
  4. Using mb_* (multibyte) PHP functions where necessary.

  5. Being sure to set database columns to use utf8_unicode_ci collation.

    These techniques are sufficient for our development site, but do not work on the live site.

On the live site I've also tried adding mysql_set_encoding('ut8', $mysql_connection) but this does not help either. I have found some evidence that newer versions of PHP and MySQL are mishandling UTF-8 character encodings.

A: 

Are the chars corrupted only once they are stored in the DB? What about PHP?

Try adding the accept-charset="UTF-8" attribute to your form elements.


Also, make sure your .php files are UTF-8 encoded without BOM.

Alix Axel
Hi. Yes, the characters are corrupted once they are stored in the DB. Again, our application works just fine end-to-end on our development server, so the issue is very likely present in PHP or MySQL itself.
jkndrkn
+1  A: 

Note that if you are using Doctrine or Propel mysql_set_encoding() is going to have no effect as both ORMs are PDO based (Propel < 1.3 is Creole/Mysqli based).

prodigitalson
Thanks, that is helpful. I'm currently using symfony 1.0 with Propel. I've run some test and it does seem that my application is using the Creole mysql and not the mysqli driver.
jkndrkn
Really? I didnt know here was a mysql driver i thought it only supported mysqli. Its been so long :-) Thanks for the tip!
prodigitalson
+3  A: 

Take a minimal example - submit a form with just say a Σ character, and isolate where in the path from browser to database it's being corrupted. Print the value of bin2hex($str) to a log in your PHP app, as soon as possible after the request is received from the browser, the last thing before you hand it off to the database, and anywhere else you suspect might be a problem area - for Σ it should print out cea3. In the database, run charset(col), hex(col) on the saved data - if it's all working it should print out ["utf8", "cea3"] (assuming that the tables are utf8).

The clearer you can be about how the characters are corrupted - do they show up as blank? as mojibake? as empty "tofu" boxes? is the string truncated at the problem character? what are the expected and observed byte representations? - the more clues you'll have about what might have caused it.

d__
Thanks. I've added examples of how the characters are corrupted. I'll run through your detailed isolation procedure today at some point.
jkndrkn
My suspicion, which you'll be able to confirm very quickly, is that you've accidentally made the tables latin1 instead of utf8, which would be consistent with the sigma being lost, while the ô is preserved but its following combining acute is lost. Do SHOW CREATE TABLE ... to confirm.
d__
My initial impression is Donal's last comment, be sure to double-check all of the encodings on your database, tables, and individual columns.
Cryo
Nope, charset is still utf8 and collation is still utf8_unicode_ci. Your suggestion does provide a clue to the possibility of some link in the data flow being misconfigured to use latin1, though. I'm currently working with our hosting provider to see if they can make some adjustments to our my.cnf file and see if that helps. Thanks again.
jkndrkn
+1  A: 

Our experienced system administrator discovered a fix:

alter database DB_NAME character set utf8;

This solved our problems completely.

jkndrkn
I was going to ask… are your columns in the DB set to utf8? Be aware that MySQL 5.x does not handle non-BMP characters at all, and those characters (such as a musical G-clef) will be lost.
MtnViewMark
Oh and look: StackOverflow seems to handle that character nicely!
MtnViewMark
My columns are, indeed set to utf8. If they were not, my application would not have worked properly on my development server.
jkndrkn
Heh! Spoke too soon: while the nice "live" AJAX handled the character, upon loading the page later I see thier DB munged the character!
MtnViewMark
A: 

I had the same problem, but it wasn't MySQL since Doctrine is taken care of everything, from DB collation to tables and connection in UTF-8.

I found substr() to be causing troubles with umlauts(ü) so I switched to mb_substr(), mb_strlen() etc. for user input data. Do not forget about the ini setting or it won't work:

ini_set('mbstring.internal_encoding','UTF-8');

OR

mbstring.internal_encoding = UTF-8; (in php.ini)

Luis Merino