views:

1915

answers:

6

I am in the process of fixing some bad UTF8 encoding. I am currently using PHP 5 and MySQL

In my database I have a few instances of bad encodings that print like: î

  • The database collation is
  • utf8_general_ci PHP is using a proper
  • UTF8 header Notepad++ is set to use
  • UTF8 without BOM database management is handled in phpMyAdmin
  • not all cases of accented characters are broken

What I need is some sort of function that will help me map the instances of î, í, ü and others like it to their proper accented UTF8 characters.

A: 

It looks like your utf-8 is being interpreted as iso8859-1 or Win-1250 at some point.

When you say "In my database I have a few instances of bad encodings" - how did you check this? Through your app, phpmyadmin or the command line client? Are all utf-8 encodings showing up like this or only some? Is it possible you had the encodings wrong and it has been incorrectly converted from iso8859-1 to utf-8 when it was utf-8 already?

teambob
I use phpmyadmin for database management. And no, not all cases are badly encoded.
Jayrox
+13  A: 

I've had to try to 'fix' a number of UTF8 broken situations in the past, and unfortunately it's never easy, and often rather impossible.

Unless you can determine exactly how it was broken, and it was always broken in that exact same way, then it's going to be hard to 'undo' the damage.

If you want to try to undo the damage, your best bet would be to start writing some sample code, where you attempt numerous variations on calls to mb_convert_encoding() to see if you can find a combination of 'from' and 'to' that fixes your data. In the end, it's often best to not even bother worrying about fixing the old data because of the pain levels involved, but instead to just fix things going forward.

However, before doing this, you need to make sure that you fix everything that is causing this issue in the first place. You've already mentioned that your DB table collation and editors are set properly. But there are more places where you need to check to make sure that everything is properly UTF-8:

  • Make sure that you are serving your HTML as UTF-8:
    • header("Content-Type: text/html; charset=utf-8");
  • Change your PHP default charset to utf-8:
    • ini_set("default_charset", 'utf-8');
  • If your database doesn't ALWAYS talk in utf-8, then you may need to tell it on a per connection basis to ensure it's in utf-8 mode, in MySQL you do that by issuing:
    • charset utf8
  • You may need to tell your webserver to always try to talk in UTF8, in Apache this command is:
    • AddDefaultCharset UTF-8
  • Finally, you need to ALWAYS make sure that you are using PHP functions that are properly UTF-8 complaint. This means always using the mb_* styled 'multibyte aware' string functions. It also means when calling functions such as htmlspecialchars(), that you include the appropriate 'utf-8' charset parameter at the end to make sure that it doesn't encode them incorrectly.

If you miss up on any one step through your whole process, the encoding can be mangled and problems arise. Once you get in the 'groove' of doing utf-8 though, this all becomes second nature. And of course, PHP6 is supposed to be fully unicode complaint from the getgo, which will make lots of this easier (hopefully)

Eli
Thank you very much! Because there are also many correctly encoded Strings in the DB, wich makes the Problem worse, i chose to str_replace the Strings i know that are corrupt with their correct Characters. It works great.I have already implemented most of your Tips regarding PHP and Server Setup, but it is a great summary, so i would chose this as the Answer, because my solution is not really beautiful.
Paul Weber
One important note on this advice: Do NOT add 'utf-8' as the second argument to the function htmlspecialchars(). Without the argument, that function does the correct thing with UTF-8 strings, since it ignores all bytes with the high bit set and passes them. This will preserve them and "does the right thing".With 'utf-8', htmlspecialchars() interprets the UTF-8 string - but doesn't handle characters outside the BMP (those with code points U+10000 and above, encoded in four bytes). It incorrectly encodes those that happen to match the specials mod 65536.. The behavior is both slower and wrong.
MtnViewMark
+1  A: 

I know this isn't very elegant, but after it was mentioned that the strings may be double encoded, I made this function:

function fix_double encoding($string)
{
 $utf8_chars = explode(' ', 'À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö');
 $utf8_double_encoded = array();
 foreach($utf8_chars as $utf8_char)
 {
      $utf8_double_encoded[] = utf8_encode(utf8_encode($utf8_char));
 }
 $string = str_replace($utf8_double_encoded, $utf8_chars, $string);
 return $string;
}

This seems to work perfectly to remove the double encoding I am experiencing. I am probably missing some of the characters that could be an issue to others. However, for my needs it is working perfectly.

Jayrox
+1  A: 

The way is to convert to binary and then to correct encoding

Dan
+1  A: 

As Dan pointed out: you need to convert them to binary and then convert/correct the encoding.

E.g., for utf8 stored as latin1 the following SQL will fix it:

UPDATE table
   SET field = CONVERT( CAST(field AS BINARY) USING utf8)
 WHERE $broken_field_condition
blueyed
interesting; i'll remember this if i have the issue again. thanks
Jayrox
+1  A: 

If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.

I made a function that addresses all this issues. It´s called forceUTF8().

You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1) or UTF8, or the string can have a mix of the two. forceUTF8() will convert everything to UTF8.

I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.

Usage:

$utf8_string = forceUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = forceLatin1($utf8_or_latin1_or_mixed_string);

I've included another function, fixUFT8(), wich will fix every UTF8 string that looks garbled.

Usage:

$utf8_string = fixUTF8($garbled_utf8_string);

Examples:

echo fixUTF8("Fédération Camerounaise de Football");
echo fixUTF8("Fédération Camerounaise de Football");
echo fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

Download:

http://dl.dropbox.com/u/186012/PHP/forceUTF8.zip

Sebastián Grignoli