ansaurus

Question

how to detect and fix character encoding in a mysql database via php?

Answer 1

A:

As you said that your data is sometimes converted using utf8_encode, your data is encoded with either UTF-8 oder ISO 8859-1 (since utf8_encode converts from ISO 8859-1 to UTF-8). And since UTF-8 encodes the characters from 128 to 255 with two bytes starting with 1100001x, you just have to test if your data is valid UTF-8 and convert it if not.

So scan all your data if it already is UTF-8 (see several is_utf8 functions) and use utf8_encode if it’s not UTF-8.

Gumbo 2009-10-01 12:06:21

hi Gumbo, i've updated my question with a first (unsuccessful try). Can you please look and advise?

pixeline 2009-10-01 19:01:50

Answer 2

+2 A:

It's not completely clear from the question what character-encoding lens you're currently looking through (this depends on the defaults of your text editor, browser headers, database configuration, etc), and what character-encoding transformations the data has gone through. It may be that, for example, by tweaking a database configuration everything will be corrected, and that's a lot better than making piecemeal changes to data.

It looks like it might be a problem of utf8 double-encoding, and if that's the case, both the original and the corrupted data will be in utf8, so encoding detection won't give you the information you need. The approach in that case requires making assumptions about what characters can reasonably turn up in your data: as far as PHP and Mysql are concerned "Ã©" is perfectly legal utf8, so you have to make a judgement based on what you know about the data and its authors that it must be corrupted. These are risky assumptions to make if you're just a technician. Luckily, if you know the data is in French and there's only 3000 records, it's probably ok to make those kinds of assumptions.

Below is a script that you can adapt first of all to check your data, then to correct it, and finally to check it again. All it's doing is processing a string as utf8, breaking it into characters, and comparing the characters against a whitelist of expected French characters. It signals a problem if the string is either not in utf8 or contains characters that aren't normally expected in French, for example:

PROBABLY OK     Côte d'Azur
HAS NON-WHITELISTED CHAR        CÃ´te d'Azur    195,180 Ã´
NON-UTF8        C�e d'Azur

Here's the script, you'll need to download the dependent unicode functions from http://hsivonen.iki.fi/php-utf8/

<?php

// Download from http://hsivonen.iki.fi/php-utf8/
require "php-utf8/utf8.inc";

$my_french_whitelist = array_merge(
  range(0,127), // throw in all the lower ASCII chars
  array(
    0xE8, // small e-grave
    0xE9, // small e-acute
    0xF4, // small o-circumflex
    //... Will need to add other accented chars,
    // Euro sign, and whatever other chars
    // are normally expected in the data.
  )
);

// NB, whether this string literal is in utf8
// depends on the encoding of the text editor
// used to write the code
$str1 = "Côte d'Azur";
$test_data = array(
  $str1,
  utf8_encode($str1),
  utf8_decode($str1),
);

foreach($test_data as $str){
  $questionable_chars = non_whitelisted(
    $my_french_whitelist,
    $str
  );
  if($questionable_chars===true){
    p("NON-UTF8", $str);
  }else if ($questionable_chars){
    p(
      "HAS NON-WHITELISTED CHAR",
      $str,
      implode(",", $questionable_chars),
      unicodeToUtf8($questionable_chars)
    );
  }else{
    p("PROBABLY OK", $str);
  }
}

function non_whitelisted($whitelist, $utf8_str){
  $codepoints = utf8ToUnicode($utf8_str);
  if($codepoints===false){ // has non-utf8 char
    return true;
  }
  return array_diff(
    array_unique($codepoints),
    $whitelist
  );
}


function p(){
  $args = func_get_args();
  echo implode("\t", $args), "\n";
}

d__ 2009-10-02 00:02:30

I think you understood exactly the issue, and i really appreciate your extensive input. I'll try your script and get back to you. Question: where can i find the code of the other accented chars to add in the whitelist? What is it called? Is it the Hex code column on this page: http://webdesign.about.com/od/localization/l/blhtmlcodes-fr.htm ?

pixeline 2009-10-02 19:16:24

That list should cover most of what you need, and if you need any other characters, it's based on the Unicode codepoint, which you can look up somewhere like http://www.fileformat.info/info/unicode/index.htm . The Euro sign might cause some problems - its Unicode codepoint is U+20AC, but a convention has arisen in HTML of using a character reference of 80, which is what the webdesign.about.com list uses.

d__ 2009-10-02 19:27:48

Answer 3

+2 A:

I think you might be taking a more compilation approach. I received a Bulgarian database a few weeks back that was dynamically encoded in the DB, but when moving it to another database I got the funky ???

The way I solved that was by dumping the database, setting the database to utf8 collation and then importing the data as binary. This auto-converted everything to utf8 and didn't give me anymore ???.

This was in MySQL

Gus 2009-10-02 15:42:00

mmh, sounds good! How exactly do you do the "importing the data as binary" part? Is it possible via phpmyadmin ?

pixeline 2009-10-02 19:09:58

Sorry I took a while to answer I was away.It is possible via phpmyadminhttp://i38.tinypic.com/1z8cgj.jpg

Gus 2009-10-08 15:40:34

Hi Gus. Thanks for getting back to me. I tried it and no luck. Béatrice still turns out as BÃ©atrice both on the old and the new database.

pixeline 2009-10-09 08:36:52

Answer 4

A:

Hi there, my problem is that somehow I got in my database chars like these à,é,ê in plain format or utf8 encoded. After investigation I got the conclusion that some browser (I do not know IE or FF or other) is encoding the submitted input data as there was no utf8 encoding intentionally added to handling the submit forms. So, if I would read data with utf8_encode, I'll alter the other plain chars, and vice-versa.

My solution, after I studied solutions given above: 1. I created a new database with charset utf8 2. Imported the database AFTER I changed the charset definition on CREATE TABLE statement in sql dump file from Latin.... to UTF8. 3. import data from original database (until here maybe will be enough just to change the charset on existing db and tables, and this only if original db is not utf8) 4. update the content in database directly by replacing the utf8 encoded chars with there plain format something like

UPDATE `clients` SET `name` = REPLACE(`name`,"Ã©",'é' )  WHERE `name` LIKE CONVERT( _latin1 '%é%' USING utf8 );

I put in db class (for php code) this line to make sure that their is a UTF8 communication

$this->query('SET CHARSET UTF8');

So, ho to update? (step 4) I've built an array with possible chars that might be encoded

$special_chars = array(
  'ù','û','ü',
  'ÿ',
  'à','â','ä','å','æ',
  'ç',
  'é','è','ê','ë',
  'ï','î',
  'ô','','ö','ó','ø',
  'ü');

I've buit an array with pairs of table,field that should be updated

$where_to_look = array(
    array("table_name" , "field_name"),
     ..... );

than,

    foreach($special_chars as $char)
    {
      foreach($where_to_look as $pair)
      {
        //$table = $pair[0]; $field = $pair[1]
        $sql = "SELECT id , `" . $pair[1] . "` FROM " .$pair[0] . " WHERE `" . $pair[1] . "` LIKE CONVERT( _latin1 '%" . $char . "%' USING utf8 );";

    if($db->num_rows() > 0){
         $sql1 = "UPDATE " . $pair[0] . " SET `" . $pair[1] . "` = REPLACE(`" . $pair[1] . "`,CONVERT( _latin1 '" . $char . "' USING utf8 ),'" . $char . "' )  WHERE `" . $pair[1] . "` LIKE CONVERT( _latin1 '%" . $char . "%' USING utf8 )";
         $db1->query($sql1);
        }
    }
 }

The basic ideea is to use encoding features of mysql to avoid encoding done between mysql, apache, browser and back; NOTE: I had not avaiable php functions like mb_....

Best

didi 2009-10-23 11:00:10

Answer 5

A:

When you connect to the database remember to always use mysql_set_charset('utf8', $db_connection);

it will fix everything, it solved all my problems.

See this: http://phpanswer.com/store-french-characters-into-mysql-db-and-display/

cristian 2010-08-15 02:11:05

ansaurus

tags:

views:

answers:

how to detect and fix character encoding in a mysql database via php?

related questions