views:

1090

answers:

4

I imported some data using LOAD DATA INFILE into a MySQL Database. The table itself and the columns are using the UTF8 character set, but the default character set of the database is latin 1. Because the default character type of the database is latin1, and I used LOAD DATA INFILE without specifying a character set, it interpreted the file as latin1, even though the data in the file was UTF8. Now I have a bunch of badly encoded data in my UTF8 colum. I found this article which seems to address a similar problem, which is "UTF8 inserted in cp1251", but my problem is "Latin1 inserted in UTF8". I've tried editing the queries there to convert the latin1 data to UTF8, but can't get it to work. Either the data comes out the same, or even more mangled than before. Just as an example, the word Québec is showing as Québec.

[ADDITIONAL INFO]

When Selecting the data wrapped in HEX(), Québec has the value 5175C383C2A9626563.

The Create Table (shortened) of this table is.

CREATE TABLE MyDBName.`MyTableName`
(
`ID` INT NOT NULL AUTO_INCREMENT, 
.......
`City` CHAR(32) NULL, 
.......
`)) ENGINE InnoDB CHARACTER SET utf8;
A: 

Converting latin1 to UTF8 is not what you want to do, you kind of need the opposite.

If what really happened was this:

  1. UTF-8 strings were interpreted as Latin-1 and transcoded to UTF-8, mangling them.
  2. You are now, or could be, reading UTF-8 strings with no further interpretation

What you must do now is:

  1. Read the "UTF-8" with no transcode.
  2. Convert it to Latin-1. Now you should actually have the original UTF-8.
  3. Now put it in your "UTF-8" column with no further conversion.
DigitalRoss
+1  A: 

LOAD DATA INFILE allows you to set an encoding file is supposed to be in:

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

FractalizeR
Yeah, I wish I would have realized this before hand, but now the data is already mangled. I wanted to know if I could fix it without reimporting it.
Kibbee
+2  A: 

I've had cases like this in old wordpress installations with the problem being that the data itself was already in UTF-8 within a Latin1 database (due to WP default charset). This means there was no real need for conversion of the data but the ddbb and table formats. In my experience things get messed up when doing the dump as I understand MySQL will use the client's default character set which in many cases is now UTF-8. Therefore making sure that exporting with the same coding of the data is very important. In case of Latin1 DDBB with UTF-8 coding:

$ mysqldump –default-character-set=latin1 –databases wordpress > m.sql

Then replace the Latin1 references within the exported dump before reimporting to a new database in UTF-8. Sort of:

$ replace “CHARSET=latin1″ “CHARSET=utf8″ \
    “SET NAMES latin1″ “SET NAMES utf8″ < m.sql > m2.sql

In my case this link was of great help. Commented here in spanish.

luison
+1  A: 

I wrote that http://code.google.com/p/mysqlutf8convertor/ for Latin Database to UTF-8 Database. All tables and field to change UTF-8.

saturngod