views:

370

answers:

1

We're often faced with the need to send a data file to one of our clients with data from the database he/she needs to translate. Most of the time this export is CSV or XLS. Most of the time we create a csv dump with phpmyadmin and get an xls file in return with the translated data. The problem is that most of the time the data is UTF8 and when the file is returned as xls each and every time we load the data into mysql again we end up with utf8 problems, characters not being displayed properly, etc ...

We've already doublechecked everything in mysql from my.conf to column charactersets and everything is set correctly to UTF8.

My question is not how to fix the encoding issue since that's been solved but how we would best proceed in the future handling this situation? What export format should we hand over? How should we import (just mysql load data infile or our own processing scripts). What is the general consensus on how to handle this situation?

We would like to continue using excel if possible since that's the format almost everybody expects including our clients' translation agencies. Our clients' ease of use is the most important factor here, without overloading us with major issues each time. The best of both worlds :)

+1  A: 

The application I am currently working on includes the functionality of data import as well. The data is mostly encoded in utf-8.

My approach is to preprocess the imported CSV (or tab delimited)(in any encoding) file to a correct utf-8 encoded temporary CSV file in client script (Python) and load the contents of the file using LOAD DATA INFILE statement.

The encoding of the file is controled by character_set_database system variable (the variable should be set on the server level) and starting from MySQL 5.1.17 can be overridden by the CHARACTER SET clause of the LOAD DATA INFILE.

The only thing one should know is that MySQL stores up to 3 bytes for each character instead of 4 (that might be a problem for orient languages).

To export lots of data efficienly you can use SELECT ... INTO OUTFILE statement.

newtover
So you actually know how/what format and encoding you'll receive when the client returns the dump? That unfortunately isn't the case :(
ChrisR
@ChrisRamakers: what is the question then? do you want an automated way of understanding the encoding of the imported file?
newtover