views:

108

answers:

3

I received a bunch of CSV files from a client (that appear to be a database dump), and many of the columns have weird characters like this:

  • Alain Lefèvre
  • Angèle Dubeau & La Pietà

That's seems like an awful lot of characters to represent an é. Does anyone know what encoding would produce that many characters for é? I have no idea where they're getting these CSV files from, but assuming I can't get them in a better format, how would I convert them to something like UTF-8?

+5  A: 

It seems like it's a double-re-misdecoded UTF-8. It may be possible to recover the data by opening it as utf-8, saving it as Latin-1 (perhaps), and opening it as UTF-8 again.

Tordek
+1 for double-re-misdecoded :)
Pekka
I'll raise you one, it's triple-re-misread-as-cp1252 UTF8 :)
d__
A: 

That's seems like an awful lot of characters to represent an é.

Remember, character ≠ byte. What you're seeing in the output is characters; you'll need to do something unusual to actually see the bytes. (I suggest ‘xxd’, a tool that is installed with the Vim application; or ‘od’, one of the core utilities of the GNU operating system.)

Does anyone know what encoding would produce that

One tool that is good at guessing the character encoding of a byte stream is ‘enca’ the Extremely Naive Charset Analyser.

bignose
+3  A: 

It looks like it's been through a corruption process where the data was written as utf-8 but read in as cp1252, and this happened three times. This might be recoverable (I don't know if it will work for every character, but at least for some) by putting the corrupted data through the reverse transformation - read in as utf8, write out as cp1252, repeat. There are plenty of ways of doing that kind of conversion - using a text editor as Tordek suggests, using commandline tools as below, or using the encoding features built in to your database or programming language.

unix shell prompt> echo Alain Lefèvre | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252 | 
iconv -f utf-8 -t cp1252

Alain Lefèvre

unix shell prompt>
d__
Awesome! (15 chars)
Tordek