views:

1136

answers:

9

I have an ANSI encoded text file that should not have been encoded as ANSI as there were accented characters that ANSI does not support. I would rather work with UTF-8.

Can the data be decoded correctly or is it lost in transcoding?

What tools could I use?

Here is a sample of what I have:

ç é

I can tell from context (café should be café) that these should be these two characters:

ç é
A: 

vim -c "set encoding=utf8" -c "set fileencoding="utf8" -c "wq" filename

HTH

Zsolt Botykai
A: 

Do you know the original encoding of the file (assuming it was converted at some point from one charset to another)? If so, you should be able to map from the resulting characters back to the original characters by using tables like this one.

If you don't know the original encoding, you could probably work it out using a probabilistic approach, based on the frequency of different words in the language you're working with. But you may not be willing to put in the work that would require.

gregory
Unfortunately, no I do not know the original encoding. It is a common problem when clients send you files made on a variety of systems. They may not know what a character encoding is. Note that the growing adoption of Linux desktops using UTF-8 by default could reduce this problem transparently.
Liam
I totally agree. UTF-8 is definitely the most reasonable encoding to use in most situations, but you can hardly expect clients to understand or act on that, unfortunately.
gregory
+2  A: 

Use iconv - see http://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets

Troels Arvin
Will a simple conversion assume the data is correct and keep the bad data?
Liam
Yes, it will. I think people are misunderstanding the question. The problem is that the data is already corrupted, so you need a remedial solution.
gregory
A: 

And then there is the somewhat older recode program.

unbeknown
A: 

If you see question marks in the file or if the accents are already lost, going back to utf8 will not help your cause. e.g. if café became cafe - changing encoding alone will not help (and you'll need original data).

Can you paste some text here, that'll help us answer for sure.

A: 

There are programs that try to detect the encoding of an file like chardet. Then you could convert it to a different encoding using iconv. But that requires that the original text is still intact and no information is lost (for example by removing accents or whole accented letters).

unbeknown
+1  A: 

When you see character sequences like ç and é, it's usually an indication that a UTF-8 file has been opened by a program that reads it in as ANSI (or similar). Unicode characters such as these:

U+00C2 Latin capital letter A with circumflex
U+00C3 Latin capital letter A with tilde
U+0082 Break permitted here
U+0083 No break here

tend to show up in ANSI text because of the variable-byte strategy that UTF-8 uses. This strategy is explained very well here.

The advantage for you is that the appearance of these odd characters makes it relatively easy to find, and thus replace, instances of incorrect conversion.

I believe that, since ANSI always uses 1 byte per character, you can handle this situation with a simple search-and-replace operation. Or more conveniently, with a program that includes a table mapping between the offending sequences and the desired characters, like these:

“ -> “ # should be an opening double curly quote
â€? -> ” # should be a closing double curly quote

Any given text, assuming it's in English, will have a relatively small number of different types of substitutions.

Hope that helps.

gregory
excellent, thanks!
Liam
+2  A: 

EDIT: A simple possibility to eliminate before getting into more complicated solutions: have you tried setting the character set to utf8 in the text editor in which you're reading the file? This could just be a case of somebody sending you a utf8 file that you're reading in an editor set to say cp1252.

Just taking the two examples, this is a case of utf8 being read through the lens of a single-byte encoding, likely one of iso-8859-1, iso-8859-15, or cp1252. If you can post examples of other problem characters, it should be possible to narrow that down more.

As visual inspection of the characters can be misleading, you'll also need to look at the underlying bytes: the § you see on screen might be either 0xa7 or 0xc2a7, and that will determine the kind of character set conversion you have to do.

Can you assume that all of your data has been distorted in exactly the same way - that it's come from the same source and gone through the same sequence of transformations, so that for example there isn't a single é in your text, it's always ç? If so, the problem can be solved with a sequence of character set conversions. If you can be more specific about the environment you're in and the database you're using, somebody here can probably tell you how to perform the appropriate conversion.

Otherwise, if the problem characters are only occurring in some places in your data, you'll have to take it instance by instance, based on assumptions along the lines of "no author intended to put ç in their text, so whenever you see it, replace by ç". The latter option is more risky, firstly because those assumptions about the intentions of the authors might be wrong, secondly because you'll have to spot every problem character yourself, which might be impossible if there's too much text to visually inspect or if it's written in a language or writing system that's foreign to you.

d__
thanks Donal, any suggestions for viewing the bytes?
Liam
Plenty of options, depending on where you are: hd -c filename, opening it in vi and looking at the "weird" character escapes, bin2hex in php, hex(fieldname) in mysql.
d__
Thanks, this seems to be the best solution. Understanding the underlying bytes and intelligently replacing them seems like the smartest option, developing a script to as I go to automate the changes.
Liam
A: 

I found a simple way to auto-detect file encodings - change the file to a text file (on a mac rename the file extension to .txt) and drag it to a Mozilla Firefox window (or File -> Open). Firefox will detect the encoding - you can see what it came up with under View -> Character Encoding.

I changed my file's encoding using TextMate once I knew the correct encoding. File -> Reopen using encoding and choose your encoding. Then File -> Save As and change the encoding to UTF-8 and line endings to LF (or whatever you want)

Mark Robinson