Windows-1252 to UTF-8 encoding

views:

1618

answers:

Windows-1252 to UTF-8 encoding

I've copied certain files from a Windows machine to a Linux machine. So all the windows encoded(windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the "recode" utility for that. How can I specify that the "recode" utility should only convert windows-1252 encoded files and not the UTF-8 files.

Example usage of recode: recode windows-1252.. myfile.txt

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this I would like to know if myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

+3 A:

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

Jon Skeet 2010-01-06 15:42:34

There are a few bytes which cp1252 doesn't map to a character: 0x81, 0x8D, 0x8F, 0x90, 0x9D. The point stands, however. I wouldn't try to bulk-convert encodings of files from multiple different sources.

bobince 2010-01-06 17:17:34

Thanks for pointing that out - I really thought *everything* was mapped in 1252. I'm sure it's the case for some other encodings :)

Jon Skeet 2010-01-06 17:56:13

ISO-8859-1 maps every byte to a character, with the `80..9F` range being the C1 control characters. In Java I can decode every byte in the range `00..FF` to a String using ISO-8859-1, then re-encode it to get the original bytes back. When I try that with windows-1252 I get garbage for the values bobince listed. That surprised me; I thought it would fill those gaps with the corresponding control characters from ISO-8859-1.

Alan Moore 2010-01-07 18:58:02

Use the iconv command.

To make sure the file is in Windows-1252, open it in Notepad (under Windows), then click Save As. Notepad suggests current encoding as the default; if it's Windows-1252 (or any 1-byte codepage, for that matter), it would say "ANSI".

Seva Alekseyev 2010-01-06 15:42:51

Opening each file would be an exhaustive process. I want to do the conversion for a large number of files. Is there any other way I could do this?

Sam 2010-01-06 15:56:33

What language are the files in? The difference between Windows-1252 and UTF-8 only manifests on non-ASCII characters, i. e. on national ones. Any file is a valid Windows-1252 file, but without looking at the content and checking if the characters make sense in the target language you cannot tell if it's really Windows-1252.If the file has no extended characters, then the conversion would be trivial anyway, and you don't have to bother.

Seva Alekseyev 2010-01-06 16:16:21

Addition: you can validate UTF-8 though. Even iconv can do that - convert a file from UTF-8 to UTF-16 and back; if it's not identical to the original, then UTF-8 it was not. Probably easy to do with creative pipelining.

Seva Alekseyev 2010-01-06 16:26:27

And before you start, do some stats. How many files from the bulk actually do require conversion?

Seva Alekseyev 2010-01-06 16:29:01

I'm using vim and do set encoding=utf-8 then save the file back.

Alternatively, you can use iconv:

iconv -f WINDOWS-1252 -t UTF-8 filename.txt

Gregory Pakosz 2010-01-06 15:50:04

encoding=utf-8 only ensures that vim is using an encoding that can support all unicode characters. To ensure that a file is treated as utf-8 or as latin1 as a fallback you need to set fileencodings correctly. E.g. to be something like 'utf-8,latin1'.

Charles Bailey 2010-01-06 15:55:57

+1 A:

There's no general way to tell if a file is encoded with a specific encoding. Remember that an encoding is nothing more but an "agreement" how the bits in a file should be mapped to characters.

If you don't know which of your files are actually already encoded in UTF-8 and which ones are encoded in windows-1252, you will have to inspect all files and find out yourself. In the worst case that could mean that you have to open every single one of them with either of the two encodings and see whether they "look" correct -- i.e., all characters are displayed correctly. Of course, you may use tool support in order to do that, for instance, if you know for sure that certain characters are contained in the files that have a different mapping in windows-1252 vs. UTF-8, you could grep for them after running the files through 'iconv' as mentioned by Seva Akekseyev.

Another lucky case for you would be, if you know that the files actually contain only characters that are encoded identically in both UTF-8 and windows-1252. In that case, of course, you're done already.

kleiba 2010-01-06 15:52:58

ansaurus

tags:

views:

answers:

Windows-1252 to UTF-8 encoding

related questions