views:

866

answers:

3

I have a collection of files encoded in ANSI or UTF-16LE. I would like python to open the files using the correct encoding. The problem is that the ANSI files do not raise any sort of exception when encoded using UTF-16le and vice versa.

Is there a straightforward way to open up the files using the correct file encoding?

+4  A: 

Use the chardet library to detect the encoding.

RichieHindle
chardet is not perfect, but if files with different encodings get mixed up, this is your best bet.
nosklo
Chardet works but it takes way too long to process all the files
PCBEEF
A: 

You can check for the BOM at the beginning of the file to check whether it's UTF.

Then unicode.decode accordingly (using one of the standard encodings).

EDIT Or, maybe, try s.decode('ascii') your string (given s is the variable name). If it throws UnicodeDecodeError, then decode it as 'utf_16_le'.

Mike Hordecki
Not all files contain a BOM header
kgiannakakis
it's not ascii it's ANSI which is windows-1252 I believe. python does not through any exceptions when i try and decode a uft-16le file using windows-1252.
PCBEEF
UnicodeDecodeError happens when string contains non-ANSI characters. No exceptions means your string doesn't happen to have those characters. Are you sure your string contains non-ANSI characters? What does your string look like before and after the conversion?
Mike Hordecki
If I have a file that is encoded in utf-16le containing the text "๑۩۞۩๑", it will decode under windows-1252, however, when printing the result it will give me "ÿþQéÞéQ". Python does not throw any exceptions.
PCBEEF
@Mike H, there's no such thing as a "non-ANSI character" (as you put it). Every byte from 0..255 maps to a character in windows-1252 (which extends ISO-8859-1). Some are control characters, which will print out as question marks or boxes, but they're all valid.
Alan Moore
@Alan Moore: **WRONG**. There are 5 bytes that are invalid in `cp1252` which does NOT "extend" ISO-8859-1; it replaces the 32 C1 control characters with 27 letters, symbols etc plus 5 x undefined. See `http://en.wikipedia.org/wiki/Windows-1252`. BTW "ANSI" is NOT always cp1252; it's cp125X where X depends on the locale.
John Machin
@John Machin: Yeah, "extends" isn't the right word for that relationship; I've learned to use more precise language since I wrote that comment. As for the five un-remapped code points, implementations vary: some (like Java) follow Microsoft's specification and unassign them, while others (like Microsoft) do the sensible thing and leave the original mappings in place. But my saying "they're all valid" was definitely wrong.
Alan Moore
@Alan Moore: Your newly-precise language implies that Microsoft doesn't follow Microsoft's specification! What is MS's spec? Relevant URLs: http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT and http://msdn.microsoft.com/en-us/library/cc195054.aspx (8 undefined; EURO and Zz with caron added later) and http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx (5 undefined) -- the 1st and 3rd are referenced by IANA (http://www.iana.org/assignments/charset-reg/windows-1252).
John Machin
@Alan Moore: "sensible ... leave the original mappings in place"?? Let's look at the original mappings: 81: not even named in Unicode; 8D: REVERSE LINE FEED (useless); 8F: SINGLE SHIFT 3 (useless especially as 8E (SINGLE SHIFT 2) was remapped to CAPITAL Z with caron); 90: DEVICE CONTROL STRING (useless); 9D: OPERATING SYSTEM COMMAND (wow). In the real world, those 5 positions should be treated as undefined, just as all of 80-9F should be treated as undefined in ISO-8859-1.
John Machin
A: 

What's in the files? If it's plain text in a Latin-based alphabet, almost every other byte the UTF-16LE files will be zero. In the windows-1252 files, on the other hand, I wouldn't expect to see any zeros at all. For example, here's “Hello” in windows-1252:

93 48 65 6C 6C 6F 94

...and in UTF-16LE:

1C 20 48 00 65 00 6C 00 6C 00 6F 00 1D 20

Aside from the curly quotes, each character maps to the same value, with the addition of a trailing zero byte. In fact, that's true for every character in the ISO-8859-1 character set (windows-1252 extends ISO-8859-1 to add mappings for several printing characters—like curly quotes—to replace the control characters in the range 0x80..0x9F).

If you know all the files are either windows-1252 or UTF-16LE, a quick scan for zeroes should be all you need to figure out which is which. There's a good reason why chardet is so slow and complex, but in this case I think you can get away with quick and dirty.

Alan Moore