ansaurus

Question

Best way for Parsing ANSI and UTF-16LE files using Python 2/3?

Answer 1

+4 A:

Use the chardet library to detect the encoding.

RichieHindle 2009-05-04 09:26:14

chardet is not perfect, but if files with different encodings get mixed up, this is your best bet.

nosklo 2009-05-04 11:08:43

Chardet works but it takes way too long to process all the files

PCBEEF 2009-05-04 13:38:09

Answer 2

A:

You can check for the BOM at the beginning of the file to check whether it's UTF.

Then unicode.decode accordingly (using one of the standard encodings).

EDIT Or, maybe, try s.decode('ascii') your string (given s is the variable name). If it throws UnicodeDecodeError, then decode it as 'utf_16_le'.

Mike Hordecki 2009-05-04 09:27:06

Not all files contain a BOM header

kgiannakakis 2009-05-04 09:31:01

it's not ascii it's ANSI which is windows-1252 I believe. python does not through any exceptions when i try and decode a uft-16le file using windows-1252.

PCBEEF 2009-05-04 13:40:41

UnicodeDecodeError happens when string contains non-ANSI characters. No exceptions means your string doesn't happen to have those characters. Are you sure your string contains non-ANSI characters? What does your string look like before and after the conversion?

Mike Hordecki 2009-05-04 14:03:34

If I have a file that is encoded in utf-16le containing the text "๑۩۞۩๑", it will decode under windows-1252, however, when printing the result it will give me "ÿþQéÞéQ". Python does not throw any exceptions.

PCBEEF 2009-05-04 15:04:27

@Mike H, there's no such thing as a "non-ANSI character" (as you put it). Every byte from 0..255 maps to a character in windows-1252 (which extends ISO-8859-1). Some are control characters, which will print out as question marks or boxes, but they're all valid.

Alan Moore 2009-05-05 10:54:25

@Alan Moore: **WRONG**. There are 5 bytes that are invalid in `cp1252` which does NOT "extend" ISO-8859-1; it replaces the 32 C1 control characters with 27 letters, symbols etc plus 5 x undefined. See `http://en.wikipedia.org/wiki/Windows-1252`. BTW "ANSI" is NOT always cp1252; it's cp125X where X depends on the locale.

John Machin 2010-08-28 02:20:14

@John Machin: Yeah, "extends" isn't the right word for that relationship; I've learned to use more precise language since I wrote that comment. As for the five un-remapped code points, implementations vary: some (like Java) follow Microsoft's specification and unassign them, while others (like Microsoft) do the sensible thing and leave the original mappings in place. But my saying "they're all valid" was definitely wrong.

Alan Moore 2010-08-28 07:52:29

@Alan Moore: Your newly-precise language implies that Microsoft doesn't follow Microsoft's specification! What is MS's spec? Relevant URLs: http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT and http://msdn.microsoft.com/en-us/library/cc195054.aspx (8 undefined; EURO and Zz with caron added later) and http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx (5 undefined) -- the 1st and 3rd are referenced by IANA (http://www.iana.org/assignments/charset-reg/windows-1252).

John Machin 2010-08-28 08:34:25

@Alan Moore: "sensible ... leave the original mappings in place"?? Let's look at the original mappings: 81: not even named in Unicode; 8D: REVERSE LINE FEED (useless); 8F: SINGLE SHIFT 3 (useless especially as 8E (SINGLE SHIFT 2) was remapped to CAPITAL Z with caron); 90: DEVICE CONTROL STRING (useless); 9D: OPERATING SYSTEM COMMAND (wow). In the real world, those 5 positions should be treated as undefined, just as all of 80-9F should be treated as undefined in ISO-8859-1.

John Machin 2010-08-28 08:42:26

Answer 3

A:

What's in the files? If it's plain text in a Latin-based alphabet, almost every other byte the UTF-16LE files will be zero. In the windows-1252 files, on the other hand, I wouldn't expect to see any zeros at all. For example, here's “Hello” in windows-1252:

93 48 65 6C 6C 6F 94

...and in UTF-16LE:

1C 20 48 00 65 00 6C 00 6C 00 6F 00 1D 20

Aside from the curly quotes, each character maps to the same value, with the addition of a trailing zero byte. In fact, that's true for every character in the ISO-8859-1 character set (windows-1252 extends ISO-8859-1 to add mappings for several printing characters—like curly quotes—to replace the control characters in the range 0x80..0x9F).

If you know all the files are either windows-1252 or UTF-16LE, a quick scan for zeroes should be all you need to figure out which is which. There's a good reason why chardet is so slow and complex, but in this case I think you can get away with quick and dirty.

Alan Moore 2009-05-05 12:09:00

ansaurus

tags:

views:

answers:

Best way for Parsing ANSI and UTF-16LE files using Python 2/3?

related questions