ansaurus

Question

Answer 1

+8 A:

What you're seeing is a UTF-8 encoded BOM, or "Byte Order Mark". The BOM is not usually used for UTF-8 files, so the best way to handle it might be to open the file with a UTF-8 codec, and skip over the U+FEFF character if present.

Greg Hewgill 2010-03-01 23:22:37

Answer 2

+5 A:

import codecs

with codecs.open(file, "r", "utf-8-sig") as f:
    a, b, c= map(int, f.readline().split(","))

This works in Python 2.6.4. The codecs.open call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.

ΤΖΩΤΖΙΟΥ 2010-03-02 00:01:27

Thanks. This works on my UTF-8 files but fails on the Unicode and Unicode big endian. Is there a foolproof way of opening any kind of encoded file and getting those numbers or I would having to explicitly specify the encoding?

Ηλίας 2010-03-02 09:40:23

AFAIK you have to specify the encoding. Obviously, you can write a small function that does the three tests and returns an appropriately decoded file.

ΤΖΩΤΖΙΟΥ 2010-03-02 09:52:38

Great. I found the chardet module that does exactly this http://chardet.feedparser.org/

Ηλίας 2010-03-02 10:20:15

Minor error on your code above: a, b, c= map(int,f.readline().split(","))

Ηλίας 2010-03-03 11:12:02

@iKarampa: thank you very much!

ΤΖΩΤΖΙΟΥ 2010-03-03 22:37:21

ansaurus

tags:

views:

answers:

Dealing with UTF-8 numbers in Python

related questions