views:

98

answers:

3

I've set up a script that basically does a large-scale find-and-replace on a plain text document.

At the moment it works fine with ASCII, UTF-8, and UTF-16 (and possibly others, but I've only tested these three) encoded documents so long as the encoding is specified inside the script (the example code below specifies UTF-16).

Is there a way to make the script automatically detect which of these character encodings is being used in the input file and automatically set the character encoding of the output file the same as the encoding used on the input file?

findreplace = [
('term1', 'term2'),
]    

inF = open(infile,'rb')
    s=unicode(inF.read(),'utf-16')
    inF.close()

    for couple in findreplace:
        outtext=s.replace(couple[0],couple[1])
        s=outtext

    outF = open(outFile,'wb')
    outF.write(outtext.encode('utf-16'))
    outF.close()

Thanks!

+1  A: 

No there isn't. You have to encode that knowledge inside the file itself or from an outside source.

There are some heuristics, that can guess the encoding of a file through statistical analysis of the byte order frequency; but I won't be using them for any mission critical data.

Lie Ryan
+2  A: 

From the link J.F. Sebastian posted: try chardet.

Keep in mind that in general it's impossible to detect the character encoding of every input file 100% reliably - in other words, there are possible input files which could be interpreted equally well as any of several character encodings, and there may be no way to tell which one is actually being used. chardet uses some heuristic methods and gives you a confidence level indicating how "sure" it is that the character encoding it tells you is actually correct.

David Zaslavsky
+1  A: 

Some observations and questions:

(1) ASCII is a subset of UTF-8 in the sense that if a file can be decoded successfully using ASCII, then it can be decoded successfully using UTF-8. So you can cross ASCII off your list.

(2) Are the two terms in findreplace ever going to include non-ASCII characters? Note that an answer of "yes" would indicate that the goal of writing an output file in the same character set as the input may be difficult/impossible to achieve.

(3) Why not write ALL output files in the SAME handle-all-Unicode-characters encoding e.g. UTF-8?

(4) Do the UTF-8 files have a BOM?

(5) What other character sets do you reasonably expect to need to handle?

(6) Which of the four possibilities (UTF-16LE / UTF-16BE) x (BOM / no BOM) are you calling UTF-16? Note that I'm deliberately not trying to infer anything from the presence of 'utf-16' in your code.

(7) Note that chardet doesn't detect UTF-16xE without a BOM. chardet has other blind-spots with non-*x and older charsets.

Update Here are some code snippets that you can use to determine what "ANSI" is, and try decoding using a restricted list of encodings. Note: this presumes a Windows environment.

# determine "ANSI"
import locale
ansi = locale.getdefaultlocale()[1] # produces 'cp1252' on my Windows box.

f = open("input_file_path", "rb")
data = f.read()
f.close()

if data.startswith("\xEF\xBB\xBF"): # UTF-8 "BOM"
    encodings = ["utf-8-sig"]
elif data.startswith(("\xFF\xFE", "\xFE\xFF")): # UTF-16 BOMs
    encodings = ["utf16"]
else:
    encodings = ["utf8", ansi, "utf-16le"]
# ascii is a subset of both "ANSI" and "UTF-8", so you don't need it.
# ISO-8859-1 aka latin1 defines all 256 bytes as valid codepoints; so it will
# decode ANYTHING; so if you feel that you must include it, put it LAST.
# It is possible that a utf-16le file may be decoded without exception
# by the "ansi" codec, and vice versa.
# Checking that your input text makes sense, always a very good idea, is very 
# important when you are guessing encodings.

for enc in encodings:
    try:
        udata = data.decode(enc)
        break
    except UnicodeDecodeError:
        pass
else:
    raise Exception("unknown encoding")

# udata is your file contents as a unicode object
# When writing the output file, use 'utf8-sig' as the encoding if you
# want a BOM at the start. 
John Machin
Sorry, I'm a bit of a newbie to all this!As you suggest in 3, I've changed it now so that it always outputs in UTF-8.As the people who'd be using this are using Windows machines I'd imagine that the UTF-8 files would have a BOM.I plan to distribute the script among non-technophiles, so it basically just needs to accept the various default character sets, nothing special. So in response to 5, it would probably need ANSI, ASCII, UTF8, UTF16, and ISO 8859-1.
Haidon
@Haidon: Please answer Q2 and Q6. When asked for clarification, edit your question instead of commenting. What makes you think that ISO-8859-1 is a "default character set on Windows"? How would the average Windows user create such a file?
John Machin
@detly: Thanks; fixed.
John Machin