Some observations and questions:
(1) ASCII is a subset of UTF-8 in the sense that if a file can be decoded successfully using ASCII, then it can be decoded successfully using UTF-8. So you can cross ASCII off your list.
(2) Are the two terms in findreplace ever going to include non-ASCII characters? Note that an answer of "yes" would indicate that the goal of writing an output file in the same character set as the input may be difficult/impossible to achieve.
(3) Why not write ALL output files in the SAME handle-all-Unicode-characters encoding e.g. UTF-8?
(4) Do the UTF-8 files have a BOM?
(5) What other character sets do you reasonably expect to need to handle?
(6) Which of the four possibilities (UTF-16LE / UTF-16BE) x (BOM / no BOM) are you calling UTF-16? Note that I'm deliberately not trying to infer anything from the presence of 'utf-16' in your code.
(7) Note that chardet
doesn't detect UTF-16xE without a BOM. chardet
has other blind-spots with non-*x and older charsets.
Update Here are some code snippets that you can use to determine what "ANSI" is, and try decoding using a restricted list of encodings. Note: this presumes a Windows environment.
# determine "ANSI"
import locale
ansi = locale.getdefaultlocale()[1] # produces 'cp1252' on my Windows box.
f = open("input_file_path", "rb")
data = f.read()
f.close()
if data.startswith("\xEF\xBB\xBF"): # UTF-8 "BOM"
encodings = ["utf-8-sig"]
elif data.startswith(("\xFF\xFE", "\xFE\xFF")): # UTF-16 BOMs
encodings = ["utf16"]
else:
encodings = ["utf8", ansi, "utf-16le"]
# ascii is a subset of both "ANSI" and "UTF-8", so you don't need it.
# ISO-8859-1 aka latin1 defines all 256 bytes as valid codepoints; so it will
# decode ANYTHING; so if you feel that you must include it, put it LAST.
# It is possible that a utf-16le file may be decoded without exception
# by the "ansi" codec, and vice versa.
# Checking that your input text makes sense, always a very good idea, is very
# important when you are guessing encodings.
for enc in encodings:
try:
udata = data.decode(enc)
break
except UnicodeDecodeError:
pass
else:
raise Exception("unknown encoding")
# udata is your file contents as a unicode object
# When writing the output file, use 'utf8-sig' as the encoding if you
# want a BOM at the start.