views:

1464

answers:

4

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.

I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.

When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.

I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.


Answer:

ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.

The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.

I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.

Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!

+1  A: 

ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.

The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.

Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.

There is code out there for checking validity of UTF chars.

Thomas Tempelmann
Sorry, I mean't ANSI, not ASCII. I'll edit that out.
lkessler
Windows still has device drivers. If your kernel code isn't 7 bit clean you'll regret it.
Windows programmer
@Windows programmer: what do you mean kernel code needs to be 7-bit clean? Most (all?) drivers need to deal with Unicode - although sometimes the problem is correctly converting from MBCS to Unicode (do I use OEM or the default codepage?, etc).
Michael Burr
OK, code that handles filenames has to copy and convert character strings in variables (PUNICODE etc.), but the source code still has to be 7-bit clean in order to compile properly at compile time.
Windows programmer
+3  A: 

My guess is:

  • First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
  • If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
  • If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.
Vilx-
Hmmm, I often see bytes with values less than 32 in my text files. Things like \n, \r and \t. Rarely some other ones, too.
Michael Burr
ASCII, most ANSI code pages, and UTF-8 understand characters such as carriage return, line feed, horizontal tab, null character, etc., which have byte values less than 32.
Windows programmer
Fair point. I'll modify the post.
Vilx-
I meant to say ANSI, not ASCII in the question. I've modified the question now. You might want to modify your answer to reflect this.
lkessler
+4  A: 

Here is how notepad does that

There is also the python Universal Encoding Detector which you can check.

Igal Serban
MS hid the facts
Windows programmer
The IsTextUnicode is a good first step. Then it says it uses http://www.ietf.org/rfc/rfc2279.txt?number=2279 for the UTF-8 definition, but that doesn't say what to test.
lkessler
Actually, WP, it's http://en.wikipedia.org/wiki/Bush_hid_the_facts (some jokes do have to be explained).
Alan Moore
Actually my version is "MS hid the facts" (without quotation marks of course). Try it.
Windows programmer
+6  A: 

Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.

ShreevatsaR
Ooooh. That's exactly the type of algorithm I'm looking for. Now if I could figure out how it works, or just find a Delphi equivalent ...
lkessler
According to the docs, it's a Python port of Mozilla cpp code. The latter is located at http://mxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/ No idea which incarnation is easier to port though!
moodforaday
(contd.) The CPP version seems to be more amply commented, which might help in porting.
moodforaday