Is there any way to discover what charset encoding a file is using?
There's no way to do this with 100% reliability. You have to decide which cost vs accuracy tradeoffs you are comfortable with. I discuss many possible algorithms (with pros & cons) in this reply: http://stackoverflow.com/questions/1077634/powershell-search-script-that-ignores-binary-files/1078277#1078277
As Richard indicated, theres no completely reliable way to do this. However, here are some potentially helpful links:
http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=469
http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2
See this: Detecting File Encodings in .NET
From Msdn:
There is no great way to detect an arbitrary ANSI code page, though there have been some attempts to do this based on the probability of certain byte sequences in the middle of text. We don't try that in StreamReader. A few file formats like XML or HTML have a way of specifying the character set on the first line in the file, so Web browsers, databases, and classes like XmlTextReader can read these files correctly. But many text files don't have this type of information built in.
The only way to reliably do this is to look for byte order marks at the start of the text file. (This blob more generally represents the endianness of character encoding used, but also the encoding - e.g. UTF8, UTF16, UTF32). Unfortunately, this method only works for Unicode-based encodings, and nothing before that (for which much less reliable methods must be used).
The StreamReader
type supports detecting these marks to determine the encoding - you simply need to pass a flag to the parameter as such:
new System.IO.StreamReader("path", true)
You can then check the value of stremReader.CurrentEncoding
to determine the encoding used by the file. Note however that if no byte encoding marks exist, then CurrentEncoding
will default to Encoding.Default
.
I've coded that a while ago in C++, and it got pretty complex. Here's what I do (accepting the first that matches):
- Look for Byte Order Marks
- Check if the text is valid UTF-32 BE/LE
- Check if the text is valid UTF-16 BE/LE
- Check if the text is valid UTF-8
- Assume current code page
This copes with the many BOM-less text files that are out there, but does not help with text stored with custom ANSI code pages.
For these, there's no deterministic detection possible. E.g. a file saved with "eastern european" encoding and loading on a computer with "western european" default code page will be garbled.
The only possibility to help in this case is let the user select the code page (from a user experience, the best would be letting the user change the assumed encoding when he sees the text).
It works OK on a test set but of course misinterpretations are possible, if unlikely.
Code Pages could be determined by a statistical analysis of the text (e.g. frequency of character pairs and triplets containing non-ASCII characters, or word lists in different languages, but I haven't found any suitable approach trying that.
The Win32 IsTextUnicode is notoriously bad, it checks only for UTF-16, and is probably the culprit behind the "bush hid the facts" thing in notepad.
As peterchen wrote you should write "bush hide the facts" in the Notepad.exe, save and reopen it to see how difficult is to detect the Encoding.