views:

113

answers:

6

I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.

What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?

+1  A: 

I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)

If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/

Specifically this article may be useful: http://blogs.msdn.com/michkap/archive/2007/04/22/2239345.aspx

mletterle
It supports both, don't worry.
George Edison
There's also a Unicode line separator U+2028, but I've never seen it in the wild.
xan
+4  A: 

Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.

As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).

Steve Emmerson
A: 

There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.

I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.

Vilx-
Well, I could do that, but I don't think an external library is an option.
George Edison
Not external library. Character encoding tables. Mappings between Unicode and other character sets. Although an external library would make the conversions WAY easier. I do understand correctly that you are writing this text editor yourself, right?
Vilx-
Yes, I am writing it myself.
George Edison
A: 

Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.

Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.

xan
Point taken. But checking the whole file when it is > 5 MB or so is ridiculous and pointless.
George Edison
What I'm thinking is to just support UTF-16 and UTF-8. Standard ASCII characters are the same in ASCII and UTF-8, and the other 128 characters could probably be ignored.
George Edison
A: 

Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.

As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?

You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.

thieger
I'm using wxWidgets, which supports UTF-16 and UTF-8 if I'm not mistaken.
George Edison
I read the documentation, and wxWidgets supports quite a few others like UTF-32. Not that it is very common, mind you.
George Edison
+1  A: 

UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.

1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.

Joseph Boyle