I'm looking for a way to detect character sets within documents. I've been reading the Mozilla character set detection implementation here:
I've also found a Java implementation of this called jCharDet:
Both of these are based on research carried out using a set of static data. What I'm wondering is whether anybody has used any other implementation successfully and if so what? Did you roll your own approach and if so what was the algorithm you used to detect the character set?
Any help would be appreciated. I'm not looking for a list of existing approaches via Google, nor am I looking for a link to the Joel Spolsky article - just to clarify : )
UPDATE: I did a bunch of research into this and ended up finding a framework called cpdetector that uses a pluggable approach to character detection, see:
This provides BOM, chardet (Mozilla approach) and ASCII detection plugins. It's also very easy to write your own. There's also another framework, which provides much better character detection that the Mozilla approach/jchardet etc...
It's quite easy to write your own plugin for cpdetector that uses this framework to provide a more accurate character encoding detection algorithm. It works better than the Mozilla approach.