views:

111

answers:

3

After certain survey, I come to discover that there are a few encoding detection project in java world, if the getEncoding in InputStreamReader does not work:

  1. juniversalchardet
  2. jchardet
  3. cpdetector
  4. ICU4J

However, I really do not know which is the best among the all. Can anyone with hand-on experience tell me which one is the best in Java?

+1  A: 

I've personally used jchardet in our project (juniversalchardet wasn't available back then) just to check if a stream was UTF-8 or not.

It was easier to integrate with our application than the other and yielded great results.

antispam
A: 

I found an answer online:

http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html

It says something vealuable here:

The strength of a character encoding detector lies in whether or not its focus is on statistical analysis or HTML META and XML prolog discovery. If you are processing HTML files that have META, use cpdetector. Otherwise, your best bet is either monq.stuff.EncodingDetector or com.sun.syndication.io.XmlReader.

So that's why I am using cpdetector now. I will update the post with the result of it.

Winston Chen
Do you only care about files that already are tagged with the charset via XML or META? That test is very, very suspect (so much so that I ran it myself). The test files it uses are not real content, but they are code charts. I.e., they are not "text in encoding X" but "text in English with a list of the code points in encoding X". However, all test files are tagged with the encoding. A comparison should be done, but not with these test files.
Steven R. Loomis
Further testing: I ran the test case in that blog against the same detectors (latest versions) on untagged data. ONLY icu detected: euc-jp, iso-2022-jp, koi8-r, iso-2022-cn iso-2022-kr.... Only ICU and Mozilla jchardet detected: shift-jis, gb18030, big5... I used samples from http://source.icu-project.org/repos/icu/icu/trunk/source/extra/uconv/samples/ and the utf-8 directory (some converted from files there into the target codepage).
Steven R. Loomis
+1  A: 

Your other answer made me do some research.. and, (biased as I am), I now think that ICU's may be the best for un-tagged data.

Steven R. Loomis
cool!! Thank you so much for this. I then will focus on both ICU and cpdetector, do some experiments, and see what best serves my needs!!
Winston Chen
You can test out the ICU detector as a java web start with http://icu-project.org/icu4jdemos.html using either the "Web Start Demo" (can only detect on URLs then) or "Downloadable Demo Jar". Once started, just click the DetectingViewer.
Steven R. Loomis