views:

568

answers:

7

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

  • No additional meta-data is available. The byte array is literally the only available input.
  • The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.
+1  A: 

Check out jchardet

Chi
Please elaborate - why do you consider jchardet to be the best library around?
knorv
A: 

Should be stuff already available

google search turned up icu4j

or

http://jchardet.sourceforge.net/

gomesla
I kind of know how to use Google, but the question specifically asks for "what is the best way [..]". So which is best, icu4j, jchardet or some other library?
knorv
+1  A: 

Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:

http://www.joelonsoftware.com/articles/Unicode.html

Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.

Rooke
A: 

Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

http://stackoverflow.com/questions/887148/how-to-determine-if-a-string-contains-invalid-encoded-characters

Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

ZZ Coder
What about the cases where it is not UTF-8?
knorv
+8  A: 

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    detector.reset();
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

knorv
+2  A: 

Here's my favorite: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

It works like this:

  • If there's a UTF-8 or UTF-16 BOM, return that encoding.
  • If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
  • If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
  • Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).

It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.

Alan Moore
+1  A: 

There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

Thomas Mueller