ansaurus

Question

Guessing the encoding of text represented as byte[] in Java

Answer 1

+1 A:

Check out jchardet

Chi 2009-11-05 00:24:57

Please elaborate - why do you consider jchardet to be the best library around?

knorv 2009-11-05 05:51:02

Answer 2

A:

Should be stuff already available

google search turned up icu4j

or

http://jchardet.sourceforge.net/

gomesla 2009-11-05 01:00:10

I kind of know how to use Google, but the question specifically asks for "what is the best way [..]". So which is best, icu4j, jchardet or some other library?

knorv 2009-11-05 05:50:02

Answer 3

+1 A:

Chi's answer seems most promising for real use. I just want to add that, according to Joel Spolsky, Internet Explorer used a frequency-based guessing algorithm in its day:

http://www.joelonsoftware.com/articles/Unicode.html

Roughly speaking, all the assumed-to-be-text is copied, and parsed in every encoding imaginable. Whichever parse fits a language's average word (and letter?) frequency profile best, wins. I can not quickly see if jchardet uses the same kind of approach, so I thought I'd mention this just in case.

Rooke 2009-11-05 01:01:06

Answer 4

A:

Without encoding indicator, you will never know for sure. However, you can make some intelligent guesses. See my answer to this question,

http://stackoverflow.com/questions/887148/how-to-determine-if-a-string-contains-invalid-encoded-characters

Use the validUTF8() methods. If it returns true, treat it as UTF8, otherwise as Latin-1.

ZZ Coder 2009-11-05 01:28:44

What about the cases where it is not UTF-8?

knorv 2009-11-05 05:45:47

Answer 5

+8 A:

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    detector.reset();
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

knorv 2009-11-05 07:04:48

Answer 6

+2 A:

Here's my favorite: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

It works like this:

If there's a UTF-8 or UTF-16 BOM, return that encoding.
If none of the bytes have the high-order bit set, return ASCII (or you can force it to return a default 8-bit encoding instead).
If there are bytes with the high bit set but they're arranged in the correct patterns for UTF-8, return UTF-8.
Otherwise, return the platform default encoding (e.g., windows-1252 on an English-locale Windows system).

It may sound overly simplistic, but in my day-to-day work it's well over 90% accurate.

Alan Moore 2009-11-05 12:46:35

Answer 7

+1 A:

There is also Apache Tika - a content analysis toolkit. It can guess the mime type, and it can guess the encoding. Usually the guess is correct with a very high probability.

Thomas Mueller 2010-09-20 12:38:05

ansaurus

tags:

views:

answers:

Guessing the encoding of text represented as byte[] in Java

related questions