ansaurus

Question

Java : How to determine the correct charset encoding of a stream

Answer 1

A:

Can you pick the appropriate char set in the Constructor:

new InputStreamReader(new FileInputStream(in), "ISO8859_1");

Kevin 2009-01-31 15:44:08

The point here was to see whether the charset could be determined programatically.

Joel 2009-01-31 15:46:37

No, it won't guess it for you. You have to supply it.

Kevin 2009-01-31 15:50:06

There may be a heuristic method, as suggested by some of the answers here http://stackoverflow.com/questions/457655/java-charset-and-windows/457849#457849

Joel 2009-01-31 15:56:15

Answer 2

+18 A:

You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.

The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.

Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.

Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In english the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.

Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.

Eduard Wirch 2009-01-31 15:44:22

Answer 3

+2 A:

If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.

Fabian Steeg 2009-01-31 15:46:01

Answer 4

+4 A:

You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for "malformed-input" or "unmappable-character" errors. Of course, this only tells you if a charset is wrong; it doesn't tell you if it is correct. For that, you need a basis of comparison to evaluate the decoded results, e.g. do you know beforehand if the characters are restricted to some subset, or whether the text adheres to some strict format? The bottom line is that charset detection is guesswork without any guarantees.

Zach Scrivena 2009-02-01 07:33:32

Answer 5

+1 A:

For ISO8859_1 files, there is not an easy way to distinguish them from ASCII. For Unicode files however one can generally detect this based on the first few bytes of the file.

UTF-8 and UTF-16 files include a Byte Order Mark (BOM) at the very beginning of the file. The BOM is a zero-width non-breaking space.

Unfortunately, for historical reasons, Java does not detect this automatically. Programs like Notepad will check the BOM and use the appropriate encoding. Using unix or Cygwin, you can check the BOM with the file command. For example:

$ file sample2.sql 
sample2.sql: Unicode text, UTF-16, big-endian

For Java, I suggest you check out this code, which will detect the common file formats and select the correct encoding: How to read a file and automatically specify the correct encoding

brianegge 2009-05-26 07:20:38

Answer 6

+2 A:

I found a nice third party library which can detect actual encoding: http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

I didn't test it extensively but it seems to work.

falcon 2010-01-07 09:04:04

Answer 7

+2 A:

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text

Lorrat 2010-02-15 11:53:01

Answer 8

A:

check this out: http://site.icu-project.org/ (icu4j) they have libraries for detecting charset from IOStream could be simple like this:

BufferedInputStream bis = new BufferedInputStream(input); CharsetDetector cd = new CharsetDetector(); cd.setText(bis); CharsetMatch cm = cd.detect();

            if (cm != null) {
                reader = cm.getReader();
                charset = cm.getName();

            }else {

//give default charset

}

2010-10-25 10:11:42

ansaurus

tags:

views:

answers:

Java : How to determine the correct charset encoding of a stream

related questions