ansaurus

Question

How can I identify different encodings without the use of a BOM?

Answer 1

+2 A:

In general, you cannot identify the character encoding of a data stream with 100% accuracy. The best you can do is try to decode using a limited set of expected encodings, and then apply some heuristics to the decoded result to see if it "looks like" text in the expected language. (But any heuristic will give false positives and false negatives for certain data streams.) Alternatively, put a human in the loop to decide which decoding makes the most sense.

A better solution is to to redesign your protocol so that whatever is supplying the data has to also supply the encoding scheme used for the data. (And if you cannot, blame whoever is responsible for designing / implementing the system that cannot give you an encoding scheme!).

EDIT: from your comments on the question, the data files are being delivered via HTTP. In this case, you should arrange that your HTTP server snarfs the "content-type" header of the POST requests delivering the data, extract the character set / encoding from the header, and save it in a way / place that your file parser can deal with.

Stephen C 2009-08-28 00:50:13

Answer 2

A:

This will cause you headaches down the road, no doubt about it. You can check for alternating zero bytes for the simplistic cases (ASCII only, UTF-16, either byte order) but the minute you start getting a stream of characters above the 0x7f code point, that method becomes useless.

If you have the file handle, the best bet is to save the current file pointer, seek to the start, read the BOM then seek back to the original position.

Either that or remember the BOM somehow.

Relying on the data contents is a bad idea unless you're absolutely certain the character range will be restricted for all inputs.

paxdiablo 2009-08-28 00:50:56

Relying on the BOM is a worse idea unless you're absolutely certain that the file will have one.

dan04 2010-08-14 21:54:47

"The first bit of data written to it has the BOM available" was in the question so I _was_ absolutely certain :-)

paxdiablo 2010-08-15 00:48:42

Answer 3

A:

This question contains a few options for character detection which don't appear to require a BOM.

My project is currently using jCharDet but I might need to look at some of the other options listed there as jCharDet is not 100% reliable.

jwaddell 2009-08-28 05:15:07

@jwaddell: No character detection scheme is going to be 100% reliable.

Stephen C 2009-08-28 05:29:05

ansaurus

tags:

views:

answers:

How can I identify different encodings without the use of a BOM?

related questions