views:

335

answers:

3

We have a CMS which has several thousand text/html files in it. It turns out that users have been uploading text/html files using various character encodings (utf-8,utf-8 w BOM, windows 1252, iso-8859-1).

When these files are read in and written to the response our CMS's framework forces a charset=UTF-8 on the response's content-type attribute.

Because of this, any non UTF-8 content is displayed to the user with mangled characters (?, black diamonds, etc. when there isnt the correct character translation from the "native" char encoding to UTF-8). Also, there is no metadata attached to these documents that indicate charset - As far as I know, the only way to tell what charset they are is to look at them in a text rendering app (Firefox,Notepadd++,etc.) and "look" at the content to see if it "looks" right.

Does anyone know how to automatically/intelligently convert files of unknown encoding to UTF-8? I've read this can be accomplished with statistical modeling but thats what above my head.

Thoughts on how to best approach the problem?

Thanks

+1  A: 

Try to decode it as UTF-8. If this fails then look for \x92, and decode as CP1252 if found. Otherwise, decode as Latin-1.

Ignacio Vazquez-Abrams
If it's not valid UTF-8, you might as well go straight to cp1252. It only makes a difference for bytes `\x80` to `\x9F`, but it's vanishingly unlikely anyone ever used the characters ISO-8859-1 specified for those bytes (they're all useless control codes).
bobince
Why would you check for only *one* of the cp1252 extension characters anyway? What if the text contains curly *double* quotes (`\x93`, `\x94`) but no curly single quotes (`\x91`, `\x92`)? But like @bobince said, if it's valid ISO-8859-1, you can safely assume it's valid cp1252.
Alan Moore
@bobince, Alan: A far more interesting distinction is between cp1251 and ISO-8859-15, which is quite likely what some of those "ISO-8859-1" documents really are - the euro symbol isn't exactly irrelevant these days.
Michael Borgwardt
@Michael: I have to say I've yet to meet an 8859-15 document in the wild. I think it came about a bit too late to see widespread uptake: everyone who cared about standard charsets was already headed towards UTF-8, and everyone else stuck with cp1252.
bobince
+3  A: 

You can use ICU4J's CharsetDetector

axtavt
A: 

In general, there is no way to tell. The byte sequence 63 61 66 C3 A9 is equally valid as "café" in windows-1252, "caf├⌐" in IBM437, or "café" in UTF-8. The last is statistically more likely, though.

If you don't want to deal with statistical methods, an approach that works much of the time is to assume that anything that looks like UTF-8 is, and that anything else is in windows-1252.

Or if UTF-16 is a possibility, look for FE FF or FF FE at the beginning of the file.

dan04