How to normalize text content to UTF 8 in java

views:

335

answers:

+1 Q:

How to normalize text content to UTF 8 in java

We have a CMS which has several thousand text/html files in it. It turns out that users have been uploading text/html files using various character encodings (utf-8,utf-8 w BOM, windows 1252, iso-8859-1).

When these files are read in and written to the response our CMS's framework forces a charset=UTF-8 on the response's content-type attribute.

Because of this, any non UTF-8 content is displayed to the user with mangled characters (?, black diamonds, etc. when there isnt the correct character translation from the "native" char encoding to UTF-8). Also, there is no metadata attached to these documents that indicate charset - As far as I know, the only way to tell what charset they are is to look at them in a text rendering app (Firefox,Notepadd++,etc.) and "look" at the content to see if it "looks" right.

Does anyone know how to automatically/intelligently convert files of unknown encoding to UTF-8? I've read this can be accomplished with statistical modeling but thats what above my head.

Thoughts on how to best approach the problem?

Thanks

+1 A:

Try to decode it as UTF-8. If this fails then look for \x92, and decode as CP1252 if found. Otherwise, decode as Latin-1.

Ignacio Vazquez-Abrams 2010-03-16 17:35:28

If it's not valid UTF-8, you might as well go straight to cp1252. It only makes a difference for bytes `\x80` to `\x9F`, but it's vanishingly unlikely anyone ever used the characters ISO-8859-1 specified for those bytes (they're all useless control codes).

bobince 2010-03-16 18:03:04

Why would you check for only *one* of the cp1252 extension characters anyway? What if the text contains curly *double* quotes (`\x93`, `\x94`) but no curly single quotes (`\x91`, `\x92`)? But like @bobince said, if it's valid ISO-8859-1, you can safely assume it's valid cp1252.

Alan Moore 2010-03-16 19:42:05

@bobince, Alan: A far more interesting distinction is between cp1251 and ISO-8859-15, which is quite likely what some of those "ISO-8859-1" documents really are - the euro symbol isn't exactly irrelevant these days.

Michael Borgwardt 2010-03-17 14:15:18

@Michael: I have to say I've yet to meet an 8859-15 document in the wild. I think it came about a bit too late to see widespread uptake: everyone who cared about standard charsets was already headed towards UTF-8, and everyone else stuck with cp1252.

bobince 2010-03-17 17:55:18

+3 A:

You can use ICU4J's CharsetDetector

axtavt 2010-03-16 17:46:24

In general, there is no way to tell. The byte sequence 63 61 66 C3 A9 is equally valid as "cafÃ©" in windows-1252, "caf├⌐" in IBM437, or "café" in UTF-8. The last is statistically more likely, though.

If you don't want to deal with statistical methods, an approach that works much of the time is to assume that anything that looks like UTF-8 is, and that anything else is in windows-1252.

Or if UTF-16 is a possibility, look for FE FF or FF FE at the beginning of the file.

dan04 2010-03-17 14:01:02

ansaurus

tags:

views:

answers:

How to normalize text content to UTF 8 in java

related questions