We have a CMS which has several thousand text/html files in it. It turns out that users have been uploading text/html files using various character encodings (utf-8,utf-8 w BOM, windows 1252, iso-8859-1).
When these files are read in and written to the response our CMS's framework forces a charset=UTF-8 on the response's content-type attribute.
Because of this, any non UTF-8 content is displayed to the user with mangled characters (?, black diamonds, etc. when there isnt the correct character translation from the "native" char encoding to UTF-8). Also, there is no metadata attached to these documents that indicate charset - As far as I know, the only way to tell what charset they are is to look at them in a text rendering app (Firefox,Notepadd++,etc.) and "look" at the content to see if it "looks" right.
Does anyone know how to automatically/intelligently convert files of unknown encoding to UTF-8? I've read this can be accomplished with statistical modeling but thats what above my head.
Thoughts on how to best approach the problem?
Thanks