My Java program does text extraction on RTF files using the RTFEditorKit. Some of the RTF files contain cyrillic characters (Russian), and depending on the RTF version, the extracted text is either okay or contains gibberish. When it's gibberish, I can use this to get normal text:
String text = ... // extracted text
String decodedText = new String(text.getBytes("ISO-8859-1"), "cp1251");
Now the problem is that I couldn't find a way to automatically detect the encoding of the file, i.e. whether the extracted text must be decoded or not. Does anybody know how to do this? Thanks in advance!
EDIT: In the first lines of the RTF files I see something that looks like an encoding:
- Files where I get gibberish: {\rtf1\ansi\ansicpg1251\deff0\deflang1049
- Files with okay text: {\rtf1\ansi\ansicpg1251\deff0