tags:

views:

558

answers:

4

My Java program does text extraction on RTF files using the RTFEditorKit. Some of the RTF files contain cyrillic characters (Russian), and depending on the RTF version, the extracted text is either okay or contains gibberish. When it's gibberish, I can use this to get normal text:

String text = ... // extracted text

String decodedText = new String(text.getBytes("ISO-8859-1"), "cp1251");

Now the problem is that I couldn't find a way to automatically detect the encoding of the file, i.e. whether the extracted text must be decoded or not. Does anybody know how to do this? Thanks in advance!

EDIT: In the first lines of the RTF files I see something that looks like an encoding:

  • Files where I get gibberish: {\rtf1\ansi\ansicpg1251\deff0\deflang1049
  • Files with okay text: {\rtf1\ansi\ansicpg1251\deff0
+1  A: 

I don't believe Java has anything within the standard libraries to do this.

Check out the ICU component. It has a Java variant and you can use the CharsetDetector to get the document encoding.

Jeff Foster
+1  A: 

I don't believe the file itself has an encoding. From the Wikipedia page:

RTF is an 8-bit format. That would limit it to ASCII, but RTF can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and Unicode escapes. In a code page escape, two hexadecimal digits following an apostrophe are used for denoting a character taken from a Windows code page. For example, if control codes specifying Windows-1256 are present, the sequence \'c8 will encode the Arabic letter beh (ب).

If a Unicode escape is required, the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number.

so I suspect you'll have to extract the text yourself and then parse further using the above rules.

Brian Agnew
+1  A: 

Internet Explorer uses character frequency count to guess the language and the encoding used. It sort of works. Do something similar.

Hamish Grubijan
+1  A: 

RTF files begin with two control sequences, the first of which specifies the RTF version (not the standard, but almost always the cs \rtf1), and the second of which specifies the character set, which is one of \ansi (usual), \mac, \pc, or pca (almost never encountered). Immediately after this, it is possible to specify Unicode codepages that modify the default interpretation of characters, given by \ansicpg.

There's not a whole lot of documentation I can find on this. Try looking at http://msdn.microsoft.com/en-us/library/aa140301(office.10).aspx, and the nice folks on the AbiWord developer's mailing list have spent a lot of time deciphering the various RTF specs.

Charles Stewart