views:

1120

answers:

1

Is anybody familiar with the the RTF document format and parsing using any Java libaries. The standard way people have done this is by using the RTFEditorKit in the JDK Swing API:

Swing RTFEditorKit API

but it isn't that accurate when it comes to parsing RTF documents. In fact there's a comment in the API:

The RTF support was not written by the Swing team. In the future we hope to improve the support provided.

I don't think I'm going to wait for this to happen :)

The other approach taken is to define a grammar using JavaCC and generate a parser. This works better, but I'm having trouble finding a complete grammar. I've tried:

PMD Applied JavaCC Grammar

which is ok and the following (which is the best so far).

Koders RTFParserDelegate and ETranslate Grammar

There are various implementations of the ETranslate grammar about (I know the Nutch API may use this). Does anybody know which is the most accurate grammar or whether there is a better approach to this?

I could start ploughing through the JavaCC docs to understand the .jj files and test it against the RTF files... this is my current approach, but it's taking a while... any help would be appreciated

A: 

Presumably, the source of OpenOffice contains what you're looking for.

QuickRecipesOnSymbianOS
I'm already looked an OpenOffice and submitting documents to it with JODExtractor, it's a good way of parsing the documents but a rather heavyweight solution since you need a server with X libraries installed etc... haven't ruled it out yet, still investigating, but looking at more "lightweight" solutions.
Jon