tags:

views:

210

answers:

2

we need to get the contents of RTF documents as plain text.

we were using RFTEditorKit for it, but many of the RTF documents we need to handle contain \headerf or other header fields, and RTFEditorKit doesnt parse these (silently ignores).

is there another lightweight solution to get these documents parsed to plain text?

A: 

The RTF format is pretty simple; it shouldn't take long to write your own parser. Otherwise, just copy the source code from the JDK and add support for the missing elements (I say copy because from experience, many useful classes from the JDK can't be extended).

[EDIT] To make sure this doesn't become a nightmare to maintain, copy the sources into a distinct project on your VCS. Tag the version accordingly (so you can easily pull it out when the next release of Java comes by).

Then create a second project which depends on the first. Branch your first project and make all the small changes which you need to extend the original classes. Keep these changes small. For example, make methods and fields public/protected and remove final. This way, it's simple to keep track of changes (since you never add/remove lines).

Merging with the next version will then be easy. All the heavy lifting must be done in your own project.

Aaron Digulla
pvgoddijn
A: 

This could be part of your solution: a (C++) method to retrieve the length of the plain text. Instead of incrementing the counter you can copy the character to another string.

Short translation: klammern = parenthesis

int Global::GetRtfPlainLength(const CString str)
{
int klammern = 0;
bool command = false;
int length = 0;
int i = 0;

//TRACE("\n%s\n",str);

while(i < str.GetLength())
{
 switch(str[i])
 {
 case '{': 
  klammern++;
  break;

 case '}': 
  klammern--;
  break;

 case '\\':
  if(!command) // only relevant outside command
  {
   switch(str[i + 1])
   {
   case '\'': // special chars: \'XX -> count only 1
    i += 3;
    length++;
    break;
   case '{': // escaped parenthesis
   case '}':
    length++;
    i++;
    break;
   default: // begin of a command
    command = true;
    i++;
    break;
   } // switch
  }
  break;

 case ' ': 
  if(klammern == 1) // inside parenthesis a space is part of the command
  {
   if(command)
    command = false;
   else 
    length++;
  }
  break;

 case 10:
 case 13:
  break;

 default:
  if(!command)
   length++;
  break;
 } // switch

 i++;
} // while

// some corrections
length += FindCount(str,"\\line ") * 2;
length += FindCount(str,"\\par ") * 2;

return length;
}

HTH a little.

dwo