




I was looking for a way to remove text from and RTF string and I found the following regex:


However the resulting string has two right angle brackets "}"

Before: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}{\f1\fnil MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 can u send me info for the call pls\f1\par }

After: } can u send me info for the call pls }

Any thoughts on how to improve the regex?

Edit: A more complicated string such as this one does not work: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\test\\myapp\\Apps\\\{3423234-283B-43d2-BCE6-A324B84CC70E\}\par }

+2  A: 

According to RegexPal, the two }'s are the ones bolded below:

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}{\f1\fnil MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\generator Msftedit;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 can u send me info for the call pls\f1\par }

I was able to fix the first curly brace by adding a plus sign to the regex:

     plus sign added here

And to fix the curly brace at the end, I did this:

         this checks if there is a curly brace at the end

I don't know the RTF format very well so this might not work in all cases, but it works on your example...

+2  A: 

I've used this before and it worked for me:


You will probably want to trim the ends of the result to get rid of the extra spaces left over.

John Chuckran
+1  A: 

In RTF, { and } marks a group. Groups can be nested. \ marks beginning of a control word. Control words end with either a space or a non alphabetic character. A control word can have a numeric parameter following, without any delimiter in between. Some control words also take text parameters, separated by ';'. Those control words are usually in their own groups.

I think I have managed to make a pattern that takes care of most the cases.

\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?

It leaves a few spaces when run on your pattern though.

Going trough the RTF specification (some of it), I see that there are a lot of pitfalls for pure regex based strippers. The most obvious one are that some groups should be ignored (headers, footers, etc.), while others should be rendered (formatting).

I have written a Python script that should work better than my regex above:

def striprtf(text):
   pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
   # control words which specify a "destionation".
   destinations = frozenset((
   # Translation of some special characters.
   specialchars = {
      'par': '\n',
      'sect': '\n\n',
      'page': '\n\n',
      'line': '\n',
      'tab': '\t',
      'emdash': u'\u2014',
      'endash': u'\u2013',
      'emspace': u'\u2003',
      'enspace': u'\u2002',
      'qmspace': u'\u2005',
      'bullet': u'\u2022',
      'lquote': u'\u2018',
      'rquote': u'\u2019',
      'ldblquote': u'\201C',
      'rdblquote': u'\u201D', 
   stack = []
   ignorable = False       # Whether this group (and all inside it) are "ignorable".
   ucskip = 1              # Number of ASCII characters to skip after a unicode character.
   curskip = 0             # Number of ASCII characters left to skip
   out = []                # Output buffer.
   for match in pattern.finditer(text):
      word,arg,hex,char,brace,tchar = match.groups()
      if brace:
         curskip = 0
         if brace == '{':
            # Push state
         elif brace == '}':
            # Pop state
            ucskip,ignorable = stack.pop()
      elif char: # \x (not a letter)
         curskip = 0
         if char == '~':
            if not ignorable:
         elif char in '{}\\':
            if not ignorable:
         elif char == '*':
            ignorable = True
      elif word: # \foo
         curskip = 0
         if word in destinations:
            ignorable = True
         elif ignorable:
         elif word in specialchars:
         elif word == 'uc':
            ucskip = int(arg)
         elif word == 'u':
            c = int(arg)
            if c < 0: c += 0x10000
            if c > 127: out.append(unichr(c))
            else: out.append(chr(c))
            curskip = ucskip
      elif hex: # \'xx
         if curskip > 0:
            curskip -= 1
         elif not ignorable:
            c = int(hex,16)
            if c > 127: out.append(unichr(c))
            else: out.append(chr(c))
      elif tchar:
         if curskip > 0:
            curskip -= 1
         elif not ignorable:
   return ''.join(out)

It works by parsing the RTF code, and skipping any groups which has a "destination" specified, and all "ignorable" groups ({\*...}). I also added handling of some special characters.

There are lots of features missing to make this a full parser, but should be enough for simple documents.


None of the answers were sufficient, so my solution was to use the RichTextBox control (yes, even in a non-Winform app) to extract text from RTF


The Oct 9 suggestion did the trick for me, but I replaced \ with \\ for it to work in Drupal/PHP.

+1  A: 
           FareRule = Encoding.ASCII.GetString(FareRuleInfoRS.Data);
                System.Windows.Forms.RichTextBox rtf = new System.Windows.Forms.RichTextBox();
                rtf.Rtf = FareRule;
                FareRule = rtf.Text;

So far, we haven't found a good answer to this either, other than using a RichTextBox control:

    /// <summary>
    /// Strip RichTextFormat from the string
    /// </summary>
    /// <param name="rtfString">The string to strip RTF from</param>
    /// <returns>The string without RTF</returns>
    public static string StripRTF(string rtfString)
        string result = rtfString;

            if (IsRichText(rtfString))
                // Put body into a RichTextBox so we can strip RTF
                using (System.Windows.Forms.RichTextBox rtfTemp = new System.Windows.Forms.RichTextBox())
                    rtfTemp.Rtf = rtfString;
                    result = rtfTemp.Text;
                result = rtfString;

        return result;

    /// <summary>
    /// Checks testString for RichTextFormat
    /// </summary>
    /// <param name="testString">The string to check</param>
    /// <returns>True if testString is in RichTextFormat</returns>
    public static bool IsRichText(string testString)
        if ((testString != null) &&
            return true;
            return false;

Edit: Added IsRichText method.