tags:

views:

413

answers:

3

I am trying to split an RTF file into lines (in my code) and I am not quite getting it right, mostly because I am not really grokking the entirety of the RTF format. It seems that lines can be split by \par or \pard or \par\pard or any number of fun combinations.

I am looking for a piece of code that splits the file into lines in any language really.

+1  A: 

You could try the specification (1.9.1) (see External Links on the Wikipedia page - which also has a couple of links to examples or modules in several programming languages).

That would most likely give you an idea of the line insertion "words", so you can split the file into lines using a well-defined set of rules rather than taking a guess at it.

Matthew Iselin
+1  A: 

Have you come across O'Reilly's RTF Pocket Guide, by Sean M. Burke ?

On page 13, it says

Here are some rules of thumb for putting linebreaks in RTF:

  • Put a newline before every \pard or \ (commands that are explained in the "Paragraphs" section.
  • Put a newline before and after the RTF font-table, stylesheet, and other similar constructs (like the color table, decribed later).
  • You can put a newline after every Nth space, {, or }. (Alternatively: put a newline after every space, {, or } that's after the 60th column.)

Or were you thinking of extracting the plaintext as lines, and doing it whatever the language of the plaintext?

pavium
You are on the right track, but the book you are quoting from is likely talking about encoding in RTF, not decoding.
AngryHacker
Per Microsoft's RTF spec (page 7 - Basic Entities), line breaks are placed for readability.
AngryHacker
Actually, I have only seen Part I: RTF Tutorial of the O'Reilly book - only 25 pages and a bit light on the details. The 296-page RTF 1.9.1 spec scares me, but is obviously a far better reference.
pavium
+1  A: 

I coded up a quick and dirty routine and it seems to work for pretty much anything I've been able to throw at it. It's in VB6, but easily translatable into anything else.

Private Function ParseRTFIntoLines(ByVal strSource As String) As Collection
    Dim colReturn As Collection
    Dim lngPosStart As Long
    Dim strLine As String
    Dim sSplitters(1 To 4) As String
    Dim nIndex As Long

    ' return collection of lines '

    ' The lines can be split by the following '
    ' "\par"                                  '
    ' "\par "                                 '
    ' "\par\pard "                            '

    ' Add these splitters in order so that we do not miss '
    ' any possible split combos, for instance, "\par\pard" is added before "\par" '
    ' because if we look for "\par" first, we will miss "\par\pard" '
    sSplitters(1) = "\par \pard"
    sSplitters(2) = "\par\pard"
    sSplitters(3) = "\par "
    sSplitters(4) = "\par"

    Set colReturn = New Collection

    ' We have to find each variation '
    ' We will look for \par and then evaluate which type of separator is there '

    Do
        lngPosStart = InStr(1, strSource, "\par", vbTextCompare)
        If lngPosStart > 0 Then
            strLine = Left$(strSource, lngPosStart - 1)

            For nIndex = 1 To 4
                If StrComp(sSplitters(nIndex), Mid$(strSource, lngPosStart, Len(sSplitters(nIndex))), vbTextCompare) = 0 Then
                    ' remove the 1st line from strSource '
                    strSource = Mid$(strSource, lngPosStart + Len(sSplitters(nIndex)))

                    ' add to collection '
                    colReturn.Add strLine

                    ' get out of here '
                    Exit For
                End If
            Next
        End If

    Loop While lngPosStart > 0

    ' check to see whether there is a last line '
    If Len(strSource) > 0 Then colReturn.Add strSource

    Set ParseRTFIntoLines = colReturn
End Function
AngryHacker
Thanks, Mark. I'll have to remember the closing comment.
AngryHacker