views:

1332

answers:

2

Hello,

I have approx. 12000 cells in excel containing RTF (including formatting tags). I need to parse them to get to the unformatted text.

This is the example of one of the cells with text:

{\rtf1\ansi\deflang1060\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset238
Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs24\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw11908\paperh16833\margl1800\margr1800\margt1440\margb1440\headery720\footery720
\deftab720\formshade\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn11908\pghsxn16833\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440
\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard TPR 0160 000\par IPR 0160 000\par OB-R-02-28\par}

And all I really need is this:

TPR 0160 000
IPR 0160 000
OB-R-02-28

The problem with simple looping over the cells and removing unnecessary formatting is, that not everything in those 12000 cells is as straightforward as this is. So I would need to manually inspect many different versions and write several variations; and still at the end there would be a lot of manual work to do.

But if I copy the contents of one cell to empty text document and save it as RTF, then open it with MS Word, it instantly parses the text and I get exactly what I want. Unfortunately it's extremely inconvenient to do so for a 12000 cells.

So I was thinking about VBA macro, to move cell contents to Word, force parsing and then copy the result back to the originating cell. Unfortunately I'm not really sure how to do it.

Does anybody has any idea? Or a different approach? I will be really grateful for a solution or a push in the right direction.

TNX!

A: 

You can try to parse every cell with regular expression and leave only the content you need.

Every RTF control code start with "\" and ends with space, without any additional space between. "{}" are use for grouping. If your text won't contain any, you can just remove them (the same for ";"). So now you stay with your text and some unnecessary words as "Arial", "Normal" etc. You can build the dictionary to remove them also. After some tweaking, you will stay with only the text you need.

Look at http://www.regular-expressions.info/ for more information and great tool to write RegExp's (RegexBuddy - unfortunately it isn't free, but it's worth the money. AFAIR there is also trial).

UPDATE: Of course, I don't encourage you to do it manually for every cell. Just iterate through active range: Refer this thread: SO: About iterating through cells in VBA

Personally, I'll give a try to this idea:

Sub Iterate()
   For Each Cell in ActiveSheet.UsedRange.Cells
      'Do something
   Next
End Sub

And how to use RegExp's in VBA (Excel)?

Refer: Regex functions in Excel and Regex in VBA

Basically you've to use VBScript.RegExp object through COM.

juckobee
Yes, that is a possibility. But I would really like to avoid several passes if possible. This table is infact an export from SQL DB and I will have to do this parsing for quite a few times before the end of the year. I already have a partial solution. I have a working VBA script to remove large portion of RTF formatting, but leaves some (just to be on the safe side). Then I have to Find/Replace very many strange tags and stuff.So, a complete solution would be very handy. Doing manual procedure once is a bit boring. Doing it 5 or even 10 times is absolutely nerve-wrecking.
I didn't mention to do it manually for each cell. Just iterate through all cells and parse every cell with your custom script. I'll update my comment with more thoughts.
juckobee
Oh and about several passes. If you split your task into stages you have to code them in your custom cell parser and run all during one pass! I don't know your data, but I think it'll be enough to have a cunning regular expression so you'll be able to finish it during one stage/pass.
juckobee
+1  A: 

If you did want to go down the route of using Word to parse the text, this function should help you out. As the comments suggest, you'll need a reference to the MS Word Object Library.

Function ParseRTF(strRTF As String) As String
Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'
Dim f     As Integer       'Variable to store the file I/O number'

'File path for a temporary .rtf file'
Const strFileTemp = "C:\TempFile_ParseRTF.rtf"

'Obtain the next valid file I/O number'
f = FreeFile

'Open the temp file and save the RTF string in it'
Open strFileTemp For Output As #f
    Print #f, strRTF
Close #f

'Open the .rtf file as a Word.Document'
Set wdDoc = GetObject(strFileTemp)

'Read the now parsed text from the Word.Document'
ParseRTF = wdDoc.Range.Text

'Delete the temporary .rtf file'
Kill strFileTemp

'Close the Word connection'
wdDoc.Close False
Set wdDoc = Nothing
End Function

You could call it for each of your 12,000 cells using something similar to this:

Sub ParseAllRange()
Dim rngCell As Range
Dim strRTF  As String

For Each rngCell In Range("A1:A12000")

    'Parse the cell contents'
    strRTF = ParseRTF(CStr(rngCell))

    'Output to the cell one column over'
    rngCell.Offset(0, 1) = strRTF
Next
End Sub

The ParseRTF function takes about a second to run (on my machine at least), so for 12,000 cells this will work out at about three and a half hours.


Having thought about this problem over the weekend, I was sure there was a better (quicker) solution for this.

I remembered the RTF capabilities of the clipboard, and realised that a class could be created that would copy RTF data to the clipboard, paste to a word doc, and output the resulting plain text. The benefit of this solution is that the word doc object would not have to be opened and closed for each rtf string; it could be opened before the loop and closed after.

Below is the code to achieve this. It is a Class module named clsRTFParser.

Private Declare Function GlobalAlloc Lib "kernel32" _
                (ByVal wFlags&, ByVal dwBytes As Long) As Long
Private Declare Function GlobalLock Lib "kernel32" _
                (ByVal hMem As Long) As Long
Private Declare Function GlobalUnlock Lib "kernel32" _
                (ByVal hMem As Long) As Long
Private Declare Function lstrcpy Lib "kernel32" _
                (ByVal lpString1 As Any, ByVal lpString2 As Any) As Long

Private Declare Function OpenClipboard Lib "user32" _
                (ByVal Hwnd As Long) As Long
Private Declare Function EmptyClipboard Lib "user32" () As Long
Private Declare Function RegisterClipboardFormat Lib "user32" Alias _
                "RegisterClipboardFormatA" (ByVal lpString As String) As Long
Private Declare Function SetClipboardData Lib "user32" _
                (ByVal wFormat As Long, ByVal hMem As Long) As Long
Private Declare Function CloseClipboard Lib "user32" () As Long

'---'

Dim wdDoc As Word.Document 'Ref: Microsoft Word 11.0 Object Library'

Private Sub Class_Initialize()
Set wdDoc = New Word.Document
End Sub

Private Sub Class_Terminate()
wdDoc.Close False
Set wdDoc = Nothing
End Sub

'---'

Private Function CopyRTF(strCopyString As String) As Boolean
Dim hGlobalMemory  As Long
Dim lpGlobalMemory As Long
Dim hClipMemory    As Long
Dim lngFormatRTF   As Long

'Allocate and copy string to memory'
hGlobalMemory = GlobalAlloc(&H42, Len(strCopyString) + 1)
lpGlobalMemory = GlobalLock(hGlobalMemory)
lpGlobalMemory = lstrcpy(lpGlobalMemory, strCopyString)

'Unlock the memory and then copy to the clipboard'
If GlobalUnlock(hGlobalMemory) = 0 Then
    If OpenClipboard(0&) <> 0 Then
        Call EmptyClipboard

        'Save the data as Rich Text Format'
        lngFormatRTF = RegisterClipboardFormat("Rich Text Format")
        hClipMemory = SetClipboardData(lngFormatRTF, hGlobalMemory)

        CopyRTF = CBool(CloseClipboard)
    End If
End If
End Function

'---'

Private Function PasteRTF() As String
Dim strOutput As String

'Paste the clipboard data to the wdDoc and read the plain text result'
wdDoc.Range.Paste
strOutput = wdDoc.Range.Text

'Get rid of the new lines at the beginning and end of the document'
strOutput = Left(strOutput, Len(strOutput) - 2)
strOutput = Right(strOutput, Len(strOutput) - 2)

PasteRTF = strOutput
End Function

'---'

Public Function ParseRTF(strRTF As String) As String
If CopyRTF(strRTF) Then
    ParseRTF = PasteRTF
Else
    ParseRTF = "Error in copying to clipboard"
End If
End Function

You could call it for each of your 12,000 cells using something similar to this:

Sub CopyParseAllRange()
Dim rngCell As Range
Dim strRTF  As String

'Create new instance of clsRTFParser'
Dim RTFParser As clsRTFParser
Set RTFParser = New clsRTFParser

For Each rngCell In Range("A1:A12000")

    'Parse the cell contents'
    strRTF = RTFParser.ParseRTF(CStr(rngCell))

    'Output to the cell one column over'
    rngCell.Offset(0, 1) = strRTF
Next
End Sub

I have simulated this using example RTF strings on my machine. For 12,000 cells it took two and a half minutes, a much more reasonable time frame!

Nossidge