tags:

views:

138

answers:

1

I'm trying to convert lines in an RTF file to a series of unicode strings, and then do a regex match on the lines. (I need them to be unicode so that I can output them to another file.)

However, my regex match isn't working - I think because they aren't being converted into unicode properly.

Here's my code:

usefulLines = []
textData = {}

# the regex pattern for an entry in the db (e.g. SUF 76,22): it's sufficient for us to match on three upper-case characters plus a space
entryPattern = '^([A-Z]{3})[\s].*$'  

f = open('textbase_1a.rtf', 'Ur')
fileLines = f.readlines()

# get the matching line numbers, and store in usefulLines
for i, line in enumerate(fileLines):
    #line = line.decode('utf-16be') # this causes an error: I don't really know what file encoding the RTF file is in...
    line = line.decode('mac_roman')
    print line
    if re.match(entryPattern, line):
        # now retrieve the following lines, all the way up until we get a blank line
        print "match: " + str(i)
        usefulLines.append(i)

At the moment, this prints all the lines, but doesn't print anything with match - though it should match. Also, the lines are being printed with '/par' at the start, for some reason. When I try printing them to an output file, they look very strange.

Part of the problem is that I don't know what encoding to specify. How can I find this out?

If I use entryPattern = '^.*$' then I do get matches.

Can anyone help?

+1  A: 

You did not even decode the RTF file. RTFs are not just simple text files. A file containing "äöü", for example, contains this:

{\rtf1\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fswiss\fcharset0 Arial;}}

{*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\f0\fs20\'e4\'f6\'fc\par

}

when opened in a text editor. So the characters "äöü" are encoded as windows-1252 as declared at the beginning of the file (äöü = 0xE4 0xF6 0xFC).

For reading RTF you'll first need something that converts RTF to text (already asked here).

AndiDog
OK, I didn't know that. Thank you.
AP257