I'm trying to convert lines in an RTF file to a series of unicode strings, and then do a regex match on the lines. (I need them to be unicode so that I can output them to another file.)
However, my regex match isn't working - I think because they aren't being converted into unicode properly.
Here's my code:
usefulLines = []
textData = {}
# the regex pattern for an entry in the db (e.g. SUF 76,22): it's sufficient for us to match on three upper-case characters plus a space
entryPattern = '^([A-Z]{3})[\s].*$'
f = open('textbase_1a.rtf', 'Ur')
fileLines = f.readlines()
# get the matching line numbers, and store in usefulLines
for i, line in enumerate(fileLines):
#line = line.decode('utf-16be') # this causes an error: I don't really know what file encoding the RTF file is in...
line = line.decode('mac_roman')
print line
if re.match(entryPattern, line):
# now retrieve the following lines, all the way up until we get a blank line
print "match: " + str(i)
usefulLines.append(i)
At the moment, this prints all the lines, but doesn't print anything with match - though it should match. Also, the lines are being printed with '/par' at the start, for some reason. When I try printing them to an output file, they look very strange.
Part of the problem is that I don't know what encoding to specify. How can I find this out?
If I use entryPattern = '^.*$'
then I do get matches.
Can anyone help?