Ideally, I'd like a module or library that doesn't require superuser access to install; I have limited privileges in my working environment.
Have you checked out pyrtf-ng?
Update: The parsing functionality is available if you do a Subversion checkout, but I'm not sure how full-featured it is. (Look in the rtfng.parser.base
module.)
OpenOffice has a RTF reader. You can use python to script OpenOffice, see here for more info.
You could probably try using the magic com-object on Windows to read anything that smells ms-binary. I wouldn't recommend that though.
Actually parsing the raw data probably won't be very hard, see this example written in .bat/QBasic.
DocFrac is a free open source converter betweeen RTF, HTML and text. Windows, Linux, ActiveX and DLL platforms available. It will probably be pretty easy to wrap it up in python.
RTF::TEXT::Converter - Perl extension for converting RTF into text. (in case You have problems withg DocFrac).
A discussion on this subject on python mailing list.
Official Rich Text Format (RTF) Specifications, version 1.7, by Microsoft.
Good luck (with the limited privileges in Your working environment).
Hi! I ran into the same thing ans I was trying to code it myself. It's not that easy but here is what I had when I decided to go for a commandline app. Its ruby but you can adapt to python very easily. There is some header garbage to clean up, but you can see more or less the idea.
f = File.open('r.rtf','r')
b=0
p=false
str = ''
begin
while (char = f.readchar)
if char.chr=='{'
b+=1
next
end
if char.chr=='}'
b-=1
next
end
if char.chr=='\\'
p=true
next
end
if p==true && (char.chr==' ' or char.chr=='\n' or char.chr=='\t' or char.chr=='\r')
p=false
next
end
if p==true && (char.chr=='\'')
#this is the source of my headaches. you need to read the code page from the header and encode this.
p=false
str << '#'
next
end
next if b>2
next if p
str << char.chr
end
rescue EOFError
end
f.close
I've been working on a library called Pyth, which can do this:
http://pypi.python.org/pypi/pyth/
Converting an RTF file to plaintext looks something like this:
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter
doc = Rtf15Reader.read(open('sample.rtf'))
print PlaintextWriter.write(doc).getvalue()
Pyth can also generate RTF files, read and write XHTML, generate documents from Python markup a la Nevow's stan, and has limited experimental support for latex and pdf output. Its RTF support is pretty robust -- we use it in production to read RTF files generated by various versions of Word, OpenOffice, Mac TextEdit, EIOffice, and others.