for working with MS word files in python, there is python win32 extensions, which can be used in windows. How do I do the same in linux? Is there any library?
Take a look at how the doc format works and create word document using PHP in linux. The former is especially useful. Abiword is my recommended tool. There are limitations though:
However, if the document has complicated tables, text boxes, embedded spreadsheets, and so forth, then it might not work as expected. Developing good MS Word filters is a very difficult process, so please bear with us as we work on getting Word documents to open correctly. If you have a Word document which fails to load, please open a Bug and include the document so we can improve the importer.
I'm not sure if you're going to have much lock without using COM. The .doc format is ridiculously complex, and is often called a "memory dump" of Word at the time of saving!
At Swati, that's in HTML, which is fine and dandy, but most word documents aren't so nice!
You could make a subprocess call to antiword. Antiword is a linux commandline utility for dumping text out of a word doc. Works pretty well for simple documents (obviously it loses formatting). It's available through apt, and probably as RPM, or you could compile it yourself.
I know this is an old question, but I was recently trying to find a way to extract text from MS word files, and the best solution by far I found was with wvLib:
http://wvware.sourceforge.net/
After installing the library, using it in Python is pretty easy:
import commands
exe = 'wvText ' + word_file + ' ' + output_txt_file
out = commands.getoutput(exe)
exe = 'cat ' + output_txt_file
out = commands.getoutput(exe)
And that's it. Pretty much, what we're doing is using the commands.getouput function to run a couple of shell scripts, namely wvText (which extracts text from a Word document, and cat to read the file output). After that, the entire text from the Word document will be in the out variable, ready to use.
Hopefully this will help anyone having similar issues in the future.
(Note: I posted this on this question as well, but it seems relevant here, so please excuse the repost.)
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
So that's:
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*: Remove blank lines
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)
If your intention is to use purely python modules without calling a subprocess, you can use the zipfile python modude.
content = ""
# Load DocX into zipfile
docx = zipfile.ZipFile('/home/whateverdocument.docx')
# Unpack zipfile
unpacked = docx.infolist()
# Find the /word/document.xml file in the package and assign it to variable
for item in unpacked:
if item.orig_filename == 'word/document.xml':
content = docx.read(item.orig_filename)
else:
pass
Your content string however needs to be cleaned up, one way of doing this is:
# Clean the content string from xml tags for better search
fullyclean = []
halfclean = content.split('<')
for item in halfclean:
if '>' in item:
bad_good = item.split('>')
if bad_good[-1] != '':
fullyclean.append(bad_good[-1])
else:
pass
else:
pass
# Assemble a new string with all pure content
content = " ".join(fullyclean)
But there is surely a more elegant way to clean up the string, probably using the re module. Hope this helps.
benjamin's answer is a pretty good one. I have just consolidated...
import zipfile, re
docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml')
cleaned = re.sub('<(.|\n)*?>','',content)
print cleaned
Use the native Python docx module. Here's how to extract all the text from a doc:
document = opendocx('Hello world.docx')
# This location is where most document content lives
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
# Extract all text
print getdocumenttext(document)
See http://github.com/mikemaccana/python-docx
Parsing XML with regexs invokes cthulu. Don't do it!
You can load DOC, DOCX, RTF, WordML or HTML into Aspose.Words and then just use Document.GetText().
Aspose.Words is available as a library for .NET and for Java. You can use either one but with a corresponding way of invoking it.