How to parse just the text from a Word Doc using Python?

views:

250

answers:

How to parse just the text from a Word Doc using Python?

When you try opening a MS Word document or for that matter most Windows file formats, you will see gibberish as given below broken intermittently by the actual text. I need to extract the text that goes in and want to ignore the gibberish -- which is something like given below. How do I extract only the text that matters, and ignore rest of the stuff. Please advise.

Here's a sample of open("sample.doc",r").read() of a word doc. Thanks

00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00

In an Interesting news,his is the first time we polled Indian channel community for their preferred memory supplier. Transcend came a close second, was seen to be more popular among class A city based resellers, was also the most recalled memory brand among customers according to resellers. However Transcend channels complained of parallel imports and constant unavailability of the products in grey

x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x

use open("sample.doc","rb").read()

Srinivas Reddy Thatiparthy 2010-02-14 16:02:33

Well, that gives me a long with the content a lot of rubbish, as explained in the question. the ff\xff!\x00\x00\x0 chars...

ramdaz 2010-02-14 16:04:58

Please see this question on SO..http://stackoverflow.com/questions/125222/extracting-text-from-ms-word-files-in-python

Srinivas Reddy Thatiparthy 2010-02-14 16:10:20

There is no generic why to extract information from every file format. You need to know the format to know how to extract the information.

Just wanted to state that first. So what you should look for is libraries and software that can convert/extract the information you want. And as mentioned by Ofir MicroSoft have tools for that for their formats.

But if you can not do this and want to take the chance that there is text visible in the file that you think is interesting to read you could do a normal read and look for sequences of bytes that will build text. Then comes the question, what languages/charset should I support support in my hunt for text. Is it multi-byte text?

The easy start is to loop through the data and look for sequences of [a-zA-z0-9_- ] to find the text. But word is probably multi-byte. So you should scan double byte as one char.

Note: some of the new formats like open office and docx is multiple files in a compressed container. So you need to de-compress the file first, and scan XML documents after the text you looking for.

jpyllman 2010-02-14 16:20:41

+1 A:

doc is a binary format, it's not a markup language or something. Specs: http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx

myfreeweb 2010-02-14 16:23:58

+2 A:

The tool that seems the most viable, particularly if you need an all python solution is OleFileIO.

Regis Frey 2010-02-14 16:26:46

The above is for .DOC files. .DOCX files are simply zipped xml files.

Regis Frey 2010-02-14 16:39:56

Word doc is a compressed format. You need to uncompress it first to get the real data (try open a doc file in a program like winrar, and you'll see it contains multiple files.

It even seems to be XML, so reading the format should not be that hard, although I'm not sure if you get all the data this way.

Ikke 2010-02-14 16:34:12

I had a similar problem, needing to query hundreds of Word documents. I converted the Word files to text files and used normal text parsing tools. Worked well.

keith 2010-02-14 18:59:53

ansaurus

tags:

views:

answers:

How to parse just the text from a Word Doc using Python?

related questions