tags:

views:

79

answers:

4

Hi,

I need a window program to convert word file (.doc) into text. Something like "anitiword" for windows.

I need it because I need to convert word file into text and use Lucence to index it and I am in a windows environment :(

Thanks for all your help!!!

+1  A: 

Yes. That program is called MS Word.

Open the file in Word via COM, and save it as text programmatically. On the other hand, is Lucene not able to read Word documents natively?

Tomalak
Sorry I am using php zend framework lucene. Any ideas? Thanks :)
noobplusplus
@anoob: I would start looking for a Lucene add-in that enables you to index Word-documents natively. I'm sure they exist. When you have one, install it, and your problems are gone. No need to develop anything in PHP. If you find none, since you are on Windows, install a copy of MS Word, and work with it via COM, as suggested.
Tomalak
@tomalak: http://lucene.apache.org/tika/ for example :-)
plutext
A: 

if you really need a program, here's one. Have not tried, but you can give it a shot. Otherwise, you can just use COM / vbscript.

ghostdog74
A: 

Using POI (http://poi.apache.org/) you should be able to index the old binary DOC formats. Relevant code snippets can be found on http://kalanir.blogspot.com/2008/08/how-to-index-microsoft-format-documents.html.

And for DOCX, since that's basically a ZIP file which contains a bunch of XML and resource files, it should be relatively easy to find the XML file containing the actual text (I think it's word/document.xml) and indexing the text contained in it (after stripping off all XML data)...

wimvds
A: 

You can use the OpenXML SDK to easily strip the text out of DOCX files. Does not work with .doc though--you probably need to use MS Word and COM for that.

jle