tags:

views:

3885

answers:

8

I'm trying to read a .doc file into a database so that I can index it's contents. Is there an easy way for PHP on Linux to read .doc files? Failing that is it possible to convert .doc files to rtf, pdf or some other 'open' format that is easy to read?

Note, I am not interested in .docx files.

A: 

Microsoft published the specification for the .DOC format a while ago.

J D OConal
+1  A: 

You can use antiword or AbiWord to pull the text out and feed it to your favorite full-text indexer. AbiWord is probably more effective for your purposes because it can convert into RTF, PDF and other formats (yes, it's a GUI word processor, but it also supports command-line usage).

Nicholas Riley
A: 

It's not PHP, but there is a doc2rtf utility out there that you can use. From there you can just open the RTF file as a text document, write some string replacement routines to remove the RTF formatting codes, and have a glob of text suitable for indexing.

Alternately, you can get OpenOffice and open the MS Word documents and just File > Save As > RTF.

Nathan Strong
+2  A: 

There seems to be a library for accessing Word documents but not sure how to access it from PHP. I think the best solution would be to call their wv command from PHP.

Swaroop C H
This seems ideal. I need to test it on some docs, but so far the wvText function seems to do what I need.
Conor
A: 

DOC files are stored in binary format which there hasn't been any purely php written classes in dealing with them.

RTF files are much easier to parse, being mostly text you can just open them up with fopen and read the contents.

I would suggest using RTF if you can, as there really is not a sound solution for DOC files yet.

Cetra
+6  A: 

Conor, I'd suggest to look at OpenOffice command line interface / calling macros. It can convert many file formats to many others. Then you can pick something much more parse-able than MS doc.

For instance, to convert to PDF, a command line is:

/usr/lib/ooo-2.0/program/soffice.bin -norestore -nofirststart -nologo -headless -invisible   "macro:///Standard.Module1.SaveAsPDF(demo.doc)"
Ivan Krechetov
hey that's a nice tip: do you have a link to a reference for other macros like that?
nickf
Try this: http://www.tinybutstrong.com/tbsooo.php
Ivan Krechetov
+1  A: 

phpLiveDocx is a Zend Framework component and can read and write DOC and RTF files in PHP on Linux, Windows and Mac. Furthermore, you can use it to generate PDF files and even merge data from PHP into template files created with MS Word or Open Office!

See the project web site at:

http://www.phplivedocx.org

+1  A: 

I found a unoconv package in Ubuntu. It does conversion between all formats supported by OpenOffice. You should be able to use exec in php to run this utility.

Loke