tags:

views:

57

answers:

2

A customer is asking me to build a module for his running webapp that can load docx files and extract data based on the Headings found in the document. I know docx is just a zip file and most of what I need can be found in word/document.xml, though I'm not looking forward to parsing lists/styles/images/tables and whatever other things that need to be translated from OOXML to HTML.

Are there any PHP libraries for this format? I do need some sort of flexibility though: just an OOXML to HTML converter is not going to cut it, I need to break the document up in parts.

A: 

Codeplex has a number of libraries than can work with MS Office documents:

With the exception of PHPExcel, I do not know how mature those projects are. If there is nothing to help you out there, you can still use DOM.

Gordon
+2  A: 

If it's purely docx, you can try phpdocx... don't know if it reads or only writes. PHPWord doesn't yet read, only writes (though I'm working on it).

If you only need the properties information, then you'll find it all within the /docProps/core.xml file within the zip (and possibly in /docProps/app.xml depending on exactly which properties you need), so you can bypass most of the files that hold text, style, images, etc. For verification of file names, [Content_Types].xml holds the filenames for the core and app properties files as application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml and application/vnd.openxmlformats-officedocument.extended-properties+xml

EDIT: If you need headings, then you will need to parse the document, not just the properties. That will mean identifying the heading styles, and parsing the text for entities with those styles.

Mark Baker
I need all proper styling, just need to break up the document based on found headings. I only need read, no write... and phpdocx only writes.
Daniel
Response to edit: I know I'll need to parse the document ;) I'm just looking for libraries that'll give me an easier job at doing so. Preferably I want to pass in PARTS of the document that get translated to html content.
Daniel
Aside from the two I've mentioned, I'm not aware of any other PHP libraries that work with docx format files. If you need to develop this yourself, I can point you to the documentation on the format: if you find any reader libraries, please share.There is always the fallback option of a Windows server running Word, and using PHP COM
Mark Baker
Ended up using COM. +1 and accepted for mentioning and taking the time/effort to reply
Daniel
Sorry I couldn't help more... I'm working on a pure PHP solution to MSWord files with PHPWord, but with day-job and real life, and my other FOSS projects of PHPExcel and PHPPowerpoint, it takes time
Mark Baker