tags:

views:

73

answers:

1

So I have files....

.doc
.docx
.xls
.xlsx
and .pdf

that are on the my server.

Is it possible (and if it is, how) to extract the meta data from those files using PHP? I'm looking for things like Author, keywords, title, etc...

In office documents it's the information stored along with the document properties (File...Properties...Summary for 2003, Prepare...Properties for 2007).

In PDFs it's information found in Document Properties.

This is not on a Windows server.

+2  A: 

I have managed to extract a lot of Meta information using XPDF on a linux system a few years back. Nowadays, though, I would say Zend_PDF is your best bet. Haven't used it myself but looks good and promises everything you need. Seems to have no library dependencies, either.

For Word .DOCs, if you don't find a better way, plug into an OpenOffice server instance / command line and convert the files to ODT, which is XML and parseable. If it's not possible to extract the meta data per Macro - it should be, but I don't know how much work it is. This OpenOffice Forum entry gives a ton of starting points for automated conversion.

The ...X formats are some sort of XML, so it should be easily possible to fetch the meta data from them. Alternatively, you should be able to use OpenOffice's conversion filters here as well, if they transport the meta data.

Pekka
So far, so good - Zend_PDF did the trick for PDFs. - next up is the office docs.
Jason
Nice! Be sure to keep us updated, I'm sure it will come in handy for a lot of people. Maybe this is of additional interest, or can give you some pointers. http://meta-extractor.sourceforge.net/
Pekka