views:

887

answers:

2

I need to get pages count from word documents. I've tested many libraries and scripts (apache poi, perl scripts, some application for linux and some more) and the only working solution was to install Microsoft Office with Wine and access OLE with perl. I've managed to do it but it seems I can't use it on server due to licensing problems...

The problem with apachepoi and other solutions providing access to word documents info is related to incompleteness of some docs. pageCount property in document summary is sometimes missing (it's often case with odt documents saved as doc and older docs).

Is there any way to actually count pages (not only get info from summary) without installing Microsoft Office on server?

+1  A: 

I was going to say wvSummary, but I think this uses the metadata you're referring to. I'm not sure there is a way to get the page count without actually laying out the document. So you might have to resort to using APIs to drive a real Office-compatible application like OpenOffice or AbiWord.

Matthew Flaschen
I've tried vwSummary and yes, it uses document summary. I forgot to mention OpenOffice, there is python and Java API and getting actual pages count is fairly easy. The only problem with this approach is opening doc files in OpenOffice - sometimes it can differ that the same file opened with MS Office.
Yes, OpenOffice does sometimes differ. Remember there's no standard for .doc. So really the answer is, there is no answer. There is no standard way to count pages, because there is no standard way to render a doc file. If you must get the same number of pages Word does, then you obviously have to use Word.
Matthew Flaschen
A: 

If you trust the document summary, instead of using wvSummary, you can just open the file and do a Regex search for "nofpages(\d+)". Groups[1] will contain the number of pages.

Since Word always saves the summary when it saves, I think this is pretty safe if you know the document was last saved with Word, which in my experience is 99% of the time.

PRMan