views:

246

answers:

2

Is there a pre-existing library to extract plain text form Open XML file formats (e.g. docx, pptx, and xlsx) files?

I require this to populate a lucene.net index.

I've found this example which extracts text from docx and it seems to work okay. But before building my own solution based on this I was wondering if there's something already available for the other file formats?

A: 

watch aspose.com, they have a good library to handle both ppt and pptx.

Yaroslav Yakovlev
Looks like it will do the job, but I'd need to tell the client the price and see if they bite. Considering I'm only looking to read the xml versions I'd be buying a lot more than I need.
Myster
It also means it wouldn't be very suitable for an open source project. (which is not a current requirement, but it would be a bonus)
Myster
+1  A: 

Before spending cash, it may be worth looking at the IFilter interface - these were/are designed to do exactly what you want.

http://msdn.microsoft.com/en-us/library/ms691105

http://www.codeproject.com/KB/cs/IFilter.aspx

(Some links at the bottom of the codeprject link).

MS provide IFilters for office file types. http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en

I know that we use this technology to allow us to index PDFs using Lucene but I did not write the actual code and cannot be of much use I am afraid.

If your Google-fu is strong I am sure you can dig up more examples of using IFilters to do exactly what you want.

Chris F
+1 to IFilter. I'm also using it to populate our Lucene.Net indexes. We're using it to search large files stores with great success (indexing takes lots of time, though). IFilter is an industry standard, so you can find IFilters for any imaginable content type.
buru
That's a reasonable solution, however I'm a bit wary of COM interop particularly if others wish to use the solution in various unknown environments. (or am I being paranoid)
Myster