ASP.NET library to extract plain text from Open XML file formats

views:

246

answers:

+3 Q:

ASP.NET library to extract plain text from Open XML file formats

Is there a pre-existing library to extract plain text form Open XML file formats (e.g. docx, pptx, and xlsx) files?

I require this to populate a lucene.net index.

I've found this example which extracts text from docx and it seems to work okay. But before building my own solution based on this I was wondering if there's something already available for the other file formats?

watch aspose.com, they have a good library to handle both ppt and pptx.

Yaroslav Yakovlev 2010-06-27 04:50:45

Looks like it will do the job, but I'd need to tell the client the price and see if they bite. Considering I'm only looking to read the xml versions I'd be buying a lot more than I need.

Myster 2010-07-07 00:19:52

It also means it wouldn't be very suitable for an open source project. (which is not a current requirement, but it would be a bonus)

Myster 2010-07-19 22:06:12

+1 A:

Before spending cash, it may be worth looking at the IFilter interface - these were/are designed to do exactly what you want.

http://msdn.microsoft.com/en-us/library/ms691105

http://www.codeproject.com/KB/cs/IFilter.aspx

(Some links at the bottom of the codeprject link).

MS provide IFilters for office file types. http://www.microsoft.com/downloads/details.aspx?familyid=60c92a37-719c-4077-b5c6-cac34f4227cc&displaylang=en

I know that we use this technology to allow us to index PDFs using Lucene but I did not write the actual code and cannot be of much use I am afraid.

If your Google-fu is strong I am sure you can dig up more examples of using IFilters to do exactly what you want.

Chris F 2010-07-08 20:00:11

+1 to IFilter. I'm also using it to populate our Lucene.Net indexes. We're using it to search large files stores with great success (indexing takes lots of time, though). IFilter is an industry standard, so you can find IFilters for any imaginable content type.

buru 2010-07-14 09:21:46

That's a reasonable solution, however I'm a bit wary of COM interop particularly if others wish to use the solution in various unknown environments. (or am I being paranoid)

Myster 2010-07-19 22:04:58

ansaurus

tags:

views:

answers:

ASP.NET library to extract plain text from Open XML file formats

related questions