views:

596

answers:

6

I`d like to be able to read the content of office documents (for a custom crawler).

The office version that need to be readable are from 2000 to 2007. I mainly want to be crawling words, excel and powerpoint documents.

I don`t want to retrieve the formatting, only the text in it.

The crawler is based on lucene.NET if that can be of some help and is in c#.

I already used iTextSharp for parsing PDF

+1  A: 

There is an excelent open source project POI, only drawback - it is written for Java. The .net port is somehow very beta.

Drejc
+1  A: 

Here's a nice little post on c-charpcorner by Krishnan LN that gives basic code to grab the text from a Word document using the Word Primary Interop assemblies.

Basically, you get the "WholeStory" property out of the Word document, paste it to the clipboard, then pull it from the clipboard while converting it to text format. The clipboard step is presumably done to strip out formatting.

For PowerPoint, you do a similar thing, but you need to loop through the slides, then for each slide loop through the shapes, and grab the "TextFrame.TextRange.Text" property in each shape.

For Excel, since Excel can be an OleDb data source, it's easiest to use ADO.NET. Here's a good post by Laurent Bugnion that walks through this technique.

Guy Starbuck
+1  A: 

Here is a good list of various tools for converting Word documents to plaintext, which you can then do whatever with.

Adam Rosenfield
+2  A: 

If you're already using Lucene.NET you might just want to take advantage of the various IFilters already available for doing this. Take a look at the open source SeekAFile project. It will show you how to use an IFilter to open and extract this information from any filetype where an IFilter is available. There are IFilters for Word, Excel, Powerpoint, PDf, and most of the other common document types.

Paul Mrozowski
A: 

Thanks for the answers, i`d like accept all of them. But I have to choose one.

ceetheman
A: 

You might also consider checking out DtSearch (www.DtSearch.com). Although it is primarily a searching tool, it does a great job of extracting text from a large number of file types and is considerably cheaper than other options like the Oracle/Stellent OutsideIn technology or the equivalent from Autonomy.

I've been using DtSearch for years and find it indispensible for this type of task.

JohnFx