ansaurus

Question

What is the best way to parse Microsoft Office and PDF documents?

Answer 1

+4 A:

You can, like the Windows Desktop Search, use components implementing the IFilter interface.

David Tischler 2009-01-21 13:47:34

If I can just add to this - for the love of all that is sacred, use the FoxIt PDF IFilter. The 32-bit version is free. It's so much faster and more stable than the Adobe one. http://www.foxitsoftware.com/pdf/ifilter/index.html.

Ryan Ische 2009-04-09 19:30:20

Answer 2

+1 A:

I can only talk about MS Office documents here. There are several ways to do this:

Using COM automation
Using converters which output the document in a more accessible format
Using 3rd-party libraries
Using Microsoft's OpenXML SDK

COM automation has the disadvantage of not always being reliable, mainly because applications tend to hang due to modal popup dialogs.

Converters are available for Word. You could check out the Text Converter SDK available from Microsoft which would allow you to use the document converters coming with Word in a stand-alone application. Requires some C coding but since you are using the same conversion engines as Office you will get high-fidelity results. The SDK can be obtained from http://support.microsoft.com/kb/111716.

For the third option using third party libraries you might want to have a look at Apache POI or the b2xtranslator project on SourceForge. The latter provides a C# library which allows you to extract the text from binary Word documents. PowerPoint development is still in an early stadium but text extraction should already be working.

The last option would be to use Microsoft's OpenXML SDK. This might be the preferred/easiest way. Search Google for samples. You could also handle binary documents by first converting them using the Office Compatibility Pack (download and install from Microsoft):

Word:

"C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme <input file> <output file>

Excel:

"C:\Program Files\Microsoft Office\Office12\excelcnv.exe" -oice <input file> <output file>

PowerPoint:

"C:\Program Files\Microsoft Office\Office12\ppcnvcom.exe" -oice <input file> <output file>

0xA3 2009-01-21 13:53:39

Answer 3

+2 A:

For PDF you can use my company's .NET PDF Reader component that features text extraction.

This is exactly the code you write to extract the text from a PDF:

public String ReadTextFromPages(Stream s)
{
    using (PdfTextDocument doc = new PdfTextDocument(s))
    {
        PdfTextReader rdr = doc.GetPdfTextReader();
        return rdr.ReadToEnd();
     }
}

plinth 2009-01-21 13:55:46

ansaurus

tags:

views:

answers:

What is the best way to parse Microsoft Office and PDF documents?

related questions