I'm developing a Desktop Search Engine using VB9 (VS2008) and Lucene.NET. The Indexer in Lucene.NET accepts only raw text data and it is not possible to directly extract raw text from a Microsoft Office (DOC, DOCX, PPT, PPTX) and PDF documents. What is the best way to extract raw text data from such files?
You can, like the Windows Desktop Search, use components implementing the IFilter interface.
I can only talk about MS Office documents here. There are several ways to do this:
- Using COM automation
- Using converters which output the document in a more accessible format
- Using 3rd-party libraries
- Using Microsoft's OpenXML SDK
COM automation has the disadvantage of not always being reliable, mainly because applications tend to hang due to modal popup dialogs.
Converters are available for Word. You could check out the Text Converter SDK available from Microsoft which would allow you to use the document converters coming with Word in a stand-alone application. Requires some C coding but since you are using the same conversion engines as Office you will get high-fidelity results. The SDK can be obtained from http://support.microsoft.com/kb/111716.
For the third option using third party libraries you might want to have a look at Apache POI or the b2xtranslator project on SourceForge. The latter provides a C# library which allows you to extract the text from binary Word documents. PowerPoint development is still in an early stadium but text extraction should already be working.
The last option would be to use Microsoft's OpenXML SDK. This might be the preferred/easiest way. Search Google for samples. You could also handle binary documents by first converting them using the Office Compatibility Pack (download and install from Microsoft):
Word:
"C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme <input file> <output file>
Excel:
"C:\Program Files\Microsoft Office\Office12\excelcnv.exe" -oice <input file> <output file>
PowerPoint:
"C:\Program Files\Microsoft Office\Office12\ppcnvcom.exe" -oice <input file> <output file>
For PDF you can use my company's .NET PDF Reader component that features text extraction.
This is exactly the code you write to extract the text from a PDF:
public String ReadTextFromPages(Stream s)
{
using (PdfTextDocument doc = new PdfTextDocument(s))
{
PdfTextReader rdr = doc.GetPdfTextReader();
return rdr.ReadToEnd();
}
}