How can I extract text from specific binary file formats?

views:

253

answers:

+1 Q:

How can I extract text from specific binary file formats?

In .Net, what would be the best way to extract all the text out of several binary file formats: PDF, Word, Excel, and PowerPoint.

It doesn't need to be formatted, just a big dump of the text in the file.

Code would be great, but I really just need to get pointed to some best practices or patterns on it.

+1 A:

Well, the same as in any other language/environment: Understand the file format enough to extract strings.

And yes, for many file formats this means that you should write at least half a parser for the format. PDF is especially icky, as there are no spaces per se; that's just a convention of how far apart the glyphs are; furthermore PDF can contain compressed streams so simply searching for printable strings in the file doesn't yield anything of value.

Naturally, you can look for a library or another tool which already does this. I've seen a document repository which simply passed PDF files through pdf2ascii and fed the resulting text to Lucene.

Joey 2010-01-15 16:20:53

+1 A:

You'd probably have to implement a different way to handle each file type. There is a lot of sample code around to read these formats, using office interop etc.. then you could write a method that looked at the first few bytes to work out what format the document is, or the extension and send it to the specific reader for that type of document.

Michael Baldry 2010-01-15 16:21:35

Look into Office Interop using .Net for the Office ones. For PDF, see here.

BlueRaja - Danny Pflughoeft 2010-01-15 16:22:38

Is Office Interop meant to be used server-side. I know MS has frowned on that in the past.

Deane 2010-01-15 16:32:03

+2 A:

I'm surprised no one has mentioned IFilters. IFilters is what Microsoft uses to index documents in windows. You'll have to do some googling to find IFilters for the specific formats you're looking for, but you should find most of what you need. A word of caution though, IFilters aren't perfect. They have issues.....

Here's a CodProject article to get you started: http://www.codeproject.com/KB/cs/IFilter.aspx

BFree 2010-01-15 16:39:25

Old post but THANK YOU!! I couldn't find an effective way to extract text from binary PDFs but the IFilters are doing the trick perfectly. Much thanks!!!!!!

J. Farray 2010-10-28 19:53:22

+1 A:

Check out Apache Tika.

It supports:

Microsoft Excel
Microsoft Word
Microsoft PowerPoint
Microsoft Visio
Microsoft Outlook
Portable Document Format (PDF)
OpenDocument
Plain text
Rich Text Format
gzip compression
bzip2 compression
MP3 Audio
MIDI audio
Wave audio
XML
HTML
Java class files
Java jar archives
tar archive
ZIP archive

Nick 2010-01-15 17:04:31

ansaurus

tags:

views:

answers:

How can I extract text from specific binary file formats?

related questions