tags:

views:

219

answers:

3

Is there a reliable way to extract text from PDF? The first thought that comes to mind is that PDF may have multiple columns and the extraction mechanism needs to know the logical structure somehow. I understand that some PDF docs are "tagged" but I'd need to support pretty much any PDF document.

Any third party components to the rescue here?

+5  A: 

Please see: Extracting text from PDFs in C#

Mitch Wheat
+2  A: 

Some PDFs are scans, so OCR would be required (not easy, to say the least).

Some PDFs are compressed, others (more rarely) are bare PDFs.

The PDF file format itself is well-documented, but when it comes to extracting the right "structure" from anything but a simple one-column document, you're asking for a tall order. PDF sort of represents, internally, how HTML might look if every line of text was positioned in DIVs with absolute positioning.

richardtallent
A: 

You can use "CZ-Pdf2Txt COM", it can extract text from pdf files, and it can convert pdf files to delimit table text, or csv files.

This tool can preserve original document layout, and supports command line and COM interface, so you can call it from your application, you can get demo version and more information from http://www.convertzone.com/pdf2txtcom/help.htm

regards
flyaga

flyaga
Is this more reliable than the one mentioned by the earlier poster (using iTextSharpLibrary)
DotnetDude
It can process line of text was positioned in DIVs with absolute positioning ok, and you can download the demo version http://www.convertzone.com/pdf2txtcom/demo/czpdf2txtcom.exe to try it
flyaga