Extracting text from PDF document - C#

views:

219

answers:

Extracting text from PDF document - C#

Is there a reliable way to extract text from PDF? The first thought that comes to mind is that PDF may have multiple columns and the extraction mechanism needs to know the logical structure somehow. I understand that some PDF docs are "tagged" but I'd need to support pretty much any PDF document.

Any third party components to the rescue here?

+5 A:

Please see: Extracting text from PDFs in C#

Mitch Wheat 2010-02-19 15:06:01

+2 A:

Some PDFs are scans, so OCR would be required (not easy, to say the least).

Some PDFs are compressed, others (more rarely) are bare PDFs.

The PDF file format itself is well-documented, but when it comes to extracting the right "structure" from anything but a simple one-column document, you're asking for a tall order. PDF sort of represents, internally, how HTML might look if every line of text was positioned in DIVs with absolute positioning.

richardtallent 2010-02-19 15:10:45

You can use "CZ-Pdf2Txt COM", it can extract text from pdf files, and it can convert pdf files to delimit table text, or csv files.

This tool can preserve original document layout, and supports command line and COM interface, so you can call it from your application, you can get demo version and more information from http://www.convertzone.com/pdf2txtcom/help.htm

regards
flyaga

flyaga 2010-02-20 05:36:49

Is this more reliable than the one mentioned by the earlier poster (using iTextSharpLibrary)

DotnetDude 2010-02-20 15:25:38

It can process line of text was positioned in DIVs with absolute positioning ok, and you can download the demo version http://www.convertzone.com/pdf2txtcom/demo/czpdf2txtcom.exe to try it

flyaga 2010-02-22 04:06:26

ansaurus

tags:

views:

answers:

Extracting text from PDF document - C#

related questions