ansaurus

Question

Answer 1

+4 A:

You may take a look at this article. It's based on the excellent iTextSharp library.

Darin Dimitrov 2010-01-22 10:10:28

Yeah this is the one I was using, it was pretty good although not amazingly reliable.However, looking at Tarydon's answer below explains why, and in actual fact it's probably the best I'm going to find!Cheers

Duncan Tait 2010-01-22 10:43:35

Answer 2

+2 A:

There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.

A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:

moveto (x1, y); output ("T")
moveto (x2, y); output ("ap")

This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.

Tarydon 2010-01-22 10:16:39

Answer 3

A:

You can use "CZ-Pdf2Txt COM", it can extract text from pdf files, and it can convert pdf files to delimit table text, or csv files.

This tool can preserve original document layout, and supports command line and COM interface, so you can call it from your application, you can get demo version and more information from http://www.convertzone.com/pdf2txtcom/help.htm

regards
flyaga

flyaga 2010-02-22 04:01:01

ansaurus

tags:

views:

answers:

Extracting text from PDFs in C#

related questions