tags:

views:

65

answers:

4

I'm looking for a utility or library for extracting text from PDFs and formatting it in plain text while keeping as much of the original layout as possible (eg tables, columns etc.).

We're currently using pdftotext but I was wondering if there's anything better. It has to be a command-line tool or a library we can link into our app.

Is pdftotext as good as it gets, or is there something better?

A: 

AbiWord had a SoC project for this a while back. IIRC, it did a pretty good job at recreating multicolumn documents, tables, and figures. There is a command line interface, as well.

eduffy
A: 

There are lots of tools to do text or XML extraction. There is a blog article at http://pdf.jpedal.org/java-pdf-blog/bid/12670/PDF-text which might help explain things.

A: 

For the benefit of others with the same problem: We ended up staying with pdftotext despite its drawbacks (like producing garbage output sometimes when font subsets are used).

See also: http://www.glyphandcog.com/textext.html

AndrewR
A: 

Part of the problem is that I think some of the simpler pdf manipulation/creation tools don't add text but save text as part of a static image-like pdf file. For those files you would have to use OCR.

Roman A. Taycher