ansaurus

Question

Answer 1

A:

The best thing I can currently think of (within the list of "simple" tools) is Ghostscript (current version is v.8.71) and the PostScript utility program ps2ascii.ps. Ghostscript ships it in its lib subdirectory. Try this (on Windows):

gswin32c.exe ^
   -q ^
   -sFONTPATH=c:/windows/fonts ^
   -dNODISPLAY ^
   -dSAFER ^
   -dDELAYBIND ^
   -dWRITESYSTEMDICT ^
   -dCOMPLEX ^
   -f ps2ascii.ps ^
   -dFirstPage=3 ^
   -dLastPage=7 ^
   input.pdf ^
   -dQUIET ^
   -c quit

This command processes pages 3-7 of input.pdf. Read the comments in the ps2ascii.ps file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the -dCOMPLEX part by -dSIMPLE.

pipitas 2010-09-07 00:13:57

As you would guess, this only outputs ASCII test. While free, not a great option for software that you plan to with languages other than English.

userx 2010-09-08 21:57:22

@userx: As you could guess, this is Free software: therefore source code available. Possible to extend for support of non-ASCII...

pipitas 2010-09-09 23:21:25

@userx: today I discovered 'TET', the Text Extraction Toolkit from pdflib.com. See my other answer.

pipitas 2010-09-15 23:27:05

Answer 2

A:

If the PDF contains Structured text ( http://www.jpedal.org/PDFblog/?p=410 is a short tutorial telling you how to tell this) you can get an almost perfect XML version. Otherwise, PDF can be very unstructured intenrally. I wrote a blog post explaining some of the issues on text extraction at http://www.jpedal.org/PDFblog/?p=277

mark stephens 2010-09-07 07:59:44

Answer 3

A:

QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.

http://www.quickpdflibrary.com/ - They have a 30 day trial.

Andrew Cash 2010-09-07 14:46:53

Answer 4

+1 A:

Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.

PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".

TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.

pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.

And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.

I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.

This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.

TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...

Give it a try.

pipitas 2010-09-15 23:25:40

ansaurus

tags:

views:

answers:

Recommendations for PDF text extraction

related questions