views:

355

answers:

3

Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?

I've read about PDFJet, but it can't read PDF, can it?

Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.

A: 

I know there is http://pdfbox.apache.org/index.html

Apache PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.

but I've never tested it.

Pierre
+1  A: 

iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.

Kevin Day
iText uses certain classes (like java.awt.AffineTransform) that are not available on GAE. See this page for more details: http://groups.google.com/group/google-appengine-java/web/will-it-play-in-app-engine
Miroslav Bajtoš
hmmm. The parser library certainly doesn't use AffineTransform (I actually implemented my own matrix transformations for the parser). I know that iText *supports* affine transforms when generating PDF files, but I doubt that it's required for parsing. Post the class and method that is giving you problems with using this with app engine and I'll take a look.
Kevin Day
+1  A: 

PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)

I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.

More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html

Fabrizio