tags:

views:

567

answers:

4

What is the easiest way to get the text (words) or a PDF doc as a one long String or array of Strings.

I have tried pdfbox but that is not working for me.

+3  A: 

iText.

matt b
I thought iText didn't do text extraction? It's more for building PDFs. Can you post how you're using it?
Sam Barnum
iText does have text extraction. For example, the following code extracts texts from page 3 of the input pdf.PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf"));parser.getTextFromPage(3);
Kushal Paudyal
+1  A: 

JPedal and Multivalent also offer text extraction in Java or you could access xpdf using Runtime.exec

mark stephens
+1  A: 

PDFBox barfs on many newer PDFs, especially those with embedded PNG images.

I was very impressed with PDFTextStream

Sam Barnum
A: 

use iText. The following snippet for example will extract the text.

PdfTextExtractor parser =new PdfTextExtractor(new PdfReader("C:/Text.pdf"));
parser.getTextFromPage(3);

Kushal Paudyal