tags:

views:

1006

answers:

7

I need to convert a PDF to normal text (it's the "statement of votes" from our county registrar). The files are big (2000 pages or so) and mostly contain tables. Once I get it into text, then I'm going to use a program I'm writing to parse it and put the data into a database. I've tried the 'Save as text' function in Adobe Reader, but it is not as precise as I'd like it, especially in delimiting the table data into CSV. So, any recommendations for tools or Java libraries that would do the trick?

+3  A: 

Well, there is iText. I have only limited experience with it, but I think it can do what you want.

Actually, come to think of it, I don't think iText was for reading. I believe PDFBox handled the reading part. Its site does mention "PDF to text extraction" as its top feature.

EDIT: In fact, there's an ExtractText class specifically for this. And there's also a PDFBox Text Extraction Guide, too!

Michael Myers
iText can do some reading, I think but there may be better tools (PDFBox as you mentioned, perhaps) to achieve that...
Knobloch
OK, just tried this out. It worked pretty good on the table data, however, the column headers were messed up, probably because they are vertically aligned text.
Gary Kephart
A: 

Use a text (line) printer to print to file.

dirkgently
+1  A: 

I have always found the xpdf tools very useful.

We successfully use the pdf to text conversion for converting PDF business documents for use in EDI. The option to preserve layout works well to keep things positioned well for parsing in a program.

Jarod Elliott
This worked well for me. The -layout flag helped keep the tables in a usable format in the text file.
Tim Perry
A: 

There's a list of possible solutions here:

Java libraries to read and write PDF files.

nzpcmad
A: 

I use iText and I"ve been really happy with it. I've used xmlpdf before and iText is far superior in my opinion.

SacramentoJoe
A: 

Without knowing the layout of the pages in your PDF it is difficult to say.

I would suggest downloading and trying both iText and PDBox. You will find text extract examples for both on their websites - you should have an extracter running in < 30mins assuming you know your way around Java.

Start with PDFBox as it's text extraction abilities are better than iText's.

Someone else has mentioned xpdf and that may be useful for you. It's a C library with some command line tools built around it. It has a number of text extracters and you may be able to format the output easily enough. Again, it really depends on your page layout.

Steve Claridge
A: 

PDFTextStream is our Java + .NET library for extracting content from PDF documents; you might give it a shot. Additionally, it does provide some rudimentary table data extraction utilities, which sit on top of PDFTextStream's table detection capabilities. It's by no means a general solution (though we're working on one of those, too!), but if the tabular data is clearly defined (e.g. rows and columns bounded by lines, etc), then you may find what's there now a proper solution.

Chas Emerick