PDF to text tool or Java library?

tags:

java
pdf

views:

1006

answers:

+1 Q:

PDF to text tool or Java library?

I need to convert a PDF to normal text (it's the "statement of votes" from our county registrar). The files are big (2000 pages or so) and mostly contain tables. Once I get it into text, then I'm going to use a program I'm writing to parse it and put the data into a database. I've tried the 'Save as text' function in Adobe Reader, but it is not as precise as I'd like it, especially in delimiting the table data into CSV. So, any recommendations for tools or Java libraries that would do the trick?

+3 A:

Well, there is iText. I have only limited experience with it, but I think it can do what you want.

Actually, come to think of it, I don't think iText was for reading. I believe PDFBox handled the reading part. Its site does mention "PDF to text extraction" as its top feature.

EDIT: In fact, there's an ExtractText class specifically for this. And there's also a PDFBox Text Extraction Guide, too!

Michael Myers 2009-02-24 21:11:14

iText can do some reading, I think but there may be better tools (PDFBox as you mentioned, perhaps) to achieve that...

Knobloch 2009-02-24 21:14:40

OK, just tried this out. It worked pretty good on the table data, however, the column headers were messed up, probably because they are vertically aligned text.

Gary Kephart 2009-02-24 23:22:53

Use a text (line) printer to print to file.

dirkgently 2009-02-24 21:11:55

+1 A:

I have always found the xpdf tools very useful.

We successfully use the pdf to text conversion for converting PDF business documents for use in EDI. The option to preserve layout works well to keep things positioned well for parsing in a program.

Jarod Elliott 2009-02-24 21:14:40

This worked well for me. The -layout flag helped keep the tables in a usable format in the text file.

Tim Perry 2010-07-07 23:14:38

There's a list of possible solutions here:

Java libraries to read and write PDF files.

nzpcmad 2009-02-24 21:19:09

I use iText and I"ve been really happy with it. I've used xmlpdf before and iText is far superior in my opinion.

SacramentoJoe 2009-02-24 23:25:37

Without knowing the layout of the pages in your PDF it is difficult to say.

I would suggest downloading and trying both iText and PDBox. You will find text extract examples for both on their websites - you should have an extracter running in < 30mins assuming you know your way around Java.

Start with PDFBox as it's text extraction abilities are better than iText's.

Someone else has mentioned xpdf and that may be useful for you. It's a C library with some command line tools built around it. It has a number of text extracters and you may be able to format the output easily enough. Again, it really depends on your page layout.

Steve Claridge 2009-02-24 23:58:12

PDFTextStream is our Java + .NET library for extracting content from PDF documents; you might give it a shot. Additionally, it does provide some rudimentary table data extraction utilities, which sit on top of PDFTextStream's table detection capabilities. It's by no means a general solution (though we're working on one of those, too!), but if the tabular data is clearly defined (e.g. rows and columns bounded by lines, etc), then you may find what's there now a proper solution.

Chas Emerick 2009-12-07 13:47:56

ansaurus

tags:

views:

answers:

PDF to text tool or Java library?

related questions