views:

2435

answers:

5

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).

Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.

My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?

A: 

Here's some options:

http://en.wikipedia.org/wiki/List_of_PDF_software

From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/

Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).

The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.

Terry
Why would I need OCR ("optical character recognition") to read a PDF that doesn't consist of scanned text? Wouldn't that needlessly slow down the whole process?
Javier
No. OCR is the process of converting images to text. PDF readers and PDF toolkits utilize this concept to convert an image (the same that is output from, say, a scanner) to text.
Terry
So basically you're saying, that all text inside a PDF consists of an image that needs to be recognized as text first?
Javier
I'd be lying if I said yes or no. I just know I've had success with OCRs.
Terry
+1  A: 

You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.

insane.dreamer
Or iText, http://www.lowagie.com/iText/.
James McMahon
That sounds like an interesting alternative. Have you seen an implementation or an example somewhere?
Javier
@nemo: iText? I'm trying to read PDFs, not generate them.
Javier
A: 

After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.

Javier
A: 

If you just need to get the text content out of a pdf file, pdftohtml at sourceforge is efficient. it is not suited for dealing with images.

Alexis Perrier
+1  A: 

You might find Docsplit useful:

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

pw
Javier: do take a look at Docsplit. It wraps the Apache PDFBox library for text extraction -- because we've had better quality results with PDFBox that pdftotext.
jashkenas