Ruby: Reading PDF files

views:

2435

answers:

+6 Q:

Ruby: Reading PDF files

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).

Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.

My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?

Here's some options:

http://en.wikipedia.org/wiki/List_of_PDF_software

From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/

Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).

The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.

Terry 2009-04-21 19:14:09

Why would I need OCR ("optical character recognition") to read a PDF that doesn't consist of scanned text? Wouldn't that needlessly slow down the whole process?

Javier 2009-04-25 00:14:22

No. OCR is the process of converting images to text. PDF readers and PDF toolkits utilize this concept to convert an image (the same that is output from, say, a scanner) to text.

Terry 2009-04-25 04:00:59

So basically you're saying, that all text inside a PDF consists of an image that needs to be recognized as text first?

Javier 2009-04-25 22:58:37

I'd be lying if I said yes or no. I just know I've had success with OCRs.

Terry 2009-04-26 01:30:51

+1 A:

You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.

insane.dreamer 2009-04-21 21:19:36

Or iText, http://www.lowagie.com/iText/.

James McMahon 2009-04-25 00:12:55

That sounds like an interesting alternative. Have you seen an implementation or an example somewhere?

Javier 2009-04-25 00:15:29

@nemo: iText? I'm trying to read PDFs, not generate them.

Javier 2009-04-25 00:16:44

After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.

Javier 2009-04-27 12:47:27

If you just need to get the text content out of a pdf file, pdftohtml at sourceforge is efficient. it is not suited for dealing with images.

Alexis Perrier 2010-02-12 10:16:22

+1 A:

You might find Docsplit useful:

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)

pw 2010-03-03 13:49:19

Javier: do take a look at Docsplit. It wraps the Apache PDFBox library for text extraction -- because we've had better quality results with PDFBox that pdftotext.

jashkenas 2010-06-15 13:57:26

ansaurus

tags:

views:

answers:

Ruby: Reading PDF files

related questions