views:

90

answers:

5

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.

Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:

Name | Address | Cash Reported | Year Reported | Holder Name

Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.

Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?

UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.

+1  A: 

At the very least, could anyone point me to a Ruby PDF library for this task?

If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.

I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.

Yaser Sulaiman
A: 

Maybe the Prawn ruby library? link text

Nisanio
Nope: Prawn is a PDF *writing* library.
Yaser Sulaiman
A: 

This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.

The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.

Mark Thomas
A: 

Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410

If not, you will need to build it.

mark stephens
+1  A: 

I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?

Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:

  1. Try to sanitize the file first by re-distilling it.
  2. Try with different CLI tools to extract the text into a .txt file.

This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).

If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.

Sanitize the 'Monster.pdf':

I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.

gswin32c.exe ^
  -o Monster-PDF-sanitized ^
  -sDEVICE=pdfwrite ^
  -f Monster.pdf

(I'm curious how much that single command will make your output PDF shrink if compared to the input.)

Extract text from PDF:

I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:

pdftotext.exe ^
   -f 1 ^
   -l 10 ^
   -layout ^
   -eol dos ^
   -enc Latin1 ^
   -nopgbrk ^
   Monster-PDF-sanitized.pdf ^
   first-10-pages-from-Monster-PDF-sanitized.txt

This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).

If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.

pipitas