views:

2723

answers:

3

Hi All,

Any recommendations as to which is the bestpdf reading library/gem (free/open source of course) in ruby?

I found a list at http://rubyforge.org/search/?type_of_search=soft&words=PDF&Search=Search but want to tap peoples experience in filtering it.

I mainly want to parse input pdf files and extract the text within, parse it and struturalize(is that a word?) it into the schema I need.

Cheerio, mataal.

+1  A: 

I don't think there is a native ruby way to do this. But, you could use a utility like pdftohtml to convert it to xml or html, then use Hpricot to parse from there. Unfortunately PDF's have a very location based layout, rather than a flowing layout (like standard html), making them difficult to parse even after such conversion.

Tristan Havelick
A: 

Dr.Fred is right: don't be afraid to use backticks where required.

For an idea of the complexity required, look at this guy's script to parse every British Rails Timetable and keep in mind: this is one script on one collection of uniform and well-formatted documents.

mrflip
+2  A: 

This looks ok.

http://github.com/yob/pdf-reader/tree/master

railsninja