views:

73

answers:

2

I need to extract table objects from PDF documents preferably programmatically using Perl. I am able to cut and paste to Excel, but the table would require quite a bit of manual editing once the data is imported into Excel.

I've done some search, but so far it seems though most forums suggest that most APIs are very primitive.

+2  A: 

The best module I know of for dealing with PDFs in perl is PDF::API2. However without knowing more about the manipulation you need to do its hard to give further recommendation. Another possibility is to program using Excel's built in VB functionality so that when you copy the tables into your excel spreadsheet it fires off a macro that will perform your formatting for you.

stocherilac
All I need is to process the text that is in the table. Keeping in mind that a cell may have (Empty fields, multiple lines, spaces, comas etc). Which if I cut and paste presents a challenge in terms of which delimiter to tell Excel to use.
Face
+1  A: 

I think the best CPAN module for this would probably be CAM::PDF.

However I've not used the module so I cannot confirm it will (easily) do what you require but it is a PDF manipulation library and the modules author does answer questions about CAM::PDF here on SO.

Also see this previous question: How can I extract text from a PDF file in Perl?

/I3az/

draegtun