views:

155

answers:

2

How can I extract the first paragraph of a PDF document using Perl's CAM::PDF?

A: 

Plain PDF really is not a markup language. Text is drawn at specific locations. There is something called Tagged PDF and if your documents are tagged, your job might be easier.

I would be inclined to run the documents through a PDF to text translator and grab the first chunk of text out of that if text is stored as text in your PDF and not images.

Sinan Ünür
+1  A: 
print CAM::PDF->new('file.pdf')->getPageText(1);

will get you all of the text from the page. But, CAM::PDF is definitely not the best tool for this particular job (I'm the author). I added text extraction as a whim just to see if I could do it.

Chris Dolan