How can I extract the first paragraph of a PDF document using Perl's CAM::PDF?
views:
155answers:
2
A:
Plain PDF really is not a markup language. Text is drawn at specific locations. There is something called Tagged PDF and if your documents are tagged, your job might be easier.
I would be inclined to run the documents through a PDF to text translator and grab the first chunk of text out of that if text is stored as text in your PDF and not images.
Sinan Ünür
2009-10-23 15:03:25
+1
A:
print CAM::PDF->new('file.pdf')->getPageText(1);
will get you all of the text from the page. But, CAM::PDF is definitely not the best tool for this particular job (I'm the author). I added text extraction as a whim just to see if I could do it.
Chris Dolan
2009-10-28 02:46:24