tags:

views:

189

answers:

3

I have a "searchable pdf" aka 'image files with invisible but selectable text'. (When this file is opened in Acrobat, I am alerted "You are viewing this document in PDF/A mode.")

I need to extract the bounding rectangle of each word in this document. Any suggested toolkits and the methods for accessing the "invisi-text" words' bounding-boxes?

I would prefer tools in java, but appreciate any suggestions.

A: 

Check out the iText library: http://www.lowagie.com/iText/

Richard
iText is primarily for generating pdf documents. I don't see anything in the api for extracting bounding box information for text in loaded pdfs.
jedierikb
Yep you are correct. Sorry about the bum steer. Perhaps http://support.idrsolutions.com/default.asp?W17 is a better bet?
Richard
A: 

Acrobat's javascript libraries look to be the most straightforward, especially:

getPageNthWordQuads

which works on a "searchable pdf".

Would be nice if the acrobat javascript library was available as java calls...

jedierikb
A: 

PdfBox and JPedal also offer text extraction methods.

I downloaded the JPedal demo jar, but (1) the xml it exported did not have bounding box info; and (2) when I did plain text extraction it did not return the "searchable"/invisi- text (I assume it tried to do OCR?)
jedierikb
Did you ask on the JPedal forums?