ansaurus

Question

toolkit & methods for extracting text bounds in 'searchable pdf'

Answer 1

A:

Check out the iText library: http://www.lowagie.com/iText/

Richard 2009-02-24 08:44:26

iText is primarily for generating pdf documents. I don't see anything in the api for extracting bounding box information for text in loaded pdfs.

jedierikb 2009-02-24 12:33:51

Yep you are correct. Sorry about the bum steer. Perhaps http://support.idrsolutions.com/default.asp?W17 is a better bet?

Richard 2009-02-24 17:09:09

Answer 2

A:

Acrobat's javascript libraries look to be the most straightforward, especially:

getPageNthWordQuads

which works on a "searchable pdf".

Would be nice if the acrobat javascript library was available as java calls...

jedierikb 2009-02-24 14:33:05

Answer 3

A:

PdfBox and JPedal also offer text extraction methods.

2009-02-26 09:02:02

I downloaded the JPedal demo jar, but (1) the xml it exported did not have bounding box info; and (2) when I did plain text extraction it did not return the "searchable"/invisi- text (I assume it tried to do OCR?)

jedierikb 2009-02-26 13:28:38

Did you ask on the JPedal forums?

2009-02-28 09:16:24

ansaurus

tags:

views:

answers:

toolkit & methods for extracting text bounds in 'searchable pdf'

related questions