views:

64

answers:

1

I want to be able to determine the bounding box of areas of text, images and paths on a PDF page, similar to what is shown here:

http://www.windjack.com/products/screenshot/pdfcanscreenshot2.html

Looking at the PDF spec, I can see how to determine the bounding boxes of paths and images, but I can't see how to arrive at them for text. Do I have to calculate it by hand, determining the height and width of the glyphs from the font size, etc., or is there a more straightforward way?

+2  A: 

You may be able to start with the solution to "How do I get character offset information from a pdf document?" That will give you x, y, width and height for characters and/or substrings in the document. From there, the harder part is to bound the groups of characters into spatially distinct regions. There's no guarantee that spatially grouped text on a page will be close to each other in the syntax of the file format...

Chris Dolan
Thank you, Chris. I don't speak Perl (and it's not available on the platform I'm targeting) but from my limited comprehension it looks like you are determining the width of text strings by examining the actual font metrics character by character; I take it there's no higher-level approach than that?Thank you also for the warning about the unstructuredness of the PDF format!
hatfinch