Determining "boxes of interest" on a PDF page | ansaurus

tags:

views:

64

answers:

1

Q:

Determining "boxes of interest" on a PDF page

I want to be able to determine the bounding box of areas of text, images and paths on a PDF page, similar to what is shown here:

http://www.windjack.com/products/screenshot/pdfcanscreenshot2.html

Looking at the PDF spec, I can see how to determine the bounding boxes of paths and images, but I can't see how to arrive at them for text. Do I have to calculate it by hand, determining the height and width of the glyphs from the font size, etc., or is there a more straightforward way?

+2 A:

You may be able to start with the solution to "How do I get character offset information from a pdf document?" That will give you x, y, width and height for characters and/or substrings in the document. From there, the harder part is to bound the groups of characters into spatially distinct regions. There's no guarantee that spatially grouped text on a page will be close to each other in the syntax of the file format...

Chris Dolan 2009-06-18 02:07:56

Thank you, Chris. I don't speak Perl (and it's not available on the platform I'm targeting) but from my limited comprehension it looks like you are determining the width of text strings by examining the actual font metrics character by character; I take it there's no higher-level approach than that?Thank you also for the warning about the unstructuredness of the PDF format!

hatfinch 2009-06-18 12:23:19

related questions

Zend_Pdf_Page::drawContentStream() Example?

Convert a .doc or .pdf to an image and display a thumbnail in Ruby?

Placing a PDF inside another PDF document with Zend_PDF

Open source PDF library for C/C++ application?

Opening a PDF in WPF Application

How to best merge information, at a server, into a "form", a PDF being generated as the final output

How does one decrypt a PDF with an owner password, but no user password?

How does google make make those awesome PDF reports in Analytics and when you print a Google Doc etc?

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

File format for generating dynamic reports in applications

Automated PDF Creation from URL

How do I display a PDF in Adobe Flex?

Latex=>PDF Rights management

Why is my PDF footer text invisible?

Python module for converting PDF to text

What's the best way to import/read data from pdf files?

Are e-book readers good enough for tech books?

PDF generation from XHTML in a LAMP environment

Create PDFs from multipage forms in WebObjects

Printing a PDF in .NET

PDF Creation Tutorials?

PDF Editing in PHP?

Organizing Documents

Get a preview jpeg of a pdf on windows?

How do I programmatically create a PDF in my .NET application?