views:

2835

answers:

2

I'd like to write some Java code that takes a PDF document, and creates named destinations from all of the bookmarks. I'm thinking that the iText API is the easiest way of doing this, but I have never used the API before.

How would you go about writing this sort of code with the iText API?

Open, find bookmarks, create destinations, save, close.

Or is there a different API that would be better?

+2  A: 

I'll just warn you up front that you may be disappointed with this. iText isn't really intended to be used as a parser. It's really more for creating entirely new PDF documents, but you can take a whack at it.

To start, using iText, you won't be able to modify the existing PDF document. What you can do, though, is to make a copy with the additional features that you want. (If somebody else knows better, please let me know, this drives me crazy.)

What you will want to do is create a PdfReader object from an input stream on your source file. Then create a PdfCopy object (which is just an extended PdfWriter that makes getting data from an existing source more convenient) for your destination.

As far as I can tell, the bookmarks cannot be obtained from iText at all. Another library may be needed. I think jpedal may have the ability to extract them (it can get them as an XML document, which you may then have to parse to get what you want.) However you get them, you can then add them to a java.util.List, and set that list as outline on the PDFCopy. The bookmarks themselves are just HashMaps with a particular set of keys. I'm not sure what all of the values are, but they include "Title", "Action" (which seems to be where you'd specify that this is a named destination, though I don't know what that value would be), and "URI" (which is used if this is an external link -- I suspect that this would specify the name of the named destination that you're linking to). Again, the specifics are hard to find.

Then iterate over the pages of the reader, importing each page to the PdfCopy. this page may help you.

Sorry I'm not more helpful to you. Good luck.

P.S. If anybody else knows of a better tool that's either (L)GPL or BSD licensed, I'd love to hear about it.

Ian McLaird
Thanks, that gives me enough to move forward with.
Chris Carruthers
+3  A: 

Followup: I submitted a patch to iText a few months ago (it has now been accepted and is part of HEAD) that adds text parsing capabilities to iText. PdfBox (mentioned below) has (had?) problems with reading newer PDFs that use xref streams instead of the older xref table format.


Another library that is very good at parsing existing PDF files is PdfBox It can also be used for modifying an existing PDF. FYI - this is the text parser that Lucene uses.

I will also mention that iText does have the ability to parse a PDF file, it's just not great at parsing the text content on each page. If you are looking at accessing the PDF higher level constructs (Dictionaries, etc...) that are used for storing bookmarks, etc... and you don't mind getting your hands a little dirty with reading the PDF spec, you can absolutely do what you are asking about (we do it quite a bit ourselves).

The PDF Spec is big, but readable for the most part, and you don't have to worry about the bulk of it (which is geared towards actual page content and rendering) if all you are trying to do is extract bookmarks.

Kevin Day