ansaurus

Question

PDF Text Extraction at hyperlink locations

Answer 1

A:

iText (Java & C#) could do it, though not "out of the box". You'd have to do some low-level PDF object manipulation (and some math) to determine where to start looking for example.

The good news is that there's a text extraction "strategy" that will only extract text from a given bounding box. The code might look something like this:

http://www.itextpdf.com/examples/iia.php?id=279

Getting the destination from a link isn't something with a handy example floating around. You'll have to take a look at the PDF Specification (adobe has a free copy available... it's mentioned in several other PDF-tagged questions, but I don't have the link handy on this machine).

Mark Storer 2010-10-22 05:39:01

Answer 2

A:

This is harder than it sounds - you might need to rethink your question. Intradocument hyperlinks are typically done through a link annotation with the destination set to a "Goto View" action. That view does not necessarily include bounds or even a point. Sometimes it is just a page (at current zoom) or a page (fit width) or a page (at the top, specific zoom). And it's even more complicated than that because a link destination may be a tree of actions to take in order with each action being one of 18 different possible action types, including javascript which could be used to drive the viewer to go to a particular destination.

I think you will also have trouble with "at the point where the link takes you."

You can do a lot of this task in C# using Atalasoft dotAnnotate and the PDF Text extraction add on (disclaimer, I work for Atalasoft, wrote the PDF->annotations importer, and used to work for Adobe on Acrobat v 1, 2, & 3). And no, I'm sorry, it's not free software.

Here's how I'd do it (disclaimer - this is right off the top of my head):

class PageAnnots : KeyValuePair<int, List<PdfLinkData>> { }

public PageAnnots GetPageLinkDestinations(Stream stm)
{
    PdfAnnotationDataImporter importer = new PdfAnnotationDataImporter(stm);
    List<PageAnnots> pageAnnots = new List<PageAnnots>();

    try {
        importer.Load();
        // this gets all annotations on all pages.  On long docs, this will be time consuming
        AnnotationDataCollection allAnnots = importer.Import();
        int pageNo = 0;
        // allAnnots is a collection of LayerData, each LayerData object being a collection
        // of annots for a page.  The collection is empty if there are no annots
        foreach (AnnotationData pageOfAnnots in allAnnots) {
            List<PdfLinkData> linkAnnots = new List<PdfLinkData>();
            LayerData pageLayer = pageOfAnnots as LayerData;
            if (pageLayer != null) {
                // filter out each annot that is a link
                foreach (AnnotationData annot in pageLayer.Items) {
                    PdfLinkData link = annot as PdfLinkData;
                    if (link != null)
                        linkAnnots.Add(link);
                }
            }
            if (linkAnnots.Count > 0) {
                pageAnnots.Add(new PageAnnots(pageNo, linkAnnots));
            }
            pageNo++;
        }
    }
    catch (Exception err) {
        // keep it?  drop it?
    }

    return pageAnnots;
}

At this point, we've reduced this to a collection of key value pairs, each key being a page number and each value being a non-empty list of PdfLinkData objects representing links on that page.

From there, you could iterate over this collection and try to figure the destination like this:

private int PageFromDestination(PdfDestination dest)
{
    PdfIndexedPageReference pageRef = dest.Page as PdfIndexedPageReference;
    return pageRef == null ? -1 : pageRef.PageIndex;
}

public void FigureDestination(PdfLinkData link)
{
    PdfActionList actions = link.ClickAction;
    foreach (PdfAction action in actions) {
        PdfGoToViewAction gotoView = action as PdfGoToViewAction;
        if (action == null)
            continue;
        // this only pulls the page from the destination.  The dest
        // may also contain information about the view.  I'm assuming you
        // only want the page number
        int page = PageFromDestination(gotoView.Destination);
        if (page >= 0) {
            // here's where you step in - the click action could be
            // a long chain of things including several GoToView actions.
            // it's up to you to decide what you want to do.  Handle only
            // action lists of length 1?  Stop at first GoToView?
            // aggregate them all?
        }
    }
}

And when you look at this code, you're going to wonder why on earth there is this level of abstraction in terms of indexed page references and action types and action lists? The answer is that a GoToView action could also refer to another document - cross document links are valid in PDF. While dotAnnotate doesn't support them right now, it is poised to be able to support them in the future. Similarly, the action could indicate going to a view in an embedded PDF document (yes, you can embed PDF in PDF).

You need to be aware that dotAnnotate gives you a limited set of fairly high level objects and doesn't require you to know and understand the PDF specification (too much). We have tried, in the past, to release very granular APIs into things like TIFF and found that our customers didn't find them palatable. So we tried to guess what our customers are likely to want and need and create APIs that are easier to digest.

iText and iTextSharp give you very fine level control of the API, but you will need to understand the PDF spec to get at what you need.

For example, to do the annotation extraction, you will have to open the document, get the page catalog, walk the page tree, find all page dictionaries that have an Annots key, walk the Annots array, search each dictionary in there for the key /Type with the value /Annot and for a key /SubType with the value /Link, then pull out value of the key /Dest if present and if that's non-null go with that otherwise look at the key /A and start walking the action tree to find an Action with a a key /Type set to /GoTo (IIRC) and then go from there.

A destination may be a direct destination or it may be a named destination. If it is a named destination, you will have to go back to the document catalog and pull out the name tree and search it for the name in the named destination and when you find it, pull out the information there.

So yeah, you can use iText or another similar PDF parser, but you will need to do all of these steps unless one of the library creators was kind enough to do that for you.

plinth 2010-10-26 20:25:48

ansaurus

tags:

views:

answers:

PDF Text Extraction at hyperlink locations

related questions