views:

19

answers:

1

I am working on a project with searchable PDF documents.

Having found the relevant text - I want to be able to show a small image "snippet" of the related text.

Can anyone point me in the direction of any resources or toolkits that will enable me to do this.

Roger Somerset UK

A: 

To show a small snippet of any part of a PDF file you will need to render the PDF to an image format and display that. As for how you show only the small area of the page that contains the matching text, you could do this a few different ways.

  1. Find an SDK that lets you extract all of the text from a PDF document with the co-ordinates of the individual words in the PDF. Then search through the extracted text for the matching text and retrieve the co-ordinates.
  2. Alternatively, find an SDK that does the searching for you, but it also needs to give you the co-ordinates of the individual words.
  3. Once the matching word is found, retrieve the co-ordinates for that word, and then crop the area around that word (you can make this area as big or small as you want it) and then render that page as an image. Only the cropped area will be rendered and that will be your "snippet".
  4. Cropping/rendering a page every time you want to display a matching search result might in some cases be slow, so you can also experiment with rendering the full page and then cropping the image to the necessary co-ordinates in your programming language of choice and then displaying the cropped image.

So the key requirements for you are:

  • Extract text with co-ordinates
  • Crop page in PDF
  • Render a PDF

As for toolkits that can do this, it depends entirely on what programming language you're using. Add a comment with your programming language and I'll update my answer with some suggestions.

Rowan
C# and ASP.NET to sit within our Website. I have played about with a product called dtSearch which will index my PDFs and when searched return me back an object which contains what I think are word offsets within the document. There is an option to output these hits as an xml document which, when sent to the Acrobat reader will highlight the hits. This is great for the document but I would like to show the snippet.One of the main reasons for this is that the OCRed text layer may not be 100% good text but enough for the search to find.
Roger Maynard
As a follow up to this, I have found a very comprehensive library for PDF manipulation - http://www.quickpdf.org/ .It is a commercial product but really in-expensive compared to its feature list.
Roger Maynard