views:

460

answers:

2

Looking for PDF content extractor (console tool or library).

It will be used on server to produce on-line e-books from uploaded PDF files.

Need to extract following things:

  1. text with fonts and styles;
  2. images;
  3. audio and video;
  4. links and hotspots.
  5. page snapshots and thumbnails;
  6. general PDF information, e.g. book layouts, number of pages etc.

Looking at Adobe PDF Library ($5000 though), BCL SDK (?), PDFLib (€795), QuickPDF ($250)

Now we are using open source pdf2xml (extracts text, images and links) and GhostScript (snapshots and thumbnails). The other things left are:

  1. fonts;
  2. multimedia;
  3. hotspots;
  4. page info.

We are hesitating between paying a lot of money (and possibly make mistake choosing wrong solution) or use free/open source solutions.

Which BEST solution to extract nearly everything from PDF would you recommend?

Any comments will be much appreciated.

+1  A: 

A: Font: I dont think fonts can be extracted.

B: Not sure about multimedia

C: What are hotspots?

D: Have a look at iTextSharp (open source), you might be able to extract more page info.

Mark Redman
> A: Font: I dont think fonts can be extracted.We need proper font names at least to use system fonts.> B: Not sure about multimediaMultimedia are in annotations objects in PDF as I know, so solution should be able to iterate through them in the pdf to extract right?> C: What are hotspots?Hotspot is a kind of rectangular link for example on a part of image.> Have a look at iTextSharp (open source), you might be able to extract more page info.Thank you, will give it a try.
Max
It seems to me that iText is for generating pdf files not for extracting their content. Isn't it?
Max
Yes its is mostly for generating PDF files, just thought you might be able to extract some info, like number of pages, page sizes and possibly page info? Also have a look at http://www.tallcomponents.com/ they have some decent tools too.
Mark Redman
+2  A: 

Sounds like with a few days or weeks effort, you can adapt the open source tools to your needs. Fonts and everything can certainly be extracted, this is something that every PDF reader must do anyway to display them.

You should probably take an estimate of programmer costs ($/hr) and multiply it by the estimated time it would take to add the needed open source functionality (60-80 hours?). If this is greater or close to $5000 anyway, you might consider just buying the commercial software.

Otherwise, with the help of the (quite good) PDF reference, you should be well on your way.

One more thing, you might find Poppler to be of help. It is for rendering PDF, but that is very related to what you are trying to do.

Adam Goode
Difficulty here is that even commercial SDK will require programming efforts. At their features' summary everything look great, however looking in samples, it's still unclear how to extract for example video to external file, they just dump annotation information (talking about PDFlib pCOS).
Max
Yeah, you'd have to factor that in to the cost.
Adam Goode