Looking for PDF content extractor (console tool or library).
It will be used on server to produce on-line e-books from uploaded PDF files.
Need to extract following things:
- text with fonts and styles;
- images;
- audio and video;
- links and hotspots.
- page snapshots and thumbnails;
- general PDF information, e.g. book layouts, number of pages etc.
Looking at Adobe PDF Library ($5000 though), BCL SDK (?), PDFLib (€795), QuickPDF ($250)
Now we are using open source pdf2xml (extracts text, images and links) and GhostScript (snapshots and thumbnails). The other things left are:
- fonts;
- multimedia;
- hotspots;
- page info.
We are hesitating between paying a lot of money (and possibly make mistake choosing wrong solution) or use free/open source solutions.
Which BEST solution to extract nearly everything from PDF would you recommend?
Any comments will be much appreciated.