tags:

views:

180

answers:

3

Hello,

I was wondering if there is a way to remove text and images from a PDF. My goal is to create two images, one being just the text and the other the images.

I tried to looking at the following SDKs but didn't find what I was looking for:

  • Pegasus PDFXpress
  • Tall Components
  • LeadTools

Thanks!

A: 

You could use PDFSharp - I think you can do that with this library (pdfsharp)

Gambrinus
A: 

A traditional PDF file comprises a number of instructions for a Postscript printer or display driver to draw some shapes. You might interpret the resulting shapes as being words, but when you see a capital D, say, there's no reason why the gylph that makes up the vertical bar on the left of the glyph and the glyph that makes up the curve on the right have to be adjacent to each other in the PDF, or even close to each other.

There are tools (and toolkits) that would allow you to OCR the document to get the text.

But the question is essentially meaningless: there is no text in a PDF, only instructions to draw things.

Edit: Tagged PDF files do include text, traditional PDFs do not and do not have a concept of a logical reading order or differentiation between any content. But concluding that because one has seen text in a PDF once therefore implies that all PDFs contain text would be bad logic.

amaca
What? Of course there's text in PDFs, I've extracted it and editied it with a hex editor myself!
hova
I think amaca is arguing that any solution to this problem will not be 100% perfect since the line between text and images in a PDF is not always totally solid from a computational standpoint. But it's a silly argument
Brian
amaca: You're aware that PDF files contain *fonts*, right? The most straightforward way of including text in PDF is by giving a string of glyphs to be set in some font, not by drawing random Bézier curves (and there are of course mappings from glyphs to Unicode characters).
Arthur Reutenauer
I am, Arthur. Are you aware that PDFs don't need to contain fonts? Or that even if referencing glyphs there is no implied read order?
amaca
If you're willing to go there I can only point you to Brian's answer. What are you willing to achieve in denying that there are simple solutions that would work in the vast majority of cases? You have no idea of the use case here.
Arthur Reutenauer
A: 

The LEADTOOLS SDK is capable of performing the above described task using our auto zoning functionality. The auto zoning is capable of detecting text, graphic, and table zones without actually performing OCR on the text. We have actually posted a sample application with this functionality at the below link. We also provide an OCR engine capable of extracting the text and saving it to a searchable PDF if that ever becomes part of your requirements.

http://support.leadtools.com/CS/forums/28142/ShowPost.aspx#28142

LEADTOOLS Support
LEAD Technologies Inc.
LEADTOOLS Imaging SDK Home Page
Support Forums