PDF Text and Image Removal

tags:

.net
pdf

views:

180

answers:

PDF Text and Image Removal

Hello,

I was wondering if there is a way to remove text and images from a PDF. My goal is to create two images, one being just the text and the other the images.

I tried to looking at the following SDKs but didn't find what I was looking for:

Pegasus PDFXpress
Tall Components
LeadTools

Thanks!

You could use PDFSharp - I think you can do that with this library (pdfsharp)

Gambrinus 2009-03-11 17:21:05

A traditional PDF file comprises a number of instructions for a Postscript printer or display driver to draw some shapes. You might interpret the resulting shapes as being words, but when you see a capital D, say, there's no reason why the gylph that makes up the vertical bar on the left of the glyph and the glyph that makes up the curve on the right have to be adjacent to each other in the PDF, or even close to each other.

There are tools (and toolkits) that would allow you to OCR the document to get the text.

But the question is essentially meaningless: there is no text in a PDF, only instructions to draw things.

Edit: Tagged PDF files do include text, traditional PDFs do not and do not have a concept of a logical reading order or differentiation between any content. But concluding that because one has seen text in a PDF once therefore implies that all PDFs contain text would be bad logic.

amaca 2009-03-11 17:28:08

What? Of course there's text in PDFs, I've extracted it and editied it with a hex editor myself!

hova 2009-03-11 17:54:23

I think amaca is arguing that any solution to this problem will not be 100% perfect since the line between text and images in a PDF is not always totally solid from a computational standpoint. But it's a silly argument

Brian 2009-03-11 19:08:59

amaca: You're aware that PDF files contain *fonts*, right? The most straightforward way of including text in PDF is by giving a string of glyphs to be set in some font, not by drawing random Bézier curves (and there are of course mappings from glyphs to Unicode characters).

Arthur Reutenauer 2009-03-12 12:22:38

I am, Arthur. Are you aware that PDFs don't need to contain fonts? Or that even if referencing glyphs there is no implied read order?

amaca 2009-03-13 08:32:42

If you're willing to go there I can only point you to Brian's answer. What are you willing to achieve in denying that there are simple solutions that would work in the vast majority of cases? You have no idea of the use case here.

Arthur Reutenauer 2009-03-17 11:39:07

The LEADTOOLS SDK is capable of performing the above described task using our auto zoning functionality. The auto zoning is capable of detecting text, graphic, and table zones without actually performing OCR on the text. We have actually posted a sample application with this functionality at the below link. We also provide an OCR engine capable of extracting the text and saving it to a searchable PDF if that ever becomes part of your requirements.

http://support.leadtools.com/CS/forums/28142/ShowPost.aspx#28142

LEADTOOLS Support
LEAD Technologies Inc.
LEADTOOLS Imaging SDK Home Page
Support Forums

2009-04-20 20:21:52

ansaurus

tags:

views:

answers:

PDF Text and Image Removal

related questions