ansaurus

Question

Full Text Search in PDF - converted using VB .net Code

Answer 1

+2 A:

That depends on how the conversion is done.

PDF is a fairly graphics format neutral page description language. That means that the pixels that you see on the screen in the final PDF may come from a number of different types of operators.

In the best case, text is represented as, well, text. You can figure this out easily by opening up your pdf and searching for the text "/Type /Page" (or "/Type/Page"). This will show you a page dictionary that describes on page in your document. Within that dictionary (delimited by << and >>, you will see something like "/Resources 15 0 R". This tells you where page level resources are described. "15 0 R" means "a reference to object 15, generation 0. So now search for "15 0 obj" (or more precisely what is in your file) and it will take you to another dictionary. In the resource dictionary, you should see something like this:

15 0 obj << /Font <</F1 6 0 R /F2 8 0 R>>
   /XObject <</Im0 10 0 R>>
>> endobj

This means that there are two fonts used by this page and 1 external object.

If you have no /Font entry in the resource dictionary, then there are no fonts used by this page and it is therefore not searchable. If there is a conspicuous single XObject named something like IM or Im or Image, then chances are your document is a single image that makes up the page.

So in sum, a PDF page can be painted in a number of different ways. It's possible to have text represented by actual fonts. It's possible for text to be painted with path operators (ie, a series of bezier curves). It's possible for text to be painted with one or more images. Only the first will carry the full connotation of text.

Chances are, your conversion program just prints the Word document to an image and encodes the image in the PDF.

And just to be complete for the next person who searches through this document - it is possible to place invisible text inside of PDF documents. I've written code that takes raw images, runs them through an OCR engine, places the image on the PDF and then lays down invisible text returned by the OCR engine. Invisible text is both selectable and searchable.

For more information, look in the PDF reference (I cross-checked the current spec from Adobe) in chapter 8 (Graphics) and chapter 9 (Text).

plinth 2010-01-04 14:51:26

ansaurus

tags:

views:

answers:

Full Text Search in PDF - converted using VB .net Code

related questions