views:

45

answers:

2

I am confuse that what argument should i pass in CGPDFDictionaryGetString function for "key"?I want to extract text and image from PDF file.

A: 

Good Question I think it is also confusing for me.

mitul shah
Sagar
+1  A: 

The method you have specified is normally used for extracting a String COS object, and will probably be of little direct use in getting the text off the PDF page. COS objects are stored within the PDF's document catalog tree. You normally acquire a COS object in the tree by using its key value. COS objects can be of several different types (Dictionary, Array, Number, String, Stream etc.) each type is identified with a key that allows it to be identified and retrieved via methods like:

CGPDFDictionaryGetString(key)
CGPDFDictionaryGetNumber(key)
CGPDFDictionaryGetDictionary(key)

I've never had the need to extract the on-page text myself, but looking over a simple PDF file, the on-page text seems to be in the page's "Contents" stream.

So in your case you probably want to do something like 1) Get the Document Catalog 2) Get the 'Pages' Dictionary 3) Get Page(n) that you are concerned with 4) Get that page's "Contents" stream and parse it for the text.

Images are normally stored under the page's "Resource" dictionary (which resides at the same level as the "Contents" stream.

If you want to get a better understanding of the COS object tree and its structure, you can view it for the currently viewed PDF using Acrobat's "Preflight" utility. Under the Advanced menu: Preflight... | options | Browse Internal PDF structure...

And of course, flipping through the official spec is a good Idea:

Hope that helps!

Michael Marsella