tags:

views:

494

answers:

2

I am checking if a PDF document is searchable if I can get any text from every single page in a PDF.

But checking every page seems to take forever when I am trying to extract text from a PDF that contains more than 500~2000 pages.

Is it possible for a PDF to contain text for one page but not in the rest? What I am trying to do here is that, if a first page of PDF contains text, then it is a searchable PDF else not..

A: 

Try this version of Searcharoo, which lets you search Word and PDF documents.

Chris Ballance
@Chris: "Searchable PDF" is something that you can search text on *within* PDF, not from file system.
Sung Meister
+1  A: 

Yes, it is very possible for a PDF to contain text on one page but not the rest. You could very well have a 500 page PDF that contains images on the first 499 pages, but contain text on the last page.

Unless you want to open the PDF file yourself and scan it for text/text operations, you will need to use an existing third-party PDF library that allows you to extract text from a PDF.

Also, see Ferruccio's response to a related question, which is to use the IFilter interface, specifically made for search indexing and text extraction.

Tarsier