views:

539

answers:

4

I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one large image, even where the whole page is entirely text. Others were scanned with OCR and contain images and searchable text where text is present. In many cases even words in the images were made searchable.

I want to make an automated process to recognize the text in all of the scanned documents using OCR, with Acrobat 8 Pro, but I don't want to re-OCR the files that have already been through the OCR process in the past. Does anyone know if there is a way to tell which ones contain only images, and which ones already contain searchable text?

I'm planning on doing this in C# or VB.NET but I don't think being able to tell the two kinds of files apart is language dependent.

+1  A: 

Various PDF tools can tell you if there's text. Some are available as COM controls, and maybe even native .NET ones.

Steven Sudit
Can you recommend one that you know works, or that I should try?
Bratch
http://www.leadtools.com/
Steven Sudit
A: 

Apago's pdfspy extracts information from PDF into an XML file. It includes information about the document including images and text. For your project, the useful information includes image count & size and where there is OCR (hidden) text.

http://www.apagoinc.com/pdfspy

Dwight Kelly
+1  A: 

Open the document in acrobat. Go to File -> Properties. Look in the "Advanced" section and find the PDF Producer. If it reads something like "Paper Capture..." then it has been OCR'd.

Hope this helps.

Bob
Right, in my sample sets, the image based PDFs have a blank PDF Producer, but the ones that were OCR'd show, "Adobe Acrobat 8.16 Paper Capture Plug-in." But I found another one that has selectable text and the producer is, "Acrobat Distiller 5.0.5 (Windows)." And another with text, "http://createpdf.adobe.com v5.1." Others with text "Microsoft Office Word 2007" and "GPL Ghostscript 8.54." It seems like the producer is blank for image based PDFs but some other value for PDFs that contain text.
Bratch
+1  A: 
pipitas