tags:

views:

187

answers:

5

What is the best way to programmatically check if a PDF file is a totally scanned one? I do have iText and PDFBox at my disposal. I can check if a pdf file contains text or not, and according to the result to decide if this file is OCRed, but this solution is not 100% accurate. I'd like to know whether there is another way to cope with the problem.

As you understand the solution must be Java based.

+1  A: 

IMHO you cannot decide that for sure. But you can try some things like looking for the text, trying to OCR the pdf and based on amount of recognized text decide, you can look for some basic scanning errors like fade-outs or paper/book margins.

Gabriel Ščerbák
"... basic scanning errors like fade-outs or paper/book margins." seems to be a good idea.
Alex
A: 

Do you have any knowledge of how the document would have been scanned, if it was? While the "Creator" metadata item is not mandatory, it could possibly be a useful clue if your scanner sets it.

Matthew Flynn
+1  A: 

Your best bet might be to check to see if it has text and also see if it contains a large pagesized image or lots of tiled images which cover the page. If you also check the metadata this should cover most options.

mark stephens
Mark,Your answer is the closest one to what I thought. Combining it with Rowan's suggested checking for text/font resources and Gariel's fade-outs or paper/book margins seems to be a good starting point for me.Thanks,Alex
Alex
A: 

I simply judge that by size. Scanned documents are unreasonable large. For till 1000 pages, my rule of thumb is, true text pdf: 1-20 M, the scanned one can be up 30 to 100 M.

dgg32
A: 

You can check to see if a PDF has any font resources (a pretty good indication of whether or not the document contains any fonts) using the HasFontResources function in Quick PDF Library Lite -- a free ActiveX component, which you could theoretically use from Java with the assistance of a third-party add-on.

Checking for text/font resources is the most accurate method for determining if a PDF may have been generated from a scanning process. That coupled with Mark Stephens suggestion of looking for a large page sized image, etc.

But unfortunately, there isn't any 100% guaranteed accurate method for checking to see if a PDF was scanned.

Rowan