If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and only OCR the pages that haven't already been done?
A:
Why don't you re-OCR everything? The amount of time you spend agonizing over repeated work probably exceeds the time taken for the work itself.
dar7yl
2009-10-13 17:18:37
A:
In response to dar7yl: It takes a very, very long time to OCR these 600+ page documents. It must be faster to recognize it as "Already done" and move to the next.
Djokol
2009-10-13 17:31:31
A:
If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.
mooware
2009-10-13 17:34:41