Batch OCRing PDFs that haven't already been OCR'd. | ansaurus

tags:

ocr
pdf

views:

61

answers:

3

+1 Q:

Batch OCRing PDFs that haven't already been OCR'd.

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and only OCR the pages that haven't already been done?

A:

Why don't you re-OCR everything? The amount of time you spend agonizing over repeated work probably exceeds the time taken for the work itself.

dar7yl 2009-10-13 17:18:37

A:

In response to dar7yl: It takes a very, very long time to OCR these 600+ page documents. It must be faster to recognize it as "Already done" and move to the next.

Djokol 2009-10-13 17:31:31

A:

If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.

mooware 2009-10-13 17:34:41

related questions

Zend_Pdf_Page::drawContentStream() Example?

Convert a .doc or .pdf to an image and display a thumbnail in Ruby?

Placing a PDF inside another PDF document with Zend_PDF

Open source PDF library for C/C++ application?

Opening a PDF in WPF Application

How to best merge information, at a server, into a "form", a PDF being generated as the final output

How does one decrypt a PDF with an owner password, but no user password?

How does google make make those awesome PDF reports in Analytics and when you print a Google Doc etc?

What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)?

File format for generating dynamic reports in applications

Automated PDF Creation from URL

How do I display a PDF in Adobe Flex?

Latex=>PDF Rights management

Why is my PDF footer text invisible?

Python module for converting PDF to text

What's the best way to import/read data from pdf files?

Are e-book readers good enough for tech books?

PDF generation from XHTML in a LAMP environment

Create PDFs from multipage forms in WebObjects

Printing a PDF in .NET

PDF Creation Tutorials?

PDF Editing in PHP?

Organizing Documents

Get a preview jpeg of a pdf on windows?

How do I programmatically create a PDF in my .NET application?