Reliably extracting identity fields from scanned documents / images?

views:

201

answers:

Reliably extracting identity fields from scanned documents / images?

I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".

I've tried the following software:

Tesseract-OCR
Cuneiform
Canon ImageRunner built-in OCR
Asprise OCR Java API (demo)

I've tried the following settings:

Scanning at resolutions of 300dpi and 600dpi
Tried different fonts, including OCR-A and OCR-B.

In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".

Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?

Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.

Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.

+2 A:

If you have control over the fields, why use a human-readable format in the first place? For scanning, it seems like a QR Code, or something similar would be best. It is marked for orientation, and has some built-in error correction.

http://en.wikipedia.org/wiki/QR_Code

Jeff B 2009-11-17 22:26:02

Thanks. I guess I don't have complete control. One of the identifiers is typed in before printing. I'd have to muck something up so that the QR code would be generated and printed in Word when the document was initially prepared.

Boden 2009-11-17 23:00:04

From a quick google search, it seems like there already exist some solutions for inserting QR codes and other barcodes into a Word document. Not sure about the expense, but QR codes are an "open" format, so you can find code to generate your own, maybe even with a visual basic script.

Jeff B 2009-11-18 00:33:32

+1 A:

Looks like you tried several not really good OCR tools and now come to conclusion that OCR as a whole is not mature. Why didn't you try leading OCR engines like ABBYY?

Tomato 2009-11-20 14:50:39

I concluded nothing, I simply asked whether it's possible or if this is what is to be expected of OCR technology; e.g. "Is this a lost cause?" All of the OCR tools I tried would have been acceptable for uses in which exact text extraction isn't necessary (i.e. searchable documents). Anyhow, I appreciate the recommendation, and I'll give it a try. (I'm not familiar with leading engines, which is one of the reasons I'm asking here).

Boden 2009-11-20 15:58:29

+1 A:

I started digging for products starting with Tomato's suggestion. I tried ABBYY and CVISION. Both have products that can automate OCR:

In addition, ABBYY has SDKs for various platforms, and CVISION has an SDK that appears to work with at least VB/VC++.

I haven't tried either SDK yet, and am not sure it's necessary for my project. All I need is PDFs coming in that I can extract the text from. I did however try CVISION's server product and with the OCR on its most accurate settings, it worked really well. I haven't tried ABBYY's server product yet because I have to go through a reseller to get a trial. I'm in the process of doing so, but if it starts getting annoying I'm probably going to go with CVISION. I did try ABBYY's FineReader standalone product, and it worked very well, so I assume that their server product would also.

Boden 2009-11-25 20:34:41

ansaurus

tags:

views:

answers:

Reliably extracting identity fields from scanned documents / images?

related questions