How to give best chance of success to an OCR software?

views:

227

answers:

+3 Q:

How to give best chance of success to an OCR software?

I am using Tesseract OCR (via pytesser) and PIL (Python Image Library) for automated test of an application.

I am checking that the displayed text is ok by making a screenshot and getting the text thanks to tesseract.

I had some issues in the beginning and it seems to work better since I have increased the size of the screenshot thanks to the bicubic interpolation of PIL.

Unfortunatelly, I still have some mistakes like confusion between '0' and 'O'. I can imagine that I will have other similar issues in the future.

I would like to know if there are some techniques to prepare an image in order to help the OCR. Any idea is welcomed.

Thanks in advance

For distinguishing between 0 and O, one simple solution is to choose a font that distinguishes between both (eg: 0 has a dash or dot in its middle). Would that be acceptable in your application?

Another solution is to apply a dictionary-based step after the character-by-character analysis of the text - feeding the recognized text into some form of spell-checker or validator to differentiate between difficult characters.

For instance, a round symbol followed by other numbers is most likely to be a zero, while the same symbol followed by letters is most likely to be a capital o. It's a trivial example, but it shows how context is necessary to make a more reliable OCR system.

Kena 2009-08-26 15:36:09

Unfortunately, I don't have control of the font. Can you please explain a little more what you mean by the dictionnary-based step?

luc 2009-08-26 15:49:28

Even under the best conditions OCR variants will sneak up on you. Your best option will be to design your tests to be aware of them.

mlk 2009-08-26 15:44:59

+1 A:

Shameless plug and disclaimer: my company packages Tesseract for use in .NET

Tesseract is an OK OCR engine. It can miss a lot and gets readily confused by non-text. The best thing you can do for it is to make sure it gets text only. The next best thing is to give it something sanely binarized (adaptive or dynamic threshold to get there) or grayscale and let it try to do binarization.

plinth 2009-08-26 18:56:15

I agree with that. It was confused by a dialog box edge and converted to an 'I'. When it gets text only images, it makes good job. binarization is also a good idea. thanks.

luc 2009-08-27 07:01:07

ansaurus

tags:

views:

answers:

How to give best chance of success to an OCR software?

related questions