views:

227

answers:

3

I am using Tesseract OCR (via pytesser) and PIL (Python Image Library) for automated test of an application.

I am checking that the displayed text is ok by making a screenshot and getting the text thanks to tesseract.

I had some issues in the beginning and it seems to work better since I have increased the size of the screenshot thanks to the bicubic interpolation of PIL.

Unfortunatelly, I still have some mistakes like confusion between '0' and 'O'. I can imagine that I will have other similar issues in the future.

I would like to know if there are some techniques to prepare an image in order to help the OCR. Any idea is welcomed.

Thanks in advance

A: 

For distinguishing between 0 and O, one simple solution is to choose a font that distinguishes between both (eg: 0 has a dash or dot in its middle). Would that be acceptable in your application?

Another solution is to apply a dictionary-based step after the character-by-character analysis of the text - feeding the recognized text into some form of spell-checker or validator to differentiate between difficult characters.

For instance, a round symbol followed by other numbers is most likely to be a zero, while the same symbol followed by letters is most likely to be a capital o. It's a trivial example, but it shows how context is necessary to make a more reliable OCR system.

Kena
Unfortunately, I don't have control of the font. Can you please explain a little more what you mean by the dictionnary-based step?
luc
A: 

Even under the best conditions OCR variants will sneak up on you. Your best option will be to design your tests to be aware of them.

mlk
+1  A: 

Shameless plug and disclaimer: my company packages Tesseract for use in .NET

Tesseract is an OK OCR engine. It can miss a lot and gets readily confused by non-text. The best thing you can do for it is to make sure it gets text only. The next best thing is to give it something sanely binarized (adaptive or dynamic threshold to get there) or grayscale and let it try to do binarization.

plinth
I agree with that. It was confused by a dialog box edge and converted to an 'I'. When it gets text only images, it makes good job. binarization is also a good idea. thanks.
luc