tags:

views:

65

answers:

2

Hi all,

I’m trying to extract some numbers ranging from 1-99 from a picture. I’ve tried several OCR methods using PHP, but eventually my script will fail, since the numbers occasionally is rotated 5% to the left or right. This making the picture not being recognizable.

View example of my numbers: http://d.imagehost.org/view/0373/screendump

I’ve now installed Ocropus http://code.google.com/p/ocropus/ as a test. Unfortunately this is not giving me the correct numbers every time. This leads me to think that my pictures are not optimized enough.

Does anyone have some tips/ideas how to optimize the readability of the numbers?

I would also be grateful for ideas how to find the numbers from the picture.

Regards

A: 

Is it acceptable to use an external (web-based) API for your solution? If so, please consider http://www.wisetrend.com/wisetrend_ocr_cloud.shtml (a REST API for OCR)

It can automatically correct for image rotation; Try tweaking the Deskew and AnalysisMode parameters described in http://www.wisetrend.com/WiseTREND_Online_OCR_API_v2.0.htm

(Also, when using the API, make sure that the image resolution is correctly set in the input image header - it can make all the difference in recognition quality).

Eugene Osovetsky
A: 

It seems that Tesseract / Ocropus are getting confused with the skew an it could be that multiple skewed numbers on the same line is confusing the Tesseract or Ocropus.

Are you passing in the whole image as a grid of numbers ? Have you tried sending each box (number) individually as a separate image to the OCR engine ? You may find you get better results.

Have you tried any other OCR engines ? Do you require it to be open source ?

I ran the image through a cheaper commercial OCR engine and all numbers recognised correctly. So another option is to wrap up a commercial OCR engine quite quickly with C# or C++ code and interface to deliver improved results.

Andrew Cash
I'm passing the images one by one. I ended up making a PHP function that rotates the image a couple of degrees and retries the recognition process. This has made the process succeed 96% of the times.
kris
Try playing around with the size of the white space border around the outside of the image that you pass in for recognition if you are using Tesseract directly. You may find there is an optimal size with Tesseract. If you are using Ocropus to zone the field then it probably won't change much. I haven't used Tesseract much but I know other engines can be sensitive to the whitespace border size around the characters.
Andrew Cash
BTW :Well done on the PHP rotation. It would be interesting to see some of the 4% that failed. Are you able to post up an image of some failed numbers and the result the the OCR engine gives. We may be able to comment on why they could be failing.
Andrew Cash