ansaurus

Question

OCR error correction: How to combine three erroneous results to reduce errors.

Answer 1

+1 A:

It all depends on the OCR engines you are using as to the quality of the results you can expect to get. You may find that by choosing a higher quality OCR engine that gives you confidence levels and bounding boxes would give you much better raw results in the first place and then extra information that could be used to determine the correct result.

Using Linux will restrict the possible OCR engines available to you. Personally I would rate Tesseract as 6.5/10 compared to commercial OCR engines available under Windows.

http://www.abbyy.com/ocr_sdk_linux/overview/ - The SDK may not be cheap though.

http://irislinktest.iriscorporate.com/c2-1637-189/iDRS-14-------Recognition--Image-preprocessing--Document-formatting-and-more.aspx - Available for Linux

http://www.rerecognition.com/ - Is available as a Linux version. This engine is used by many other companies.

All of the engines above should give you confidence levels, bounding boxes and better results than Tesseract OCR.

https://launchpad.net/cuneiform-linux - Cuneiform, now open sourced and running under Linux. This is likely one of your three engnines you are using. If not you should probably look at adding it.

Also you may want to look at http://tev.fbk.eu/OCR/Products.html for more options.

Can you past a sample or two of typical images and the OCR results from the engines. There are other ways to improve OCR recognition but it would depend on the images.

Andrew Cash 2010-09-10 09:45:29

At the moment I am using tesseract, cuneiform and OCRAD. Images will come from different sources. Some will be photos, some scans and some screen prints. Some have background noise, some have non black characters ...

tobltobs 2010-09-10 10:56:36

What is causing your OCR errors ? Is it the zoning algorithms you are using the find the text on the images ? Also, are the images skewed. Have you tried to despeckle them. You can also try thresholding the photos to remove the background color in some cases.

Andrew Cash 2010-09-10 11:04:57

Providing good bounding boxes to the OCR engines for reading can also improve recognition rates. Reading mixed documents like you are is going to be a challenge.

Andrew Cash 2010-09-10 11:09:28

I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more.

tobltobs 2010-09-10 13:31:47

Sounds like you are doing most of what is possible. The only way to make things work better are to improve the recognition engines used in order to remove noisy errors, start analysing other data such as bounding boxes, character confidence, dictionaries, trigram analysis etc. I would try to determine why the engines are making the errors they are in the first place - Low resolution ? noise ? coloured ? and see if that can be improved. With another engine or some bounding box info you may get your original proposal working. I would need to see samples and OCR results to make further comments.

Andrew Cash 2010-09-11 00:48:03

You may find this question useful. - http://stackoverflow.com/questions/3271174/software-to-improve-ocr-results-based-on-output-from-multiple-ocr-software-packag

Andrew Cash 2010-09-12 03:12:34

Answer 2

+1 A:

Maybe repeat the "longest common substring" until all results are the same. For your example, you would get the following in the next step:

or m_ip u
or m_ip u
or m_ip u

OR do the "longest common substring" algorithm with the first and second string and then again the result with the third string. So you get the same result or m_ip u more easy.

So you can assume that letters should be correct. Now look at the spaces. Before or there are two times l and once X, so choose l. Between or and m_ip there are two times e and once XY, so choose e. And so on.

rudi-moore 2010-09-10 10:11:32

Answer 3

A:

I'm new to OCR, but until now I find out that those systems are build to work based on a dictionary of words rather than letter by letter. So, if your images doesn't have real words, maybe you will have to look closer to the letter recognition & training part of the systems you are using.

dole 2010-09-15 04:32:23

ansaurus

tags:

views:

answers:

OCR error correction: How to combine three erroneous results to reduce errors.

The problem

What I have so far

The questions

related questions