views:

66

answers:

3

The problem

I am trying to improve the result of an OCR process by combining the output from three different OCR systems (tesseract, cuneinform, ocrad). I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more. Usually the text to recognize is between one and 6 words long. The lanuage of the text is unknown and quite often they contain fantasy words. I am on Linux. Preferred language would be Python.

What I have so far

Often every result has one or two errors. But they have errors at different characters/positions. Errors could be that they recognize a wrong character or that they include a non existing character. Not so often they ignore a character.

An example might look in the following way:

Xorem_ipsum
lorXYm_ipsum
lorem_ipuX

A X is a wrong recognized character and an Y is a character which does not exist in the text. Spaces are replaced by "_" for better readibilty.

In cases like this I try to combine the different results. Using repeatedly the "longest common substring" algorithm between the three pairs I am able to get the following structure for the given example

or m_ipsum
lor m_ip u
orem_ip u

But here I am stuck now. I am not able to combine those pieces to a result.

The questions

Do you have

  • an idea how to combine the different common longest substrings?
  • Or do you have a better idea how to solve this problem?
+1  A: 

It all depends on the OCR engines you are using as to the quality of the results you can expect to get. You may find that by choosing a higher quality OCR engine that gives you confidence levels and bounding boxes would give you much better raw results in the first place and then extra information that could be used to determine the correct result.

Using Linux will restrict the possible OCR engines available to you. Personally I would rate Tesseract as 6.5/10 compared to commercial OCR engines available under Windows.

http://www.abbyy.com/ocr_sdk_linux/overview/ - The SDK may not be cheap though.

http://irislinktest.iriscorporate.com/c2-1637-189/iDRS-14-------Recognition--Image-preprocessing--Document-formatting-and-more.aspx - Available for Linux

http://www.rerecognition.com/ - Is available as a Linux version. This engine is used by many other companies.

All of the engines above should give you confidence levels, bounding boxes and better results than Tesseract OCR.

https://launchpad.net/cuneiform-linux - Cuneiform, now open sourced and running under Linux. This is likely one of your three engnines you are using. If not you should probably look at adding it.

Also you may want to look at http://tev.fbk.eu/OCR/Products.html for more options.

Can you past a sample or two of typical images and the OCR results from the engines. There are other ways to improve OCR recognition but it would depend on the images.

Andrew Cash
At the moment I am using tesseract, cuneiform and OCRAD. Images will come from different sources. Some will be photos, some scans and some screen prints. Some have background noise, some have non black characters ...
tobltobs
What is causing your OCR errors ? Is it the zoning algorithms you are using the find the text on the images ? Also, are the images skewed. Have you tried to despeckle them. You can also try thresholding the photos to remove the background color in some cases.
Andrew Cash
Providing good bounding boxes to the OCR engines for reading can also improve recognition rates. Reading mixed documents like you are is going to be a challenge.
Andrew Cash
I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more.
tobltobs
Sounds like you are doing most of what is possible. The only way to make things work better are to improve the recognition engines used in order to remove noisy errors, start analysing other data such as bounding boxes, character confidence, dictionaries, trigram analysis etc. I would try to determine why the engines are making the errors they are in the first place - Low resolution ? noise ? coloured ? and see if that can be improved. With another engine or some bounding box info you may get your original proposal working. I would need to see samples and OCR results to make further comments.
Andrew Cash
You may find this question useful. - http://stackoverflow.com/questions/3271174/software-to-improve-ocr-results-based-on-output-from-multiple-ocr-software-packag
Andrew Cash
+1  A: 

Maybe repeat the "longest common substring" until all results are the same. For your example, you would get the following in the next step:

or m_ip u
or m_ip u
or m_ip u

OR do the "longest common substring" algorithm with the first and second string and then again the result with the third string. So you get the same result or m_ip u more easy.

So you can assume that letters should be correct. Now look at the spaces. Before or there are two times l and once X, so choose l. Between or and m_ip there are two times e and once XY, so choose e. And so on.

rudi-moore
A: 

I'm new to OCR, but until now I find out that those systems are build to work based on a dictionary of words rather than letter by letter. So, if your images doesn't have real words, maybe you will have to look closer to the letter recognition & training part of the systems you are using.

dole