The problem
I am trying to improve the result of an OCR process by combining the output from three different OCR systems (tesseract, cuneinform, ocrad). I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more. Usually the text to recognize is between one and 6 words long. The lanuage of the text is unknown and quite often they contain fantasy words. I am on Linux. Preferred language would be Python.
What I have so far
Often every result has one or two errors. But they have errors at different characters/positions. Errors could be that they recognize a wrong character or that they include a non existing character. Not so often they ignore a character.
An example might look in the following way:
Xorem_ipsum
lorXYm_ipsum
lorem_ipuX
A X is a wrong recognized character and an Y is a character which does not exist in the text. Spaces are replaced by "_" for better readibilty.
In cases like this I try to combine the different results. Using repeatedly the "longest common substring" algorithm between the three pairs I am able to get the following structure for the given example
or m_ipsum
lor m_ip u
orem_ip u
But here I am stuck now. I am not able to combine those pieces to a result.
The questions
Do you have
- an idea how to combine the different common longest substrings?
- Or do you have a better idea how to solve this problem?