views:

243

answers:

2

According to Wikipedia, "The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents." However, it gives no citation.

My question is: is this true? Is the current state-of-the-art so good that - for a good scan of English text - there aren't any major improvements left to be made?

Or, a less subjective form of this question is: how accurate are modern OCR systems at recognising English text for good quality scans?

+2  A: 

I think that it is indeed a solved problem. Just have a look on the plethora of OCR technology articles for C#, C++, Java, etc.

Of course the article does stress that the script needs to be typewritten and clear. This makes recognition a relatively trivial task, whereas if you need to OCR scanned pages (noise) or handwriting (diffusion), it can get trickier as there are more things to tune correctly.

_NT
+1  A: 

Considered narrowly as breaking up a sufficiently high-quality 2d bitmap into rectangles, each containing an identified latin character of one of a set of well-behaved, prespecified fonts (cf. Omnifont), it is a solved problem.

Start to play about with those parameters, e.g., eccentric unknown fonts, noisy scans, asian characters, it starts become somewhat flaky or require additional input. Many well-known Ominfont systems do not handle ligatures well.

And the main problem with OCR is making sense of the output. If this was a solved problem, Google Books would give flawless results.

Charles Stewart