views:

112

answers:

1

I am very new to OCR and almost know nothing about the algorithms used to recognize words. I am just getting familiar to that.

Could anybody please advise on the typical method used to recognize and separate individual characters in connected form (I mean in a word where all letters are linked together)? Forget about handwriting, supposing the letters are connected together using a known font, what is the best method to determine each individual character in a word? When characters are written separately there is no problem, but when they are joined together, we should know where every single character starts and ends in order to go to the next step and match them individually with a letter. Is there any known algorithm for that?

+2  A: 

The standard term for this process is "character segmentation" - segmentation is the image processing term for breaking images into grouped areas for recognition. "Arabic character segmentation" throws up a lot of hits in google scholar if you want to learn more.

I'd encourage you to look at Tesseract - an open source OCR implementation, especially the documents.

Feature as defined in the glossary has a bit on this, but there is a ton of information here.

Basically Tesseract solves the problem (from How Tesseract Works) by looking at blobs (not letters) then combining those blobs into words. This avoids the problem you describe, while creating new problems.

For arabic (as you point out) Tesseract doesn't work. I don't know much about this area but this paper seems to imply Dynamic Time Warping (DTW) is a useful technique. This tries to stretch the words to match them to known words, and again works in word rather than letter space.

Nick Fortescue
Tesseract is unlikely to be able to handle connected scripts like Arabic. It will take some specialized algorithms to handle this case, and right now it doesn't have them. code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
Maysam
Fair enough. I assumed you were talking about it connected English (ir Cursive). Hopefully the ideas are useful though. I'll add another answer for Arabic.
Nick Fortescue