views:

173

answers:

3

I am working on handwriting recognition and related stuff on visual studio platform and using openCV libraries. Input is in the form of binary scanned .tif images.

Currently I went into a roadblock trying to figure out a way to recognize striked out words as in you strike out (cancel) words using a straight/ curved line. I am not going to do individual character recognition 'coz that will be a waste of computation power.

Is there any way to recognize such occurrences in an alternate way?

Following are two ideas I've come upon but I am not sure - 1> use a mask like < 0 0 0 , 1 1 1, 0 0 0 > that will help in finding all horizontal lines... but this will be a very big assumption. the lines can be wavy and in any orientation.

2> skeletonize the input and look for intersections. this will give me quite a few intersections - including those that occur due to the line used to strike out the word. using some approximation like least squares etc. i can get an approximate line. but there's the problem that intersections can occur at many places - eg. 2 intersections in 'b' etc.

any suggestions?

A: 

Why not processing contours? you could take advantage of Poly (Ten-Chin) approximation and analyze only the few vectors resulting from the chain reconstruction. If you want to do more, then use a mixed pyramid/contour scheme, in order to get vectors approximations with different Level of Detail, starting from rough resolution up to finest.

Stop the refinement when you get a "reasonable" number of unique segments, apply normalization (see Moments - Hu's Moments) to make a fingerprint of your sample, and finally adopt a strong classification system.

I suggest you to look at ML (Machine Learning) part of OpenCV suite, for better reference on this latter part. For raster data, Haar's wavelets + Hidden Markovian Models work well, for vectors maybe you could use something less hard to setup (SOM, KNN, KMeans).

ZZambia
A: 

I would go with the individual character recognition. It may be a waste of computing power but it could give the best results. Just find a way to get a value from the character recognition that shows how good the character was recognized, then find a threshold for things that aren't characters. I think the canceling will destroy the char in a way that the recognition will have it problems finding something and maybe you can use this fact to find the canceled characters. To improve the results look for many characters that are badly recognized in the same region of the text, often whole words are canceled and therefore the bad recognition results will cluster.

If your performance is very bad in the end you can always come back and improve the algorithm later on.

Janusz
+1  A: 

Have you considered using the Hough transform to detect the strike lines?

Here's an illustration of the use of hough transform in handwriting, that will give you the intuition of the approach: handwritting detected lines

You can quickly test it with openCV. The function is called cvHoughLines2.

Ivan