views:

98

answers:

1

Hello All

For my project i am writing an image pre processing library for scanned documents. As of now I am stuck with line removal feature.

Problem Description: A sample scanned form:

Name*  : ______________________________
Age* : ______________________________

Email-ID: |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|

Note: Following are the further conditions:

  • The scanned document may contain many more vertical and horizontal guiding lines.
  • Thickness of the lines may exceed 1px
  • The document itself is not printed properly and might have noise in the form of ink bloating or uneven thickness
  • The document might have colored background or lines

Now what I am trying to do is to detect these lines and remove them. And while doing so the hand written content should not be lost.

Solution so for: The current solution is implemented in Java.

Detected these lines by using a combination of canny/sobel edge detectors and a threshold filter(to make image bitonal). From the previous action I get a black and white array of pixels. Traverse the array and check whether lumanicity of that pixel falls below a specified bin value. And if I found 30 (minimum line length in pixels) such pixels, I remove them. I repeat the same for vertical lines but considering the fact there will be cuts due to horizontal line removal.

Although the solution seems to work. But there are problems like,

  • Removal of overlapping characters
  • If characters in the image are not properly spaced then it is also considered as a line.
  • The output image from edge detection is in black and white.
  • A bit slow. Normally takes around 40 seconds for image of 2480*3508.

Kindly guide how to do it properly and efficiently. And if there is an opensource library then please direct.

Thanks

A: 

First, I want to mention that I know nothing about image processing in general, and about OCR in particular.

Still, a very simple heuristic comes to my mind:

  1. Separate the pixels in the image to connected components.
  2. For each connected component decide if it is a line or not using one or more of the following heuristics:
    1. Is it longer that the average letters length?
    2. Does it appear near other letters? (To remove ink bloats or artifacts).
    3. Does its X gradient and Y gradient large enough? This could make sure that this connected component contains more than just horizontal line.

The only problem I can see is, if somebody writes letters on a horizontal line, like so:

   /\     ___
  /  \   /   \
  |__|   |___/
 -|--|---|---|------------------
  |  |    \__/

In that case the line would remain, but you have to handle this case anyhow.

As I mentioned, I'm by no means an image processing expert, but sometimes very simple tricks work.

Elazar Leibovich