Hello All
For my project i am writing an image pre processing library for scanned documents. As of now I am stuck with line removal feature.
Problem Description: A sample scanned form:
Name* : ______________________________
Age* : ______________________________
Email-ID: |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
Note: Following are the further conditions:
- The scanned document may contain many more vertical and horizontal guiding lines.
- Thickness of the lines may exceed 1px
- The document itself is not printed properly and might have noise in the form of ink bloating or uneven thickness
- The document might have colored background or lines
Now what I am trying to do is to detect these lines and remove them. And while doing so the hand written content should not be lost.
Solution so for: The current solution is implemented in Java.
Detected these lines by using a combination of canny/sobel edge detectors and a threshold filter(to make image bitonal). From the previous action I get a black and white array of pixels. Traverse the array and check whether lumanicity of that pixel falls below a specified bin value. And if I found 30 (minimum line length in pixels) such pixels, I remove them. I repeat the same for vertical lines but considering the fact there will be cuts due to horizontal line removal.
Although the solution seems to work. But there are problems like,
- Removal of overlapping characters
- If characters in the image are not properly spaced then it is also considered as a line.
- The output image from edge detection is in black and white.
- A bit slow. Normally takes around 40 seconds for image of 2480*3508.
Kindly guide how to do it properly and efficiently. And if there is an opensource library then please direct.
Thanks