views:

505

answers:

2

I have 55 000 image files (in both JPG and TIFF format) which are pictures from a book.

The structure of each page is this:

some text

--- (horizontal line) ---

a number

some text

--- (horizontal line) ---

another number

some text

There can be from zero to 4 horizontal lines on any given page.

I need to find what the number is, just below the horizontal line.

BUT, numbers strictly follow each other, starting at one on page one, so in order to find the number, I don't need to read it: I could just detect the presence of horizontal lines, which should be both easier and safer than trying to OCR the page to detect the numbers.

The algorithm would be, basically:

for each image
  count horizontal lines
  print image name, number of horizontal lines
  next image

The question is: what would be the best image library/language to do the "count horizontal lines" part?

A: 

you might want to try John' Resig's OCR and Neural Nets in Javascript

pageman
Why would he need OCR when all he need is to count the lines?
kigurai
The question was how to solve the problem without OCR
Ivan
@kigurai @Ivan if doing OCR is trivial, why not. He's assuming there's a way "which should be both easier and safer than trying to OCR the page to detect the numbers."
pageman
@pageman I can assure you that counting lines will be a lot easier than doing OCR. Ivan's sugggestion of using the Hough transform for lines in OpenCV is as close as a complete answer as it gets.
kigurai
+3  A: 

Probably the easiest way to detect your lines is using the Hough transform in OpenCV (which has wrappers for many languages).

The OpenCV Hough tranform will detect all lines in the image and return their angle and star/stop coordinates. You should only keep the ones who's angle is close to horizontal and of adequate length .

Oreily's Learning OpenCV explains in detail the function's input and output (p.156).

Ivan