views:

98

answers:

2

So I have what is essentially a spreadsheet in TIFF format. There is some uniformity to it...for example, all the column widths are the same. I want to de-limit this sheet by those known-column widths and basically create lots of little graphic files, one for each cell, and run OCR on them and store it into a database. The problem is that the horizontal lines are not all the same height, so I need to use some kind of graphics library command to check if every pixel across is the same color (i.e. black). And if so, then I know I've reached the height-delimiter for a cell. How would I go about doing that? (I'm using RMagick)

+1  A: 

Use image#get_pixel: http://www.simplesystems.org/RMagick/doc/image2.html#get_pixels Warning: Those docs are old, so it may have changed in the newer versions. Look at your own rdocs using $ gem server, assuming they have rdocs.

image#rows gives you the height of the image, then you can do something like (untested):

def black_line?(pixels)
  pixels.each do |pixel| 
    unless pixel.red == 0 && pixel.green == 0 && pixel.blue == 0
      return false
    end
  end
  true
end 

black_line_heights = []
height = image.rows
width = image.columns
height.times do |y|
  pixels = image.get_pixel(0,y,width,1)
  black_line_heights << y if black_line?(pixels)
end 

Please keep in mind that I'm not sure about the api. Looking at older docs, and I can't test it now. But it looks like the general approach you would take. BTW, it assumes the row borders are 1 pixel thick. If not, change the 1 to the actual thickness and that might be enough to make it work like you expect.

ehsanul
A: 

Ehsanul had it almost right...the call is get_pixels, which takes in as arguments x,y,w,h and returns an array of those pixels. If the dimension is 1 thick, you'll get a nice one-d array.

Since the black in a document can vary, I altered Ehsanul's method a little bit to detect whether consecutive pixels were roughly the same color. AFter a 100 or so pixels, it's probably a line:

  def solid_line?(pixels, opt={}, black_val = 10)
    last_pixel = nil
     thresh =  opt[:threshold].blank? ? 4 : opt[:threshold]

     pixels.each do |pix|     
       pixel = [pix.red, pix.green, pix.blue]
       if last_pixel != nil            
         return false if pixel.reject{|p| (p-last_pixel[pixel.index(p)]).abs < thresh && p < black_val}.length > 0
       end
       last_pixel = pixel
     end
     true


    end
Zando