views:

99

answers:

3

Hello, I was wondering if anyone would give me pointers to image rec packages that would help me recognize "text" (not OCR, just something that looks like text) and a black box frame. So, suppose:

text
+----------+
|          |
|   text1  |
|          |
|          |
+----------+
     text

How do I recognize that "text" boxes are text, and that, say, text1 is inside the box?

Apologies for the vague question... I wouldn't know where to start. This is not homework, btw.

+1  A: 

You can apply any border detection algorithm to detect box. and since color of text is different form the color of background you can use even linear search to find black pixels of 'text'. I may be wrong, sorry about that.

Trickster
+2  A: 

[This is of interest to us.] I am assuming your input is effectively a bitmap - a rectangular matrix of pixels. The first question is whether it is aligned with the axes - if it's been scanned it's probably not. You may need deskewing algorithms (rather dated but it's a useful start: http://www.eecs.berkeley.edu/~fateman/kathey/node11.html)

The classic line detection is the Hough transform (http://en.wikipedia.org/wiki/Hough%5Ftransform) though our current collaborators do better than this for simple boxes and project pixels onto different viewpoints - similar to tomography. Rotate the image and count the density/histogram of points on the projection lines. For simple boxes that gives a clear signal.

For the text I suspect you either have to have a set of likely fonts or to use machine learning. In the latter you have to devise features and then select a series of images that are classified by humans as text and not-text. Your algorithm (and there are many, neural nets, maximum entropy, etc.) are then trained against these.

The quality of the pixel map makes a great deal of difference. Documents 20 years ago and much harder than bitmaps of documents created though drawing programs and dumped as PDF (of course if you can interpret text in PDF that helps a good deal.)

peter.murray.rust
My documents are simple... they are gif images, so they are clean.
Dervin Thunk
@Dervin GIF is simply a transfer format for pixels. they could hold very messy text (e.g. the captchas in SO) or fairly clean text - e.g. the fonts in SO itself. But many images are not clean when analysed in detail as they may include antialiasing
peter.murray.rust
Peter, the image would be closer to this: http://images.freshmeat.net/editorials/r_intro/images/line-graph-1.jpg
Dervin Thunk
Thanks, Peter. I agree it will never be 100%, so there will always be some manual intervention.
Dervin Thunk
A: 

A very simple algorithm would to scan left-to-right and top-to-bottom, looking for the three black pixels that make up an upper-left corner of a box (and then continuing to scan for the three pixels that would make up the matching lower-right corner). Once you've identified each box in the image in this way, you could scan the inner portion and assume that any non-white pixels mean there is some text in the box. Of course, this would not differentiate between text and images inside the box, but that would be a much more difficult problem anyway.

MusiGenesis
sorry about my naive question, but what happens if in your doc you have a T at a small y coordinate? wouldn't that be confused with the left corner?
Dervin Thunk
You cannot assume there are exactly 3 pixels - it depends on the line width, registeration with the rasterisation program , antialisaing and a lot more.
peter.murray.rust
@Dervin: you could rule out a "T" by checking the pixel to the left, and you could rule out a "+" by checking to the left and above, but all of this assumes a relatively simple image. My algorithms here wouldn't work very well with the sample image you posted below peter's comment. It wouldn't pick up the lower-right corner of the graph's box, it would falsely recognize the upper-left of the "5"s and the sideways "D" in "DJIA" as corners, etc.
MusiGenesis
@Dervin: by the way, your sample graph in your comment to peter's answer caused me actual physical pain. This answer is why: http://stackoverflow.com/questions/1538235/what-problems-have-you-solved-using-genetic-algorithms-genetic-programming/1538464#1538464
MusiGenesis