ansaurus

Question

Separate image of text into component character images

Answer 1

+3 A:

This is not an easy task especially if the background is not uniform. If what you have is an already binary image like the example, it is slightly simpler.

You can start applying a threshold algorithm if your image is not binary (Otsu adaptative threshold works well)

After you can use a labelling algorithm in order to identify each 'island'of pixels which forms your shapes (each character in this case).

The problem arises when you have noise. Shapes that were labelled but aren't of your interest. In this case you can use some heuristic to determine when a shape is a character or not (you can use normalized area, position of the object if your text is in a well define place etc). If this is not enough, you will need to deal with more complex staff like shape feature extraction algorithms and some sort of pattern recognition algorithm, like multilayer perceptrons.

To finish, this seems to be an easy task, but depending the quality of your image, it could get harder. The algorithms cited here can easily be found on the internet or also implemented in some libraries like OpenCv.

Any more help, just ask, if I can help of course ;)

Andres 2009-12-29 00:37:25

Thanks for your response! At this point I'm only interested in processing simple images like the sample I provided, black text on solid white. The other considerations I might build in later, so thanks for the tips. A labelling algorithm, then? A quick google gets me cvBlobsLib from the OpenCV library, which seems like it might do the job of finding the shapes. I'm not sure how to then go about saving them, but I'll give it a go.

blork 2009-12-29 01:01:25

Answer 2

+1 A:

I've been playing around with ocropus recently, an open-source text analysis and ocr-preprocessing tool. As a part of its workflow, it also creates the images you want. Maybe this helps you, although no python magic is involved.

moritz 2009-12-29 00:41:06

Answer 3

+1 A:

Norman Ramsey 2009-12-29 02:27:03

Answer 4

+2 A:

You could start with a simple connected components analysis (CCA) algorithm, which can be implemented quite efficiently with a scanline algorithm (you just keep track of merged regions and relabel at the end). This would give you separately numbered "blobs" for each continuous region, which would work for most (but not all) letters. Then you can simply take the bounding box of each connected blob, and that will give you the outline for each. You can even maintain the bounding box as you apply CCA for efficiency.

So in your example, the first word from the left after CCA would result in something like:

1111111  2         3
   1     2
   1     2 4444    5  666
   1     22    4   5 6
   1     2     4   5  666
   1     2     4   5     6
   1     2     4   5  666

with equivalence classes of 4=2.

Then the bounding boxes of each blob gives you the area around the letter. You will run into problems with letters such as i and j, but they can be special-cased. You could look for a region less than a certain size, which is above another region of a certain width (as a rough heuristic).

The cvBlobsLib library in OpenCV should do most of this for you.

gavinb 2009-12-29 02:51:54

Answer 5

+2 A:

Um, this is actually very easy for the sample you provided:

start at left edge
  go right 1 column at a time until the current column contains black (a letter)
  this is the start of the character
  go right again till no black at all in current column
  end of character
repeat till end of image

(Incidentally, this also works for splitting a paragraph into lines.)
If the letters overlap or share columns, it gets a little more ~~difficult~~ interesting.

Edit:

@Andres, no, it works fine for 'U', you have to look at all of each column

 U   U
 U   U
 U   U
 U   U
  UUU
 01234

0,4:everything but bottom row
1-3:only bottom row

David X 2009-12-29 05:42:01

There's a problem with this approach. The steps 'go right again till no black, end of character' are not true. If you are processing 'U' or even the 'h' character, the end of black doesn't mean end of character as they form two columns of pixels with white space in between.

Andres 2009-12-29 12:01:22

ansaurus

tags:

views:

answers:

Separate image of text into component character images

Edit:

related questions