How to segment text images using MATLAB?

+1 A:

First, NIST (Nat'l Institutes of Standards and Tech.) published a protocol known as the NIST Form-Based Handwriting Recognition System about 15 years ago for the this exact question--i.e., extracting and preparing text-as-image data for input to machine learning algorithms for OCR. Members of this group at NIST also published a number of papers on this System.

The performance of their classifier was demonstrated by data also published with the algorithm (the "NIST Handwriting Sample Forms.")

Each of the half-dozen or so OCR data sets i have downloaded and used have referenced the data extraction/preparation protocol used by NIST to prepare the data for input to their algorithm. In particular, i am pretty sure this is the methodology relied on to prepare the Boston University Handwritten Digit Database, which is regarded as benchmark reference data for OCR.

So if the NIST protocol is not a genuine standard at least it's a proven methodology to prepare text-as-image for input to an OCR algorithm. I would suggest starting there, and using that protocol to prepare your data unless you have a good reason not to.

In sum, the NIST data was prepared by extracting 32-bit x 32 bit normalized bitmaps directly from a pre-printed form.

Here's an example:

00000000000001100111100000000000 00000000000111111111111111000000 00000000011111111111111111110000 00000000011111111111111111110000 00000000011111111101000001100000 00000000011111110000000000000000 00000000111100000000000000000000 00000001111100000000000000000000 00000001111100011110000000000000 00000001111100011111000000000000 00000001111111111111111000000000 00000001111111111111111000000000 00000001111111111111111110000000 00000001111111111111111100000000 00000001111111100011111110000000 00000001111110000001111110000000 00000001111100000000111110000000 00000001111000000000111110000000 00000000000000000000001111000000 00000000000000000000001111000000 00000000000000000000011110000000 00000000000000000000011110000000 00000000000000000000111110000000 00000000000000000001111100000000 00000000001110000001111100000000 00000000001110000011111100000000 00000000001111101111111000000000 00000000011111111111100000000000 00000000011111111111000000000000 00000000011111111110000000000000 00000000001111111000000000000000 00000000000010000000000000000000

I believe that the BU data-prep technique subsumes the NIST technique but added a few steps at the end, not with higher fidelity in mind but to reduce file size. In particular, the BU group:

began with the 32 x 32 bitmaps; then
divided each 32 x 32 bitmap into non-overlapping blocks of 4x4;
Next, they counted the number of activated pixels in each block ("1" is activated; "0" is not);
the result is an 8 x 8 input matrix in which each element is an integer (0-16)

doug 2010-04-10 13:24:12

No,the input is an image with sentence on it,I want to split it into images with a single word,and then split into images with a single character.

2010-04-10 14:04:34

i've edited my answer in light of this clarification.

doug 2010-04-11 01:58:13

I can't find any implementation details in your answer:(

2010-04-15 13:19:09

i'll edit my answer this evening, so that it provides a step-by-step guide.

doug 2010-04-16 18:28:38

I'm looking forward to that!

2010-04-19 10:02:49

+1 A:

As a first pass:

process the text into lines
process a line into segments (connected parts)
find the largest white band that can be placed between each pair of segments.
look at the sequence of widths and select "large" widths as white space.
everything between white space is a word.

Now all you need a a good enough definition of "large".

BCS 2010-04-10 14:31:37

But space based solution won't work well with CJK characters:http://en.wikipedia.org/wiki/CJK_characters

2010-04-10 14:41:07

In that case, OCR a few chars in isolation and take a guess at the alphabet. If it's English, do the above, if not do something else.

BCS 2010-04-10 14:53:22

I tried to implement it following your steps,but can't manage to get started...

2010-04-15 13:10:42

A:

I am assuming U are using the image-processing toolbox in matlab.

To distinguish text in an image. U might want to follow:-

[STEP 1] Grayscale (speeds up things greatly).

[STEP 2] Contrast enhancement.

[STEP 3] Erode the image lightly to remove noise (scratches/blips)

[STEP 4] Dilation (heavy).

[STEP 5] Edge-Detection ( or ROI calculation).

With Trial-n-error, U'll get the proper co-efficients such that the image U obtain after Step 5 contain convex regions surrounding each letter/word/line/paragraph.

NOTE:

(i) Essentially the more you dilate, the larger element U get. i.e. least dilation would be usefull in identifying letters. whereas comparitively high dilation would be needed to identify lines and paragraphs.

(ii) Online ImgProc MATLAB docs http://www.mathworks.com/access/helpdesk/help/toolbox/images/

Plz Check out the "Examples in Documenation" section in the online docs OR Refer the image-processing toolbox documentation in Matlab Help menu.

The examples given there will guide you as to the proper functions to call and their various formats.

Sample CODE (not mine):

http://www.ele.uri.edu/~hansenj/projects/ele585/OCR/

GOOD LUCK!!

PS: Incidentally this was my final year project in B.E. ;-)

CVS-2600Hertz 2010-04-17 16:34:22

There can be multiple contour components after your **STEP 5**,how to deal with them?

2010-04-19 07:30:19

http://www.ele.uri.edu/~hansenj/projects/ele585/OCR/

CVS-2600Hertz 2010-04-19 08:17:39

@CVS,It's not a contour based solution.

2010-04-19 08:33:19

Someone has posted a similar question here:http://stackoverflow.com/questions/789527/ocr-convert-edge-into-a-vector-path

2010-04-19 08:40:48

CVS-2600Hertz 2010-04-19 08:50:39

The key here is how to process the **multiple** chaincode,I've commented on your post.

2010-04-19 09:17:00

U know what would be great? If U could actually take the one extra keystroke necessary to transform "U" to "you."

Seth Johnson 2010-04-21 18:29:54

A:

please give me information of sequence detecter nonoverlapping coding in matlab

kkbhavsar 2010-08-31 09:24:32

A:

for finding binary sequence like 101000000000000000010000001 detect sequence 0000,0001,001,01,1

kkbhavsar 2010-08-31 09:29:13

ansaurus

tags:

views:

answers:

How to segment text images using MATLAB?

related questions