views:

1186

answers:

4

I have this image:

alt text

I want to read it to a string using python, which I didn't think would be that hard. I came upon tesseract, and then a wrapper for python scripts using tesseract.

So I started reading images, and it's done great until I tried to read this one. Am i going to have to train it to read that specific font? Any ideas on what that specific font is? Or is there a better ocr engine I could use with python to get this job done.

Edit: Perhaps I could make some sort of vector around the numbers, then redraw them in a larger size? The larger images are the better tesseract ocr seems to read them (no surprise lol).

A: 

That looks like Eurostile font. Yes, you will have to train with each different font that is being used in your source images.

Michael Dillon
+4  A: 

Just train the engine for the 10 digits and a '.' . That should do it. And make sure you change your image to grayscale before OCRing it.

debayan
A: 

There has been a lot of traffic on this topic in the tesseract OCR discussion group lately. You will need to use a "language" of just numbers. Many people have trained the engine that way before. It looks like you're trying to outwit a captcha data protection scheme... tsk, tsk.

sventech
Not me specifically, more for a client, but that's the basis of it yes. I believe information should be free anyway though.. but that's a whole 'nother argument
Codygman
I agree information should be free, but I was thinking that what you're doing might jeopardize the privacy of personal data, which I believe should be protected (though with SSL cracked that's not long for this world).
sventech
+2  A: 

Training is hard and is not what is really needed here. The distinction between O and 0 and l and 1 are going to be hard, no matter the script. Limiting the OCR to choose only between numerical digits greatly simplifies the problem, if the context permits it.

My interest in tesseract is in processing lots of numbers, from old government reports. In this case and in the case in question, the character set will be something like '0123456789.' Following a comment in the old (sourceforge) newsgroup for tesseract, by eric_taj on 2007-03-21, you can modify Templates->IndexFor and Templates->ClassIdFor in classify/intproto.cpp to mask off characters which are not to be allowed. I modified that approach a bit to read in the allowed character set at runtime in an environment variable, so that I can adjust the permitted set on the fly.

cboe