tags:

views:

320

answers:

3

Is it possible to limit the set of characters that tesseract is looking for (e.g. search only for letters a-z)? That would improve my results greatly.

+2  A: 

You should probably look into preparing some training files. Have a look at this tool

epatel
Looks nice, but regularly crashes with an unhandled exception error... Is there an alternative?
danilo
Sorry, not that I can recall. Was some time since I used it. I used it to scan so called OCR numbers here in sweden with the isight on macs. I trained it to recognize only the special numbers http://www.memention.com/mye/
epatel
I found this page: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseractLooks a little complicated, but doable. Thanks for the hint with the training files.
danilo
Yes, I remember that page. It took awhile getting the hang of it, but afterwards it was pretty straight forward. If you figure out a good sequence of steps why not put them as an update in your question :)
epatel
+2  A: 

This tutorial details the steps required to train Tesseract. I found it very useful.

Buzzy
+2  A: 

Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs.
Add the line to the config file:

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz  

...or maybe [a-z] works.. dunno :-)
Then call tesseract similar to this:

tesseract input.tif output nobatch letters  

That will limit tesseract to recognize only the wanted characters

Blomman
Sorry for the late answer - this helped. Thank you :) By the way, the regex did not work. It was probably interpreted literally.
danilo
tessedit_char_whitelist 0123456789, i did this to fetch numbers from an image but out of 20 digits only 4 were correct.Any help would be greatly appreciated!!thank u
SWATI
SWATI: what kind of image is it? try cleaning up the source image. for example using imagemagick.
danilo