Is it possible to limit the set of characters that tesseract is looking for (e.g. search only for letters a-z)? That would improve my results greatly.
+2
A:
You should probably look into preparing some training files. Have a look at this tool
epatel
2010-03-02 13:51:27
Looks nice, but regularly crashes with an unhandled exception error... Is there an alternative?
danilo
2010-03-02 16:19:14
Sorry, not that I can recall. Was some time since I used it. I used it to scan so called OCR numbers here in sweden with the isight on macs. I trained it to recognize only the special numbers http://www.memention.com/mye/
epatel
2010-03-02 23:40:15
I found this page: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseractLooks a little complicated, but doable. Thanks for the hint with the training files.
danilo
2010-03-03 14:20:18
Yes, I remember that page. It took awhile getting the hang of it, but afterwards it was pretty straight forward. If you figure out a good sequence of steps why not put them as an update in your question :)
epatel
2010-03-03 16:04:27
+2
A:
Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs.
Add the line to the config file:
tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz
...or maybe [a-z] works.. dunno :-)
Then call tesseract similar to this:
tesseract input.tif output nobatch letters
That will limit tesseract to recognize only the wanted characters
Blomman
2010-06-06 06:08:44
Sorry for the late answer - this helped. Thank you :) By the way, the regex did not work. It was probably interpreted literally.
danilo
2010-07-11 09:09:04
tessedit_char_whitelist 0123456789, i did this to fetch numbers from an image but out of 20 digits only 4 were correct.Any help would be greatly appreciated!!thank u
SWATI
2010-10-01 10:50:51
SWATI: what kind of image is it? try cleaning up the source image. for example using imagemagick.
danilo
2010-10-21 12:27:22