tags:

views:

93

answers:

1

I have been using tesseract-ocr (in .NET) which has been working well. The images i feed it are ascii only (A-z0-9). Is there a way i can tell it not to use special characters?

A: 

There's a new thread about this question over at the Google forum linked above. The first answer concludes that it probably isn't possible.

As far as I know, this is correct, if you're using the language data files that are packaged with Tesseract. You can, however, very easily limit the output characters if you're training on your own box files. It's practically automatic: if unicharset_extractor doesn't find any non-ASCII characters in the box files, you'll never see non-ASCII characters in the output.

I was similarly frustrated by all the interpuncts and other unusual characters in my output when I first started using Tesseract, and training on my own box files solved the problem. You can even use the Tesseract training data as a starting point.

Travis Brown