views:

229

answers:

2

I am looking for a way to determine the most "different" or "recognizable" N ASCII characters... For example, if N = 10, what would be the most different N characters in the ASCII set from 0x21 to 0x7E? Obviously, the character "X" is very different than "O" (the letter), but "O" (the letter) is very similar to "0" (zero). Assuming a restricted OCR character subset, such that zero and the letter O would be detected as one or the other only, and one didn't have to worry about whether it was a zero or a letter O, what would be the most different N characters that typical OCR engines (for example Tesseract) recognize easily from a poor quality input image? Assumptions. such as "+" and "t" could widely be mistaken for one another. can be made, and thus each input character, whether it's "+" or "t" would only correspond to one or the other.

Thanks, Ben

+1  A: 

Only one way to answer this question: test it. Create a set of samples for each letter, and run OCR on each sample. The letters that OCR gets right the most often are the most "recognizable"; the letters that OCR gets wrong most often are the most "different".

MusiGenesis
+3  A: 

Unfortunately I don't think there will be a single unique answer for this.

It'll depend on the font: Compare the different ways that 0, f, s are represented and also stylistic flourishes.

It'll depend on the type of damage the characters receive before being scanned, some may be more resilient against smudging, others against cuts, others against over-writing.

If you're looking for a representation that's best at surviving being printed, scanned and OCRed, then maybe a 1D or 2D barcode would be a better choice?

Douglas Leeder