You should try implementing a modified version of a Naive Bayes spam filter. For example, in normal spam detection you calculate the probability of a word being spam and use individual word probabilities to determine if the whole message is spam.
Similarly, you could download a word list, and compute the probability that a pair of letters belongs to a real word.
E.g., create a 26x26
table say, T
. Let the 5th row represent the letter e
and let entry T(5,1)
be the number of times ea
appeared in your word list. Once you're done counting, divide each element in each row with the sum of the row so that T(5,1)
is now the percentage of times ea
appears in your word list in a pair of letter starting with e
.
Now, you can use the individual pair probability (e.g. in Jimy
that would be {Ji
,im
,iy
} to check whether Jimy
is an acceptable name or not. You'll probably have to determine the right probability to threshold at, but try it out --- it's not that hard to implement.