I'm trying to check spelling accuracy of text samples using the Stanford NLP. It's just a metric of the text, not a filter or anything, so if it's off by a bit it's fine, as long as the error is uniform.
My first idea was to check if the word is known by the lexicon:
private static LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
@Analyze(weight=25, name="Spelling")
public double spelling() {
int result = 0;
for (List<? extends HasWord> list : sentences) {
for (HasWord w : list) {
if (! lp.getLexicon().isKnown(w.word())) {
System.out.format("misspelled: %s\n", w.word());
result++;
}
}
}
return result / sentences.size();
}
However, this produces quite a lot of false positives:
misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus
misspelled: Camus
misspelled: foandf
misspelled: foandf
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: Camus
misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus
Any ideas on how to do this better?