views:

248

answers:

2

I'm trying to check spelling accuracy of text samples using the Stanford NLP. It's just a metric of the text, not a filter or anything, so if it's off by a bit it's fine, as long as the error is uniform.

My first idea was to check if the word is known by the lexicon:

private static LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");

@Analyze(weight=25, name="Spelling")
    public double spelling() {
     int result = 0;

     for (List<? extends HasWord> list : sentences) {
      for (HasWord w : list) {
       if (! lp.getLexicon().isKnown(w.word())) {
        System.out.format("misspelled: %s\n", w.word());
        result++;
       }
      }
     }

     return result / sentences.size();
    }

However, this produces quite a lot of false positives:

misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus
misspelled: Camus
misspelled: foandf
misspelled: foandf
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: babby
misspelled: formd
misspelled: gurl
misspelled: pregnent
misspelled: Camus
misspelled: Sincerity
misspelled: Sisyphus
misspelled: Sisyphus
misspelled: fidelity
misspelled: negates
misspelled: gods
misspelled: henceforth
misspelled: atom
misspelled: flake
misspelled: Sisyphus

Any ideas on how to do this better?

A: 

It looks like your answer/errors are divided between proper names, real words (which I assume don't exist in the lexicon) and true misspellings. A false negative on "Sincerity" also suggests that capitalization might be throwing it off, though you'd hope it'd be smart enough not to - worth checking anyway. Plurals shouldn't be an issue either, but a false negative on "gods"? Does it correctly identify "god"?

Since you're trying to check spelling, why check it indirectly? what is lp.getLexicon().isKnown(w.word()) doing internally? doesn't it depend on the loaded corpus? Why not just load a dictionary, normalize the case into a big hash, and do a "contains" check? Since you're in an NLP context, it should also be reasonably easy to strip out proper names, especially given that you're not looking for 100% accuracy.

Steve B.
+2  A: 

Using the parser's lexicon's isKnown(String) method as a spellchecker isn't a viable use case of the parser. The method is correct: "false" means that this word was not seen (with the given capitalization) in the approximately 1 million words of text the parser is trained from. But 1 million words just isn't enough text to train a comprehensive spellchecker from in a data-driven manner. People would typically use at least two orders of magnitude of text more, and might well add some cleverness to handle capitalization. The parser includes some of this cleverness to handle words that were unseen in the training data, but this isn't reflected in what the isKnown(String) method returns.

Christopher Manning