ansaurus

Question

Answer 1

A:

It looks like your answer/errors are divided between proper names, real words (which I assume don't exist in the lexicon) and true misspellings. A false negative on "Sincerity" also suggests that capitalization might be throwing it off, though you'd hope it'd be smart enough not to - worth checking anyway. Plurals shouldn't be an issue either, but a false negative on "gods"? Does it correctly identify "god"?

Since you're trying to check spelling, why check it indirectly? what is lp.getLexicon().isKnown(w.word()) doing internally? doesn't it depend on the loaded corpus? Why not just load a dictionary, normalize the case into a big hash, and do a "contains" check? Since you're in an NLP context, it should also be reasonably easy to strip out proper names, especially given that you're not looking for 100% accuracy.

Steve B. 2009-12-06 19:05:06

Answer 2

+2 A:

Using the parser's lexicon's isKnown(String) method as a spellchecker isn't a viable use case of the parser. The method is correct: "false" means that this word was not seen (with the given capitalization) in the approximately 1 million words of text the parser is trained from. But 1 million words just isn't enough text to train a comprehensive spellchecker from in a data-driven manner. People would typically use at least two orders of magnitude of text more, and might well add some cleverness to handle capitalization. The parser includes some of this cleverness to handle words that were unseen in the training data, but this isn't reflected in what the isKnown(String) method returns.

Christopher Manning 2009-12-22 00:33:48

ansaurus

tags:

views:

answers:

Java Stanford NLP: Spell checking

related questions