ansaurus

Question

Extracting a set of words with the Python/NLTK, then comparing it to a standard English dictionary.

Answer 1

+1 A:

If your English dictionary is indeed a set (hopefully of lowercased words),

set(vocab) - english_dictionary

gives you the set of words which are in the vocab set but not in the english_dictionary one. (It's a pity that you turned vocab into a list by that sorted, since you need to turn it back into a set to perform operations such as this set difference!).

If your English dictionary is in some different format, not really a set or not comprised only of lowercased words, you'll have to tell us what that format is for us to be able to help!-)

Edit: given the OP's edit shows that both words (what was previously called vocab) and englishwords (what I previously called english_dictionary) are in fact lists of lowercased words, then

newwords = set(words) - set(englishwords)

or

newwords = set(words).difference(englishwords)

are two ways to express "the set of words that are not englishwords". The former is slightly more concise, the latter perhaps a bit more readable (since it uses the word "difference" explicitly, instead of a minus sign) and perhaps a bit more efficient (since it doesn't explicitly transform the list englishwords into a set -- though, if speed is crucial this needs to be checked by measurement, since "internally" difference still needs to do some kind of "transformation-to-set"-like operation).

If you're keen to have a list as the result instead of a set, sorted(newwords) will give you an alphabetically sorted list (list(newwords) would give you a list a bit faster, but in totally arbitrary order, and I suspect you'd rather wait a tiny extra amount of time and get, in return, a nicely alphabetized result;-).

Alex Martelli 2010-08-06 22:41:41

Amended the question slightly to reflect this new info.

old Ixfoxleigh 2010-08-07 00:58:25

That's exactly what I needed. Thank you, Alex!

old Ixfoxleigh 2010-08-07 03:24:29

@tsimotki, you're welcome. Note that with your current reputation you can "upvote" answers you like (whether to your own questions or others) -- indeed, it's really strange (for anybody with sufficient rep) to accept an answer without upvoting it (accepting means it was the most helpful one to solve your problem, not upvoting means you didn't really like it much... unusual combination;-).

Alex Martelli 2010-08-07 04:31:56

Sorry. My first introduction to SE was through MO, where it took me lots of effort to even acquire suffrage. =)

old Ixfoxleigh 2010-08-07 11:53:06

ansaurus

tags:

views:

answers:

Extracting a set of words with the Python/NLTK, then comparing it to a standard English dictionary.

related questions