views:

67

answers:

1

I have:

from __future__ import division
import nltk, re, pprint
f = open('/home/a/Desktop/Projects/FinnegansWake/JamesJoyce-FinnegansWake.txt')
raw = f.read()
tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)
words = [w.lower() for w in text]

f2 = open('/home/a/Desktop/Projects/FinnegansWake/catted-several-long-Russian-novels-and-the-NYT.txt')
englishraw = f2.read()
englishtokens = nltk.wordpunct_tokenize(englishraw)
englishtext = nltk.Text(englishtokens)
englishwords = [w.lower() for w in englishwords]

which is straight from the NLTK manual. What I want to do next is to compare vocab to an exhaustive set of English words, like the OED, and extract the difference -- the set of Finnegans Wake words that have not, and probably never will, be in the OED. I'm much more of a verbal person than a math-oriented person, so I haven't figured out how to do that yet, and the manual goes into way too much detail about stuff I don't actually want to do. I'm assuming it's just one or two more lines of code, though.

+1  A: 

If your English dictionary is indeed a set (hopefully of lowercased words),

set(vocab) - english_dictionary

gives you the set of words which are in the vocab set but not in the english_dictionary one. (It's a pity that you turned vocab into a list by that sorted, since you need to turn it back into a set to perform operations such as this set difference!).

If your English dictionary is in some different format, not really a set or not comprised only of lowercased words, you'll have to tell us what that format is for us to be able to help!-)

Edit: given the OP's edit shows that both words (what was previously called vocab) and englishwords (what I previously called english_dictionary) are in fact lists of lowercased words, then

newwords = set(words) - set(englishwords)

or

newwords = set(words).difference(englishwords)

are two ways to express "the set of words that are not englishwords". The former is slightly more concise, the latter perhaps a bit more readable (since it uses the word "difference" explicitly, instead of a minus sign) and perhaps a bit more efficient (since it doesn't explicitly transform the list englishwords into a set -- though, if speed is crucial this needs to be checked by measurement, since "internally" difference still needs to do some kind of "transformation-to-set"-like operation).

If you're keen to have a list as the result instead of a set, sorted(newwords) will give you an alphabetically sorted list (list(newwords) would give you a list a bit faster, but in totally arbitrary order, and I suspect you'd rather wait a tiny extra amount of time and get, in return, a nicely alphabetized result;-).

Alex Martelli
Amended the question slightly to reflect this new info.
old Ixfoxleigh
That's exactly what I needed. Thank you, Alex!
old Ixfoxleigh
@tsimotki, you're welcome. Note that with your current reputation you can "upvote" answers you like (whether to your own questions or others) -- indeed, it's really strange (for anybody with sufficient rep) to accept an answer without upvoting it (accepting means it was the most helpful one to solve your problem, not upvoting means you didn't really like it much... unusual combination;-).
Alex Martelli
Sorry. My first introduction to SE was through MO, where it took me lots of effort to even acquire suffrage. =)
old Ixfoxleigh