ansaurus

Question

Word Frequency in Text using Python but disregard stop words

Answer 1

A:

stopwords = set(['an', 'a', 'the']) # etc...
finalFreq = sorted((k,v) for k,v in d.iteritems() if k not in stopwords,
                      key = operator.itemgetter(1), reverse=True)

This will filter out any keys which are in the stopwords set.

Amber 2010-07-04 03:19:05

See my comment on DavidZ's answer, yours has the same problem.

John Machin 2010-07-04 05:05:31

It's not really a problem - performance wise, you're trading a set lookup for each resultant key for a set lookup for each word your regex matches. Which is more efficient will depend on the parameters of the problem set. You're already iterating over the set of result keys to output, anyways, so the generator expression for filtering doesn't involve much additional overhead - there's no extra lists being created, and the dict isn't being modified (so you're not actually "ripping them out"; just filtering them so that they never make it into the sorted list).

Amber 2010-07-04 05:53:26

Answer 2

A:

There's an easy way to handle this by slightly modifying the code you have (edited to reflect John's comment):

stopWords = set(['a', 'an', 'the', ...])
fullWords = re.findall(r'\w+', allText)
d = defaultdict(int)
for word in fullWords:
    if word not in stopWords:
        d[word] += 1
finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True)
self.response.out.write(finalFreq)

This approach constructs the sorted list in two steps: first it filters out any words in your desired list of "stop words" (which has been converted to a set for efficiency), then it sorts the remaining entries.

David Zaslavsky 2010-07-04 03:19:14

Ummmm: why insert the stopwords and then rip them out again? Two lines to fix: ` if word not in stopwords: d[word] += 1` followed by a simple `finalFreq = d.items()`

John Machin 2010-07-04 05:03:51

@John: I missed that. Although the number of stopwords is by definition limited, so it's not such a big deal.

David Zaslavsky 2010-07-04 05:21:11

@DavidZ: re your latest edit: you don't need the `[]` (`sorted()` takes any iterable), and `(k,v) for k,v in d.iteritems()` is just `d.iteritems()`

John Machin 2010-07-04 06:22:47

@John: missed that too. I was editing in a bit of a hurry.

David Zaslavsky 2010-07-04 06:36:07

Answer 3

+2 A:

You can download lists of stopwords as files in various formats, e.g. from here -- all Python needs to do is to read the file (and these are in csv format, easily read with the csv module), make a set, and use membership in that set (probably with some normalization, e.g., lowercasing) to exclude words from the count.

Alex Martelli 2010-07-04 03:25:20

Answer 4

A:

I know that NLTK has a package with a corpus and the stopwords for many languages, including English, see here for more information. NLTK has also a word frequency counter, it's a nice module for natural language processing that you should consider to use.

Tarantula 2010-07-04 03:45:50

ansaurus

tags:

views:

answers:

Word Frequency in Text using Python but disregard stop words

related questions