ansaurus

Question

Python NLTK code snippet to train a classifier (naive bayes) using feature frequency

Answer 1

+1 A:

Hi,

In the link you sent it says this function is feature extractor that simply checks whether each of these words is present in a given document.

Here is the whole code with numbers for each line:

1     all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
2     word_features = all_words.keys()[:2000] 

3     def document_features(document): 
4          document_words = set(document) 
5          features = {}
6          for word in word_features:
7               features['contains(%s)' % word] = (word in document_words)
8          return features

In line 1 it created a list of all words.

In line 2 it takes the most frequent 2000 words.

3 the definition of the function

4 converts the document list (I think it must be a list) and converts the list to a set.

5 declares a dictionary

6 iterates over all of the most frequent 2000 words

7 creates a dictionary where the key is 'contains(theword)' and the value is either true or false. True if the word is present in the document, false otherwise

8 returns the dictionary which is shows whether the document contains the most frequent 2000 words or not.

Does this answer your question?

elif 2010-01-29 22:26:03

unfortunately not - although I understand what it is doing - I am unsure whether when people say they used a Frequency Presence to train a classifier - is this the sort of code they mean. Further one if I were to train using "frequency feature" method what changes would I have to make.

Rahul 2010-01-30 09:19:05

To me it sounds plausible to call this as the Frequency/Presence. Because you find out the most frequent words in the corpus and then check whether it is present in the document or not. Can you give references where it was mentioned as "Frequence Presence". It is not used in the page you sent.

elif 2010-02-01 15:42:56

Actually in the page you sent, the actual training happens in this code "featuresets = [(document_features(d), c) for (d,c) in documents]train_set, test_set = featuresets[100:], featuresets[:100]classifier = nltk.NaiveBayesClassifier.train(train_set)". The document_feature function is to preprocess the data, to select the features to train your classifier(NaiveBayesClassifier) on. http://en.wikipedia.org/wiki/Data_mining#Pre-processing . When they say they used a Frequence Presence to train a classifier, they must be saying that they use frequency feature selection to preprocess the data.

elif 2010-02-01 15:54:53

What do you mean by the "frequency feature" method? Do you mean the "word frequency"? Do you want to use the word frequency as features to train a classifier?

elif 2010-02-01 15:59:44

Answer 2

+2 A:

For training, create appropriate FreqDists that you can use to create ProbDists, than can then be passed in to the NaiveBayesClassifier. But the classification actually works on feature sets, which use boolean values, not frequencies. So if you want to classify based on a FreqDist, you'll have to implement your own classifier, that does not use the NLTK feature sets.

Jacob 2010-02-09 01:18:35

ansaurus

tags:

views:

answers:

Python NLTK code snippet to train a classifier (naive bayes) using feature frequency

related questions