views:

1591

answers:

4

I am reading this book (NLTK) and it is confusing. Entropy is defined as:

Entropy is the sum of the probability of each label times the log probability of that same label

How can I apply entropy and maximum entropy in terms of text mining? Can someone give me a easy, simple example (visual)?

+46  A: 

I assume they mention entropy in the context of building decision trees.

To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m or f, we want to learn a model (decision tree) that fits the data and can be used to predict the gender of a new unseen first-name.

name       gender
-----------------         The question we want to answer is:
Ashley        f               what is the gender of "Amro"? [my name :) !!]
Brian         m
Caroline      f
David         m

Now the first step is deciding what features of the data are relevant to target class we want to predict. Some example features include: first/last letter, length, number of vowels, does it end with a vowel, etc.. So after feature extraction, our data looks like:

# name    ends-vowel  num-vowels   length   gender
# ------------------------------------------------
Ashley        1         3           6        f
Brian         0         2           5        m
Caroline      1         4           8        f
David         0         2           5        m

The goal is to build a decision tree. An example of a tree would be:

length<7
|   num-vowels<3: male
|   num-vowels>=3
|   |   ends-vowel=1: female
|   |   ends-vowel=0: male
length>=7
|   length=5: male

basically each node represent a test performed on a single attribute, and we go left or right depending on the result of the test. We keep traversing the tree until we reach an leaf node which contains the class prediction (m or f)

So if we run the name Amro down this tree, we start by testing "is the legth<7?" and the answer is yes, so we go down that branch. Next we test "is the number of vowels<3?" again this is true. This leads to a leaf node labeled m, and thus the prediction is male, which I am by the way :)

The tree is built in a top-down fashion, but the question is how do you choose which attribute to split at each node? The answer is find the feature that best splits the target class into the purest possible children nodes (ie: nodes that don't contain a mix of both male and female, rather pure nodes with only one class).

This measure of purity is called the information. It represents the expected amount of information that would be needed to specify whether a new instance (first-name) should be classified male or female, given the example reached that node. We calculate it based on the number of male and female classes at the node.

Entropy on the other hand is a measure of impurity (the opposite). It is defined (for a binary class with values a/b) as:

Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))

Lets look at an example. Imagine at some point, we were considering the following split:

     ends-vowel
      [9m,5f]          <--- these [..,..] represent the class distribution
    /          \            of instance that reached that node
   =1          =0
 -------     -------
 [3m,4f]     [6m,1f]

therefore, as you can see, before the split, we had 9 males and 5 females, ie: P(m)=9/14 and P(f)=5/14. According to the definition:

Entropy_before = -5/14*log(5/14) - 9/14*log(9/14) = ...

now after the split in the branch of of ends-vowel=1, we have:

Entropy_left = -3/7*log(3/7) -4/7*log(4/7) = ...

and the branch of ends-vowel=0, we have:

Entropy__right = -6/7*log(6/7) -1/7*log(1/7) = ...

we combine the two using the number of instances down each branch as weight factor (7 instances went left, and 7 instances went right), and therefore we get the entropy after the split:

Entropy_after = 7/14*Entropy_left + 7/14*Entropy__right = ...

Now by comparing the entropy before and after the split, we obtain a measure of information gain, or how much information we gained by doing the split using that particular feature:

Information_Gain = Entropy_before - Entropy_after = ..

At each node of the tree, this calculation is done for each feature. And the feature with the largest information gain is chosen for the split. This process continues iteratively until the end.

Note that I I skipped over some details on how to handle numeric features, missing values, ..

Amro
vthanks, great answer.
TIMEX
That is a seriously good and in-depth answer, I'd give you more than a +1 if I could!!
Matt Warren
A: 

When I was implementing an algorithm to calculate the entropy of an image I found these links, see here and here.

This is the pseudo-code I used, you'll need to adapt it to work with text rather than images but the principles should be the same.

//Loop over image array elements and count occurrences of each possible
//pixel to pixel difference value. Store these values in prob_array
for j = 0, ysize-1 do $
    for i = 0, xsize-2 do begin
       diff = array(i+1,j) - array(i,j)
       if diff lt (array_size+1)/2 and diff gt -(array_size+1)/2 then begin
            prob_array(diff+(array_size-1)/2) = prob_array(diff+(array_size-1)/2) + 1
       endif
     endfor

//Convert values in prob_array to probabilities and compute entropy
n = total(prob_array)

entrop = 0
for i = 0, array_size-1 do begin
    prob_array(i) = prob_array(i)/n

    //Base 2 log of x is Ln(x)/Ln(2). Take Ln of array element
    //here and divide final sum by Ln(2)
    if prob_array(i) ne 0 then begin
        entrop = entrop - prob_array(i)*alog(prob_array(i))
    endif
endfor

entrop = entrop/alog(2)

I got this code from somewhere, but I can't dig out the link.

Matt Warren
+2  A: 
Beta
+2  A: 

I really recommend you read about Information Theory, bayesian methods and MaxEnt. The place to start is this (freely available online) book by David Mackay:

http://www.inference.phy.cam.ac.uk/mackay/itila/

Those inference methods are really far more general than just text mining and I can't really devise how one would learn how to apply this to NLP without learning some of the general basics contained in this book or other introductory books on Machine Learning and MaxEnt bayesian methods.

The connection between entropy and probability theory to information processing and storing is really, really deep. To give a taste of it, there's a theorem due to Shannon that states that the maximum amount of information you can pass without error through a noisy communication channel is equal to the entropy of the noise process. There's also a theorem that connects how much you can compress a piece of data to occupy the minimum possible memory in your computer to the entropy of the process that generated the data.

I don't think it's really necessary that you go learning about all those theorems on communication theory, but it's not possible to learn this without learning the basics about what is entropy, how it's calculated, what is it's relationship with information and inference, etc...

Rafael S. Calsaverini