I assume they mention entropy in the context of building decision trees.
To illustrate, imagine the task of learning to classify first-names into male/female groups. That is given a list of names each labeled with either m
or f
, we want to learn a model (decision tree) that fits the data and can be used to predict the gender of a new unseen first-name.
name gender
----------------- The question we want to answer is:
Ashley f what is the gender of "Amro"? [my name :) !!]
Brian m
Caroline f
David m
Now the first step is deciding what features of the data are relevant to target class we want to predict. Some example features include: first/last letter, length, number of vowels, does it end with a vowel, etc.. So after feature extraction, our data looks like:
# name ends-vowel num-vowels length gender
# ------------------------------------------------
Ashley 1 3 6 f
Brian 0 2 5 m
Caroline 1 4 8 f
David 0 2 5 m
The goal is to build a decision tree. An example of a tree would be:
length<7
| num-vowels<3: male
| num-vowels>=3
| | ends-vowel=1: female
| | ends-vowel=0: male
length>=7
| length=5: male
basically each node represent a test performed on a single attribute, and we go left or right depending on the result of the test. We keep traversing the tree until we reach an leaf node which contains the class prediction (m
or f
)
So if we run the name Amro down this tree, we start by testing "is the legth<7?" and the answer is yes, so we go down that branch. Next we test "is the number of vowels<3?" again this is true. This leads to a leaf node labeled m
, and thus the prediction is male, which I am by the way :)
The tree is built in a top-down fashion, but the question is how do you choose which attribute to split at each node? The answer is find the feature that best splits the target class into the purest possible children nodes (ie: nodes that don't contain a mix of both male and female, rather pure nodes with only one class).
This measure of purity is called the information. It represents the expected amount of information that would be needed to specify whether a new instance (first-name) should be classified male or female, given the example reached that node. We calculate it
based on the number of male and female classes at the node.
Entropy on the other hand is a measure of impurity (the opposite). It is defined (for a binary class with values a/b) as:
Entropy = - p(a)*log(p(a)) - p(b)*log(p(b))
Lets look at an example. Imagine at some point, we were considering the following split:
ends-vowel
[9m,5f] <--- these [..,..] represent the class distribution
/ \ of instance that reached that node
=1 =0
------- -------
[3m,4f] [6m,1f]
therefore, as you can see, before the split, we had 9 males and 5 females, ie: P(m)=9/14
and P(f)=5/14
. According to the definition:
Entropy_before = -5/14*log(5/14) - 9/14*log(9/14) = ...
now after the split in the branch of of ends-vowel=1
, we have:
Entropy_left = -3/7*log(3/7) -4/7*log(4/7) = ...
and the branch of ends-vowel=0
, we have:
Entropy__right = -6/7*log(6/7) -1/7*log(1/7) = ...
we combine the two using the number of instances down each branch as weight factor (7 instances went left, and 7 instances went right), and therefore we get the entropy after the split:
Entropy_after = 7/14*Entropy_left + 7/14*Entropy__right = ...
Now by comparing the entropy before and after the split, we obtain a measure of information gain, or how much information we gained by doing the split using that particular feature:
Information_Gain = Entropy_before - Entropy_after = ..
At each node of the tree, this calculation is done for each feature. And the feature with the largest information gain is chosen for the split. This process continues iteratively until the end.
Note that I I skipped over some details on how to handle numeric features, missing values, ..