views:

32

answers:

2

In ID3 implementation, at which point the recursion in Algorithm should stop.

+2  A: 

A branch stops when there are no examples left to classify or there are no attributes left to classify with. The algorithm description on Wikipedia is pretty easy to follow and there's a bunch of links to examples and discussions on there too.

Brabster
+1 equivalently stated, you can stop when all examples at the node have the same target class (no split needed)
Amro
A: 

Well you continue to split (form two new nodes from an extant one) as long as the splitting criterion is satisfied.

The splitting criterion is usually a negative value for the difference between the parent node Information Gain, aka entropy, (or variance if the variable is discrete rather than categorical) and the weighted average IG of the putative child nodes the weighted average Information Gain:

if weighted_mean(IG_child1, IG_child2) < IG_parent :
    createNodes(IG_child1, IG_child2)
else :
    continue 

So this is the trivial answer, but there's likely a more sophisticated intention behind your question, which, if you don't mind, i'll re-word slightly as should you continue to create nodes as long as the splitting criterion is satisfied?

As you might have just found out if you are coding an ID3 algorithm, applying the splitting criterion without constraint will often cause over-fitting (i.e., the tree that you've build from the training data doesn't generalize well because it hasn't distinguished the noise from the genuine patterns).

So this is more likely the answer to your question: the techniques to 'constrain' node splitting (and therefore deal with the over-fitting problem) all fall into one of two categories--top down or bottom up. An example of top down: set some threshold (e.g., if the weighted mean of the child nodes is less than 5% below, then do not split)?

An example of bottom up: pruning. Pruning means to let the algorithm split as long as the splitting criterion is satisfied then after it has stopped, start from the bottom layer of nodes and 'unsplit' any nodes in which the IG difference between the child nodes and the parent is less than some threshold.

These two approaches do not have the same effect, and in fact pruning is the superior technique. The reason: if you enforce a splitting threshold top-down, then of course some splitting will be prevented; however, if it had been allowed to occur, the next split (one or both of the two child nodes into grandchildren) might have been a valid split (i.e., above the threshold) but that split would never occur. Pruning of course accounts for this.

doug