Lecture 07 On Decision Trees
Lecture 07 On Decision Trees
Lecture 07 On Decision Trees
Decision Trees
Alice Gao
November 2, 2021
Contents
1 Learning Goals 3
6 Real-Valued Features 26
6.1 Jeeves dataset with real-valued temperatures . . . . . . . . . . . . . . . . . . 26
6.2 Handling a discrete feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Handling a real-valued feature . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.4 Choosing a split point for a real-valued feature . . . . . . . . . . . . . . . . . 28
7 Over-fitting 31
1
CS 486/686 Lecture 7
8 Practice Problems 36
1 Learning Goals
By the end of the lecture, you should be able to
• Describe the components of a decision tree.
• Construct a decision tree given an order of testing the features.
• Determine the prediction accuracy of a decision tree on a test set.
• Compute the entropy of a probability distribution.
• Compute the expected information gain for selecting a feature.
• Trace the execution of and implement the ID3 algorithm.
Example: Jeeves is a valet to Bertie Wooster. On some days, Bertie likes to play
tennis and asks Jeeves to lay out his tennis things and book the court. Jeeves would
like to predict whether Bertie will play tennis (and so be a better valet). Each morning
over the last two weeks, Jeeves has recorded whether Bertie played tennis on that day
and various attributes of the weather (training set).
Jeeves would like to evaluate the classifier he has come up with for predicting whether
Bertie will play tennis. Each morning over the next two weeks, Jeeves records the
following data (test set).
• I am 30 years old.
• I am an accountant.
Following the tree, we answer no (not under 20 years old), no (not over 65 years old),
yes (work related), no (not working in tech), and no (not trying to get fired). The leaf
we reach is a thumb down, meaning we should not use emoji.
Problem: Construct a (full) decision tree for the Jeeves data set using the following
order of testing features.
Solution: Here is the process to generate the decision tree by the given order.
We have 9 positive and 5 negative examples. They are not in the same class, so we
will have to choose a feature to test.
Based on the given order, we will test Outlook first. Outlook has three values: Sunny,
Overcast, and Rain. We split the examples into three branches.
The three sets look like this. Example 1 has Outlook equal to Sunny, so it goes into
the left branch. Example 3 has Outlook equal to Overcast, so it goes into the middle
branch, etc.
In the middle branch, all the examples are positive. There is no need to test another
feature, and so we may make a decision. We create a leaf node with the label Yes, and
we are done with this branch.
Looking at the left branch next, there are two positive and three negative examples.
We have to test another feature. Based on the given order, we will test Temp next.
Temp has three values — Hot, Mild, and Cool. We create three branches again. The
five examples are split between these branches.
We will repeat this process at every node. First, check if all the examples are in the
same class. If they are, create a leaf node with the class label and stop. Otherwise,
choose the next feature to test and split the examples based on the chosen feature.
The above is the final decision tree. Each internal node represents a test—Not all of
these are Boolean. Beside each node, the training examples are partitioned into two
classes based on whether Bertie played tennis that day: Yes (+) and No (−).
Problem: I took our training set and added a few examples. It now has 17 instead of
14 examples. For this modified training set, let’s construct one branch of the decision
tree where Outlook is Sunny, Temperature is Mild, and Humidity is High.
Solution: After testing Humidity is High, what should we do next? We have tested
three of the four input features. To continue, testing Wind is the only option. When
Wind is Strong, we have one positive example, and the decision is Yes. When Wind is
Weak, we have three examples: one positive and two negative examples. The examples
are mixed, so we cannot make a decision. But, we have tested all four features —
there’s no more feature to test. What should we do in this case?
Let’s take another look at our data set. There are three examples when Wind is Weak.
Note that, for the three examples, the values of all input features are all the same —
Sunny, Mild, High, and Weak, but they have different labels, No for 8 and 15 and Yes
for 16. This is an example of a noisy data set. With noisy data, even if we know the
values of all the input features, we are still unable to make a deterministic decision.
One reason for having a noisy data set is that the decision may be influenced by
some features that we do not observe. Perhaps, another factor not related to weather
influences Bertie’s decision, but we don’t know what that factor is.
What should we do when we run out of features to test? There are a few options.
One option is to predict the majority class. In this case, the majority decision is No.
Another option is to make a randomized decision. We can think of the three examples
as a probability distribution. There is a 1/3 probability of predicting Yes and a 2/3
probability of predicting No. We will make the decision based on a random draw from
the distribution. For this example, let’s use the majority decision.
Problem: Let’s consider another modified Jeeves training set. Complete one branch
of the decision tree where Temperature is Hot, Wind is Weak, and Humidity is High.
Solution: After testing the three features, we have three examples left: one positive
and two negative. The only feature left is Outlook. Outlook has three values. Let’s
split up the examples into three branches.
If Outlook is Sunny, we have two negative examples, and the decision is No. If Outlook
is Overcast, we have one positive example, and the decision is Yes. Finally, if Outlook
is Rain, there are no examples left. What should we do here?
Before we decide on what to do, let’s ask a related question. Why did we encounter
this case? Let’s take another look at the modified data set. The case we are looking
for is: Temperature is Hot, Wind is Weak, Humidity is High, and Outlook is Rain.
After going through the data set, you will realize that no example in this data set has
this combination of input feature values. This is the reason that we had no examples
left. If we never observe a combination of feature values, we don’t know how to predict
it.
Now that we understand why this happened, how should we handle this case? One
idea is to try to find some examples that are close to this case. If we go up to the
parent node, the parent has some examples for different values of Outlook. Arguably,
these are the closest examples to our case. Therefore, we can use the examples at the
parent node to make a decision. At the parent node, the examples are likely to be
mixed. Most likely, we cannot make a deterministic decision. Similar to the previous
case, we can either make the majority decision or decide based on a random draw from
a probability distribution.
Here is the pseudocode for the algorithm. Since a tree is recursive, we will naturally use a
recursive algorithm to build it.
The algorithm starts with three base cases.
1. If all the examples are in the same class, we will return the class label.
2. If there are no features left, we have noisy data. We can either return the majority
decision or make a decision probabilistically.
3. If there are no examples left, then some combination of input features is absent in the
training set. We can use the examples at the parent node to make a decision. If the
examples are not in the same class, we can either return the majority decision or make
a decision probabilistically.
Next, we have the recursive part. Suppose that we have a pre-specified order of testing the
input features. At each step, we will split up the examples based on the chosen feature’s
values. We will label the edges with the feature’s value. Each subtree only has examples
where the value of the feature corresponds to the value on the edge.
There’s one crucial step left. So far, we have assumed that a pre-defined order of testing the
input features. Where does this order come from? In practice, we have to choose this order
ourselves.
(A) 0.2
(B) 0.4
(C) 0.6
(D) 0.8
(E) 1
You might be wondering, is one bit of entropy a lot or very little? You will get a better
idea of this after question 2. For now, believe me when I say that there is a lot of
uncertainty in this binary distribution.
This should make intuitive sense. This distribution is uniform. A uniform distribution
has a lot of uncertainty since every outcome is equally likely to become true.
We have a very skewed distribution. Almost all the probability is on the second
outcome.
(A) 0.02
(B) 0.04
(C) 0.06
(D) 0.08
(E) 0.1
Compared to the previous distribution, this distribution has very little uncertainty.
Again, this result makes intuitive sense. Looking at the probabilities, we almost know
for sure that the second outcome will become true. This means that we don’t really
have that much uncertainty. If you are making a bet, you will probably bet on the
second outcome and you will likely win the bet.
What have we learned about entropy from these two questions? Let me summarize some
important points.
Solution: For any binary distribution, its entropy achieves its maximum value of
one when the distribution is uniform, and achieves its minimum value of zero when
the distribution is a point mass — one outcome has a probability of 1.
When p is 0 or 1, calculating the entropy is a bit tricky. since log2 (0) is undefined. In
this case, let’s the term 0 ∗ log2 (0) to be 0. This definition is reasonable since limit of
x ∗ log2 (x) is 0 when x approaches 0.
Finally, here is the plot of the distribution’s entropy with respect to p. As p increases
from 0 to 1, the entropy increases first, reaches the maximum value when p is 0.5, and
then decreases after that.
Here is a practice question for you. I’ve summarized some properties of entropy for a binary
distribution. What about a distribution with more than two outcomes. For example, what
are the maximum and minimum entropy for a distribution with three outcomes?
Recall that the information content of a feature is the entropy of the examples before testing
the feature minus the entropy of the examples after testing the future. Formally, we call this
the expected information gain of testing this feature.
Let me write down the formula for calculating the expected information gain.
Before testing the feature, we have p positive and n negative examples. Let’s convert this
to a binary distribution: Yes has a probability of p / (p + n), and No has a probability of
n / (p + n). We can calculate the entropy of this distribution using the formula we saw.
p n
Ibefore = I , . (2)
p+n p+n
After testing the feature, we have k distributions, one for each of the k feature values. We
have a problem here. Without an example, we don’t know which distribution will be useful.
How do we calculate the entropy of k distributions? The solution is to calculate the expected
entropy of these distributions. We will first calculate the entropy of each distribution. Then,
we will take the expected value of the k entropies. In the expected value, the weight of each
distribution is the fraction of examples in that distribution.
For example, for the feature value vi , there are pi positive examples and ni negative examples.
The entropy is given by I(pi /(pi + ni ), ni /(pi + ni )). The fraction of examples in the this
distribution is (pi + ni )/(p + n) — this is the weight of the i-th entropy in the expected value.
The expected entropy after testing the feature is
k
X pi + ni pi ni
EIafter = ∗I , . (3)
i=1
p + n p i + n i p i + n i
Finally, the expected information gain or entropy reduction is the entropy before testing the
feature minus the expected entropy after testing the feature.
Problem: What is the entropy of the examples before we select a feature for the
root node of the tree?
(A) 0.54
(B) 0.64
(C) 0.74
(D) 0.84
(E) 0.94
Solution: There are 14 examples: 9 positive, 5 negative. Applying the formula gives
9 9 5 5
Hbefore = − log2 + log2
14 14 14 14
9 5
=− (−0.637) + (−1.485)
14 14
= −(−0.939)
≈ 0.94.
ln x
(In an exam, you could calculate log2 x = ).
ln 2
Problem: What is the expected information gain if we select Outlook as the root
node of the tree?
(A) 0.237
(B) 0.247
(C) 0.257
(D) 0.267
(E) 0.277
Problem: What is the expected information gain if we select Humidity as the root
node of the tree?
(A) 0.151
(B) 0.251
(C) 0.351
(D) 0.451
(E) 0.551
Comparing Outlook and Humidity, the expected information gain of testing Outlook first is
greater than that of testing Humidity first.
You may also notice that the expected information gain at a node is always taken with
respect to the entropy before the node, so when comparing features to test at that node
we could simply ignore the entropy before testing and select the feature with the lowest
post-testing entropy.
Example: Here is the continuation of the example. The calculations have been
condensed, but you should verify them yourself.
For the root node, we can choose from all four features to test:
2 3 2 0 2 2 1 1 1 1 0
Gain(Temp) = I , − ·I , + ·I , + ·I ,
5 5 5 2 2 5 2 2 5 1 1
2 2 1
= 0.97 − (0) + (1) + (0)
5 5 5
= 0.97 − 0.4
= 0.57
(
High +: − : 1, 2, 8
Humidity =
Normal + : 9, 11 − :
2 3 3 0 3 2 2 0
Gain(Humidity) = I , − ·I , + ·I ,
5 5 5 3 3 5 2 2
3 2
= 0.97 − (0) + (0)
5 5
= 0.97 − 0
= 0.97
(
Weak + : 9 − : 11
Wind =
Strong + : 1, 8 − : 2
2 3 3 1 2 2 1 1
Gain(Wind) = I , − ·I , + ·I ,
5 5 5 3 3 5 2 2
3 2
= 0.97 − (0.918) + (1)
5 5
= 0.97 − 0.951
= 0.019
We pick Humidity since it has the greatest expected information gain. This makes
sense since the positive and negative examples are completely separated after testing
Humidity.
2 3 3 2 1 2 1 1
Gain(Temp) = I , − ·I , + ·I ,
5 5 5 3 3 5 2 2
3 2
= 0.97 − (0.918) + (1)
5 5
= 0.97 − 0.951
= 0.019
(
High +:4 − : 14
Humidity =
Normal + : 5, 10 − : 6
2 3 2 1 1 3 2 1
Gain(Humidity) = I , − ·I , + ·I ,
5 5 5 2 2 5 3 3
2 3
= 0.97 − (1) + (0.918)
5 5
= 0.97 − 0.951
= 0.019
(
Weak + : 4, 5, 10 − :
Wind =
Strong + : − : 6, 14
2 3 3 3 0 2 0 2
Gain(Wind) = I , − ·I , + ·I ,
5 5 5 3 3 5 2 2
3 2
= 0.97 − (0) + (0)
5 5
= 0.97 − 0
= 0.97
We pick Wind since it has the greatest expected information gain. This makes sense
since the positive and negative examples are completely separated after testing Wind.
Outlook
No Yes Yes No
Recall that we wanted a shallow, small tree, and that’s exactly what we got.
6 Real-Valued Features
6.1 Jeeves dataset with real-valued temperatures
So far, the decision tree learner algorithm only works when all of the features have discrete
values. In the real world, we are going to encounter a lot of data sets where many features
have continuous values, or they’re real-valued. Let’s see how decision trees can handle them.
Instead of having the temperature being a discrete feature, having three values: Mild,
Hot, and Cool, we will use actual numbers for Temperature.
For convenience, we will reorder the data set based on the value of the Temperature.
Example: With the modified Jeeves data set from before, there are 11 possible split
points using the naı̈ve method.
7. Determine the expected information gain for each possible split point and choose the
split point with the largest gain.
Example: Using the modified Jeeves data set from before, consider the midway point
of X = 23.9 and Y = 26.6. Then LX = {Yes} and LY = {No}. Taking Yes = a ∈ LX
and No = b ∈ LY , we have a 6= b, so (X + Y )/2 = 25.25 is a possible split point.
Problem: For the Jeeves training set, is the midpoint between 20.0 and 20.6 a
possible split point?
Solution: This is not a possible split point: L20.0 = {Yes} and L20.6 = {Yes}.
Problem: For the Jeeves training set, is the midpoint between 21.1 and 21.7 a
possible split point?
Solution: This is a possible split point: L21.1 = {Yes} and L21.7 = {No}.
Problem: For the Jeeves training set, is the midpoint between 21.7 and 22.2 a
possible split point?
Solution: This is a possible split point: L21.7 = {No} and L22.2 = {No, Yes}.
Intuitively, you can understand this procedure as considering local changes at each midway
point. If the target feature doesn’t change locally, that point probably isn’t a valuable split
point.
7 Over-fitting
7.1 Corrupted data in the Jeeves dataset
Over-fitting is a common problem when we’re learning a decision tree. As a model for
supervised machine learning, a decision tree has several nice properties. Decision trees are
simpler, they’re easy to understand and easy to interpret.
When it’s important to explain a model to another human being, a decision tree is a good
choice. In contrast, a neural network is often complex and difficult or impossible to interpret.
Also decision trees can produce a reasonable model even if you have a tiny data set. In
contrast, a neural network often does not work at all with a small data set, and it is extremely
prone to over-fitting.
Therefore, even though decision trees have so many nice properties, over-fitting is still a
common problem when we’re constructing a decision tree.
Example: The Jeeves data set has 14 data points, this is an extremely small data
set; we can still learn a reasonable decision tree for this data set.
Outlook
No Yes Yes No
Example: Suppose that you sent the training set and the test set to your friend.
For some reason the data set gets corrupted during the transmission. Your friend gets
a corrupted training set where only one data point is different. For this third data,
point, it changed from the label ”yes” to the label ”no”.
This is a tiny change to the training set, but how would this change affect the decision
tree that we generate?
Outlook
Weak Strong
No Yes
We grew an entire middle sub-tree to accommodate one corrupted data point, and
this small corruption caused a dramatic change to the tree. This new tree is more
complicated and it likely won’t generalize well to unseen data.
The decision tree learner algorithm is a perfectionist. The algorithm will keep growing
the tree until it perfectly classifies all the examples in the training set. However, this
is not necessarily a desirable behavior and this can easily lead to over-fitting.
3. Minimum information gain: We can decide not to split the examples if the benefit of
splitting at that node is not large enough. We can measure the benefit by calculating
the expected information gain. In other words, do not split examples if the expected
information gain is less than the threshold.
4. Reduction in training error: We can decide not to split the examples at a node if the
reduction in training error is less than a predefined threshold value.
For pre-pruning, we will split the examples at a node only when it’s useful. We can use
pre-pruning with any of the criteria above.
Post-pruning: grow a full tree first and then trim it afterwards.
Post-pruning is particularly useful when any individual feature is not informative, but mul-
tiple features working together is very informative.
Example: Consider a data set with two input features, both features are binary.
The target feature is computing the XOR function of the two input features. That
means the target feature is true if and only if the two input features have different
values.
For this data set, each input feature alone tells us nothing about the value of the target
feature. However, if you know both input features then you know the target feature
for certain.
• With pre-pruning,
We will test none of the two input features, testing each feature gives us zero
information. So we’ll end up with a tree that has only the root node and the
tree will predict the target feature randomly.
• With post-pruning,
We will end up growing the full tree first which means testing both input features
in some order. Afterwards, we will try to see if we can prune any of the node.
It turns out it’s not a good idea to prune any node, because the second input
feature in the order is going to give us all the information that we need.
So we will end up growing a full tree and the full tree will predict the target
feature perfectly.
Post-pruning helps us recognize the cases with multiple features working together will be
very beneficial, but any of the features individually will not be very useful. We can apply
post-pruning to any of the criteria we have described above. Note that we will only decide
to post-prune a node if it has only leaf nodes as its descendants.
First of all, we will restrict our attention to nodes that only have leaf nodes as its
descendants. At a node like this, if the expected information gain is less than a
predefined threshold value, we will delete this node’s children which are all leaf nodes
and then convert this node to a leaf node.
There has to be examples with different labels at this node possibly both positive and
negative examples. We can make a majority decision at this node.
8 Practice Problems
1. When learning the tree, we chose a feature to test at each step by maximizing the ex-
pected information gain. Does this approach allow us to generate the optimal decision
tree? Why or why not?
2. Consider a data-set with real-valued features. Suppose that we perform binary splits
only when building a decision tree.
Is it possible to encounter the “no features left” base case? Why?
Is it possible to encounter the “no examples left” base case? Why?
3. What is the main advantage of post-pruning over pre-pruning?