Lecturenotes DecisionTree Spring15
Lecturenotes DecisionTree Spring15
Lecturenotes DecisionTree Spring15
Decision Trees
Assume we are given the following data:
The task at hand is to develop an application that advises whether we should wait for a table at
a restaurant or not. To this end we are going to assume that the given data represents the true
concept WillWait and we will further assume that all future data will be described by the same
input attributes aka features. A description of these features is provided in Figure-2.
Figure 2: The available input attributes (features) and their brief descriptions.
1
We can view this problem as playing a 20 questions game, where every question we ask:
• should help us in narrowing down the value of the target WillWait
• depends on the previous questions that we may have already asked
Here, features represent questions and the answer is the specific feature value for a data instance.
What happens when we ask a question?
• Depending upon the number of possible answers the data is split into multiple subsets
• If a subset has the same value for the target concept (e.g., WillWait ) then we have our
answer
• If the subset has different values for the target concept then we need to ask more questions
(a) (b)
Figure 3: Possible splitting of data based on different features. The root shows the feature we are
testing, and its child nodes show how the data is split.(a) Split using the Alternate? feature, (b)
using the Patrons? feature.
Figure-3 shows what happens when we ask two different questions. For instance in the case of
Figure-3a, we have selected to ask the question whether an alternate restaurant is available nearby
or not, in this case we can see that the question is not informative in regards to answering our
question because no matter what the answer is we have a fifty-fifty chance of being correct. This
is as good as deciding randomly (flipping a fair coin).
However, if we ask the question about the number of patrons in the restaurant (Figure-3b) then
for two out of the three possible answers we can predict the target concept with 100% confidence.
If we keep asking questions till we reach nodes in which we can make the predictions with acceptable
chances of error or run out of questions we will get a decision tree built using the training data.
One possible decision tree for the WillWait problem is shown in Figure-4.
1 Decision Trees
A decision tree can be viewed as a function that maps a vector valued input to a single output or
“decision” value. The output/decision value is real-valued or continuous when the tree is used for
regression and is discrete valued in the case of a classification tree.
2
Figure 4: A decision tree for the WillWait task
• For a given data instance, each internal node of the tree tests a single feature value (later we
will discuss how this can be generalized to utilize multiple features) to select one of its child
nodes.
• This process continues till we reach a leaf node which assigns the final decision value to the
instance.
Predicting the label for a data instance using a decision tree: Given an instance:
x1 = (’Yes’,’No’,’Yes’,’Yes’,’Full’,’$$$’,’No’,’No’,’French’,10 − 30)
and the decision tree in Figure-4, we can predict the label as:
• Start at the root and follow the branch corresponding to the value of the feature that the
root tests (Patrons). Since the given instance has a value of ’full’ for this feature we follow
the corresponding branch and reach the next node which tests for the WaitEstimate feature.
• Repeat this process till a leaf node is reached. For the given instance this results in a decision
of ’No’.
Remark: It should be noted here that by changing the order in which the features are tested we
can end up with an entirely different tree which maybe shorter or longer than the one in Figure-4.
Similarly, not all features are required to be used in the tree, for example the tree does not test the
Type or Price features.
3
Data: Training data with m features and n instances
Result: Induced Decision Tree
Begin with an empty tree;
while stopping criterion is not satisfied do
select best feature;
split the training data on the best feature;
repeat last two steps for the child nodes resulting from the split;
end
Algorithm 1: A greedy algorithm for inducing a decision tree from given training data
4
2.4 Tree splitting:
In Figure-1, each internal node of the decision tree defines a split of the training data based on one
single feature. Splits based on only one feature are also known as univariate splits. For example
the root node splits the training data into three subsets corresponding to the different values that
the Patrons feature can take, so that the left-most child has all the instances whose feature value
is ’None’. The type of split that a feature defines depends on how many possible values the feature
can take. A binary feature will split the data into two sub-groups while a feature with k distinct
values will define a k-ary (multiway) split resulting in multiple sub-groups.
where I(.) is an indicator function which is 1 when the arguments evaluate to true and 0 otherwise.
In the above equation the indicator function is used simply to count the number of instances
belonging to class k. We classify all the instances in q to the majority class i.e., class(q) =
arg maxk pqk .
At node q we need to select the next best feature which will define the next split. One possible
criterion for feature selection would be to measure the classification error of the resulting split.
Nq
1 X
Error(q) = I(yi 6= class(q)) (2)
Nq
i=1
A feature that minimizes the classification error would be considered as the best feature to split
node q.
For example consider the training data shown in Figure-1, and we decide that Bar is the feature
that we should use to define our first split (root of the decision tree). This would produce a binary
split resulting in two new nodes that we will represent by {x1 , x2 , x4 , x5 , x8 , x11 } ∈ Barno and
{x3 , x6 , x7 , x9 , x10 , x12 } ∈ Baryes . Before splitting the data the misclassification error was 50%
and for both Barno and Baryes it is still 50%. So Bar is a bad choice as it has no effect on the
classification error.
Although, classification error can be used to grow a decision tree, it has some shortcomings: it
does not favor pure nodes i.e., nodes that have instances belonging to only one class. As an example
assume that we have a binary classification task with each class having 50 instances, represented as
(50, 50). Consider two splits, one which results in (10, 40) and (40, 10) and another that produces
(50, 20) and (0, 30). In both cases the classification error is 20%, but the latter split produces a
pure node which should be preferred over the former. Therefore, instead of looking at classification
error we use other measures that characterize the purity/impurity of a node, so that our goal is
to prefer splits that produce pure nodes. It should be noted here that the impurity of a node is
measured in terms of the target variable i.e., the class labels and not the feature itself.
5
(a) (b)
Figure 5: (a) The entropy and its comparison to other node impurity measures (b) for binary
classification tasks. The x-axis in both cases is the probability that an instance belongs to class 1,
and in (b) the entropy function has been scaled down for ease of comparison.
where k is the number of classes. Entropy measures the impurity of a given node: if all the instances
belong to one class then entropy is zero (for this we would need to assume that 0 log2 0 = 0).
Similarly, for a uniform distribution of class labels i.e., there is an equal number of instances
belonging to each class entropy reaches its maximum value. This can be seen in the case of binary
classification in Figure-5a, where the entropy function has a maximum at p = 0.5.
In order to select the best feature to further split a node we want the maximum reduction in
entropy, i.e., the child nodes should be as pure as possible (or less impure than their parent). If
we split node q based on feature V , that has |V | distinct values (resulting in a |V |-way split) the
reduction in entropy is calculated as:
|V |
X Ni
IG(q, V ) = H(q) − H(i) (4)
Nq
i=1
where Ni is the number of instances in the ith child node and H(i) is its entropy calculated using (3).
(4) is known as information gain, and we would select the feature with the maximum information
gain to split node q.
Gini Index: The Gini index for a node q having Nq instances is defined as:
k
X
Gini(q) = pqk (1 − pqk ) (5)
i=1
6
where k is the number of classes. Gini index is another impurity measure that is extensively used
and is similar to entropy as can be seen from Figure-5b. It reaches its maximum value when the
instances are equally distributed among all classes, and has a value of zero when all instances belong
to one class. Similar to information gain (4), we can measure the gain in Gini index when splitting
on feature A:
|V |
X Ni
GainGini (q, V ) = Gini(q) − Gini(i) (6)
Nq
i=1
• Ordinal Features: The values taken by an ordinal feature are discrete but can be ordered.
The features Price, Patrons and WaitEstimate are examples of ordinal features. To create
a binary split based on an ordinal feature x(i) with ’k’ distinct values, we can select one of
the feature values as the threshold θ and then split the data into two subsets corresponding
to x(i) < θ and x(i) ≥ θ. Figure-6 shows an example of creating a binary split for an ordinal
feature.
Figure 6: Defining a binary split for an ordinal variable using the Price feature from Example-1.
The left branch contains feature values that are < the testing value, and the right branch has all
values that are ≥ the testing value.
7
• Numerical Features: A multiway split for a continuous feature, can result in an N -ary split,
where N is the total number of training instances, so that each child has only one instance.
To avoid this situation we transform a continuous feature into a binary feature by defining a
suitable threshold θ, and then splitting the data as x(i) < θ and x(i) ≥ θ. Let x1i , x2i , . . . , xN i
be the sorted values for the ith feature. Then the optimal threshold can be found by setting
x +x
the threshold as each of the N − 1 mid-points i.e., ji 2(j+1)i ; ∀j ∈ {1, 2, . . . , N − 1}, and
identifying the threshold value that maximizes the gain.
• Nominal (Categorical) Features: Defining a binary split for nominal features entails parti-
tioning the set of all possible feature values in two non-empty and non-overlapping subsets.
For a nominal feature A ∈ {a1 , a2 , . . . , av } this requires evaluating 2v−1 − 1 possible binary
partitions. For example consider the Type feature from Example-1. An exhaustive search
over the possible binary partitions would entail evaluating the following splits:
1. {Burger},{French,Thai}
2. {French},{Burger,Thai}
3. {Thai},{Burger,French}
One possible strategy to avoid an exhaustive search is to replace the original feature by
v dummy boolean features, where the ith dummy feature represents the indicator function
I(xji = vi ); ∀j ∈ {1, 2, . . . , N }.
8
Figure 8: Overfitting.
2.5 Overfitting
Overfitting occurs when the decision tree has low classification error on the training data, but
considerably higher classification error on test (previously unseen) data. Figure-8 shows an example
of overfitting and we can see that as the number of node in the tree grow its accuracy increases on
training data but the accuracy on test data gets lowered. As the size of the tree grows, the decision
tree starts to capture the peculiarities of the training data itself such as measurement noise instead
of learning a general trend from the data which better characterizes the underlying target concept.
There are a number of methods that can be used to avoid growing larger and deeper trees.
Remark: Care should be taken while setting these limit, because stringent limits can result
in a shallow tree that is unable to adequately represent the target concept we are trying to
learn. There are no general rules to how these limits should be set and usually these limits
are dataset specific.
9
Data: Validation Data Dvd and fully grown tree Tf ull
Result: Pruned Decision Tree Tpruned
T=Tf ull ;
while all nodes of T are tested do
select an inner node t ∈ T ;
construct T 0 : replace t with a leaf node using training data;
if error(T 0 , Dvd ) ≤ error(T, Dvd ) then
T = T 0;
end
end
Algorithm 2: Algorithm for reduced error post-pruning of decision trees.
The pruning strategy is then to refine each rule by removing any preconditions or antecedents
that do not increase the error on validation data. So, in the first step for the rule given above
the following two rules will be tested:
If any of these rules have an error that is lower than the original rule, then we would replace
the original rule with the new pruned rule. The pruning process will recurse over the new
(shorter) rules and halts when all the pruned rules are worst than their un-pruned version.
At the end of the pruning procedure all the rules are sorted based on their accuracy over the
validation data, and are used in this sequence to classify new instances.
• throw away the feature if a very high number of instances (e.g., 70%) have missing feature
values
• fill in the missing values based on the class mean for the missing feature
10
• in certain cases where a missing value has its own specific meaning for example in the case
of a medical test where a certain hormone is not detected we can create a new feature value
to represent that the value is missing (not measured)
• if we come to an internal node and the feature value is missing, we can distribute the instance
to all the child nodes with diminished weights that reflect the proportion of instances going
down to each child
Predicted Class
P N
True False
P
Positive(TP) Negative(FN)
Actual
Class
False True
N
Positive(FP) Negative(TN)
Figure 9: Confusion matrix for a binary classification task. We have two classes positive (P) and
negative (N).
So far, we have dealt with uniform costs i.e., the cost of predicting a false positive or a false negative
is equal (unit cost, every mistake counts as one). However, in domains where we care about what
errors are more costlier than others, we need to incorporate this information into the tree induction
process, so that the final decision tree gives more attention to reducing the types of error that
have a high cost. To this end we can specify the misclassification costs for a k−class classification
problem using a (k × k) loss/cost matrix that is defined as follows:
0 L12 L13 . . . L1k
L21 0 L23 . . . L2k
..
.. .. .. ..
(7)
. . . . .
Lk1 Lk2 Lk3 . . . 0
11
where, Lij represent the cost of misclassifying an instance belonging to the ith class as belonging to
class j. The diagonal entries represent the cost associated with predicting the correct label which is
zero. These costs P
can be integrated into Gini index by observing that we can equivalently represent
(5) as Gini(q) = k6=k0 pqk pqk0 (Homework problem), and then modify it as:
X
Gini(q) = Lkk0 pqk pqk0 (8)
k6=k0
This approach works only for classification problems that have more than two classes.
3 Regression trees
Lets say that for each node m, χm is the set of datapoints reaching that node.
Estimate a predicted value per tree node
P
t∈χm yt
gm =
|χm |
− gm )2
P
t∈χm (yt
Em =
|χm |
How to choose the next split. If Em < θ, then stop splitting. Otherwise choose the split that
realizes the maximum drop in error for all all brances. Say we are considering feature X with
branches x1 , x2 , ..., xk , and lets call χmj the subset of χm for which X = xj .
P
t∈χmj yt
gmj =
|χmj |
2
P P
0 j t∈χmj (yt − gmj )
Em (X) =
|χm |
0 (X) is minimized, or the drop in error is maximized.
We shall choose X such that Em
12
Figure 10: Regression tree
13
Figure 11: Regression fit
More generally, a binary linear multivariate node m split can look like
w1 x1 + w2 x2 + ...wm xm + w0 > 0
Such splits can be extremely powerful (if data is linearly separable, a single split at root can
create a perfect classification); even more complex splits can be obtained using nonlinear split
functions.
However, finding a good multivariate split is not anymore a matter of brute force: there
d N
are 2 d possible splits (or hyperplanes). Later on in the course we will discuss linear classification
and how good hyperplanes can be obtained without an exhaustive search.
14
Figure 12: Real-valued information gain and decision tree
15
Figure 13: The target concept that we need to learn is a linear function of both x1 and x2 .
A univariate decision tree (top) is unable to capture the target concept. On the other hand a
multivariate decision tree (bottom) learns the linear decision boundary using only a single split.
16