Decision Tree
Decision Tree
Decision Tree
Prerequisite-
- Probability and statistics.
Objectives-
- Understand the prerequisite term such as gini index, entropy, Information gain and pruning.
- Understand Decision tree as CART algorithm.
- Tuning parameter of decision tree.
- Advantages and disadvantages of Decision Tree.
Decision Tree
A decision Tree is one of most popular and effective supervised learning technique for
classification problem that equally works well with both categorical and quantitative variables. It is a
graphical representation of all the possible solution to a decision that is based on certain condition. In
this algorithm, the training sample points are split into two or more sets based on the split condition over
input variables. A simple example of decision tree can be as a person has to take a decision for going to
sleep or restaurant based on parameters like he is hungry or have 25$ in his pocket.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
1
- Continuous variable decision tree: A tree with continuous response variable is known as
continuous variable decision tree.
Terminology
Before moving forward into the details of decision tree and its working lets understand the
meaning of some terminology associated with it.
Root node- Represent the entire set of the population which gets further divided into sets based on
splitting decisions.
Decision node- These are the internal nodes of the tree, These nodes are expressed through conditional
expression for input attributes.
Leaf node- Nodes which do not split further are known as leaf nodes or terminal nodes.
Splitting- The process of dividing a node into one or more sub nodes.
Pruning- It is the reverse process of splitting where the sub nodes are removed.
The tree accuracy is heavily affected by the split point at decision node. Decision trees use different
criteria to decide split on decision node to get two or more sub nodes. The resultant sub nodes must
increase in the homogeneity of data points also known as the purity of nodes with respect to target
variable. The split decision is tested on all available variables and then the split with maximum purity sub
nodes is get selected.
Measures of Impurity:
Decision trees recursively split feature about to their target variable’s purity. The algorithm is designed to
optimize each split such the purity will be maximized. Impurity can be measured in many ways such as
Gini impurity, Entropy and information gain.
Gini Impurity-
Gini index is the measure of how often a randomly chosen element from the set would be
incorrectly labelled. Mathematically the impurity of a set can be expressed as:
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2
𝑔𝑖𝑛𝑖 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − . 𝑝/0
/
This can be understood by a simple example of let a bag contains some balls (4 red balls and 0 blue
ball). The gini index would be:
Example- Let we have been given with a data of some students with a target variable that whether a
student is athlete or not. The input attributes are gender (Male/Female) and education (UG/PG).
Case 1- Let first make a split with respect to gender variable: Figure is showing the split-
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3
3 0 3 0
𝑔𝑖𝑛𝑖M@=>?@ = 1 − {𝑃(𝑦𝑒𝑠)0 + 𝑃(𝑁𝑜) 0}
= 1 − D E H + E H I = 0.5
6 6
Step 2- Calculate the gini for split using weights of each gini score.
4 6
𝑔𝑖𝑛𝑖O@PQ@R = (𝑔𝑖𝑛𝑖=>?@ ) + S𝑔𝑖𝑛𝑖M@=>?@ T = 0.4 (0.375) + 0.6(0.5) = 𝟎. 𝟒𝟓
10 10
Case 2- Let first make second split with respect to the education variable: Figure is showing the split-
0 0}
1 0 4 0
𝑔𝑖𝑛𝑖ZY =1− {𝑃(𝑦𝑒𝑠) + 𝑃(𝑁𝑜) = 1 − D E H + E H I = 0.32
5 5
Step 2- Calculate the gini for split using weights of each gini score.
5 3
𝑔𝑖𝑛𝑖@Q\]>^/_P = (𝑔𝑖𝑛𝑖XY ) + (𝑔𝑖𝑛𝑖ZY ) = 0.5 (0.0) + 0.3(0.32) = 𝟎. 𝟎𝟗𝟔
10 10
As it is clear from the calculations that the gini score for education is lower than for gender variable so
the best variable for making a split is education. ( gives more pure sub nodes)
Entropy- In Layman term, Entropy is nothing but the measure of disorder. We can also think it as a
measure of purity. The mathematical formula of Entropy is as follows-
]
𝐸 = . − 𝑝/ 𝑙𝑜𝑔0 𝑝/
/cd
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4
6 6 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − log 0 E H − log 0 E H = 0.442 + 0.529 = 0.971
10 10 10 10
The entropy comes out to be 0.971 which is considered a high entropy, Which means low level of
Purity. Entropy is measured between 0 and 1.
Information Gain
Information Gain measures the reduction in Entropy and decides which attribute would be
selected as decision node. In general, information gain has again calculated the subtraction of decision
node entropy to the weighted average of the entropies for the children of decision node. That is, for m
points in the first child node and n points in the second child node, the information gain is:
𝒎 𝒏
𝑰𝑮 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒆𝒄𝒊𝒔𝒊𝒐𝒏𝒏𝒐𝒅𝒆) − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑭𝒊𝒓𝒔𝒕𝑪𝒉𝒊𝒍𝒅) − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒆𝒄𝒐𝒏𝒅𝑪𝒉𝒊𝒍𝒅)
𝒎+𝒏 𝒎+𝒏
Gini vs Entropy-
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5
Pruning-
One of the problems with decision tree is it gets easily overfit with training sample and becomes
too large and complex. A complex and large tree poorly generalizes the new samples data whereas a
small tree fails to capture the information of training sample data.
Pruning may be defined as shortening the branches of tree. The process of reducing the size of the tree
by turning some branch node into leaf node and removing the leaf node under the original branch.
Pruning is very useful in decision tree because sometime what happens is that the decision tree
may fit the training data very well but performs very poorly in testing or new data. So, by removing
branches we can reduce the complexity of tree which help in reducing the over fitting of tree.
To generate decision trees that will generalize to new problems well, we can tune different aspect about
trees. We call these aspects of decision tree “hyperparameters”. Some of Important Hyperparameters
used in decision trees are as follows:
Maximum Depth- The maximum depth of decision tree is simply the largest length between the root to
leaf. A tree of maximum length k can have at most 2**k leaves.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
6
Minimum number of samples per leaf- While splitting a node, one could run into the problem of having
990 samples in one of them, and 10 on the other. This will not take us too far in our process, and would
be a waste of resources and time. If we want to avoid this, we can set a minimum for the number of
samples we allow on each leaf.
Maximum number of feature- We can have too many features to build a decision tree. While splitting,
in every split, we have to check the entire data-set on each of the features. This can be very expensive.
A solution for this is to limit the number of features that one looks for in each split. If this number is large
enough, we're very likely to find a good feature among the ones we look for (although maybe not the
perfect one). However, if it's not as large as the number of features, it will speed up our calculations
significantly.
Example-
The decision tree for the example discussed in gini calculation will be-
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7
Advantages of Decision tree
- A small change in the data-set can result in large change in the structure of the decision tree
causing instability in the model.
- It require more time to train a model in decision tree than any other algorithm present out
there.
- In decision tree calculation can be far more expensive than the other algorithm.
- It is not advised to apply decision tree for regression or predicting continuous values.
*********
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
8