Decision Tree

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Decision Tree- CART

Prerequisite-
- Probability and statistics.

Objectives-
- Understand the prerequisite term such as gini index, entropy, Information gain and pruning.
- Understand Decision tree as CART algorithm.
- Tuning parameter of decision tree.
- Advantages and disadvantages of Decision Tree.

Decision Tree
A decision Tree is one of most popular and effective supervised learning technique for
classification problem that equally works well with both categorical and quantitative variables. It is a
graphical representation of all the possible solution to a decision that is based on certain condition. In
this algorithm, the training sample points are split into two or more sets based on the split condition over
input variables. A simple example of decision tree can be as a person has to take a decision for going to
sleep or restaurant based on parameters like he is hungry or have 25$ in his pocket.

Figure 1 Example of Decision tree

Types of Decision tree


- Categorical variable decision tree : The type of decision tree is classified based in the
response/target variable. A tree with qualitative or categorical response variable is known as
Categorical variable decision tree.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
1
- Continuous variable decision tree: A tree with continuous response variable is known as
continuous variable decision tree.

Terminology
Before moving forward into the details of decision tree and its working lets understand the
meaning of some terminology associated with it.
Root node- Represent the entire set of the population which gets further divided into sets based on
splitting decisions.
Decision node- These are the internal nodes of the tree, These nodes are expressed through conditional
expression for input attributes.
Leaf node- Nodes which do not split further are known as leaf nodes or terminal nodes.
Splitting- The process of dividing a node into one or more sub nodes.
Pruning- It is the reverse process of splitting where the sub nodes are removed.

Figure 2 Decision Tree Terminology

The tree accuracy is heavily affected by the split point at decision node. Decision trees use different
criteria to decide split on decision node to get two or more sub nodes. The resultant sub nodes must
increase in the homogeneity of data points also known as the purity of nodes with respect to target
variable. The split decision is tested on all available variables and then the split with maximum purity sub
nodes is get selected.

Measures of Impurity:
Decision trees recursively split feature about to their target variable’s purity. The algorithm is designed to
optimize each split such the purity will be maximized. Impurity can be measured in many ways such as
Gini impurity, Entropy and information gain.

Gini Impurity-
Gini index is the measure of how often a randomly chosen element from the set would be
incorrectly labelled. Mathematically the impurity of a set can be expressed as:

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2
𝑔𝑖𝑛𝑖 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − . 𝑝/0
/

This can be understood by a simple example of let a bag contains some balls (4 red balls and 0 blue
ball). The gini index would be:

𝑔𝑖𝑛 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 1 − {𝑃(𝑅𝑒𝑑)0 + 𝑃(𝐵𝑙𝑢𝑒)0 } = 1 − { 10 + 00 } = 0


The impurity measurement is zero because no matter what ball we take out we would never incorrectly
label it. Similarly if we consider 2 red and 2 blue balls, the gini index would be 0.5. Gini primarily works
with categorical variables and perform binary split. Lower the gini index shows higher homogeneity. Split
selection can be done in two simple steps.
1. Calculate the gini impurity using the above formula for each sub nodes.
2. Calculate the gini for the split using weighted approach of gini scores of each node of split.

Example- Let we have been given with a data of some students with a target variable that whether a
student is athlete or not. The input attributes are gender (Male/Female) and education (UG/PG).

Person Gender Education Athlete


Student1 Male UG Yes
Student2 Female PG No
Student3 Male UG Yes
Student4 Male PG No
Student5 Female UG Yes
Student6 Female UG Yes
Student7 Female PG No
Student8 Male UG Yes
Student9 Female PG Yes
Student10 Female PG No

Case 1- Let first make a split with respect to gender variable: Figure is showing the split-

Step 1- calculate the gini impurity for each sub node


3 0 1 0
𝑔𝑖𝑛𝑖=>?@ = 1 − {𝑃(𝑦𝑒𝑠)0 + 𝑃(𝑁𝑜) 0}
= 1 − D E H + E H I = 0.375
4 4

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3
3 0 3 0
𝑔𝑖𝑛𝑖M@=>?@ = 1 − {𝑃(𝑦𝑒𝑠)0 + 𝑃(𝑁𝑜) 0}
= 1 − D E H + E H I = 0.5
6 6
Step 2- Calculate the gini for split using weights of each gini score.
4 6
𝑔𝑖𝑛𝑖O@PQ@R = (𝑔𝑖𝑛𝑖=>?@ ) + S𝑔𝑖𝑛𝑖M@=>?@ T = 0.4 (0.375) + 0.6(0.5) = 𝟎. 𝟒𝟓
10 10

Case 2- Let first make second split with respect to the education variable: Figure is showing the split-

Step 1- calculate the gini impurity for each sub node


5 0 0 0
𝑔𝑖𝑛𝑖XY = 1 − {𝑃(𝑦𝑒𝑠)0 + 𝑃(𝑁𝑜)0 } = 1 − D E H + E H I = 0.0
5 5

0 0}
1 0 4 0
𝑔𝑖𝑛𝑖ZY =1− {𝑃(𝑦𝑒𝑠) + 𝑃(𝑁𝑜) = 1 − D E H + E H I = 0.32
5 5
Step 2- Calculate the gini for split using weights of each gini score.
5 3
𝑔𝑖𝑛𝑖@Q\]>^/_P = (𝑔𝑖𝑛𝑖XY ) + (𝑔𝑖𝑛𝑖ZY ) = 0.5 (0.0) + 0.3(0.32) = 𝟎. 𝟎𝟗𝟔
10 10

As it is clear from the calculations that the gini score for education is lower than for gender variable so
the best variable for making a split is education. ( gives more pure sub nodes)

Entropy and Information gain

Entropy- In Layman term, Entropy is nothing but the measure of disorder. We can also think it as a
measure of purity. The mathematical formula of Entropy is as follows-
]
𝐸 = . − 𝑝/ 𝑙𝑜𝑔0 𝑝/
/cd

Where pi is the probability of class i.


For the given example we have two labels for athlete class (Yes/No). Therefore the athlete could be
either Yes or No. So, the entropy for the given set is defined as:
6 4
𝑃(𝑌𝑒𝑠) = 𝑎𝑛𝑑 𝑃(𝑁𝑜) =
10 10

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4
6 6 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − log 0 E H − log 0 E H = 0.442 + 0.529 = 0.971
10 10 10 10

The entropy comes out to be 0.971 which is considered a high entropy, Which means low level of
Purity. Entropy is measured between 0 and 1.

Information Gain

Information Gain measures the reduction in Entropy and decides which attribute would be
selected as decision node. In general, information gain has again calculated the subtraction of decision
node entropy to the weighted average of the entropies for the children of decision node. That is, for m
points in the first child node and n points in the second child node, the information gain is:

𝒎 𝒏
𝑰𝑮 = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑫𝒆𝒄𝒊𝒔𝒊𝒐𝒏𝒏𝒐𝒅𝒆) − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑭𝒊𝒓𝒔𝒕𝑪𝒉𝒊𝒍𝒅) − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺𝒆𝒄𝒐𝒏𝒅𝑪𝒉𝒊𝒍𝒅)
𝒎+𝒏 𝒎+𝒏

A Simple Graph between a class and Entropy would be like:

Figure 3 Entropy graph w.r.t impurity

Gini vs Entropy-

Figure 4 Gini vs Entropy Graph

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5
Pruning-

One of the problems with decision tree is it gets easily overfit with training sample and becomes
too large and complex. A complex and large tree poorly generalizes the new samples data whereas a
small tree fails to capture the information of training sample data.
Pruning may be defined as shortening the branches of tree. The process of reducing the size of the tree
by turning some branch node into leaf node and removing the leaf node under the original branch.

Figure 5 Pruning in Decision Tree

Pruning is very useful in decision tree because sometime what happens is that the decision tree
may fit the training data very well but performs very poorly in testing or new data. So, by removing
branches we can reduce the complexity of tree which help in reducing the over fitting of tree.

Hyperparameters for Decision Trees

To generate decision trees that will generalize to new problems well, we can tune different aspect about
trees. We call these aspects of decision tree “hyperparameters”. Some of Important Hyperparameters
used in decision trees are as follows:
Maximum Depth- The maximum depth of decision tree is simply the largest length between the root to
leaf. A tree of maximum length k can have at most 2**k leaves.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
6
Minimum number of samples per leaf- While splitting a node, one could run into the problem of having
990 samples in one of them, and 10 on the other. This will not take us too far in our process, and would
be a waste of resources and time. If we want to avoid this, we can set a minimum for the number of
samples we allow on each leaf.

Maximum number of feature- We can have too many features to build a decision tree. While splitting,
in every split, we have to check the entire data-set on each of the features. This can be very expensive.
A solution for this is to limit the number of features that one looks for in each split. If this number is large
enough, we're very likely to find a good feature among the ones we look for (although maybe not the
perfect one). However, if it's not as large as the number of features, it will speed up our calculations
significantly.

Example-

The decision tree for the example discussed in gini calculation will be-

Figure 6 Decision Tree

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7
Advantages of Decision tree

- Decision tree does not require normalization and scaling of data.


- Missing value in the data-set also does not affect the process of making the decision tree of
the particular dataset.
- It is very easy to explain decision tree to anyone. It does not require much knowledge of
statistics and visualization also helps very much.
- Decision tree requires less time and effort during data pre-processing than other algorithms.

Disadvantages of Decision Tree

- A small change in the data-set can result in large change in the structure of the decision tree
causing instability in the model.
- It require more time to train a model in decision tree than any other algorithm present out
there.
- In decision tree calculation can be far more expensive than the other algorithm.
- It is not advised to apply decision tree for regression or predicting continuous values.

*********

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
8

You might also like