Decision Tree

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 57

Decision Tree

With the increase in the implementation of Machine Learning algorithms for solving industry level problems, the demand
for more complex and iterative algorithms has become a need. The Decision Tree Algorithm is one such algorithm that is
used to solve both Regression and Classification problems.
Decision Tree

Tree based learning algorithms are considered to be one of the best and mostly used
supervised learning methods. Tree based methods empower predictive models with high
accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear
relationships quite well. They are adaptable at solving any kind of problem at hand
(classification or regression).

Methods like decision trees, random forest, gradient boosting are being popularly used in
all kinds of data science problems.
What is a Decision Tree ? 

Decision tree is a type of Supervised learning


algorithm (having a pre-defined target
variable) that is mostly used in classification
problems. It works for both categorical and
continuous input and output variables.

In this technique, we split the population or


sample into two or more homogeneous sets (or
sub-populations) based on most significant
splitter / differentiator in input variables.
Why Decision Tree Algorithm?

1. It is considered to be the most understandable


Machine Learning algorithm and it can be easily
“ A Decision Tree is a Supervised
interpreted.
Machine Learning algorithm which
looks like an inverted tree, wherein
2. It can be used for classification and regression each node represents a predictor
problems. variable (feature), the link between the
nodes represents a Decision and each
leaf node represents an outcome
3. Unlike most Machine Learning algorithms, it works (response variable).  ”
effectively with non-linear data.

4. Constructing a Decision Tree is a very quick process


since it uses only one feature per node to split the
data.
Decision Tree – CART

It is also known as CART (Classification & Regression Trees).

Some real-life applications of Decision Trees include:

1. Credit scoring models in which the criteria that cause an applicant to be rejected need to be
clearly documented and free from bias

2. Marketing studies of customer behavior such as satisfaction or churn, which will be shared with
management or advertising agencies

3. Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of


disease progression
How it Works ??????

Let’s say that you hosted a huge


party and you want to know how
many of your guests were non-
vegetarians.

To solve this problem, let’s create a


simple Decision Tree.
Technical Terminology
A Decision Tree has the following structure:

1. Root Node: The root node is the starting point of a tree. At this point, the first split is
performed.

2. Internal Nodes: Each internal node represents a decision point (predictor variable) that
eventually leads to the prediction of the outcome.

3. Leaf/ Terminal Nodes: Leaf nodes represent the final class of the outcome and therefore
they’re also called terminating nodes.

4. Branches: Branches are connections between nodes, they’re represented as arrows. Each


branch represents a response such as yes or no.
How Does The Decision Tree Algorithm Work?

The algorithm of the decision tree models works by


The Decision Tree Algorithm follows the below steps:
repeatedly partitioning the data into multiple sub-
spaces so that the outcomes in each final sub-space
Step 1: Select the feature (predictor variable) that best classifies
is as homogeneous as possible. This approach is
the data set into the desired classes and assign that feature to technically called recursive partitioning.
the root node.
The produced result consists of a set of rules used
Step 2: Traverse down from the root node, whilst making
for predicting the outcome variable, which can be
relevant decisions at each internal node such that each internal either:
node best classifies the data. • a continuous variable, for regression trees
Step 3: Route back to step 1 and repeat until you assign a class • a categorical variable, for classification trees
to the input data.

The decision rules generated by the CART


The above-mentioned steps represent the general workflow of a
(Classification & Regression Trees) predictive
Decision Tree used for classification purposes. model are generally visualized as a binary tree.
How decision tree ….. works ?
Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X) and Height (5 to 6 ft). 15 out of
these 30 play cricket in leisure time. Now, I want to create a model to predict who will play cricket during leisure period? In this
problem, we need to segregate students who play cricket in their leisure time based on highly significant input variable among all
three.

This is where decision tree helps, it will segregate the students based on all values of three variable and identify the variable, which
creates the best homogeneous sets of students (which are heterogeneous to each other). In the snapshot below, you can see that
variable Gender is able to identify best homogeneous sets compared to the other two variables.
Types of Decision Trees
Types of decision tree is based on the type of target variable we have. It can be of two types:

• Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called as
categorical variable decision tree. Example:- In above scenario of student problem, where the target variable
was “Student will play cricket or not” i.e. YES or NO.

• Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called as
Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an
insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance
company does not have income details for all customers. Now, as we know this is an important variable, then we
can build a decision tree to predict customer income based on occupation, product and various other variables. In
this case, we are predicting values for continuous variable.
Important Terminology….. Additional
Root Node: It represents entire population or sample and this further gets divided
into two or more homogeneous sets.

Splitting: It is a process of dividing a node into two or more sub-nodes.

Decision Node: When a sub-node splits into further sub-nodes, then it is called


decision node.

Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.

Pruning: When we remove sub-nodes of a decision node, this process is called


pruning. You can say opposite process of splitting.

Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.

Parent and Child Node: A node, which is divided into sub-nodes is called parent
node of sub-nodes where as sub-nodes are the child of parent node.
Advantages
1. Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical background. It does not
require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily
relate their hypothesis.

2. Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two
or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target
variable. It can also be used in data exploration stage. For example, we are working on a problem where we have information
available in hundreds of variables, there decision tree will help to identify most significant variable.

3. Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by
outliers and missing values to a fair degree.

4. Data type is not a constraint: It can handle both numerical and categorical variables.

5. Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no
assumptions about the space distribution and the classifier structure.
Disadvantages

1. Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved
by setting constraints on model parameters and pruning (discussed in detailed below).

2. Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information
when it categorizes variables in different categories.
Regression Trees vs Classification Trees

1. Regression trees are used when dependent variable is continuous. Classification trees are used when dependent
variable is categorical.

2. In case of regression tree, the value obtained by terminal nodes in the training data is the mean response of
observation falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its
prediction with mean value.

3. In case of classification tree, the value (class) obtained by terminal node in the training data is the mode of
observations falling in that region. Thus, if an unseen data observation falls in that region, we’ll make its
prediction with mode value.

4. Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions. For
the sake of simplicity, you can think of these regions as high dimensional boxes or boxes.
Regression Trees vs Classification Trees

5. Both the trees follow a top-down greedy approach known as recursive binary splitting. We call it as ‘top-
down’ because it begins from the top of tree when all the observations are available in a single region and
successively splits the predictor space into two new branches down the tree. It is known as ‘greedy’
because, the algorithm cares (looks for best variable available) about only the current split, and not about
future splits which will lead to a better tree.

6. This splitting process is continued until a user defined stopping criteria is reached. For example: we can tell
the algorithm to stop once the number of observations per node becomes less than 50.

7. In both the cases, the splitting process results in fully grown trees until the stopping criteria is reached. But,
the fully grown tree is likely to overfit data, leading to poor accuracy on unseen data. This bring ‘pruning’.
Pruning is one of the technique used tackle overfitting. 
How does a tree decide where to split?

The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for
classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of
sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the
node increases with respect to the target variable. Decision tree splits the nodes on all available variables and
then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. Let’s look at the four most commonly used
algorithms in decision tree:
1. Gini
Gini  says, if we select two items from a population at random then they must be of same class and probability
for this is 1 if population is pure.

• It works with categorical target variable “Success” or “Failure”.

• It performs only Binary splits

• Higher the value of Gini higher the homogeneity.

• CART (Classification and Regression Tree) uses Gini method to create binary splits.

Steps to Calculate Gini for a split

Calculate Gini for sub-nodes, using formula sum of square of probability for success and failure (p^2+q^2).

Calculate Gini for split using weighted Gini score of each node of that split.
1. Gini
Example: – Referring to example used above, where we want to segregate the students based on target variable ( playing
cricket or not ). In the snapshot below, we split the population using two input variables Gender and Class. Now, I want to
identify which split is producing more homogeneous sub-nodes using Gini .

Split on Gender: Similar for Split on Class:


1.Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68 1.Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51
2.Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55 2.Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
3.Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59 3.Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

Gini score for Split on Gender is higher than Split on Class, hence, the node split will take place on Gender.
Gini Impurity

The term ‘Gini Impurity’ which is determined by subtracting


the gini value from 1. So mathematically we can say,

Gini Impurity = 1-Gini


2. Chi-Square
It is an algorithm to find out the statistical significance between the differences between sub-nodes and parent node. We measure it
by sum of squares of standardized differences between observed and expected frequencies of target variable.

1. It works with categorical target variable “Success” or “Failure”.

2. It can perform two or more splits.

3. Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.

4. Chi-Square of each node is calculated using formula,

5. Chi-square = ((Actual – Expected)^2 / Expected)^1/2

6. It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Steps to Calculate Chi-square for a split:

• Calculate Chi-square for individual node by calculating the deviation for Success and Failure both

• Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split
Example – Chi Square
Split on Gender:
1. First we are populating for node Female, Populate the actual value for “Play Cricket” and “Not Play
Cricket”, here these are 2 and 8 respectively.
2. Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be 5 for both
because parent node has probability of 50% and we have applied same probability on Female
count(10).
3. Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 – 5 = -3) and for
“Not play cricket” ( 8 – 5 = 3).
4. Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using formula with formula,
= ((Actual – Expected)^2 / Expected)^1/2.
5. Now add all Chi-square values to calculate Chi-square for split Gender.
Example – Chi Square

Split on Class:
Perform similar steps of calculation for split on Class and you will come up with below table.

Above, you can see that Chi-square also identify the Gender split is more significant compare to Class .
3. Information Gain
Look at the image below and think which node can be described easily. I am sure, your answer is C because it requires less information as
all values are similar. On the other hand, B requires more information to describe it and A requires the maximum information. In other
words, we can say that C is a Pure node, B is less Impure and A is more impure.

Now, we can build a conclusion that less impure node requires less information to describe it. And, more impure node requires more
information. Information theory is a measure to define this degree of disorganization in a system known as Entropy. If the sample is
completely homogeneous, then the entropy is zero and if the sample is an equally divided (50% – 50%), it has entropy of one.
Entropy

Entropy can be calculated using formula:-

Here p and q is probability of success and failure respectively in that node. Entropy is also used with
categorical target variable. It chooses the split which has lowest entropy compared to parent node and other
splits. The lesser the entropy, the better it is.

Steps to calculate entropy for a split:


1.Calculate entropy of parent node
2.Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available
in split.
Example - Entropy
Example: Let’s use this method to identify best split for student example.

Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1 shows that it is a impure node.

Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for male node,  -(13/20) log2 (13/20) – (7/20) log2
(7/20) = 0.93

Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30)*0.93 = 0.86

Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for Class X node,  -(9/16) log2 (9/16) – (7/16) log2
(7/16) = 0.99.

Entropy for split Class =  (14/30)*0.99 + (16/30)*0.99 = 0.99

Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will split on Gender. We can derive
information gain from entropy as 1- Entropy.
What are the key parameters of tree modeling and
how can we avoid over-fitting in decision trees?

Overfitting is one of the key challenges faced while modeling decision trees. If there is no limit set of a
decision tree, it will give you 100% accuracy on training set because in the worse case it will end up
making 1 leaf for each observation. Thus, preventing overfitting is pivotal while modeling a decision
tree and it can be done in 2 ways:

1. Setting constraints on tree size

2. Tree pruning
Setting Constraints on Tree Size
Setting Constraints on Tree Size
1.Minimum samples for a node split
1. Defines the minimum number of samples (or observations) which are required in a node to be
considered for splitting.
2. Used to control over-fitting. Higher values prevent a model from learning relations which might be
highly specific to the particular sample selected for a tree.
3. Too high values can lead to under-fitting hence, it should be tuned using CV.
2.Minimum samples for a terminal node (leaf)
1. Defines the minimum samples (or observations) required in a terminal node or leaf.
2. Used to control over-fitting similar to min_samples_split.
3. Generally lower values should be chosen for imbalanced class problems because the regions in which
the minority class will be in majority will be very small.
3.Maximum depth of tree (vertical depth)
1. The maximum depth of a tree.
2. Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular
sample.
3. Should be tuned using CV.
Setting Constraints on Tree Size
Maximum number of terminal nodes
The maximum number of terminal nodes or leaves in a tree.
Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a
maximum of 2^n leaves.
Maximum features to consider for split
The number of features to consider while searching for a best split. These will be randomly selected.
As a thumb-rule, square root of the total number of features works great but we should check upto 30-40%
of the total number of features.
Higher values can lead to over-fitting but depends on case to case.
5. Are tree based models better than linear models?
“If I can use logistic regression for classification problems and linear regression for regression problems, why is there a need to
use trees”? Many of us have this question. And, this is a valid one too.

Actually, you can use any algorithm. It is dependent on the type of problem you are solving. Let’s look at some key factors which
will help you to decide which algorithm to use:

• If the relationship between dependent & independent variable is well approximated by a linear model, linear
regression will outperform tree based model.

• If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model
will outperform a classical regression method.

• If you need to build a model which is easy to explain to people, a decision tree model will always do better than a
linear model. Decision tree models are even simpler to interpret than linear regression!
Decision Tree in R
Case Study 1
Problem : A Car seats manufacturing company is interested to know about the
segment or attributes causes high sale.
Approach

A decision tree can be built with target variable Sale (we will first convert it in
categorical variable) & all other variable will be independent in the analysis.
A simulated data set containing sales of child car seats at 400 different stores.

A data frame with 400 observations on the following 11 variables.

1. Sales - Unit sales (in thousands) at each location


2. CompPrice - Price charged by competitor at each location
3. Income - Community income level (in thousands of dollars)
4. Advertising - Local advertising budget for company at each location (in thousands of dollars)
5. Population - Population size in region (in thousands)
6. Price - Price company charges for car seats at each site
7. ShelveLoc - A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the
car seats at each site
8. Age - Average age of the local population
9. Education - Education level at each location
10. Urban - A factor with levels No and Yes to indicate whether the store is in an urban or rural location
11. US - A factor with levels No and Yes to indicate whether the store is in the US or not
To find out the
range of Sales
variable
create a categorical
variable based on this
sales, If sales is higher
than 8 give it Yes
Else give in No
Remove target variable – Sales
Although the summary
is quite satisfactory as
we have
Misclassification error
pretty low.

Lets look at the test


dataset result for the
same .
Model
accuracy
parameters
Lowest deviation and size point
Case Study -2

Decision Tree with Life Insurance Promotion dataset


Case Study – Life Insurance Promotion

Dataset containing information about credit card holders who have accepted or rejected various
promotional offerings. We are trying to infer relations about the likelihood of different card holders in
accepting or rejecting life insurance. Information about each customer’s age, income, gender and whether
the customers took advantage of the life insurance promotion or not are included in the dataset

You might also like