Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Advanced Machine Learning

Neural Networks – Decision Trees – Random Forest – XGBoost

By. Oday Mourad


Neural Networks
Neural Networks

Input layer 𝑥Ԧ Hidden layer

price
Output layer 𝑎
affordability
shipping cost
𝑦ො

marketing awareness

We could say that neural networks


automatically engineer the data to
material perceived extract new features and control the
quality contribution of each feature to the
creation process.
Neural Networks Notation
every unit (neuron) contains
its own parameters 𝑤 and b

𝐴Ԧ[0] 𝐴Ԧ[1] 𝐴Ԧ[2] 𝐴Ԧ[3] 𝐴Ԧ[4]


𝑋Ԧ

input layer4

layer3
layer1
every unit (neuron) contain Let’s zoom in on
layer2
activation function like the third layer
sigmoid, tanh, Relu …
Neural Networks Notation

[3] 3 [3]
𝐴1 = 𝑔(𝑊1 . 𝐴Ԧ 2 + 𝑏1 )
[3] [3]
𝑊1 , 𝑏1

𝑔 is the
Ԧ[2]
𝐴 Ԧ[3]
𝐴 [3] 3 [3]
𝐴2 = 𝑔(𝑊2 . 𝐴Ԧ 2 + 𝑏2 ) activation
[3] [3]
𝑊2 , 𝑏2 function
[3] 3 [3]
𝐴3 = 𝑔(𝑊3 . 𝐴Ԧ 2 + 𝑏3 )

[3] [3]
𝑊3 , 𝑏3

[𝑙] 𝑙 Ԧ 𝑙−1 [𝑙]


𝐴𝑗 = 𝑔(𝑊𝑗 .𝐴 + 𝑏𝑗 )
layer3
Neural Networks Notation (Matrices Shapes)
Shape of 𝑋
(𝑚 , 𝑛) (𝑚 , 4) (𝑚 , 5) (𝑚 , 3) (𝑚 , 1)

𝐴Ԧ[0] 𝐴Ԧ[1] 𝐴Ԧ[2] 𝐴Ԧ[3] 𝐴Ԧ[4]


𝑋Ԧ

input (3,1)

(5,3)
(𝑛, 4) m: # of training examples
Shape of 𝑊 [1] (4,5) n: # of input features
Activation Function Types ( Sigmoid )

1
𝑔(𝑧) = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 (𝑧) = 𝑔(𝑧)
1+𝑒 −𝑧

Binary classification 1

It is commonly used for models where we have


to predict the probability as an output. Since the 0.5

probability of anything exists only between the


range of 0 and 1, sigmoid is the right choice 0 z
because of its range.
Activation Function Types ( ReLU )

𝑔(𝑧) = 𝑅𝑒𝐿𝑈 (𝑧) = max (0, 𝑧) g(Z)

Regression
It is commonly used for models where we have
to predict any positive numeric value as an
output (as in the case of the regression problem
like house price), and ReLU is the most common
0 Z
choice for hidden layers because of its low
compute time and its non-flat part.
Activation Function Types ( Linear Activation Function )

𝑔(𝑧) = 𝑧 𝑔(𝑍)

Regression
It is commonly used for models where we have
to predict any numeric value as an output (as in
Z
the case of the regression problem like total
profit)

Note: There are many other activation function types like:


tanh, leaky ReLU, and SoftMax … and we will talk about
them later.
Binary Classification NN Example

We want to build a neural network to predict if a handwritten image number is 0 or 1.

The activation function for the output


neuron is sigmoid to predict
probability from 0 to 1

The input
is the
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑: 𝑦ො ≥ 0.5
OR image
pixels
0 Or 1
we will use the cross-entropy loss
function because we are trying to
predict one of two classes
Binary Classification NN Example

We will build it using Tensorflow:

The first step is building and


stringing the NN layers

Then we will compile the model


and select the loss function

The last step is to train the NN


Multiclass Classification

What if we have more than two classes?


• Recognize all numbers
• Recognize if a patient has one of three or five different diseases.
• Recognize animal categories.

𝑝(𝑦 = 0|𝑥)
Ԧ Multiclass
2 Classes
𝑝(𝑦 = 1|𝑥)
Ԧ How to apply it?
SoftMax (4 Possible outputs)

Logistic Regression SoftMax Regression

𝑒 𝑧1
𝑧 = 𝑤. 𝑥Ԧ + 𝑏 𝑧1 = 𝑤1 . 𝑥Ԧ + 𝑏1 𝑎1 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑤
𝑎1 = 𝑔 𝑧 = = 𝑝(𝑦 = 1|𝑥)
Ԧ 𝑒 𝑧2
1 + 𝑒 −𝑧 𝑧2 = 𝑤2 . 𝑥Ԧ + 𝑏2 𝑎2 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑎2 = 1 − 𝑎1 = 𝑝 𝑦 = 0 𝑥)
Ԧ
𝑒 𝑧3
𝑧3 = 𝑤3 . 𝑥Ԧ + 𝑏3 𝑎3 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4

𝑒 𝑧4
𝑧4 = 𝑤3 . 𝑥Ԧ + 𝑏3 𝑎4 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
SoftMax (N Possible outputs)

𝑒 𝑧𝑗
𝑎𝑗 = 𝑁 = 𝑃 𝑦 = 𝑗 𝑥)
Ԧ
σ𝑘=1 𝑒 𝑧𝑘

𝑧𝑗 = 𝑤𝑗 . 𝑥Ԧ + 𝑏𝑗 ∶ 𝑗 = 1 ,2 ,……,𝑛

𝑛𝑜𝑡𝑒: 𝑎1 + 𝑎2 + … + 𝑎𝑛 = 1
Cost
Logistic Regression SoftMax Regression

𝑒 𝑧1
𝑧 = 𝑤. 𝑥Ԧ + 𝑏 𝑎1 = 𝑧1 𝑧2 𝑧3 𝑧4
= 𝑝(𝑦 = 1|𝑥)
Ԧ
𝑒 + 𝑒 + 𝑒 + 𝑒
𝑤
𝑎1 = 𝑔 𝑧 = = 𝑝(𝑦 = 1|𝑥)
Ԧ
1+𝑒 −𝑧 𝑒 𝑧𝑁
𝑎𝑁 = 𝑧1 𝑧2 𝑧3 𝑧4
= 𝑝(𝑦 = 𝑁|𝑥)
Ԧ
𝑒 + 𝑒 + 𝑒 + 𝑒
𝑎2 = 1 − 𝑎1 = 𝑝 𝑦 = 0 𝑥)
Ԧ
− log 𝑎1 𝑖𝑓 𝑦 = 1
𝑙𝑜𝑠𝑠 = −𝑦 log 𝑎1 − 1 − 𝑦 log(1 − 𝑎1) − log 𝑎2 𝑖𝑓 𝑦 = 2
𝑙𝑜𝑠𝑠 𝑎1 , 𝑎2 , … , 𝑎𝑁 , 𝑦 = .
.
𝑗 𝑤 , 𝑏 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑙𝑜𝑠𝑠
− log 𝑎𝑁 𝑖𝑓 𝑦 = 𝑁
Multiclass Classification NN Example

We want to build a neural network to recognize a handwritten image number.

1
The activation
function for the
output neuron is
SoftMax to predict
probability from 0 to
8 1 for each class.

The input is the


9
image pixels
ReLU
Multiclass Classification NN Example

We will build it using Tensorflow:

The first step is building and


stringing the NN layers

Then we will compile the


model and select the loss
function

The last step is to train the NN


Multiclass Classification NN Example
More numerically accurate implementation of logistic loss:
Advanced Optimization Algorithm
Adam(Adaptive moment estimation)
𝑤2

Every step is going in the same


direction
Let’s go faster (increase α)
So instead of having a single
𝑤1 learning rate for all parameters,
Adam uses a different one for
𝑤2 every parameter and changes it
Every step is going in a different automatically through training.
direction
Let’s fix it (decrease α)

𝑤1
Machine Learning Advices
Debugging a learning algorithm

You’ve implemented regularized linear regression on housing prices

𝑚 𝑛
1 𝑖
λ
(𝑖) 2 2
𝐽 𝑤, 𝑏 = ෍(𝑓𝑤,𝑏 𝑥 − 𝑦 ) + ෍ 𝑤𝑗
2𝑚 2𝑚
𝑖 𝑗

• Get more training examples


But it makes unacceptably • Try smaller sets of features
• Try getting additional features
large errors in predictions. • Try a more complex model
What do you try next? • Try increasing λ
• Try decreasing λ
Evaluating a Model

Size Price
2104 400 We need to know how our model fits
1600 330 training and testing data.
2400 369
1416 232 70% Train Data So we compute cost function for train
3000 540 data and for test data
1985 300
1534 315 Sometimes Model fits the training data
1427 199 very well but fails to generalize to new
1380 212 30% Test Data examples in the training set.
1494 243
Model Selection
(choosing the right model)

Suppose we have to select one of these models:

𝑑 = 1 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑏 𝑤 <1> , 𝑏 <1> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <1> , 𝑏 <1> )

𝑑 = 2 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑏 𝑤 <2> , 𝑏 <2> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <2> , 𝑏 <2> )

𝑑 = 10 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + ⋯ + 𝑤10 𝑥110 𝑏 𝑤 <10> , 𝑏 <10> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <10> , 𝑏 <10> )

After calculating test costs for every model, we found that 𝑗𝑡𝑒𝑠𝑡 (𝑤 <5> , 𝑏 <5> ) has the lowest
value, Should we select the fifth model?
Model Selection
(choosing the right model)

The Problem is 𝑗𝑡𝑒𝑠𝑡 𝑤 <5> , 𝑏 <5> is likely to be an optimistic estimate of


generalization error. Ie: An extra parameter d (degree of polynomial ) was
chosen using the test set.

So we need an extra dependent set for model selection … What is its name?

Cross Validation Set (validation set, development set, dev set)


Model Selection
(choosing the right model)

Size Price
2104 400
1600 330
2400 369
60% Train Data
1416 232
3000 540
1985 300
1534 315
20% Cross-Validation Data
1427 199
1380 212
20% Test Data
1494 243
Model Selection
(choosing the right model)

𝑑 = 1 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑏 𝑤 <1> , 𝑏 <1> ⇒ 𝑗𝑐𝑣 (𝑤 <1> , 𝑏 <1> )

𝑑 = 2 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑏 𝑤 <2> , 𝑏 <2> ⇒ 𝑗𝑐𝑣 (𝑤 <2> , 𝑏 <2> )

𝑑 = 10 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + ⋯ + 𝑤10 𝑥110 𝑏 𝑤 <10> , 𝑏 <10> ⇒ 𝑗𝑐𝑣 (𝑤 <10> , 𝑏 <10> )

We will select the model that has the least validation error, Then calculate the
estimated generalization error using the test set.
Model Selection
(choosing the right model)

Estimate the Select the right model calculate the estimated


parameters using the using the Validation generalization error
Train Set Set using the Test Set
Bias / Variance (Model Complexity)
Underfitting Sweet spot – Sugar place Overfitting
“just right”

High Bias Just Right High Variance


𝑗𝑡𝑟𝑎𝑖𝑛 𝑖𝑠 ℎ𝑖𝑔ℎ 𝑗𝑡𝑟𝑎𝑖𝑛 𝑖𝑠 𝑙𝑜𝑤 𝑗𝑡𝑟𝑎𝑖𝑛 𝑖𝑠 𝑙𝑜𝑤
𝑗𝑐𝑣 𝑖𝑠 ℎ𝑖𝑔ℎ 𝑗𝑐𝑣 𝑖𝑠 𝑙𝑜𝑤 𝑗𝑐𝑣 𝑖𝑠 ℎ𝑖𝑔ℎ
Bias / Variance with Model Complexity

𝑗𝑐𝑣 (𝑤 , 𝑏)
High Variance (Overfit):
𝑗𝑡𝑟𝑎𝑖𝑛 (𝑤 , 𝑏)
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛
Just right (𝑗𝑡𝑟𝑎𝑖𝑛 may be low)
High Bias (Underfit):
𝑗𝑡𝑟𝑎𝑖𝑛 will be high
𝑗𝑡𝑟𝑎𝑖𝑛 ≈ 𝑗𝑐𝑣

Degree of polynomial

Note: Sometimes maybe we have High Bias and High Variance: (𝑗𝑡𝑟𝑎𝑖𝑛 will be high AND
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛 ) this happened when the model overfits for part of the input (very complicated
model), And for the other part of the input it does not fit the data will (underfit)
Bias / Variance (with regularization)
Underfitting Sweet spot – Sugar place Overfitting
“just right”

High Bias Just Right High Variance


λ 𝑖𝑠 𝑡𝑜𝑜 ℎ𝑖𝑔ℎ ⇒ 𝑤 𝑤𝑖𝑙𝑙 𝑏𝑒 𝑧𝑒𝑟𝑜 λ 𝑖𝑠 𝑟𝑖𝑔ℎ𝑡 λ 𝑖𝑠 𝑡𝑜𝑜 𝑙𝑜𝑤 ⇒ 𝑛𝑜 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛
Bias / Variance with Regularization

𝑗𝑐𝑣 (𝑤 , 𝑏)
High Bias (Underfit):
𝑗𝑡𝑟𝑎𝑖𝑛 (𝑤 , 𝑏)
𝑗𝑡𝑟𝑎𝑖𝑛 will be high
Just right 𝑗𝑡𝑟𝑎𝑖𝑛 ≈ 𝑗𝑐𝑣
High Variance (Overfit):
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛
(𝑗𝑡𝑟𝑎𝑖𝑛 may be low)

Note: The regularization curve is the horizontal flip of the polynomial degree curve.
Establishing a baseline level of performance

Suppose we want to build a speech recognition application (Audio to Text):


After building the system we got these errors:
We need to
Training Error 𝑗𝑡𝑟𝑎𝑖𝑛 = 10.8% establish a
How to know if these error
baseline level of
values are good or bad?
𝐶𝑟𝑜𝑠𝑠 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑗𝑐𝑣 = 14.3% performance to
compare with.

H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% Now we can see that training


0.2% error is close to humans error,
Training Error 𝑗𝑡𝑟𝑎𝑖𝑛 = 10.8% But cross-validation error is
much higher than training error
4.0% => high variance (Overfitting )
𝐶𝑟𝑜𝑠𝑠 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑗𝑐𝑣 = 14.8%
Establishing a baseline level of performance

H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% training error and cross-


4.4%
validation error are much
Training Error 𝑗𝑡𝑟𝑎𝑖𝑛 = 15.0%
higher than human error =>
𝐶𝑟𝑜𝑠𝑠 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑗𝑐𝑣 = 15.5% 0.5% high bias (Underfitting )

H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% training error is much higher


4.4%
than humans error, And cross-
Training Error 𝑗𝑡𝑟𝑎𝑖𝑛 = 15.0%
validation error is much higher
𝐶𝑟𝑜𝑠𝑠 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑗𝑐𝑣 = 19.7% 4.7% than training error => high bias,
high variance
Learning Curves
Underfitting Sweet spot – Sugar place Overfitting
“just right”

𝑗𝑐𝑣
𝑗𝑐𝑣 𝑗𝑐𝑣
Gap

error
error

error
Human-level performance Human-level performance

𝑗𝑡𝑟𝑎𝑖𝑛 Gap
Gap 𝑗𝑡𝑟𝑎𝑖𝑛
Human-level performance 𝑗𝑡𝑟𝑎𝑖𝑛

𝑚 (training set size) 𝑚 (training set size) 𝑚 (training set size)

If a learning algorithm suffers If a learning algorithm suffers


from high bias, getting more from high variance, getting
training data will not (by itself) more training data is likely to
help much help.
Debugging a learning algorithm

You’ve implemented regularized linear regression on housing prices

𝑚 𝑛
1 𝑖
λ
(𝑖) 2 2
𝐽 𝑤, 𝑏 = ෍(𝑓𝑤,𝑏 𝑥 − 𝑦 ) + ෍ 𝑤𝑗
2𝑚 2𝑚
𝑖 𝑗

• Get more training examples Fixes high variance


But it makes unacceptably • Try smaller sets of features Fixes high variance
• Try getting additional features Fixes high bias
large errors in predictions. • Try a more complex model Fixes high bias
What do you try next? • Try increasing λ Fixes high bias
• Try decreasing λ Fixes high variance
Neural Networks Bias / Variance
Neural Networks are low bias models.

Yes Yes
Does it do well on the Does it do well on the
training set? cross validation set?
Done!

No No

Bigger Network More Data


Iterative Loop Of ML Development

Choose
architecture
(model,
data, etc.)

Start here
Diagnostic
(bias,
variance, Train Model
and error
analysis)
Data Augmentation

Augmentation: modifying existing training examples to create new training examples

This technique is used especially for images and audio data that can increase your training set
size significantly

The distortion introduced


should be a representation of
the type of noise/distortion in
the test set
Full Cycle Of Machine Learning project

Deploy in
Scope Project Collect Data Train Model
production

Training, error Deploy, monitor


Define Define And
analysis & iterative and maintain
project collect Data
improvement system
Decision Tree Model
Cat Classification Example
Ear Shape Face Shape Whiskers Cat
Pointy Round Present 1
Floppy Not Round Present 1
Floppy Round Absent 0
Pointy Not Round Present 0
Pointy Round Present 1
Pointy Round Absent 1
Floppy Not Round Absent 0
Pointy Round Absent 1
Floppy Round Absent 0
Floppy Round Absent 0
Cat Classification Example
Root node
Decision nodes Ear Shape
Pointy Floppy

Face Shape Whiskers

Round Not Round Present Absent

Cat Not Cat Cat Not Cat

Leaf nodes
Decision Tree Learning

Decision1: How to choose what feature to split on at each node?


Maximize purity (or minimize impurity)

Ear Face
Cat DNA
Shape Shape

Pointy Floppy Round Not Round Yes No

4/5 Cats 1/5 Cats 4/7 Cats 1/3 Cats 5/5 Cats 0/5 Cats

This feature maximizes


This means there are 5 examples the purity
with Pointy ear shapes, 4 of them
are Cats Best Information Gain
Decision Tree Learning
(But how to calculate the impurity?)

Entropy as a measure of Impurity: 𝐻 𝑝1 = −𝑝1 𝑙𝑜𝑔2 𝑝1 − 1 − 𝑝1 𝑙𝑜𝑔2 (1 − 𝑝1 )

3
1 𝑝1 = ⇒ −0.5 𝑙𝑜𝑔2 0.5 − 1 − 0.5 𝑙𝑜𝑔2 1 − 0.5 = 1
6

5
𝑝1 = ⇒ −0.83 𝑙𝑜𝑔2 0.83 − 1 − 0.83 𝑙𝑜𝑔2 1 − 0.83 = 0.66
6

0 6
𝑝1 = ⇒ −1 𝑙𝑜𝑔2 1 − 1 − 1 𝑙𝑜𝑔2 1 − 1 = 0
6
0 0.5 1
Information Gain

Information Gain: The reduction of entropy in compared to if we hadn’t split at all

𝑙𝑒𝑓𝑡 𝑟𝑖𝑔ℎ𝑡
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 𝐻 𝑝1𝑟𝑜𝑜𝑡 − 𝑤 𝑙𝑒𝑓𝑡 𝐻 𝑝1 + 𝑤 𝑟𝑖𝑔ℎ𝑡 𝐻 𝑝1

Ear 𝑝𝑟𝑜𝑜𝑡 = 5ൗ10 = 0.5


Shape

4/5 Cats 1/5 Cats

= 1ൗ5 = 0.2
𝑟𝑖𝑔ℎ𝑡
= 4ൗ5 = 0.8
𝑙𝑒𝑓𝑡
𝑝1 𝑝1

𝑤 𝑙𝑒𝑓𝑡 = 5Τ10 𝑤 𝑟𝑖𝑔ℎ𝑡 = 5Τ10


𝑝1 = 5ൗ10 = 0.5 𝐻(0.5) = 1 𝑝1 = 5ൗ10 = 0.5 𝐻(0.5) = 1 𝑝1 = 5ൗ10 = 0.5 𝐻(0.5) = 1

Ear Face
Whiskers
Shape Shape

4/5 Cats 1/5 Cats 4/7 Cats 1/3 Cats 3/4 Cats 2/6 Cats

𝑝1 = 4ൗ5 = 0.8 𝑝1 = 1ൗ5 = 0.2 𝑝1 = 4ൗ7 = 0.57 𝑝1 = 1ൗ3 = 0.33 𝑝1 = 3ൗ4 = 0.75 𝑝1 = 2ൗ6 = 0.33

𝐻(0.8) = 0.72 𝐻(0.2) = 0.72 𝐻(0.57) = 0.99 𝐻(0.33) = 0.92 𝐻(0.75) = 0.81 𝐻(0.33) = 0.92

5 5 7 3 4 6
𝐻 0.5 − 𝐻 0.8 + 𝐻 0.2 𝐻 0.5 − 𝐻 0.87 + 𝐻 0.33 𝐻 0.5 − 𝐻 0.75 + 𝐻 0.33
10 10 10 10 10 10

= 𝟎. 𝟐𝟖 = 𝟎. 𝟎𝟑 = 𝟎. 𝟏𝟐
Best information gain
Decision Tree Learning

Decision2: When do you stop splitting?

• When a node is 100% one class.


• When splitting a node will result in the tree exceeding a maximum depth.
• When improvements in purity score are below a threshold.
• When a number of examples I a node is below a threshold.
Decision Tree Learning
1. Start with all examples at the root node.
2. Calculate the information gain for all the possible features And pick the one with the
highest information gain.
3. Split dataset according to selected feature, and create left and right branches of the tree.
4. Keep repeating the splitting process until the stopping criteria are met:
• When a node is 100% one class
• When splitting a node will result in the tree exceeding a maximum depth.
• Information gained from additional splits is less than the threshold.
• When a number of examples in a node is below a threshold.
One Hot Encoding Of Categorical Feature

We saw that all features in the decision tree have only 2 classes (Ear shape: pointy or floppy
– Face Shape: Round Or Not Round - …)

What if we have a feature that has more than 2 classes? We will apply One Hot
Encoding
Ear
3 classes 2 classes 2 classes

Ear Face Whisker Is Cat Pointy Floppy Round Face Whisker Is Cat
Pointy Round 1 1 1 0 0 Round 1 1
Floppy Not Round 1 1 0 1 0 Not Round 1 1
Round Round 0 0 0 0 1 Round 0 0
Pointy Not Round 0 0 1 0 0 Not Round 0 0
Continuous Feature

Weight Is Cat
6 1
20 0
9 1
16 0
12 1
11.5 0 Weight
8 11
15 0
10 1 We will try many thresholds to split the range on the
continuous feature, And take the threshold that minimizes
17 0
the impurity
7 1
Continuous Feature

Weight Is Cat One way to do this is:


6.5 6 1
1. sort the values in ascending order
7.5 7 1
2. calculate the averages for every adjacent
9.5
9 1
10 1 values
10.75
11.5 0 3. calculate the impurity if we select this average
11.75
12 1 as a threshold
13.5
15 0
15.5 4. Set the average with the minimum impurity as
16 0
16.5 a threshold
17 0
18.5
20 0
Multiple Decision Trees

One of the weaknesses of using a single decision tree is that maybe this decision tree is
sensitive to small changes in the data. What is the solution?

The solution is to build a lot of decision trees (tree ensemble)

Tree1 Cat

New
Example
Tree2 Not Cat Vote Not Cat

How to build a lot of decision trees?


Tree3 Not Cat Random Forest, XGBoost, …
Random Forest - XGBoost
Random Forest Algorithm

Random forest combines the simplicity of decision trees with flexibility resulting
in a vast improvement in accuracy

These awesome Random Forest slides were taken from StateQuest Youtube Channel, The link is in the description.
Step1: Create a “bootstrapped” dataset
To create a bootstrapped dataset that is the same size as the original, We randomly select
samples from the original dataset.

The important detail is that we can pick the same sample more than once.

Original Dataset Bootstrapped Dataset


Good Good
Chest Blocked Heart Chest Blocked Heart
Blood Weight Blood Weight
Pain Arteries Disease Pain Arteries Disease
Circ Circ

0 0 0 125 0 1 1 1 180 1

1 1 1 180 1 0 0 0 125 0
Sampling
1 1 0 210 0 1 0 1 167 1
with
1 0 1 167 1 replacement 1 0 1 167 1
Step2: Create a decision tree using the bootstrapped dataset, but only use a
random subset of variables (or columns) at each step (in this example we will only
consider 2 variables at each step)

Randomly select 2 variables


Choose the variable that did the best job
separating the samples (assume that Good
Blood Circ did) Good
Chest Blocked Heart
Blood Weight
Pain Arteries Disease
Circ
that Good 1 1 1 180 1
Blood Circ
0 0 0 125 0

1 0 1 167 1

1 0 1 167 1
Step2: Create a decision tree using the bootstrapped dataset, but only use a
random subset of variables (or columns) at each step (in this example we will only
consider 2 variables at each step)

Just like for the root, we randomly select 2


Randomly select 2 variables
variables from the remaining features and
choose the best.
Good
Chest Blocked Heart
Blood Weight
Pain Arteries Disease
that Good Circ

Blood Circ 1 1 1 180 1

0 0 0 125 0

1 0 1 167 1

1 0 1 167 1
We built a tree:
1. Using a bootstrapped dataset.
2. Only considering a random subset of variables at each step.

Now go back to Step1 and repeat: Make a new bootstrapped dataset and build
a tree considering a subset of variables at each step.

Note: Bootstrapping the data plus using aggregate to make a decision is called Bagging
Note: After generating the tree, we vote for a new example and consider the most votes.
XGBoost

Given a training set of size m.

For b=1 to B:

1. Use sampling with replacement to create a new training set of size m.


But instead of picking from all examples with equal (1/m) probability, make it
more likely to pick examples that the previously trained trees misclassify

2. Train a decision tree on the new dataset.


Decision Trees Vs Neural Networks

Decision trees and tree ensembles:


• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text )
• Fast
• Small decision trees may be human interpretable

Neural Networks:
• Works well on all types of data, including tabular (structured) and unstructured data.
• May be slower than decision trees.
• Works with transfer learning
• When building a system of multiple models working together, it might be easier to string
together multiple neural networks
Oday Mourad

I am happy to share Data Science and Machine


Learning basics with my wonderful LinkedIn
friends, please follow me and let’s learn together.

You might also like