Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost

Advanced Machine Learning
Neural Networks – Decision Trees – Random Forest – XGBoost
By. Oday Mourad

Neural Networks
Neural Networks
Input layer 𝑥Ԧ Hidden layer
price
Output layer 𝑎
affordability
shipping cost
𝑦ො
marketing awareness
We could say that neural networks

automatically engineer the data to
material perceived extract new features and control the
quality contribution of each feature to the
creation process.
Neural Networks Notation
every unit (neuron) contains
its own parameters 𝑤 and b
𝐴Ԧ[0] 𝐴Ԧ[1] 𝐴Ԧ[2] 𝐴Ԧ[3] 𝐴Ԧ[4]

𝑋Ԧ
input layer4
layer3
layer1
every unit (neuron) contain Let’s zoom in on
layer2
activation function like the third layer
sigmoid, tanh, Relu …
Neural Networks Notation
[3] 3 [3]
𝐴1 = 𝑔(𝑊1 . 𝐴Ԧ 2 + 𝑏1 )
[3] [3]
𝑊1 , 𝑏1
𝑔 is the
Ԧ[2]
𝐴 Ԧ[3]
𝐴 [3] 3 [3]
𝐴2 = 𝑔(𝑊2 . 𝐴Ԧ 2 + 𝑏2 ) activation
[3] [3]
𝑊2 , 𝑏2 function
[3] 3 [3]
𝐴3 = 𝑔(𝑊3 . 𝐴Ԧ 2 + 𝑏3 )
[3] [3]
𝑊3 , 𝑏3
[𝑙] 𝑙 Ԧ 𝑙−1 [𝑙]

𝐴𝑗 = 𝑔(𝑊𝑗 .𝐴 + 𝑏𝑗 )
layer3
Neural Networks Notation (Matrices Shapes)
Shape of 𝑋
(𝑚 , 𝑛) (𝑚 , 4) (𝑚 , 5) (𝑚 , 3) (𝑚 , 1)
𝐴Ԧ[0] 𝐴Ԧ[1] 𝐴Ԧ[2] 𝐴Ԧ[3] 𝐴Ԧ[4]

𝑋Ԧ
input (3,1)
(5,3)
(𝑛, 4) m: # of training examples
Shape of 𝑊 [1] (4,5) n: # of input features
Activation Function Types ( Sigmoid )
1
𝑔(𝑧) = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 (𝑧) = 𝑔(𝑧)
1+𝑒 −𝑧
Binary classification 1
It is commonly used for models where we have

to predict the probability as an output. Since the 0.5
probability of anything exists only between the

range of 0 and 1, sigmoid is the right choice 0 z
because of its range.
Activation Function Types ( ReLU )
𝑔(𝑧) = 𝑅𝑒𝐿𝑈 (𝑧) = max (0, 𝑧) g(Z)
Regression
to predict any positive numeric value as an
output (as in the case of the regression problem
like house price), and ReLU is the most common
0 Z
choice for hidden layers because of its low
compute time and its non-flat part.
Activation Function Types ( Linear Activation Function )
𝑔(𝑧) = 𝑧 𝑔(𝑍)
Regression
to predict any numeric value as an output (as in
Z
the case of the regression problem like total
profit)
Note: There are many other activation function types like:

tanh, leaky ReLU, and SoftMax … and we will talk about
them later.
Binary Classification NN Example
We want to build a neural network to predict if a handwritten image number is 0 or 1.
The activation function for the output

neuron is sigmoid to predict
probability from 0 to 1
The input
is the
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑: 𝑦ො ≥ 0.5
OR image
pixels
0 Or 1
we will use the cross-entropy loss
function because we are trying to
predict one of two classes
Binary Classification NN Example
We will build it using Tensorflow:
The first step is building and

stringing the NN layers
Then we will compile the model

and select the loss function
The last step is to train the NN

Multiclass Classification
What if we have more than two classes?

• Recognize all numbers
• Recognize if a patient has one of three or five different diseases.
• Recognize animal categories.
𝑝(𝑦 = 0|𝑥)
Ԧ Multiclass
2 Classes
𝑝(𝑦 = 1|𝑥)
Ԧ How to apply it?
SoftMax (4 Possible outputs)
Logistic Regression SoftMax Regression
𝑒 𝑧1
𝑧 = 𝑤. 𝑥Ԧ + 𝑏 𝑧1 = 𝑤1 . 𝑥Ԧ + 𝑏1 𝑎1 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑤
𝑎1 = 𝑔 𝑧 = = 𝑝(𝑦 = 1|𝑥)
Ԧ 𝑒 𝑧2
1 + 𝑒 −𝑧 𝑧2 = 𝑤2 . 𝑥Ԧ + 𝑏2 𝑎2 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑎2 = 1 − 𝑎1 = 𝑝 𝑦 = 0 𝑥)
Ԧ
𝑒 𝑧3
𝑧3 = 𝑤3 . 𝑥Ԧ + 𝑏3 𝑎3 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑒 𝑧4
𝑧4 = 𝑤3 . 𝑥Ԧ + 𝑏3 𝑎4 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
SoftMax (N Possible outputs)
𝑒 𝑧𝑗
𝑎𝑗 = 𝑁 = 𝑃 𝑦 = 𝑗 𝑥)
Ԧ
σ𝑘=1 𝑒 𝑧𝑘
𝑧𝑗 = 𝑤𝑗 . 𝑥Ԧ + 𝑏𝑗 ∶ 𝑗 = 1 ,2 ,……,𝑛
𝑛𝑜𝑡𝑒: 𝑎1 + 𝑎2 + … + 𝑎𝑛 = 1
Cost
Logistic Regression SoftMax Regression
𝑒 𝑧1
𝑧 = 𝑤. 𝑥Ԧ + 𝑏 𝑎1 = 𝑧1 𝑧2 𝑧3 𝑧4
= 𝑝(𝑦 = 1|𝑥)
Ԧ
𝑒 + 𝑒 + 𝑒 + 𝑒
𝑤
𝑎1 = 𝑔 𝑧 = = 𝑝(𝑦 = 1|𝑥)
Ԧ
1+𝑒 −𝑧 𝑒 𝑧𝑁
𝑎𝑁 = 𝑧1 𝑧2 𝑧3 𝑧4
= 𝑝(𝑦 = 𝑁|𝑥)
Ԧ
𝑒 + 𝑒 + 𝑒 + 𝑒
𝑎2 = 1 − 𝑎1 = 𝑝 𝑦 = 0 𝑥)
Ԧ
− log 𝑎1 𝑖𝑓 𝑦 = 1
𝑙𝑜𝑠𝑠 = −𝑦 log 𝑎1 − 1 − 𝑦 log(1 − 𝑎1) − log 𝑎2 𝑖𝑓 𝑦 = 2
𝑙𝑜𝑠𝑠 𝑎1 , 𝑎2 , … , 𝑎𝑁 , 𝑦 = .
.
𝑗 𝑤 , 𝑏 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑙𝑜𝑠𝑠
− log 𝑎𝑁 𝑖𝑓 𝑦 = 𝑁
Multiclass Classification NN Example
We want to build a neural network to recognize a handwritten image number.
1
The activation
function for the
output neuron is
SoftMax to predict
probability from 0 to
8 1 for each class.
The input is the

9
image pixels
ReLU
We will build it using Tensorflow:
The first step is building and

stringing the NN layers
Then we will compile the

model and select the loss
function
The last step is to train the NN

More numerically accurate implementation of logistic loss:
Advanced Optimization Algorithm
Adam(Adaptive moment estimation)
𝑤2
Every step is going in the same

direction
Let’s go faster (increase α)
So instead of having a single
𝑤1 learning rate for all parameters,
Adam uses a different one for
𝑤2 every parameter and changes it
Every step is going in a different automatically through training.
direction
Let’s fix it (decrease α)
𝑤1
Machine Learning Advices
Debugging a learning algorithm
You’ve implemented regularized linear regression on housing prices
𝑚 𝑛
1 𝑖
λ
(𝑖) 2 2
𝐽 𝑤, 𝑏 = ෍(𝑓𝑤,𝑏 𝑥 − 𝑦 ) + ෍ 𝑤𝑗
2𝑚 2𝑚
𝑖 𝑗
• Get more training examples

But it makes unacceptably • Try smaller sets of features
• Try getting additional features
large errors in predictions. • Try a more complex model
What do you try next? • Try increasing λ
• Try decreasing λ
Evaluating a Model
Size Price
2104 400 We need to know how our model fits
1600 330 training and testing data.
2400 369
1416 232 70% Train Data So we compute cost function for train
3000 540 data and for test data
1985 300
1534 315 Sometimes Model fits the training data
1427 199 very well but fails to generalize to new
1380 212 30% Test Data examples in the training set.
1494 243
Model Selection
(choosing the right model)
Suppose we have to select one of these models:
𝑑 = 1 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑏 𝑤 <1> , 𝑏 <1> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <1> , 𝑏 <1> )
𝑑 = 2 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑏 𝑤 <2> , 𝑏 <2> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <2> , 𝑏 <2> )
𝑑 = 10 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + ⋯ + 𝑤10 𝑥110 𝑏 𝑤 <10> , 𝑏 <10> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <10> , 𝑏 <10> )
After calculating test costs for every model, we found that 𝑗𝑡𝑒𝑠𝑡 (𝑤 <5> , 𝑏 <5> ) has the lowest
value, Should we select the fifth model?
Model Selection
The Problem is 𝑗𝑡𝑒𝑠𝑡 𝑤 <5> , 𝑏 <5> is likely to be an optimistic estimate of

generalization error. Ie: An extra parameter d (degree of polynomial ) was
chosen using the test set.
So we need an extra dependent set for model selection … What is its name?
Cross Validation Set (validation set, development set, dev set)

Model Selection
Size Price
2104 400
1600 330
2400 369
60% Train Data
1416 232
3000 540
1985 300
1534 315
20% Cross-Validation Data
1427 199
1380 212
20% Test Data
1494 243
Model Selection
𝑑 = 1 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑏 𝑤 <1> , 𝑏 <1> ⇒ 𝑗𝑐𝑣 (𝑤 <1> , 𝑏 <1> )
𝑑 = 2 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑏 𝑤 <2> , 𝑏 <2> ⇒ 𝑗𝑐𝑣 (𝑤 <2> , 𝑏 <2> )
𝑑 = 10 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + ⋯ + 𝑤10 𝑥110 𝑏 𝑤 <10> , 𝑏 <10> ⇒ 𝑗𝑐𝑣 (𝑤 <10> , 𝑏 <10> )
We will select the model that has the least validation error, Then calculate the
estimated generalization error using the test set.
Model Selection
Estimate the Select the right model calculate the estimated

parameters using the using the Validation generalization error
Train Set Set using the Test Set
Bias / Variance (Model Complexity)
Underfitting Sweet spot – Sugar place Overfitting
“just right”
High Bias Just Right High Variance

𝑗𝑡𝑟𝑎𝑖𝑛 𝑖𝑠 ℎ𝑖𝑔ℎ 𝑗𝑡𝑟𝑎𝑖𝑛 𝑖𝑠 𝑙𝑜𝑤 𝑗𝑡𝑟𝑎𝑖𝑛 𝑖𝑠 𝑙𝑜𝑤
𝑗𝑐𝑣 𝑖𝑠 ℎ𝑖𝑔ℎ 𝑗𝑐𝑣 𝑖𝑠 𝑙𝑜𝑤 𝑗𝑐𝑣 𝑖𝑠 ℎ𝑖𝑔ℎ
Bias / Variance with Model Complexity
𝑗𝑐𝑣 (𝑤 , 𝑏)
High Variance (Overfit):
𝑗𝑡𝑟𝑎𝑖𝑛 (𝑤 , 𝑏)
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛
Just right (𝑗𝑡𝑟𝑎𝑖𝑛 may be low)
High Bias (Underfit):
𝑗𝑡𝑟𝑎𝑖𝑛 will be high
𝑗𝑡𝑟𝑎𝑖𝑛 ≈ 𝑗𝑐𝑣
Degree of polynomial
Note: Sometimes maybe we have High Bias and High Variance: (𝑗𝑡𝑟𝑎𝑖𝑛 will be high AND
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛 ) this happened when the model overfits for part of the input (very complicated
model), And for the other part of the input it does not fit the data will (underfit)
Bias / Variance (with regularization)
“just right”
High Bias Just Right High Variance

λ 𝑖𝑠 𝑡𝑜𝑜 ℎ𝑖𝑔ℎ ⇒ 𝑤 𝑤𝑖𝑙𝑙 𝑏𝑒 𝑧𝑒𝑟𝑜 λ 𝑖𝑠 𝑟𝑖𝑔ℎ𝑡 λ 𝑖𝑠 𝑡𝑜𝑜 𝑙𝑜𝑤 ⇒ 𝑛𝑜 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛
Bias / Variance with Regularization
𝑗𝑐𝑣 (𝑤 , 𝑏)
High Bias (Underfit):
𝑗𝑡𝑟𝑎𝑖𝑛 (𝑤 , 𝑏)
𝑗𝑡𝑟𝑎𝑖𝑛 will be high
Just right 𝑗𝑡𝑟𝑎𝑖𝑛 ≈ 𝑗𝑐𝑣
High Variance (Overfit):
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛
(𝑗𝑡𝑟𝑎𝑖𝑛 may be low)
Note: The regularization curve is the horizontal flip of the polynomial degree curve.
Establishing a baseline level of performance
Suppose we want to build a speech recognition application (Audio to Text):

After building the system we got these errors:
We need to
Training Error 𝑗𝑡𝑟𝑎𝑖𝑛 = 10.8% establish a
How to know if these error
baseline level of
values are good or bad?
𝐶𝑟𝑜𝑠𝑠 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑗𝑐𝑣 = 14.3% performance to
compare with.
H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% Now we can see that training

0.2% error is close to humans error,
Training Error 𝑗𝑡𝑟𝑎𝑖𝑛 = 10.8% But cross-validation error is
much higher than training error
4.0% => high variance (Overfitting )
𝐶𝑟𝑜𝑠𝑠 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑗𝑐𝑣 = 14.8%
Establishing a baseline level of performance
H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% training error and cross-

4.4%
validation error are much
Training Error 𝑗𝑡𝑟𝑎𝑖𝑛 = 15.0%
higher than human error =>
𝐶𝑟𝑜𝑠𝑠 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑗𝑐𝑣 = 15.5% 0.5% high bias (Underfitting )
H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% training error is much higher

4.4%
than humans error, And cross-
Training Error 𝑗𝑡𝑟𝑎𝑖𝑛 = 15.0%
validation error is much higher
𝐶𝑟𝑜𝑠𝑠 𝑉𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 𝐸𝑟𝑟𝑜𝑟 𝑗𝑐𝑣 = 19.7% 4.7% than training error => high bias,
high variance
Learning Curves
“just right”
𝑗𝑐𝑣
𝑗𝑐𝑣 𝑗𝑐𝑣
Gap
error
error
error
Human-level performance Human-level performance
𝑗𝑡𝑟𝑎𝑖𝑛 Gap
Gap 𝑗𝑡𝑟𝑎𝑖𝑛
Human-level performance 𝑗𝑡𝑟𝑎𝑖𝑛
𝑚 (training set size) 𝑚 (training set size) 𝑚 (training set size)
If a learning algorithm suffers If a learning algorithm suffers

from high bias, getting more from high variance, getting
training data will not (by itself) more training data is likely to
help much help.
Debugging a learning algorithm
You’ve implemented regularized linear regression on housing prices
𝑚 𝑛
1 𝑖
λ
(𝑖) 2 2
𝐽 𝑤, 𝑏 = ෍(𝑓𝑤,𝑏 𝑥 − 𝑦 ) + ෍ 𝑤𝑗
2𝑚 2𝑚
𝑖 𝑗
• Get more training examples Fixes high variance

But it makes unacceptably • Try smaller sets of features Fixes high variance
• Try getting additional features Fixes high bias
large errors in predictions. • Try a more complex model Fixes high bias
What do you try next? • Try increasing λ Fixes high bias
• Try decreasing λ Fixes high variance
Neural Networks Bias / Variance
Neural Networks are low bias models.
Yes Yes
Does it do well on the Does it do well on the
training set? cross validation set?
Done!
No No
Bigger Network More Data

Iterative Loop Of ML Development
Choose
architecture
(model,
data, etc.)
Start here
Diagnostic
(bias,
variance, Train Model
and error
analysis)
Data Augmentation
Augmentation: modifying existing training examples to create new training examples
This technique is used especially for images and audio data that can increase your training set
size significantly
The distortion introduced

should be a representation of
the type of noise/distortion in
the test set
Full Cycle Of Machine Learning project
Deploy in
Scope Project Collect Data Train Model
production
Training, error Deploy, monitor

Define Define And
analysis & iterative and maintain
project collect Data
improvement system
Decision Tree Model
Cat Classification Example
Ear Shape Face Shape Whiskers Cat
Pointy Round Present 1
Floppy Not Round Present 1
Floppy Round Absent 0
Pointy Not Round Present 0
Pointy Round Present 1
Pointy Round Absent 1
Floppy Not Round Absent 0
Pointy Round Absent 1
Cat Classification Example
Root node
Decision nodes Ear Shape
Pointy Floppy
Face Shape Whiskers
Round Not Round Present Absent
Cat Not Cat Cat Not Cat
Leaf nodes
Decision Tree Learning
Decision1: How to choose what feature to split on at each node?

Maximize purity (or minimize impurity)
Ear Face
Cat DNA
Shape Shape
Pointy Floppy Round Not Round Yes No
4/5 Cats 1/5 Cats 4/7 Cats 1/3 Cats 5/5 Cats 0/5 Cats
This feature maximizes

This means there are 5 examples the purity
with Pointy ear shapes, 4 of them
are Cats Best Information Gain
(But how to calculate the impurity?)
Entropy as a measure of Impurity: 𝐻 𝑝1 = −𝑝1 𝑙𝑜𝑔2 𝑝1 − 1 − 𝑝1 𝑙𝑜𝑔2 (1 − 𝑝1 )
3
1 𝑝1 = ⇒ −0.5 𝑙𝑜𝑔2 0.5 − 1 − 0.5 𝑙𝑜𝑔2 1 − 0.5 = 1
6
5
𝑝1 = ⇒ −0.83 𝑙𝑜𝑔2 0.83 − 1 − 0.83 𝑙𝑜𝑔2 1 − 0.83 = 0.66
6
0 6
𝑝1 = ⇒ −1 𝑙𝑜𝑔2 1 − 1 − 1 𝑙𝑜𝑔2 1 − 1 = 0
6
0 0.5 1
Information Gain
Information Gain: The reduction of entropy in compared to if we hadn’t split at all
𝑙𝑒𝑓𝑡 𝑟𝑖𝑔ℎ𝑡
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 𝐻 𝑝1𝑟𝑜𝑜𝑡 − 𝑤 𝑙𝑒𝑓𝑡 𝐻 𝑝1 + 𝑤 𝑟𝑖𝑔ℎ𝑡 𝐻 𝑝1
Ear 𝑝𝑟𝑜𝑜𝑡 = 5ൗ10 = 0.5

Shape
4/5 Cats 1/5 Cats
= 1ൗ5 = 0.2
𝑟𝑖𝑔ℎ𝑡
= 4ൗ5 = 0.8
𝑙𝑒𝑓𝑡
𝑝1 𝑝1
𝑤 𝑙𝑒𝑓𝑡 = 5Τ10 𝑤 𝑟𝑖𝑔ℎ𝑡 = 5Τ10

𝑝1 = 5ൗ10 = 0.5 𝐻(0.5) = 1 𝑝1 = 5ൗ10 = 0.5 𝐻(0.5) = 1 𝑝1 = 5ൗ10 = 0.5 𝐻(0.5) = 1
Ear Face
Whiskers
Shape Shape
4/5 Cats 1/5 Cats 4/7 Cats 1/3 Cats 3/4 Cats 2/6 Cats
𝑝1 = 4ൗ5 = 0.8 𝑝1 = 1ൗ5 = 0.2 𝑝1 = 4ൗ7 = 0.57 𝑝1 = 1ൗ3 = 0.33 𝑝1 = 3ൗ4 = 0.75 𝑝1 = 2ൗ6 = 0.33
𝐻(0.8) = 0.72 𝐻(0.2) = 0.72 𝐻(0.57) = 0.99 𝐻(0.33) = 0.92 𝐻(0.75) = 0.81 𝐻(0.33) = 0.92
5 5 7 3 4 6
𝐻 0.5 − 𝐻 0.8 + 𝐻 0.2 𝐻 0.5 − 𝐻 0.87 + 𝐻 0.33 𝐻 0.5 − 𝐻 0.75 + 𝐻 0.33
10 10 10 10 10 10
= 𝟎. 𝟐𝟖 = 𝟎. 𝟎𝟑 = 𝟎. 𝟏𝟐
Best information gain
Decision2: When do you stop splitting?
• When a node is 100% one class.

• When splitting a node will result in the tree exceeding a maximum depth.
• When improvements in purity score are below a threshold.
• When a number of examples I a node is below a threshold.
1. Start with all examples at the root node.
2. Calculate the information gain for all the possible features And pick the one with the
highest information gain.
3. Split dataset according to selected feature, and create left and right branches of the tree.
4. Keep repeating the splitting process until the stopping criteria are met:
• When a node is 100% one class
• When splitting a node will result in the tree exceeding a maximum depth.
• Information gained from additional splits is less than the threshold.
• When a number of examples in a node is below a threshold.
One Hot Encoding Of Categorical Feature
We saw that all features in the decision tree have only 2 classes (Ear shape: pointy or floppy
– Face Shape: Round Or Not Round - …)
What if we have a feature that has more than 2 classes? We will apply One Hot
Encoding
Ear
3 classes 2 classes 2 classes
Ear Face Whisker Is Cat Pointy Floppy Round Face Whisker Is Cat
Pointy Round 1 1 1 0 0 Round 1 1
Floppy Not Round 1 1 0 1 0 Not Round 1 1
Round Round 0 0 0 0 1 Round 0 0
Pointy Not Round 0 0 1 0 0 Not Round 0 0
Continuous Feature
Weight Is Cat
6 1
20 0
9 1
16 0
12 1
11.5 0 Weight
8 11
15 0
10 1 We will try many thresholds to split the range on the
continuous feature, And take the threshold that minimizes
17 0
the impurity
7 1
Continuous Feature
Weight Is Cat One way to do this is:

6.5 6 1
1. sort the values in ascending order
7.5 7 1
2. calculate the averages for every adjacent
9.5
9 1
10 1 values
10.75
11.5 0 3. calculate the impurity if we select this average
11.75
12 1 as a threshold
13.5
15 0
15.5 4. Set the average with the minimum impurity as
16 0
16.5 a threshold
17 0
18.5
20 0
Multiple Decision Trees
One of the weaknesses of using a single decision tree is that maybe this decision tree is
sensitive to small changes in the data. What is the solution?
The solution is to build a lot of decision trees (tree ensemble)
Tree1 Cat
New
Example
Tree2 Not Cat Vote Not Cat
How to build a lot of decision trees?

Tree3 Not Cat Random Forest, XGBoost, …
Random Forest - XGBoost
Random Forest Algorithm
Random forest combines the simplicity of decision trees with flexibility resulting
in a vast improvement in accuracy
These awesome Random Forest slides were taken from StateQuest Youtube Channel, The link is in the description.
Step1: Create a “bootstrapped” dataset
To create a bootstrapped dataset that is the same size as the original, We randomly select
samples from the original dataset.
The important detail is that we can pick the same sample more than once.
Original Dataset Bootstrapped Dataset

Good Good
Chest Blocked Heart Chest Blocked Heart
Blood Weight Blood Weight
Pain Arteries Disease Pain Arteries Disease
Circ Circ
0 0 0 125 0 1 1 1 180 1
1 1 1 180 1 0 0 0 125 0
Sampling
1 1 0 210 0 1 0 1 167 1
with
1 0 1 167 1 replacement 1 0 1 167 1
Step2: Create a decision tree using the bootstrapped dataset, but only use a
random subset of variables (or columns) at each step (in this example we will only
consider 2 variables at each step)
Randomly select 2 variables

Choose the variable that did the best job
separating the samples (assume that Good
Blood Circ did) Good
Chest Blocked Heart
Blood Weight
Pain Arteries Disease
Circ
that Good 1 1 1 180 1
Blood Circ
0 0 0 125 0
1 0 1 167 1
1 0 1 167 1
Step2: Create a decision tree using the bootstrapped dataset, but only use a
random subset of variables (or columns) at each step (in this example we will only
consider 2 variables at each step)
Just like for the root, we randomly select 2

Randomly select 2 variables
variables from the remaining features and
choose the best.
Good
Chest Blocked Heart
Blood Weight
Pain Arteries Disease
that Good Circ
Blood Circ 1 1 1 180 1
0 0 0 125 0
1 0 1 167 1
1 0 1 167 1
We built a tree:
1. Using a bootstrapped dataset.
2. Only considering a random subset of variables at each step.
Now go back to Step1 and repeat: Make a new bootstrapped dataset and build
a tree considering a subset of variables at each step.
Note: Bootstrapping the data plus using aggregate to make a decision is called Bagging
Note: After generating the tree, we vote for a new example and consider the most votes.
XGBoost
Given a training set of size m.
For b=1 to B:
1. Use sampling with replacement to create a new training set of size m.

But instead of picking from all examples with equal (1/m) probability, make it
more likely to pick examples that the previously trained trees misclassify
2. Train a decision tree on the new dataset.

Decision Trees Vs Neural Networks
Decision trees and tree ensembles:

• Works well on tabular (structured) data
• Not recommended for unstructured data (images, audio, text )
• Fast
• Small decision trees may be human interpretable
Neural Networks:
• Works well on all types of data, including tabular (structured) and unstructured data.
• May be slower than decision trees.
• Works with transfer learning
• When building a system of multiple models working together, it might be easier to string
together multiple neural networks
Oday Mourad
I am happy to share Data Science and Machine

Learning basics with my wonderful LinkedIn
friends, please follow me and let’s learn together.

Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning

Neural Networks – Decision Trees – Random Forest – XGBoost

By. Oday Mourad

Input layer 𝑥Ԧ Hidden layer

We could say that neural networks

𝐴Ԧ[0] 𝐴Ԧ[1] 𝐴Ԧ[2] 𝐴Ԧ[3] 𝐴Ԧ[4]

[𝑙] 𝑙 Ԧ 𝑙−1 [𝑙]

𝐴Ԧ[0] 𝐴Ԧ[1] 𝐴Ԧ[2] 𝐴Ԧ[3] 𝐴Ԧ[4]

It is commonly used for models where we have

probability of anything exists only between the

𝑔(𝑧) = 𝑅𝑒𝐿𝑈 (𝑧) = max (0, 𝑧) g(Z)

Note: There are many other activation function types like:

We want to build a neural network to predict if a handwritten image number is 0 or 1.

The activation function for the output

We will build it using Tensorflow:

The first step is building and

Then we will compile the model

The last step is to train the NN

What if we have more than two classes?

Logistic Regression SoftMax Regression

We want to build a neural network to recognize a handwritten image number.

The input is the

We will build it using Tensorflow:

The first step is building and

Then we will compile the

The last step is to train the NN

Every step is going in the same

You’ve implemented regularized linear regression on housing prices

• Get more training examples

Suppose we have to select one of these models:

𝑑 = 1 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑏 𝑤 <1> , 𝑏 <1> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <1> , 𝑏 <1> )

𝑑 = 2 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑏 𝑤 <2> , 𝑏 <2> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <2> , 𝑏 <2> )

𝑑 = 10 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + ⋯ + 𝑤10 𝑥110 𝑏 𝑤 <10> , 𝑏 <10> ⇒ 𝑗𝑡𝑒𝑠𝑡 (𝑤 <10> , 𝑏 <10> )

The Problem is 𝑗𝑡𝑒𝑠𝑡 𝑤 <5> , 𝑏 <5> is likely to be an optimistic estimate of

Cross Validation Set (validation set, development set, dev set)

𝑑 = 1 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑏 𝑤 <1> , 𝑏 <1> ⇒ 𝑗𝑐𝑣 (𝑤 <1> , 𝑏 <1> )

𝑑 = 2 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + 𝑏 𝑤 <2> , 𝑏 <2> ⇒ 𝑗𝑐𝑣 (𝑤 <2> , 𝑏 <2> )

𝑑 = 10 ⇒ 𝑓𝑤 ,𝑏 = 𝑥Ԧ = 𝑤1 𝑥1 + 𝑤2 𝑥12 + ⋯ + 𝑤10 𝑥110 𝑏 𝑤 <10> , 𝑏 <10> ⇒ 𝑗𝑐𝑣 (𝑤 <10> , 𝑏 <10> )

Estimate the Select the right model calculate the estimated

High Bias Just Right High Variance

High Bias Just Right High Variance

Suppose we want to build a speech recognition application (Audio to Text):

H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% Now we can see that training

H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% training error and cross-

H𝑢𝑚𝑎𝑛 𝑙𝑒𝑣𝑒𝑙 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 = 10.6% training error is much higher

𝑚 (training set size) 𝑚 (training set size) 𝑚 (training set size)

If a learning algorithm suffers If a learning algorithm suffers

You’ve implemented regularized linear regression on housing prices

• Get more training examples Fixes high variance

Bigger Network More Data

Augmentation: modifying existing training examples to create new training examples

The distortion introduced

Training, error Deploy, monitor

Face Shape Whiskers

Round Not Round Present Absent

Cat Not Cat Cat Not Cat

Decision1: How to choose what feature to split on at each node?

Pointy Floppy Round Not Round Yes No

This feature maximizes