Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
price
Output layer 𝑎
affordability
shipping cost
𝑦ො
marketing awareness
input layer4
layer3
layer1
every unit (neuron) contain Let’s zoom in on
layer2
activation function like the third layer
sigmoid, tanh, Relu …
Neural Networks Notation
[3] 3 [3]
𝐴1 = 𝑔(𝑊1 . 𝐴Ԧ 2 + 𝑏1 )
[3] [3]
𝑊1 , 𝑏1
𝑔 is the
Ԧ[2]
𝐴 Ԧ[3]
𝐴 [3] 3 [3]
𝐴2 = 𝑔(𝑊2 . 𝐴Ԧ 2 + 𝑏2 ) activation
[3] [3]
𝑊2 , 𝑏2 function
[3] 3 [3]
𝐴3 = 𝑔(𝑊3 . 𝐴Ԧ 2 + 𝑏3 )
[3] [3]
𝑊3 , 𝑏3
input (3,1)
(5,3)
(𝑛, 4) m: # of training examples
Shape of 𝑊 [1] (4,5) n: # of input features
Activation Function Types ( Sigmoid )
1
𝑔(𝑧) = 𝑆𝑖𝑔𝑚𝑜𝑖𝑑 (𝑧) = 𝑔(𝑧)
1+𝑒 −𝑧
Binary classification 1
Regression
It is commonly used for models where we have
to predict any positive numeric value as an
output (as in the case of the regression problem
like house price), and ReLU is the most common
0 Z
choice for hidden layers because of its low
compute time and its non-flat part.
Activation Function Types ( Linear Activation Function )
𝑔(𝑧) = 𝑧 𝑔(𝑍)
Regression
It is commonly used for models where we have
to predict any numeric value as an output (as in
Z
the case of the regression problem like total
profit)
The input
is the
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑: 𝑦ො ≥ 0.5
OR image
pixels
0 Or 1
we will use the cross-entropy loss
function because we are trying to
predict one of two classes
Binary Classification NN Example
𝑝(𝑦 = 0|𝑥)
Ԧ Multiclass
2 Classes
𝑝(𝑦 = 1|𝑥)
Ԧ How to apply it?
SoftMax (4 Possible outputs)
𝑒 𝑧1
𝑧 = 𝑤. 𝑥Ԧ + 𝑏 𝑧1 = 𝑤1 . 𝑥Ԧ + 𝑏1 𝑎1 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑤
𝑎1 = 𝑔 𝑧 = = 𝑝(𝑦 = 1|𝑥)
Ԧ 𝑒 𝑧2
1 + 𝑒 −𝑧 𝑧2 = 𝑤2 . 𝑥Ԧ + 𝑏2 𝑎2 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑎2 = 1 − 𝑎1 = 𝑝 𝑦 = 0 𝑥)
Ԧ
𝑒 𝑧3
𝑧3 = 𝑤3 . 𝑥Ԧ + 𝑏3 𝑎3 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
𝑒 𝑧4
𝑧4 = 𝑤3 . 𝑥Ԧ + 𝑏3 𝑎4 = 𝑧1
𝑒 + 𝑒 𝑧2 + 𝑒 𝑧3 + 𝑒 𝑧4
SoftMax (N Possible outputs)
𝑒 𝑧𝑗
𝑎𝑗 = 𝑁 = 𝑃 𝑦 = 𝑗 𝑥)
Ԧ
σ𝑘=1 𝑒 𝑧𝑘
𝑧𝑗 = 𝑤𝑗 . 𝑥Ԧ + 𝑏𝑗 ∶ 𝑗 = 1 ,2 ,……,𝑛
𝑛𝑜𝑡𝑒: 𝑎1 + 𝑎2 + … + 𝑎𝑛 = 1
Cost
Logistic Regression SoftMax Regression
𝑒 𝑧1
𝑧 = 𝑤. 𝑥Ԧ + 𝑏 𝑎1 = 𝑧1 𝑧2 𝑧3 𝑧4
= 𝑝(𝑦 = 1|𝑥)
Ԧ
𝑒 + 𝑒 + 𝑒 + 𝑒
𝑤
𝑎1 = 𝑔 𝑧 = = 𝑝(𝑦 = 1|𝑥)
Ԧ
1+𝑒 −𝑧 𝑒 𝑧𝑁
𝑎𝑁 = 𝑧1 𝑧2 𝑧3 𝑧4
= 𝑝(𝑦 = 𝑁|𝑥)
Ԧ
𝑒 + 𝑒 + 𝑒 + 𝑒
𝑎2 = 1 − 𝑎1 = 𝑝 𝑦 = 0 𝑥)
Ԧ
− log 𝑎1 𝑖𝑓 𝑦 = 1
𝑙𝑜𝑠𝑠 = −𝑦 log 𝑎1 − 1 − 𝑦 log(1 − 𝑎1) − log 𝑎2 𝑖𝑓 𝑦 = 2
𝑙𝑜𝑠𝑠 𝑎1 , 𝑎2 , … , 𝑎𝑁 , 𝑦 = .
.
𝑗 𝑤 , 𝑏 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑙𝑜𝑠𝑠
− log 𝑎𝑁 𝑖𝑓 𝑦 = 𝑁
Multiclass Classification NN Example
1
The activation
function for the
output neuron is
SoftMax to predict
probability from 0 to
8 1 for each class.
𝑤1
Machine Learning Advices
Debugging a learning algorithm
𝑚 𝑛
1 𝑖
λ
(𝑖) 2 2
𝐽 𝑤, 𝑏 = (𝑓𝑤,𝑏 𝑥 − 𝑦 ) + 𝑤𝑗
2𝑚 2𝑚
𝑖 𝑗
Size Price
2104 400 We need to know how our model fits
1600 330 training and testing data.
2400 369
1416 232 70% Train Data So we compute cost function for train
3000 540 data and for test data
1985 300
1534 315 Sometimes Model fits the training data
1427 199 very well but fails to generalize to new
1380 212 30% Test Data examples in the training set.
1494 243
Model Selection
(choosing the right model)
After calculating test costs for every model, we found that 𝑗𝑡𝑒𝑠𝑡 (𝑤 <5> , 𝑏 <5> ) has the lowest
value, Should we select the fifth model?
Model Selection
(choosing the right model)
So we need an extra dependent set for model selection … What is its name?
Size Price
2104 400
1600 330
2400 369
60% Train Data
1416 232
3000 540
1985 300
1534 315
20% Cross-Validation Data
1427 199
1380 212
20% Test Data
1494 243
Model Selection
(choosing the right model)
We will select the model that has the least validation error, Then calculate the
estimated generalization error using the test set.
Model Selection
(choosing the right model)
𝑗𝑐𝑣 (𝑤 , 𝑏)
High Variance (Overfit):
𝑗𝑡𝑟𝑎𝑖𝑛 (𝑤 , 𝑏)
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛
Just right (𝑗𝑡𝑟𝑎𝑖𝑛 may be low)
High Bias (Underfit):
𝑗𝑡𝑟𝑎𝑖𝑛 will be high
𝑗𝑡𝑟𝑎𝑖𝑛 ≈ 𝑗𝑐𝑣
Degree of polynomial
Note: Sometimes maybe we have High Bias and High Variance: (𝑗𝑡𝑟𝑎𝑖𝑛 will be high AND
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛 ) this happened when the model overfits for part of the input (very complicated
model), And for the other part of the input it does not fit the data will (underfit)
Bias / Variance (with regularization)
Underfitting Sweet spot – Sugar place Overfitting
“just right”
𝑗𝑐𝑣 (𝑤 , 𝑏)
High Bias (Underfit):
𝑗𝑡𝑟𝑎𝑖𝑛 (𝑤 , 𝑏)
𝑗𝑡𝑟𝑎𝑖𝑛 will be high
Just right 𝑗𝑡𝑟𝑎𝑖𝑛 ≈ 𝑗𝑐𝑣
High Variance (Overfit):
𝑗𝑐𝑣 ≫ 𝑗𝑡𝑟𝑎𝑖𝑛
(𝑗𝑡𝑟𝑎𝑖𝑛 may be low)
Note: The regularization curve is the horizontal flip of the polynomial degree curve.
Establishing a baseline level of performance
𝑗𝑐𝑣
𝑗𝑐𝑣 𝑗𝑐𝑣
Gap
error
error
error
Human-level performance Human-level performance
𝑗𝑡𝑟𝑎𝑖𝑛 Gap
Gap 𝑗𝑡𝑟𝑎𝑖𝑛
Human-level performance 𝑗𝑡𝑟𝑎𝑖𝑛
𝑚 𝑛
1 𝑖
λ
(𝑖) 2 2
𝐽 𝑤, 𝑏 = (𝑓𝑤,𝑏 𝑥 − 𝑦 ) + 𝑤𝑗
2𝑚 2𝑚
𝑖 𝑗
Yes Yes
Does it do well on the Does it do well on the
training set? cross validation set?
Done!
No No
Choose
architecture
(model,
data, etc.)
Start here
Diagnostic
(bias,
variance, Train Model
and error
analysis)
Data Augmentation
This technique is used especially for images and audio data that can increase your training set
size significantly
Deploy in
Scope Project Collect Data Train Model
production
Leaf nodes
Decision Tree Learning
Ear Face
Cat DNA
Shape Shape
4/5 Cats 1/5 Cats 4/7 Cats 1/3 Cats 5/5 Cats 0/5 Cats
3
1 𝑝1 = ⇒ −0.5 𝑙𝑜𝑔2 0.5 − 1 − 0.5 𝑙𝑜𝑔2 1 − 0.5 = 1
6
5
𝑝1 = ⇒ −0.83 𝑙𝑜𝑔2 0.83 − 1 − 0.83 𝑙𝑜𝑔2 1 − 0.83 = 0.66
6
0 6
𝑝1 = ⇒ −1 𝑙𝑜𝑔2 1 − 1 − 1 𝑙𝑜𝑔2 1 − 1 = 0
6
0 0.5 1
Information Gain
𝑙𝑒𝑓𝑡 𝑟𝑖𝑔ℎ𝑡
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 𝐻 𝑝1𝑟𝑜𝑜𝑡 − 𝑤 𝑙𝑒𝑓𝑡 𝐻 𝑝1 + 𝑤 𝑟𝑖𝑔ℎ𝑡 𝐻 𝑝1
= 1ൗ5 = 0.2
𝑟𝑖𝑔ℎ𝑡
= 4ൗ5 = 0.8
𝑙𝑒𝑓𝑡
𝑝1 𝑝1
Ear Face
Whiskers
Shape Shape
4/5 Cats 1/5 Cats 4/7 Cats 1/3 Cats 3/4 Cats 2/6 Cats
𝑝1 = 4ൗ5 = 0.8 𝑝1 = 1ൗ5 = 0.2 𝑝1 = 4ൗ7 = 0.57 𝑝1 = 1ൗ3 = 0.33 𝑝1 = 3ൗ4 = 0.75 𝑝1 = 2ൗ6 = 0.33
𝐻(0.8) = 0.72 𝐻(0.2) = 0.72 𝐻(0.57) = 0.99 𝐻(0.33) = 0.92 𝐻(0.75) = 0.81 𝐻(0.33) = 0.92
5 5 7 3 4 6
𝐻 0.5 − 𝐻 0.8 + 𝐻 0.2 𝐻 0.5 − 𝐻 0.87 + 𝐻 0.33 𝐻 0.5 − 𝐻 0.75 + 𝐻 0.33
10 10 10 10 10 10
= 𝟎. 𝟐𝟖 = 𝟎. 𝟎𝟑 = 𝟎. 𝟏𝟐
Best information gain
Decision Tree Learning
We saw that all features in the decision tree have only 2 classes (Ear shape: pointy or floppy
– Face Shape: Round Or Not Round - …)
What if we have a feature that has more than 2 classes? We will apply One Hot
Encoding
Ear
3 classes 2 classes 2 classes
Ear Face Whisker Is Cat Pointy Floppy Round Face Whisker Is Cat
Pointy Round 1 1 1 0 0 Round 1 1
Floppy Not Round 1 1 0 1 0 Not Round 1 1
Round Round 0 0 0 0 1 Round 0 0
Pointy Not Round 0 0 1 0 0 Not Round 0 0
Continuous Feature
Weight Is Cat
6 1
20 0
9 1
16 0
12 1
11.5 0 Weight
8 11
15 0
10 1 We will try many thresholds to split the range on the
continuous feature, And take the threshold that minimizes
17 0
the impurity
7 1
Continuous Feature
One of the weaknesses of using a single decision tree is that maybe this decision tree is
sensitive to small changes in the data. What is the solution?
Tree1 Cat
New
Example
Tree2 Not Cat Vote Not Cat
Random forest combines the simplicity of decision trees with flexibility resulting
in a vast improvement in accuracy
These awesome Random Forest slides were taken from StateQuest Youtube Channel, The link is in the description.
Step1: Create a “bootstrapped” dataset
To create a bootstrapped dataset that is the same size as the original, We randomly select
samples from the original dataset.
The important detail is that we can pick the same sample more than once.
0 0 0 125 0 1 1 1 180 1
1 1 1 180 1 0 0 0 125 0
Sampling
1 1 0 210 0 1 0 1 167 1
with
1 0 1 167 1 replacement 1 0 1 167 1
Step2: Create a decision tree using the bootstrapped dataset, but only use a
random subset of variables (or columns) at each step (in this example we will only
consider 2 variables at each step)
1 0 1 167 1
1 0 1 167 1
Step2: Create a decision tree using the bootstrapped dataset, but only use a
random subset of variables (or columns) at each step (in this example we will only
consider 2 variables at each step)
0 0 0 125 0
1 0 1 167 1
1 0 1 167 1
We built a tree:
1. Using a bootstrapped dataset.
2. Only considering a random subset of variables at each step.
Now go back to Step1 and repeat: Make a new bootstrapped dataset and build
a tree considering a subset of variables at each step.
Note: Bootstrapping the data plus using aggregate to make a decision is called Bagging
Note: After generating the tree, we vote for a new example and consider the most votes.
XGBoost
For b=1 to B:
Neural Networks:
• Works well on all types of data, including tabular (structured) and unstructured data.
• May be slower than decision trees.
• Works with transfer learning
• When building a system of multiple models working together, it might be easier to string
together multiple neural networks
Oday Mourad