Ch2 - Fundamental of Deep Learning
Ch2 - Fundamental of Deep Learning
Ch2 - Fundamental of Deep Learning
1
Fundamentals of Deep Learning Chapter 2
Multilayer Perceptrons
Input Output
Unit Unit
Input Output
layer Hidden Hidden layer
layer 1 layer 2
Multilayer Perceptrons
Output of unit
Input of unit in dam = vector or matrix
in thuong = scalar
Matrix of weight
𝑙𝑡ℎ layer
𝑗𝑡ℎ unit
bias
Vector of
bias
Vector of input
Vector of output Activate function
HCM City Univ. of Technology, Faculty of Mechanical Engineering 3 Duong Van Tu
Fundamentals of Deep Learning Chapter 2
Multilayer Perceptrons S
𝑙
𝑤𝑖𝑗 is the weight of the connection between the 𝑖𝑡ℎ unit of the (𝑙 − 1)𝑡ℎ layer and the
𝑗𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑏𝑖𝑙 is the bias of the 𝑖𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑾 – Matrix of weight.
𝒃 – Matrix of bias.
Multilayer Perceptrons
Vector of input
𝐱 = 𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑇
Matrix of weight
𝐰 = 𝑤0 𝑤1 𝑤2 𝑤3 𝑤4 𝑇
Dimension 𝑑 = 4
Unit output: 𝑧
co 4 chieu input
Activation function: 𝑠𝑖𝑔𝑛 (khong tin bias)
𝐱 = 𝑥1 𝑥2 𝑇
0 0
0 𝑎1 = 𝑓1 (𝑤1,1 𝑥1 + 𝑤2,1 𝑥2 + b10 )
𝑤1,1 𝑛1
𝑥1 1
𝑤1,1
0 0
𝑤2,1 𝑤1,2 1
𝑤2,1
𝑥2
0 𝑛2
𝑤2,2 0
𝑎2 = 𝑓1 (𝑤2,1 0
𝑥1 + 𝑤2,2 𝑥2 + b02 )
Input layer output layer
1 1
hidden layer 1 activate function 𝑦ො = 𝑓2 (𝑤1,1 𝑎1 + 𝑤2,1 𝑎2 + 𝑏1 )
Backpropagation
lan truyen nguoc
Backpropagation, short for "backward propagation of errors," is an algorithm for
supervised learning of artificial neural networks using gradient descent.
The loss (error) function di nguoc chieu lai dao ham
1
𝐸 𝑿, 𝜽 = σ𝑁 𝑦ො𝑖 − 𝑦𝑖 2 mean square error
2𝑁 𝑖=1
Gradient Descent
Local minimum
Solving 𝑓ሶ 𝑥 = 0 to find 𝑥 ∗
ሶ
At any 𝑥, 𝑓(𝑥) is the slope of the tangent.
Gradient Descent
Our objective is to find 𝜃 such a way that the loss
function 𝐸(𝑿, 𝜽) is minimum.
𝑁
1 2
𝐸 𝑿, 𝜽 = 𝑦ො𝑖 − 𝑦𝑖
2𝑁
𝑖=1
Gradient Descent
If 𝑓ሶ 𝑥(𝑡) > 0 then 𝑥(𝑡) is on the right of 𝑥 ∗
If 𝑓ሶ 𝑥(𝑡) < 0 then 𝑥(𝑡) is on the left of 𝑥 ∗
Then the new value of 𝑥 should be obtained as:
𝑥 𝑡 + 1 = 𝑥 𝑡 − 𝜂 𝑓ሶ 𝑥(𝑡)
Gradient Descent
The matrix of weight is updated by the following:
𝜽(𝑡 + 1) = 𝜽(𝑡) − 𝜂∇𝜃 𝐸 𝑿, 𝜽(𝑡)
𝜂 – The learning rate
𝑁
1 2
𝐸 𝑿, 𝜽 = 𝑦ො𝑖 − 𝑦𝑖
2𝑁
𝑖=1
1 2
𝐽 𝐰 = 𝒚 − 𝐱ത𝐰
2𝑁
Gradient Descent
Numerical Gradient
𝑓 𝑥0 − 𝑓 𝑥0 − 𝜀
𝑓ሶ 𝑥 ≈
𝜀
𝑓 𝑥0 + 𝜀 − 𝑓 𝑥0
ሶ
𝑓 𝑥 ≈
𝜀
𝑓 𝑥0 + 𝜀 − 𝑓 𝑥0 − 𝜀
𝑓ሶ 𝑥 ≈
2𝜀
• In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is
taken to be the whole dataset.
• Although, using the whole dataset is really useful for getting to the minima in a less
noisy and less random manner, but the problem arises when our datasets get big.
• You have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique.
• In SGD, it uses only a single sample, i.e., a batch size of one, to perform each
iteration.
Linear Regression
Linear Regression
The figure illustrates the relationship between the weight and the height of the dataset
Linear Regression
𝑁 = 13
𝑑=1
𝐗 = x1 , … , x𝑁 = [147, 150, 153, 158, 163, 165, 168, ..., 183]
𝐲 = 𝑦1 , … , 𝑦𝑁 = [49, 50, 51, 54, 58, 59, 60, ..., 68]
𝐰 = 𝑤0 𝑤1 𝑥0 𝑤0
𝐱ത = [𝑥0 , 𝑥1 ] 𝑦ො
𝑥1 𝑤1
𝑦ො = 𝑓 𝐰 𝑇 𝐱ത = 𝑓 𝑥1 𝑤1 + 𝑤0 = 𝑥1 𝑤1 + 𝑤0
Linear Regression
The loss function: Train error
Validation error
Test error
1
𝐸 𝐰 = 𝒚 − 𝐰 𝑇 𝐱ത 2
2𝑁
Underfitting (train error con cao)
The gradient is calculated as: Overfitting (chi ok vs train data)
Fitting
1 𝑇 𝑇
∇𝐰 𝐸 𝐰 = 𝐱ത 𝐰 𝐱ത − 𝒚
𝑁
The weight is updated by:
𝐰 = 𝐰 − 𝜂 ∇𝐰 𝐸 𝐰
Loop until…
Linear Regression
Binary Classification
Linearly Separable
Binary Classification
Matrix of input 𝐗 = 𝐱1 , 𝐱 2 , … , 𝐱N ∈ ℝ𝑑×𝑁
𝑁 – Number of data point
𝑑 – Dimension of a point
Matrix of output 𝐲 = 𝑦1 , 𝑦2 , … , 𝑦𝑁 ∈ ℝ𝑁×1
If 𝐱𝑖 belongs to the blue class, 𝑦𝑖 = 1
Otherwise 𝑦𝑖 = −1
Binary Classification
𝑥0 𝑤0
𝑥1 𝑦ො
𝑤1
𝑥2 𝑤2
Binary Classification
𝑥0 𝑤0
𝑥1 𝑦ො
𝑤1
𝑥2 𝑤2
Why?
Binary Classification
𝑦ො = 𝑠𝑔𝑛 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑤0 = sgn(𝐰 𝑇 𝐱ത)
𝐸1 𝐰 = (−yi sgn 𝐰 𝑇 𝐱ത 𝑖 )
𝐱ത 𝒊 ∈𝕄
Binary Classification
The 1𝑠𝑡 loss function
𝐸1 𝐰 = (−yi sgn 𝐰 𝑇 𝐱ത 𝑖 )
𝐱ത 𝒊 ∈𝕄
𝐸 𝐰 = (−yi 𝐰 𝑇 𝐱ത 𝑖 )
𝐱ത 𝒊 ∈𝕄
The gradient of the loss function at a point The matrix of weight is updated by:
∇𝐰 𝐸 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = −𝑦𝑖 𝐱ത 𝑖 𝐰 = 𝐰 + 𝜂𝑦𝑖 𝐱ത 𝑖
Logistic Regression
Example: Probability of passing an exam versus hours of study.
A group of 20 students spends between 0 and 6 hours studying for an exam.
How does the number of hours spent studying affect the probability of the student
passing the exam?
Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
Logistic Regression
Example: Probability of passing an exam versus hours of study.
Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
Logistic Regression
Example: Probability of passing an exam versus hours of study.
Logistic Regression
Example: Probability of passing an exam versus hours of study.
Sigmoid function
1
𝑓 𝑠 =
1 + 𝑒 −𝑠
Gradient
𝑒 −𝑠 1 𝑒 −𝑠
𝑓ሶ 𝑠 = = −𝑠 −𝑠 = 𝑓 𝑠 1 − 𝑓(𝑠)
1 + 𝑒 −𝑠 2 1+𝑒 1+𝑒
𝑦ො1 = 𝑓 𝐰 𝑇 𝐱ത 𝑖 Probability of 𝑖𝑡ℎ point belongs to class 1 (pass)
𝑦ො2 = 1 − 𝑓 𝐰 𝑇 𝐱ത 𝑖 Probability of 𝑖𝑡ℎ point belongs to class 0 (Fail)
Logistic Regression
Example: Probability of passing an exam versus hours of study.
𝑦 1−𝑦𝑖
𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 𝑧𝑖 𝑖 1 − 𝑧𝑖
If 𝑦𝑖 = 1 then 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 𝑧𝑖
If 𝑦𝑖 = 0 then 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 1 − 𝑧𝑖
𝐰 = argmax 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖
𝐸 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = − 𝑦𝑖 𝑙𝑜𝑔𝑧𝑖 + 1 − 𝑦𝑖 log 1 − 𝑧𝑖
𝑖=1