Ch2 - Fundamental of Deep Learning

Chapter 2
Fundamentals of Deep Learning
1
Fundamentals of Deep Learning Chapter 2
Multilayer Perceptrons
Number of layers: 𝐿 = Hidden layer numbers + 1 Hidden

Unit
Input Output
Unit Unit
Input Output
layer Hidden Hidden layer
layer 1 layer 2
HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu

Output of unit
Input of unit in dam = vector or matrix
in thuong = scalar
Matrix of weight
𝑙𝑡ℎ layer
𝑗𝑡ℎ unit
bias
Vector of
bias
Vector of input
Vector of output Activate function
Multilayer Perceptrons S
𝑙
𝑤𝑖𝑗 is the weight of the connection between the 𝑖𝑡ℎ unit of the (𝑙 − 1)𝑡ℎ layer and the
𝑗𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑏𝑖𝑙 is the bias of the 𝑖𝑡ℎ unit of the 𝑙𝑡ℎ layer.
𝑾 – Matrix of weight.
𝒃 – Matrix of bias.
What is the learning meaning?

Example Single layer perception

nay la bias
Vector of input
𝐱 = 𝑥0 𝑥1 𝑥2 𝑥3 𝑥4 𝑇
Matrix of weight
𝐰 = 𝑤0 𝑤1 𝑤2 𝑤3 𝑤4 𝑇
Dimension 𝑑 = 4
Unit output: 𝑧
co 4 chieu input
Activation function: 𝑠𝑖𝑔𝑛 (khong tin bias)

Flow in Neural Network
𝐱 = 𝑥1 𝑥2 𝑇
0 0
0 𝑎1 = 𝑓1 (𝑤1,1 𝑥1 + 𝑤2,1 𝑥2 + b10 )
𝑤1,1 𝑛1
𝑥1 1
𝑤1,1
0 0
𝑤2,1 𝑤1,2 1
𝑤2,1
𝑥2
0 𝑛2
𝑤2,2 0
𝑎2 = 𝑓1 (𝑤2,1 0
𝑥1 + 𝑤2,2 𝑥2 + b02 )
Input layer output layer
1 1
hidden layer 1 activate function 𝑦ො = 𝑓2 (𝑤1,1 𝑎1 + 𝑤2,1 𝑎2 + 𝑏1 )

Backpropagation
lan truyen nguoc
Backpropagation, short for "backward propagation of errors," is an algorithm for
supervised learning of artificial neural networks using gradient descent.
The loss (error) function di nguoc chieu lai dao ham
1
𝐸 𝑿, 𝜽 = σ𝑁 𝑦ො𝑖 − 𝑦𝑖 2 mean square error
2𝑁 𝑖=1
𝑦𝑖 - The target output (labeled data)

𝑦ො𝑖 - The predict output of the network on the input 𝐱 𝐢
𝑁 - The set of input-output pairs { 𝐱𝟏 , 𝑦1 , … , (𝐱𝐍 , 𝑦𝑁 )}

𝜽≜𝐰∪𝐛

Gradient Descent
Local minimum
Solving 𝑓ሶ 𝑥 = 0 to find 𝑥 ∗
ሶ
At any 𝑥, 𝑓(𝑥) is the slope of the tangent.

Gradient Descent
Our objective is to find 𝜃 such a way that the loss
function 𝐸(𝑿, 𝜽) is minimum.
𝑁
1 2
𝐸 𝑿, 𝜽 = ෍ 𝑦ො𝑖 − 𝑦𝑖
2𝑁
𝑖=1

Gradient Descent
If 𝑓ሶ 𝑥(𝑡) > 0 then 𝑥(𝑡) is on the right of 𝑥 ∗
If 𝑓ሶ 𝑥(𝑡) < 0 then 𝑥(𝑡) is on the left of 𝑥 ∗
Then the new value of 𝑥 should be obtained as:
𝑥 𝑡 + 1 = 𝑥 𝑡 − 𝜂 𝑓ሶ 𝑥(𝑡)
𝜂 – The shifting rate

Gradient Descent
The matrix of weight is updated by the following:
𝜽(𝑡 + 1) = 𝜽(𝑡) − 𝜂∇𝜃 𝐸 𝑿, 𝜽(𝑡)
𝜂 – The learning rate
𝑁
1 2
𝐸 𝑿, 𝜽 = ෍ 𝑦ො𝑖 − 𝑦𝑖
2𝑁
𝑖=1
1 2
𝐽 𝐰 = 𝒚 − 𝐱ത𝐰
2𝑁

Gradient Descent
Numerical Gradient
𝑓 𝑥0 − 𝑓 𝑥0 − 𝜀
𝑓ሶ 𝑥 ≈
𝜀
𝑓 𝑥0 + 𝜀 − 𝑓 𝑥0
ሶ
𝑓 𝑥 ≈
𝜀
𝑓 𝑥0 + 𝜀 − 𝑓 𝑥0 − 𝜀
𝑓ሶ 𝑥 ≈
2𝜀

Gradient Descent with Momentum

Normal gradient descent 𝐰 = 𝐰 − 𝜂 ∇𝐰 𝐸 𝐰

If we define that
𝑣 𝑡 ≜ ∇𝐰 𝐸 𝐰
Then the gradient descent with momentum proposes that
𝐰 t+1 =𝐰 t −𝑣 𝑡
with
𝑣 𝑡 = 𝛾𝑣 𝑡 − 1 + 𝜂∇𝐰 𝐸 𝐰
In which 𝛾 used to be chosen as 0.9.


Stochastic Gradient Descent
• In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is
taken to be the whole dataset.
• Although, using the whole dataset is really useful for getting to the minima in a less
noisy and less random manner, but the problem arises when our datasets get big.
• You have a million samples in your dataset, so if you use a typical Gradient Descent
optimization technique.
• In SGD, it uses only a single sample, i.e., a batch size of one, to perform each
iteration.

Stochastic Gradient Descent
Batch Gradient Descent Stochastic Gradient Descent

Linear Regression
A given dataset is described as follows:
Height (cm) Weight (kg) Height (cm) Weight (kg)

147 49 168 60
150 50 170 72
153 51 173 63
155 52 175 64
158 54 178 66
160 56 180 67
163 58 183 68
165 59

Linear Regression
The figure illustrates the relationship between the weight and the height of the dataset

Linear Regression
𝑁 = 13
𝑑=1
𝐗 = x1 , … , x𝑁 = [147, 150, 153, 158, 163, 165, 168, ..., 183]
𝐲 = 𝑦1 , … , 𝑦𝑁 = [49, 50, 51, 54, 58, 59, 60, ..., 68]
𝐰 = 𝑤0 𝑤1 𝑥0 𝑤0
𝐱ത = [𝑥0 , 𝑥1 ] 𝑦ො
𝑥1 𝑤1
𝑦ො = 𝑓 𝐰 𝑇 𝐱ത = 𝑓 𝑥1 𝑤1 + 𝑤0 = 𝑥1 𝑤1 + 𝑤0
Activate function: Linear function 𝑓 𝑠 = 𝑠

Linear Regression
The loss function: Train error
Validation error
Test error
1
𝐸 𝐰 = 𝒚 − 𝐰 𝑇 𝐱ത 2
2𝑁
Underfitting (train error con cao)
The gradient is calculated as: Overfitting (chi ok vs train data)
Fitting
1 𝑇 𝑇
∇𝐰 𝐸 𝐰 = 𝐱ത 𝐰 𝐱ത − 𝒚
𝑁
The weight is updated by:
𝐰 = 𝐰 − 𝜂 ∇𝐰 𝐸 𝐰
Loop until…

Linear Regression

Binary Classification
Given problem Our goal
Linearly Separable

Matrix of input 𝐗 = 𝐱1 , 𝐱 2 , … , 𝐱N ∈ ℝ𝑑×𝑁
𝑁 – Number of data point
𝑑 – Dimension of a point
Matrix of output 𝐲 = 𝑦1 , 𝑦2 , … , 𝑦𝑁 ∈ ℝ𝑁×1
If 𝐱𝑖 belongs to the blue class, 𝑦𝑖 = 1
Otherwise 𝑦𝑖 = −1

𝑥0 𝑤0
𝑥1 𝑦ො
𝑤1
𝑥2 𝑤2
The predict output of the network is

determined by:
𝑦ො = 𝑓 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑤0 = 𝑓(𝐰 𝑇 𝐱ത)
The value of 𝑦ො either is 1 or -1.
What should the activate function be?

𝑥0 𝑤0
𝑥1 𝑦ො
𝑤1
𝑥2 𝑤2
The predict output of the network is

determined by:
𝑦ො = 𝑠𝑔𝑛 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑤0 = sgn 𝐰 𝑇 𝐱ത
Why?

𝑦ො = 𝑠𝑔𝑛 𝑥1 𝑤1 + 𝑥2 𝑤2 + 𝑤0 = sgn(𝐰 𝑇 𝐱ത)
The matrix of weight is initialized with 0.

The 𝑖𝑡ℎ misclassified point in the set 𝕄:
𝑦ො𝑖 ≠ 𝑦𝑖
yi sgn 𝐰 𝑇 𝐱ത 𝑖 < 0
The 1𝑠𝑡 loss function
𝐸1 𝐰 = ෍ (−yi sgn 𝐰 𝑇 𝐱ത 𝑖 )
𝐱ത 𝒊 ∈𝕄

The 1𝑠𝑡 loss function
𝐸1 𝐰 = ෍ (−yi sgn 𝐰 𝑇 𝐱ത 𝑖 )
The 2𝑛𝑑 loss function
𝐸 𝐰 = ෍ (−yi 𝐰 𝑇 𝐱ത 𝑖 )
The gradient of the loss function at a point The matrix of weight is updated by:
∇𝐰 𝐸 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = −𝑦𝑖 𝐱ത 𝑖 𝐰 = 𝐰 + 𝜂𝑦𝑖 𝐱ത 𝑖

Logistic Regression
Example: Probability of passing an exam versus hours of study.
A group of 20 students spends between 0 and 6 hours studying for an exam.
How does the number of hours spent studying affect the probability of the student
passing the exam?
Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

Logistic Regression
Hours 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Pass 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1

Logistic Regression
Which will the activate function suitable for this?

Logistic Regression
Sigmoid function
1
𝑓 𝑠 =
1 + 𝑒 −𝑠
Gradient
𝑒 −𝑠 1 𝑒 −𝑠
𝑓ሶ 𝑠 = = −𝑠 −𝑠 = 𝑓 𝑠 1 − 𝑓(𝑠)
1 + 𝑒 −𝑠 2 1+𝑒 1+𝑒
𝑦ො1 = 𝑓 𝐰 𝑇 𝐱ത 𝑖 Probability of 𝑖𝑡ℎ point belongs to class 1 (pass)
𝑦ො2 = 1 − 𝑓 𝐰 𝑇 𝐱ത 𝑖 Probability of 𝑖𝑡ℎ point belongs to class 0 (Fail)

Logistic Regression
𝑦 1−𝑦𝑖
𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 𝑧𝑖 𝑖 1 − 𝑧𝑖
If 𝑦𝑖 = 1 then 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 𝑧𝑖
If 𝑦𝑖 = 0 then 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = 1 − 𝑧𝑖
𝐰 = argmax 𝑃 𝐰, 𝐱ത 𝑖 , 𝑦𝑖
The loss function

𝑁
𝐸 𝐰, 𝐱ത 𝑖 , 𝑦𝑖 = − ෍ 𝑦𝑖 𝑙𝑜𝑔𝑧𝑖 + 1 − 𝑦𝑖 log 1 − 𝑧𝑖
𝑖=1

Ch2 - Fundamental of Deep Learning

Uploaded by

Copyright:

Available Formats

Ch2 - Fundamental of Deep Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch2 - Fundamental of Deep Learning

Uploaded by

Copyright:

Available Formats

Chapter 2

Fundamentals of Deep Learning

Number of layers: 𝐿 = Hidden layer numbers + 1 Hidden

HCM City Univ. of Technology, Faculty of Mechanical Engineering 2 Duong Van Tu

What is the learning meaning?

HCM City Univ. of Technology, Faculty of Mechanical Engineering 4 Duong Van Tu

Example Single layer perception

HCM City Univ. of Technology, Faculty of Mechanical Engineering 5 Duong Van Tu

Flow in Neural Network

HCM City Univ. of Technology, Faculty of Mechanical Engineering 6 Duong Van Tu

𝑦𝑖 - The target output (labeled data)

𝑁 - The set of input-output pairs { 𝐱𝟏 , 𝑦1 , … , (𝐱𝐍 , 𝑦𝑁 )}

HCM City Univ. of Technology, Faculty of Mechanical Engineering 7 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 8 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 9 Duong Van Tu

𝜂 – The shifting rate

HCM City Univ. of Technology, Faculty of Mechanical Engineering 10 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 11 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 12 Duong Van Tu

Gradient Descent with Momentum

HCM City Univ. of Technology, Faculty of Mechanical Engineering 13 Duong Van Tu

Gradient Descent with Momentum

Normal gradient descent 𝐰 = 𝐰 − 𝜂 ∇𝐰 𝐸 𝐰

HCM City Univ. of Technology, Faculty of Mechanical Engineering 14 Duong Van Tu

Gradient Descent with Momentum

HCM City Univ. of Technology, Faculty of Mechanical Engineering 15 Duong Van Tu

Stochastic Gradient Descent

HCM City Univ. of Technology, Faculty of Mechanical Engineering 16 Duong Van Tu

Stochastic Gradient Descent

Batch Gradient Descent Stochastic Gradient Descent

HCM City Univ. of Technology, Faculty of Mechanical Engineering 17 Duong Van Tu

A given dataset is described as follows:

Height (cm) Weight (kg) Height (cm) Weight (kg)

HCM City Univ. of Technology, Faculty of Mechanical Engineering 18 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 19 Duong Van Tu

Activate function: Linear function 𝑓 𝑠 = 𝑠

HCM City Univ. of Technology, Faculty of Mechanical Engineering 20 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 21 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 22 Duong Van Tu

Given problem Our goal

HCM City Univ. of Technology, Faculty of Mechanical Engineering 23 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 24 Duong Van Tu

The predict output of the network is

The value of 𝑦ො either is 1 or -1.

What should the activate function be?

HCM City Univ. of Technology, Faculty of Mechanical Engineering 25 Duong Van Tu

The predict output of the network is

HCM City Univ. of Technology, Faculty of Mechanical Engineering 26 Duong Van Tu

The matrix of weight is initialized with 0.

HCM City Univ. of Technology, Faculty of Mechanical Engineering 27 Duong Van Tu

The 2𝑛𝑑 loss function

HCM City Univ. of Technology, Faculty of Mechanical Engineering 28 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 29 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 30 Duong Van Tu

Which will the activate function suitable for this?

HCM City Univ. of Technology, Faculty of Mechanical Engineering 31 Duong Van Tu

HCM City Univ. of Technology, Faculty of Mechanical Engineering 32 Duong Van Tu

The loss function