Neural Network Notes
Neural Network Notes
Neural Network Notes
Andres Mendez-Vazquez
1 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
2 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
3 / 148
The Long Path [1]
2018
Channel Boosted CNN
Channel
Boosting
Beyond
2018
CBAM
Attention
Residual Attention
Module
2018
CMPE-SE
Feature Map SE Net
Explotation
A Small History
Transformers-CNN
Width ResNext
Explotation
2017
FractalNet
4 5 1 1
2 6 2 6
3 5 7 3
4.5
9
5
6.5
Residual and Multipath Multi-Path
Connectivity
2016
Dense Net
ZfNet
ResNet
Architectures
2015
1 9 2 1 Skip Connections
Optimization
Visualization
Depth
Highway Net
Parameter
Revolution
Feature
The Beginnig of Atention?
VGG
2014 Effective Reception Filed
(Small Size Filters)
Inception-ResNet
Factorization Inception-V4
Bottleneck
Inception-V3
Inception-V2
The Revolution
Parallelism
2014
Spatial
Explotation
Inception GoogleNet
Block
2006 Maxpooling
2007 NVIDIA
2006 GPU
2010 ImageNet
Explotation
Explotation
AlexNet
Spatial
Early 2000
2012
Depth
1998
LeNet
1989
ConvNet
Early Attempts
1979
Neurocognition
4 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
5 / 148
Digital Images as pixels in a digitized matrix [2]
Ilumination
Source
Output
Ilumination
Source
6 / 148
Further [2]
Something Notable
Remember digitization implies that a digital image is an
approximation of a real scene
7 / 148
Further [2]
Something Notable
Remember digitization implies that a digital image is an
approximation of a real scene
7 / 148
Images
8 / 148
Therefore, we have the following process
9 / 148
Example
Edge Detection
10 / 148
Then
11 / 148
Example
Object Recognition
12 / 148
Therefore
13 / 148
Therefore
13 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
14 / 148
Multilayer Neural Network Classification
15 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
16 / 148
Drawbacks of previous neural networks
Large N
17 / 148
Drawbacks of previous neural networks
In addition, little or no invariance to shifting, scaling, and other forms
of distortion
Large N
18 / 148
Drawbacks of previous neural networks
In addition, little or no invariance to shifting, scaling, and other forms
of distortion
Large N
19 / 148
Drawbacks of previous neural networks
20 / 148
For Example
We have
Black and white patterns: 232×32 = 21024
Gray scale patterns: 25632×32 = 2561024
21 / 148
For Example
22 / 148
Possible Solution
Problem!!!
Training time
Network size
Free parameters
23 / 148
Possible Solution
Problem!!!
Training time
Network size
Free parameters
23 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
24 / 148
Hubel/Wiesel Architecture
They commented
The visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells
25 / 148
Hubel/Wiesel Architecture
They commented
The visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells
25 / 148
Something Like
We have
Feature Hierarchy
Hyper-complex cells
Complex cells
Simple cells
26 / 148
History
Patterns of Local
Face Features
Contrast
Faces
OUTPUT
27 / 148
About CNN’s
Something Notable
CNN’s Were neurobiologically motivated by the findings of locally sensitive
and orientation-selective nerve cells in the visual cortex.
In addition
They designed a network structure that implicitly extracts relevant
features.
Properties
Convolutional Neural Networks are a special kind of multi-layer neural
networks.
28 / 148
About CNN’s
Something Notable
CNN’s Were neurobiologically motivated by the findings of locally sensitive
and orientation-selective nerve cells in the visual cortex.
In addition
They designed a network structure that implicitly extracts relevant
features.
Properties
Convolutional Neural Networks are a special kind of multi-layer neural
networks.
28 / 148
About CNN’s
Something Notable
CNN’s Were neurobiologically motivated by the findings of locally sensitive
and orientation-selective nerve cells in the visual cortex.
In addition
They designed a network structure that implicitly extracts relevant
features.
Properties
Convolutional Neural Networks are a special kind of multi-layer neural
networks.
28 / 148
About CNN’s
In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.
29 / 148
About CNN’s
In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.
29 / 148
About CNN’s
In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.
29 / 148
About CNN’s
In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.
29 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
30 / 148
Local Connectivity
Input Image
31 / 148
Local Connectivity
Input Image
31 / 148
Local Connectivity
32 / 148
Local Connectivity
32 / 148
Local Connectivity
We decide only to connect the neurons in a local way
Each hidden unit is connected only to a subregion (patch) of the
input image.
It is connected to all channels:
32 / 148
Example
Input Image
33 / 148
Example
Input Image
33 / 148
Example
Input Image
Thus
X
yi = f wi xci (3)
i∈Lp ,c
34 / 148
Example
Input Image
Thus
X
yi = f wi xci (3)
i∈Lp ,c
34 / 148
Solving the following problems...
First
Fully connected hidden layer would have an unmanageable number of
parameters
Second
Computing the linear activation of the hidden units would have been
quite expensive
35 / 148
Solving the following problems...
First
Fully connected hidden layer would have an unmanageable number of
parameters
Second
Computing the linear activation of the hidden units would have been
quite expensive
35 / 148
How this looks in the image...
We have
Receptive Field
36 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
37 / 148
Parameter Sharing
Second Idea
Share matrix of parameters across certain units.
38 / 148
Parameter Sharing
Second Idea
Share matrix of parameters across certain units.
38 / 148
Example
39 / 148
Example
39 / 148
Now, in our notation
40 / 148
Now, in our notation
40 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
41 / 148
Digital Images
42 / 148
Digital Images
42 / 148
Digital Images
42 / 148
Digital Images
42 / 148
Many times we want to eliminate noise in a image
43 / 148
This is defined as
44 / 148
This can be generalized into the 2D images
45 / 148
This can be generalized into the 2D images
46 / 148
This can be generalized into the 2D images
47 / 148
This can be generalized into the 2D images
48 / 148
Moving average in 2D
Basically in 2D
We can define different types of filter using the idea of weighted
average
−m X
X m
(I ∗ K) (i, j) = I (i − s, j − l) × K (s, l) (5)
s=m l=−m
49 / 148
Moving average in 2D
Basically in 2D
We can define different types of filter using the idea of weighted
average
−m X
X m
(I ∗ K) (i, j) = I (i − s, j − l) × K (s, l) (5)
s=m l=−m
49 / 148
Another Example
50 / 148
Another Example
50 / 148
Convolution
Definition
Let I : [a, b] × [c, d] → [0..255] be the image and
K : [e, f ] × [h, i] → R be the kernel. The output of Convolving I
with K, denoted I ∗ K is
n
X n
X
(I ∗ K) [x, y] = I (x − s, y − l) × K (s, l)
s=−n l=−n
51 / 148
Now, why not to expand this idea
Imagine that a three channel image is splitted into a three feature
map
Feature Maps
52 / 148
Mathematically, we have the following
Map i
n
3 X
X n
X
(I ∗ k) [x, y, o] = I (x − l, y − s, c) × k (l, s, c, o)
c=1 l=−n s=−n
Therefore
The convolution works as a
I Filter
I Encoder
I Decoder
I etc
53 / 148
Mathematically, we have the following
Map i
n
3 X
X n
X
(I ∗ k) [x, y, o] = I (x − l, y − s, c) × k (l, s, c, o)
c=1 l=−n s=−n
Therefore
The convolution works as a
I Filter
I Encoder
I Decoder
I etc
53 / 148
For Example, Encoder
54 / 148
Notation
Therefore
We can see the Convolutional as a fusion of information from
different feature maps.
(l−1)
m1
X (l−1) (l)
Yj ∗ Kij
j=1
55 / 148
Notation
Therefore
We can see the Convolutional as a fusion of information from
different feature maps.
(l−1)
m1
X (l−1) (l)
Yj ∗ Kij
j=1
55 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks
Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .
Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks
Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .
Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks
Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .
Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks
Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .
Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks
Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .
Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Therefore
Something Notable
(l) (l)
m2 and m3 are influenced by border effects.
Therefore, the output feature maps when the Convolutional sum is
defined properly have size
57 / 148
Therefore
Something Notable
(l) (l)
m2 and m3 are influenced by border effects.
Therefore, the output feature maps when the Convolutional sum is
defined properly have size
57 / 148
Why? The Border
Example
Convolutional Maps
58 / 148
Special Case
When l = 1
The input is a single image I consisting of one or more channels.
59 / 148
Thus
We have
(l) (l) (l)
Each feature map Yi in layer l consists of m1 · m2 units arranged in a
two dimensional array.
(l−1)
m1
(l) (l) X (l) (l−1)
Yi = Bi + Kij ∗ Yj
x,y x,y x,y
j=1
(l−1) (l) (l)
m1 h1 h2
(l) X X X (l) (l−1)
= Bi + Kij Yj
x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2
60 / 148
Thus
We have
(l) (l) (l)
Each feature map Yi in layer l consists of m1 · m2 units arranged in a
two dimensional array.
(l−1)
m1
(l) (l) X (l) (l−1)
Yi = Bi + Kij ∗ Yj
x,y x,y x,y
j=1
(l−1) (l) (l)
m1 h1 h2
(l) X X X (l) (l−1)
= Bi + Kij Yj
x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2
60 / 148
Here, an interesting case
61 / 148
Here, an interesting case
61 / 148
In [8]
62 / 148
In [8]
62 / 148
Here
Then, we can generalize such cost function for that total set of
images (Minbatch)
(l−1)
(l)
2 (l)
I mX m1 m1
1
λ X
X (l)
(l,k) (l)
(l−1,k)
X (l,k) p
Cl (y) = gij Yj ∗ Kij − Yi
+ Yj
2 k=1 i=1
j=1
j=1
2
Here, we have
(l−1,k)
Yi are the feature maps from the previous layer
(l)
gij is a fixed binary matrix that determines the connectivity between
feature maps at different layers
(l,k) (l−1,k)
I If Yj is connected to certain Yi elments
63 / 148
Here
Then, we can generalize such cost function for that total set of
images (Minbatch)
(l−1)
(l)
2 (l)
I mX m1 m1
1
λ X
X (l)
(l,k) (l)
(l−1,k)
X (l,k) p
Cl (y) = gij Yj ∗ Kij − Yi
+ Yj
2 k=1 i=1
j=1
j=1
2
Here, we have
(l−1,k)
Yi are the feature maps from the previous layer
(l)
gij is a fixed binary matrix that determines the connectivity between
feature maps at different layers
(l,k) (l−1,k)
I If Yj is connected to certain Yi elments
63 / 148
This can be sen as
+ + + +
64 / 148
They noticed some drawbacks
65 / 148
They noticed some drawbacks
65 / 148
We have that
(l,k)
An interesting use of an auxiliar variable/layer Xi
(l)
(l−1)
2
I mX
m
1 1
λX
X (l)
(l,k) (l)
(l−1,k)
Cl (y) =
gij Yj ∗ Kij − Yi
+ ...
2 k=1 i=1
j=1
2
(l) (l)
m1
I X
I Xm1
βX
(l,k) (l,k)
2
X (l,k) p
Yj − Xi
+ Yj
2 k=1 j=1 2
k=1 j=1
66 / 148
We have that
(l,k)
An interesting use of an auxiliar variable/layer Xi
(l)
(l−1)
2
I mX
m
1 1
λX
X (l)
(l,k) (l)
(l−1,k)
Cl (y) =
gij Yj ∗ Kij − Yi
+ ...
2 k=1 i=1
j=1
2
(l) (l)
m1
I X
I Xm1
βX
(l,k) (l,k)
2
X (l,k) p
Yj − Xi
+ Yj
2 k=1 j=1 2
k=1 j=1
66 / 148
This is based on
(l,k) (l,k)
Fixing the values of Yj and Xi
They call these two stages the Y and X sub-problems...
67 / 148
This is based on
(l,k) (l,k)
Fixing the values of Yj and Xi
They call these two stages the Y and X sub-problems...
67 / 148
Y sub-problem
(l,k)
Taking the derivative of Yj
(l−1)
(l)
m1 m1
∂Cl (y) X (l)T X (l) (l,k) (l−1,k)
h
(l,k) (l,k)
i
(l,k)
=λ Fij Ftj Yj − Yj +β Yj − Xj =0
∂Yj i=1 t=1
Where
It is a sparse convolution matrix (l)
(l) if gij = 1
Fij = (l)
0 if gij = 0
68 / 148
Y sub-problem
(l,k)
Taking the derivative of Yj
(l−1)
(l)
m1 m1
∂Cl (y) X (l)T X (l) (l,k) (l−1,k)
h
(l,k) (l,k)
i
(l,k)
=λ Fij Ftj Yj − Yj +β Yj − Xj =0
∂Yj i=1 t=1
Where
It is a sparse convolution matrix (l)
(l) if gij = 1
Fij = (l)
0 if gij = 0
68 / 148
Therefore
(l)
Fij as a sparse convolution matrix
(l)
Equivalent to convolve with Kij
69 / 148
Therefore
(l)
Fij as a sparse convolution matrix
(l)
Equivalent to convolve with Kij
69 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
70 / 148
As in a Multilayer Perceptron
We use a non-linearity
However, there is a drawback when using Back-Propagation under a
sigmoid function
1
s (x) =
1 + e−x
ds (x) e−x
f (x) = =
dx (1 + e−x )2
df (x)
After making dx
=0
We have the maximum is at x = 0
72 / 148
Therefore
ds (x) e−x
f (x) = =
dx (1 + e−x )2
df (x)
After making dx
=0
We have the maximum is at x = 0
72 / 148
Therefore
ds (x) e−x
f (x) = =
dx (1 + e−x )2
df (x)
After making dx
=0
We have the maximum is at x = 0
72 / 148
Therefore
A vanishing derivative
Making quite difficult to do train a deeper network using this
activation function
73 / 148
Therefore
A vanishing derivative
Making quite difficult to do train a deeper network using this
activation function
73 / 148
Therefore
A vanishing derivative
Making quite difficult to do train a deeper network using this
activation function
73 / 148
Thus
74 / 148
Thus
74 / 148
Therefore, we have
When k = 1
Softplus +3.5
ReLu +3.0
+2.5
+2.0
+1.5
+1.0
+0.5
−2.8 −2.2 −1.6 −1.0 −0.4 +0.4 +1.0 +1.6 +2.2 +2.8
−0.5
75 / 148
Increase k
When k = 104
+0.004
Softplus +0.003
ReLu
+0.002
+0.001
76 / 148
Non-Linearity Layer
77 / 148
Non-Linearity Layer
77 / 148
Non-Linearity Layer
77 / 148
Thus
Where
f is the activation function used in layer l and operates point wise.
78 / 148
Thus
Where
f is the activation function used in layer l and operates point wise.
78 / 148
Thus
Where
f is the activation function used in layer l and operates point wise.
78 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
79 / 148
Rectification Layer, Rabs
Then, the absolute value for each component of the feature maps is
computed
(l) (l)
Yi = Yi (10)
80 / 148
Rectification Layer, Rabs
Then, the absolute value for each component of the feature maps is
computed
(l) (l)
Yi = Yi (10)
80 / 148
Rectification Layer, Rabs
Then, the absolute value for each component of the feature maps is
computed
(l) (l)
Yi = Yi (10)
80 / 148
Thus
We have that
Experiments show that rectification plays a central role in achieving good
performance.
Remark
Rectification could be included in the non-linearity layer.
But also it can be seen as an independent layer.
81 / 148
Thus
We have that
Experiments show that rectification plays a central role in achieving good
performance.
Remark
Rectification could be included in the non-linearity layer.
But also it can be seen as an independent layer.
81 / 148
Thus
We have that
Experiments show that rectification plays a central role in achieving good
performance.
Remark
Rectification could be included in the non-linearity layer.
But also it can be seen as an independent layer.
81 / 148
Given that we are using Backpropagation
We need a soft approximation to f (x) = |x|
For this, we have
∂f
= sgn (x)
∂x
When x 6= 0. Why?
82 / 148
Given that we are using Backpropagation
We need a soft approximation to f (x) = |x|
For this, we have
∂f
= sgn (x)
∂x
When x 6= 0. Why?
82 / 148
Given that we are using Backpropagation
We need a soft approximation to f (x) = |x|
For this, we have
∂f
= sgn (x)
∂x
When x 6= 0. Why?
82 / 148
We get the following situation
Something Notable
+0.0007
+0.0006
+0.0005
+0.0004
+0.0003
+0.0002
+0.0001
−0 0001
83 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
84 / 148
Normalizing
85 / 148
Normalizing
85 / 148
Normalizing
85 / 148
Normalizing
85 / 148
Normalizing
85 / 148
Subtractive Normalization
With
( )
1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2
86 / 148
Subtractive Normalization
With
( )
1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2
86 / 148
Subtractive Normalization
With
( )
1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2
86 / 148
Brightness Normalization
Where
κ, µ and λ are hyperparameters which can be set using a
ln 1 + ekx
f (x) =
k
validation set.
87 / 148
Brightness Normalization
Where
κ, µ and λ are hyperparameters which can be set using a
ln 1 + ekx
f (x) =
k
validation set.
87 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
88 / 148
Sub-sampling Layer
Motivation
The motivation of subsampling the feature maps obtained by previous
layers is robustness to noise and distortions.
How?
Normally, in traditional Convolutional Networks subsampling this is
done by applying skipping factors!!!
However, it is possible to combine subsampling with pooling and do it
in a separate layer
89 / 148
Sub-sampling Layer
Motivation
The motivation of subsampling the feature maps obtained by previous
layers is robustness to noise and distortions.
How?
Normally, in traditional Convolutional Networks subsampling this is
done by applying skipping factors!!!
However, it is possible to combine subsampling with pooling and do it
in a separate layer
89 / 148
Sub-sampling
90 / 148
How is sub-sampling implemented?
91 / 148
How is sub-sampling implemented?
91 / 148
How is sub-sampling implemented?
91 / 148
How is sub-sampling implemented?
91 / 148
How is sub-sampling implemented?
91 / 148
There are also other ways of doing this
92 / 148
There are also other ways of doing this
92 / 148
What is Pooling?
Pooling
Spatial pooling is way to compute image representation
based on encoded local features.
93 / 148
Pooling
Pooling Operation
It operates by placing windows at non-overlapping positions in each
feature map and keeping one value per window such that the feature maps
are sub-sampled.
94 / 148
Pooling
Pooling Operation
It operates by placing windows at non-overlapping positions in each
feature map and keeping one value per window such that the feature maps
are sub-sampled.
94 / 148
Thus
Each unit
(l)
In one of the m1 = 4 output feature maps represents the average or
the maximum within a fixed window of the corresponding feature map
in layer (l − 1).
95 / 148
Thus
Each unit
(l)
In one of the m1 = 4 output feature maps represents the average or
the maximum within a fixed window of the corresponding feature map
in layer (l − 1).
95 / 148
Examples of pooling
Average pooling
When using a boxcar filter, the operation is called average pooling and the
layer denoted by PA .
4 5 1 1
2 6 2 6 4.5 5
3 5 7 3 9 6.5
1 9 2 1
96 / 148
Examples of pooling
Max pooling
For max pooling, the maximum value of each window is taken. The layer
is denoted by PM .
4 5 1 1
2 6 2 6 5 6
3 5 7 3 9 7
1 9 2 1
97 / 148
An interesting property
98 / 148
An interesting property
98 / 148
Other Poolings
99 / 148
Other Poolings
99 / 148
Springerberg et al. [12]
100 / 148
Springerberg et al. [12]
100 / 148
Example
Horizontal Stride
101 / 148
Example
Horizontal Stride
102 / 148
Example
Horizontal Stride
103 / 148
There are attempts to understand its effects
Please read Kolda’s Paper before you get into the other
You need a little bit of notation...
104 / 148
There are attempts to understand its effects
Please read Kolda’s Paper before you get into the other
You need a little bit of notation...
104 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
105 / 148
Here, the people at Google [15] around 2015
They claim
The min-batch forces to have those changes which impact on the
learning capabilities of the network.
106 / 148
Here, the people at Google [15] around 2015
They claim
The min-batch forces to have those changes which impact on the
learning capabilities of the network.
106 / 148
Here, the people at Google [15] around 2015
They claim
The min-batch forces to have those changes which impact on the
learning capabilities of the network.
106 / 148
They gave the following reasons
Consider a layer with the input u that adds the learned bias b
Then, it normalizes the result by subtracting the mean of the
activation over the training data:
b = x − E [x]
x
1
PN
I X = {x, ..., xN } the data samples and E [x] = N i=1 xi
Finally
107 / 148
They gave the following reasons
Consider a layer with the input u that adds the learned bias b
Then, it normalizes the result by subtracting the mean of the
activation over the training data:
b = x − E [x]
x
1
PN
I X = {x, ..., xN } the data samples and E [x] = N i=1 xi
Finally
107 / 148
They gave the following reasons
Consider a layer with the input u that adds the learned bias b
Then, it normalizes the result by subtracting the mean of the
activation over the training data:
b = x − E [x]
x
1
PN
I X = {x, ..., xN } the data samples and E [x] = N i=1 xi
Finally
107 / 148
Then
Therefore
We need to integrate the normalization into the process of training.
108 / 148
Then
Therefore
We need to integrate the normalization into the process of training.
108 / 148
Normalization via Mini-Batch Statistic
109 / 148
Normalization via Mini-Batch Statistic
109 / 148
Definition of Whitening
Whitening
Suppose X is a random (column) vector with non-singular covariance
matrix Σ and mean 0.
Then
Then the transformation Y = W X with a whitening matrix W
satisfying the condition W T W = Σ−1 yields the whitened random
vector Y with unit diagonal covariance.
110 / 148
Definition of Whitening
Whitening
Suppose X is a random (column) vector with non-singular covariance
matrix Σ and mean 0.
Then
Then the transformation Y = W X with a whitening matrix W
satisfying the condition W T W = Σ−1 yields the whitened random
vector Y with unit diagonal covariance.
110 / 148
Such Normalization
111 / 148
Therefore
112 / 148
Therefore
112 / 148
Therefore
112 / 148
The Transformation
113 / 148
The Transformation
113 / 148
Finally
114 / 148
Finally
114 / 148
Finally
114 / 148
Finally
114 / 148
Finally
114 / 148
Backpropagation
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN
tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
However
117 / 148
However
117 / 148
However
117 / 148
They found something quite interesting
118 / 148
Actually Batch Normalization
119 / 148
Actually Batch Normalization
119 / 148
They actually have a connected result
120 / 148
They actually have a connected result
120 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
121 / 148
Fully Connected Layer
m (l)
(l) X (l) (l) (l) (l)
zi = wi,k yk thus yi = f zi
k=0
Otherwise
(l−1) (l−1) (l−1)
Layer l expects m1 feature maps of size m2 × m3 as input.
122 / 148
Fully Connected Layer
m (l)
(l) X (l) (l) (l) (l)
zi = wi,k yk thus yi = f zi
k=0
Otherwise
(l−1) (l−1) (l−1)
Layer l expects m1 feature maps of size m2 × m3 as input.
122 / 148
Then
123 / 148
Here
(l)
Where wi,j,r,s
It denotes the weight connecting the unit at position (r, s) in the j th
feature map of layer (l − 1) and the ith unit in layer l.
Something Notable
In practice, Convolutional Layers are used to learn a feature hierarchy
and one or more fully connected layers are used for classification
purposes based on the computed features.
124 / 148
Here
(l)
Where wi,j,r,s
It denotes the weight connecting the unit at position (r, s) in the j th
feature map of layer (l − 1) and the ith unit in layer l.
Something Notable
In practice, Convolutional Layers are used to learn a feature hierarchy
and one or more fully connected layers are used for classification
purposes based on the computed features.
124 / 148
Basically
N N X
K 2
X X (l)
L (W ) = En (W ) = ynk − tnk (Sum of Squared Error)
n=1 n=1 k=1
N N X K
X X (l)
L (W ) = En (W ) = tnk log ynk (Cross-Entropy Error)
n=1 n=1 k=1
125 / 148
Basically
N N X
K 2
X X (l)
L (W ) = En (W ) = ynk − tnk (Sum of Squared Error)
n=1 n=1 k=1
N N X K
X X (l)
L (W ) = En (W ) = tnk log ynk (Cross-Entropy Error)
n=1 n=1 k=1
125 / 148
About this
126 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
127 / 148
We have the following Architecture
128 / 148
Therefore, we have
Layer l = 1
This Layer is using a ReLu f with 3 channels
(l) (l)
h1 h2
X X
(l) (l) (l) (l−1)
Y1 = B1 + K11 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2
(l) (l)
h1 h2
X X
(l) (l) (l) (l−1)
Y2 = B2 + K21 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2
(l) (l)
h1 h2
X X
(l) (l) (l) (l−1)
Y3 = B3 + K31 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2
129 / 148
Layer l = 2
130 / 148
Then, you repeat the previous process
131 / 148
The fully connected layer
132 / 148
We have for simplicity sake
133 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
134 / 148
After collecting all input/output
Therefore
We have using sum of squared errors (loss function):
1 (6) (6) 2
L= y1 − t1
2
135 / 148
After collecting all input/output
Therefore
We have using sum of squared errors (loss function):
1 (6) (6) 2
L= y1 − t1
2
135 / 148
Therefore
With
(6) (5)
y1 = ReLu z1
136 / 148
Therefore
With
(6) (5)
y1 = ReLu z1
136 / 148
Therefore
We have
(6) (5) (5)
∂y1 ∂f z1 ∂z1
(5)
= (5)
× (5)
∂wr,s,k ∂z1 ∂wr,s,k
137 / 148
Therefore
We have
(6) (5) (5)
∂y1 ∂f z1 ∂z1
(5)
= (5)
× (5)
∂wr,s,k ∂z1 ∂wr,s,k
137 / 148
(5)
∂z1
Now, we need to derive (5)
∂wr,s,k
We know that
(6) (6)
9 m 2 m 3
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1
Finally
(5)
∂z1
(4)
(5)
= Yk
∂wr,s,k r,s
138 / 148
(5)
∂z1
Now, we need to derive (5)
∂wr,s,k
We know that
(6) (6)
9 m 2 m 3
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1
Finally
(5)
∂z1
(4)
(5)
= Yk
∂wr,s,k r,s
138 / 148
(5)
∂z1
Now, we need to derive (5)
∂wr,s,k
We know that
(6) (6)
9 m 2 m 3
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1
Finally
(5)
∂z1
(4)
(5)
= Yk
∂wr,s,k r,s
138 / 148
Maxpooling
This is not derived after all, but we go directly go for the max term
Assume you get the max element for f = 1, 2, ..., 9 and j = 1
(l) (l)
h1 h2
X X
(3) (3) (3) (2)
Yf = Bf + Kf 1 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2
139 / 148
Therefore
We have then
(6) 2
(6)
∂L 1 ∂ y1 − t1
(3)
= ×
(3)
∂ Kf 1 2 ∂ Kf 1
k,t k,t
(5)
(3)
∂f z1 (5) ∂f Yf
∂L
(6) (6) ∂zi x,y
(3)
= y1 − t 1 (5)
×
(4)
×
(3)
∂ Kf 1 ∂z1 ∂ Yf ∂ Kf 1
k,t x,y k,t
140 / 148
Therefore
We have then
(6) 2
(6)
∂L 1 ∂ y1 − t1
(3)
= ×
(3)
∂ Kf 1 2 ∂ Kf 1
k,t k,t
(5)
(3)
∂f z1 (5) ∂f Yf
∂L
(6) (6) ∂zi x,y
(3)
= y1 − t 1 (5)
×
(4)
×
(3)
∂ Kf 1 ∂z1 ∂ Yf ∂ Kf 1
k,t x,y k,t
140 / 148
Therefore
We have
(5)
∂zi (5)
(3)
= wx,y,f
∂ Yf
x,y
141 / 148
Therefore
We have
(5)
∂zi (5)
(3)
= wx,y,f
∂ Yf
x,y
141 / 148
Therefore
We have
(3) (3)
(3)
∂f Yf ∂f Yf ∂ Yf
x,y x,y
= × x,y
(3) (3) (3)
∂ Kf 1 ∂ Yf ∂ Kf 1
k,t x,y k,t
Then
(3)
∂f Yf
x,y
(3)
(3)
= f0 Yf
∂ Yf x,y
x,y
142 / 148
Therefore
We have
(3) (3)
(3)
∂f Yf ∂f Yf ∂ Yf
x,y x,y
= × x,y
(3) (3) (3)
∂ Kf 1 ∂ Yf ∂ Kf 1
k,t x,y k,t
Then
(3)
∂f Yf
x,y
(3)
(3)
= f0 Yf
∂ Yf x,y
x,y
142 / 148
Finally, we have
The equation
(3)
∂ Yf
x,y = Y1(2)
(3) x−k,x−t
∂ Kf 1
k,t
143 / 148
The Other Equations
144 / 148
The Other Equations
144 / 148
[1] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the
recent architectures of deep convolutional neural networks,” Artificial
Intelligence Review, vol. 53, no. 8, pp. 5455–5516, 2020.
[2] R. Szeliski, Computer Vision: Algorithms and Applications.
Berlin, Heidelberg: Springer-Verlag, 1st ed., 2010.
[3] S. Haykin, Neural Networks and Learning Machines.
No. v. 10 in Neural networks and learning machines, Prentice Hall,
2009.
[4] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction
and functional architecture in the cat’s visual cortex,” The Journal of
physiology, vol. 160, no. 1, p. 106, 1962.
[5] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
145 / 148
[6] W. Zhang, K. Itoh, J. Tanida, and Y. Ichioka, “Parallel distributed
processing model with local space-invariant interconnections and its
optical architecture,” Appl. Opt., vol. 29, pp. 4790–4797, Nov 1990.
[7] J. J. Weng, N. Ahuja, and T. S. Huang, “Learning recognition and
segmentation of 3-d objects from 2-d images,” in 1993 (4th)
International Conference on Computer Vision, pp. 121–128, IEEE,
1993.
[8] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus,
“Deconvolutional networks,” in 2010 IEEE Computer Society
Conference on computer vision and pattern recognition,
pp. 2528–2535, IEEE, 2010.
[9] D. Krishnan and R. Fergus, “Fast image deconvolution using
hyper-laplacian priors,” Advances in neural information processing
systems, vol. 22, pp. 1033–1041, 2009.
146 / 148
[10] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternating
minimization algorithm for total variation image reconstruction,”
SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 248–272, 2008.
[11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.
The MIT Press, 2016.
[12] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
“Striving for simplicity: The all convolutional net,” 2015.
[13] T. G. Kolda and B. W. Bader, “Tensor decompositions and
applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
[14] C. Kong and S. Lucey, “Take it in your stride: Do we need striding in
cnns?,” arXiv preprint arXiv:1712.02502, 2017.
[15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
147 / 148
[16] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch
normalization help optimization?,” in Advances in Neural Information
Processing Systems, pp. 2483–2493, 2018.
148 / 148