Neural Network Notes

Introduction to Neural Networks and Deep Learning
Introduction to the Convolutional Network
Andres Mendez-Vazquez
March 28, 2021
1 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution
2 Convolutional Networks
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer
4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
2 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
3 / 148
The Long Path [1]
2018
Channel Boosted CNN
Channel
Boosting
Beyond
2018
CBAM
Attention
Residual Attention
Module
2018
CMPE-SE
Feature Map SE Net
Explotation
A Small History
Transformers-CNN
Complex Architectures and PyramidalNet
of a The Attention Revolution PolyNet
Revolution 2017 WideResNext
Width ResNext
Explotation
2017
FractalNet
4 5 1 1
2 6 2 6
3 5 7 3
4.5
9
5
6.5
Residual and Multipath Multi-Path
Connectivity
2016
Dense Net
ZfNet
ResNet
Architectures
2015
1 9 2 1 Skip Connections
Optimization
Visualization
Depth
Highway Net
Parameter
Revolution
Feature
The Beginnig of Atention?
VGG
2014 Eﬀective Reception Filed
(Small Size Filters)
Inception-ResNet
Factorization Inception-V4
Bottleneck
Inception-V3
Inception-V2
The Revolution
Parallelism
2014
Spatial
Explotation
Inception GoogleNet
Block
First Results 2013 3D CNN's

PROGRAMMING
CNN Stagnation
2006 Maxpooling
2007 NVIDIA
2006 GPU
2010 ImageNet
Explotation
Explotation
AlexNet
Spatial
Early 2000
2012
Depth
1998
LeNet
1989
ConvNet
Early Attempts
1979
Neurocognition
4 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
5 / 148
Digital Images as pixels in a digitized matrix [2]
Ilumination
Source
Output
Ilumination
Source
6 / 148
Further [2]
Pixel values typically represent

Gray levels, colors, heights, opacities etc
Something Notable
Remember digitization implies that a digital image is an
approximation of a real scene
7 / 148
Further [2]
Pixel values typically represent

Gray levels, colors, heights, opacities etc
Something Notable
Remember digitization implies that a digital image is an
approximation of a real scene
7 / 148
Images
Common image formats include

On sample/pixel per point (B&W or Grayscale)
Three samples/pixel per point (Red, Green, and Blue)
Four samples/pixel per point (Red, Green, Blue, and “Alpha”)
8 / 148
Therefore, we have the following process
Low Level Process

Noise
Imagen Removal Sharpening
9 / 148
Example
Edge Detection
10 / 148
Then
Mid Level Process

Input Processes Output
Object
Image Recognition Attributes
Segmentation
11 / 148
Example
Object Recognition
12 / 148
Therefore
It would be nice to automatize all these processes

We would solve a lot of headaches when setting up such process
Why not to use the data sets

By using a Neural Networks that replicates the process.
13 / 148
Therefore
It would be nice to automatize all these processes

We would solve a lot of headaches when setting up such process
Why not to use the data sets

By using a Neural Networks that replicates the process.
13 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
14 / 148
We have the following classification [3]
15 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
16 / 148
Drawbacks of previous neural networks
The number of trainable parameters becomes extremely large
Large N
17 / 148
In addition, little or no invariance to shifting, scaling, and other forms
of distortion
Large N
18 / 148
In addition, little or no invariance to shifting, scaling, and other forms
of distortion
Large N
Shift to the Left
19 / 148
The topology of the input data is completely ignored
20 / 148
For Example
We have
Black and white patterns: 232×32 = 21024
Gray scale patterns: 25632×32 = 2561024
21 / 148
For Example
If we have an element that the network has never seen
22 / 148
Possible Solution
We can minimize this drawbacks by getting

Fully connected network of sufficient size can produce outputs that
are invariant with respect to such variations.
Problem!!!
Training time
Network size
Free parameters
23 / 148
Possible Solution
We can minimize this drawbacks by getting

Fully connected network of sufficient size can produce outputs that
are invariant with respect to such variations.
Problem!!!
Training time
Network size
Free parameters
23 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
24 / 148
Hubel/Wiesel Architecture
Something Notable [4]

D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981)
They commented
The visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells
25 / 148
Hubel/Wiesel Architecture
Something Notable [4]

D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981)
They commented
The visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells
25 / 148
Something Like
We have
Feature Hierarchy
Hyper-complex cells
Complex cells
Simple cells
26 / 148
History
Convolutional Neural Networks (CNN) were invented by [5]

In 1989, Yann LeCun and Yoshua Bengio introduced the concept of
Convolutional Neural networks.
Patterns of Local
Face Features
Contrast
Faces
OUTPUT
INPUT LAYERS HIDDEN LAYERS 1 HIDDEN LAYERS 2
27 / 148
About CNN’s
Something Notable
CNN’s Were neurobiologically motivated by the findings of locally sensitive
and orientation-selective nerve cells in the visual cortex.
In addition
They designed a network structure that implicitly extracts relevant
features.
Properties
Convolutional Neural Networks are a special kind of multi-layer neural
networks.
28 / 148
About CNN’s
Something Notable
In addition
features.
Properties
networks.
28 / 148
About CNN’s
Something Notable
In addition
features.
Properties
networks.
28 / 148
About CNN’s
In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.
29 / 148
About CNN’s
In addition
from an image.
29 / 148
About CNN’s
In addition
from an image.
29 / 148
About CNN’s
In addition
from an image.
29 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
30 / 148
Local Connectivity
We have the following idea [6]

Instead of using a full connectivity...
Input Image
We would have something like this

n
!
X
yi = f wi xi (1)
i=1
31 / 148
Local Connectivity
We have the following idea [6]

Instead of using a full connectivity...
Input Image
We would have something like this

n
!
X
yi = f wi xi (1)
i=1
31 / 148
Local Connectivity
We decide only to connect the neurons in a local way

Each hidden unit is connected only to a subregion (patch) of the
input image.
It is connected to all channels:
32 / 148
Local Connectivity

input image.
32 / 148
Local Connectivity
input image.
32 / 148
Example
For gray scale, we get something like this
Input Image
Then, our formula changes

 
X
yi = f  wi x i  (2)
i∈Lp
33 / 148
Example
For gray scale, we get something like this
Input Image
Then, our formula changes

 
X
yi = f  wi x i  (2)
i∈Lp
33 / 148
Example
In the case of the 3 channels
Input Image
Thus
 
X
yi = f  wi xci  (3)
i∈Lp ,c
34 / 148
Example
In the case of the 3 channels
Input Image
Thus
 
X
yi = f  wi xci  (3)
i∈Lp ,c
34 / 148
Solving the following problems...
First
Fully connected hidden layer would have an unmanageable number of
parameters
Second
Computing the linear activation of the hidden units would have been
quite expensive
35 / 148
Solving the following problems...
First
Fully connected hidden layer would have an unmanageable number of
parameters
Second
Computing the linear activation of the hidden units would have been
quite expensive
35 / 148
How this looks in the image...
We have
Receptive Field
36 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
37 / 148
Parameter Sharing
Second Idea
Share matrix of parameters across certain units.
These units are organized into

The same feature “map”
I Where the units share same parameters (For example, the same mask)
38 / 148
Parameter Sharing
Second Idea
Share matrix of parameters across certain units.
These units are organized into

The same feature “map”
I Where the units share same parameters (For example, the same mask)
38 / 148
Example
We have something like this

Feature Map 1 Feature Map 2 Feature Map 3
39 / 148
Example
We have something like this

Feature Map 1 Feature Map 2 Feature Map 3
39 / 148
Now, in our notation
We have a collection of matrices representing this connectivity

Wij is the connection matrix the ith input channel with the jth
feature map.
In each cell of these matrices is the weight to be multiplied with the
local input to the local neuron.
An now why the name of convolution

Yes!!! The definition is coming now.
40 / 148
Now, in our notation
We have a collection of matrices representing this connectivity

Wij is the connection matrix the ith input channel with the jth
feature map.
In each cell of these matrices is the weight to be multiplied with the
local input to the local neuron.
An now why the name of convolution

Yes!!! The definition is coming now.
40 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
41 / 148
Digital Images
In computer vision [2, 7]

We usually operate on digital (discrete) images:
Sample the 2D space on a regular grid.
Quantize each sample (round to nearest integer).
The image can now be represented as a matrix of integer values,

I : [a, b] × [c, d] → [0..255]
j−→
 
79 5 6 90 12 34 2 1

 8 90 12 34 26 78 34 5 
i↓ 8 1 3 90 12 34 11 61 
 
 
 77 90 12 34 200 2 9 45 
1 3 90 12 20 1 6 23
42 / 148
Digital Images


I : [a, b] × [c, d] → [0..255]
j−→
 
79 5 6 90 12 34 2 1

 8 90 12 34 26 78 34 5 
i↓ 8 1 3 90 12 34 11 61 
 
 
 77 90 12 34 200 2 9 45 
1 3 90 12 20 1 6 23
42 / 148
Digital Images


I : [a, b] × [c, d] → [0..255]
j−→
 
79 5 6 90 12 34 2 1

 8 90 12 34 26 78 34 5 
i↓ 8 1 3 90 12 34 11 61 
 
 
 77 90 12 34 200 2 9 45 
1 3 90 12 20 1 6 23
42 / 148
Digital Images


I : [a, b] × [c, d] → [0..255]
j−→
 
79 5 6 90 12 34 2 1

 8 90 12 34 26 78 34 5 
i↓ 8 1 3 90 12 34 11 61 
 
 
 77 90 12 34 200 2 9 45 
1 3 90 12 20 1 6 23
42 / 148
Many times we want to eliminate noise in a image
For example a moving average
43 / 148
This is defined as
This last moving average can be seen as

n −m
X 1 X
(I ∗ k) (i) = I (i − j) × K (j) = I (i − j) (4)
N
j=−n j=m
With I (j) representing the value of the pixel at position j,

1
N
if j ∈ {−m, −m + 1, ..., 1, 0, 1, ..., m − 1, m}
K (j) =
0 else
with 0 < m < n.
44 / 148
This can be generalized into the 2D images
Left I and Right I ∗ K
45 / 148
46 / 148
47 / 148
48 / 148
Moving average in 2D
Basically in 2D
We can define different types of filter using the idea of weighted
average
−m X
X m
(I ∗ K) (i, j) = I (i − s, j − l) × K (s, l) (5)
s=m l=−m
For example, the Box Filter

 
1 1 1
1
K =  1 1 1  "The Box Filter" (6)

9
1 1 1
49 / 148
Moving average in 2D
Basically in 2D
We can define different types of filter using the idea of weighted
average
−m X
X m
(I ∗ K) (i, j) = I (i − s, j − l) × K (s, l) (5)
s=m l=−m
For example, the Box Filter

 
1 1 1
1
K =  1 1 1  "The Box Filter" (6)

9
1 1 1
49 / 148
Another Example
The Gaussian Filter

 
0 1 2 1 0

 1 3 4 3 1 

K= 2 5 9 5 2
 

 
 1 3 5 3 1 
0 1 2 1 0
Thus, we can define the concept of convolution

Yes, using the previous ideas
50 / 148
Another Example
The Gaussian Filter

 
0 1 2 1 0

 1 3 4 3 1 

K= 2 5 9 5 2
 

 
 1 3 5 3 1 
0 1 2 1 0
Thus, we can define the concept of convolution

Yes, using the previous ideas
50 / 148
Convolution
Definition
Let I : [a, b] × [c, d] → [0..255] be the image and
K : [e, f ] × [h, i] → R be the kernel. The output of Convolving I
with K, denoted I ∗ K is
n
X n
X
(I ∗ K) [x, y] = I (x − s, y − l) × K (s, l)
s=−n l=−n
51 / 148
Now, why not to expand this idea
Imagine that a three channel image is splitted into a three feature
map
Feature Maps
52 / 148
Mathematically, we have the following
Map i
n
3 X
X n
X
(I ∗ k) [x, y, o] = I (x − l, y − s, c) × k (l, s, c, o)
c=1 l=−n s=−n
Therefore
The convolution works as a
I Filter
I Encoder
I Decoder
I etc
53 / 148
Mathematically, we have the following
Map i
n
3 X
X n
X
(I ∗ k) [x, y, o] = I (x − l, y − s, c) × k (l, s, c, o)
c=1 l=−n s=−n
Therefore
The convolution works as a
I Filter
I Encoder
I Decoder
I etc
53 / 148
For Example, Encoder
We have the following situation
54 / 148
Notation
We have the following

(l)
Yj is a matrix representing the l layer and j th feature map.
(l)
Kij is the kernel filter with ith kernel for layer j th .
Therefore
We can see the Convolutional as a fusion of information from
different feature maps.
(l−1)
m1
X (l−1) (l)
Yj ∗ Kij
j=1
55 / 148
Notation

(l)
Yj is a matrix representing the l layer and j th feature map.
(l)
Kij is the kernel filter with ith kernel for layer j th .
Therefore
We can see the Convolutional as a fusion of information from
different feature maps.
(l−1)
m1
X (l−1) (l)
Yj ∗ Kij
j=1
55 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks
Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .
Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
j=1 s=−ks l=−ks
Where
(l)
(l)
h i h i
(l) (l) (l)
Thus
(l−1)
(l−1) (l−1)
56 / 148
Thus, we have
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
j=1 s=−ks l=−ks
Where
(l)
(l)
h i h i
(l) (l) (l)
Thus
(l−1)
(l−1) (l−1)
56 / 148
Thus, we have
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
j=1 s=−ks l=−ks
Where
(l)
(l)
h i h i
(l) (l) (l)
Thus
(l−1)
(l−1) (l−1)
56 / 148
Thus, we have
equal to
(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
j=1 s=−ks l=−ks
Where
(l)
(l)
h i h i
(l) (l) (l)
Thus
(l−1)
(l−1) (l−1)
56 / 148
Therefore
Thew output of layer l

(l) (l) (l)
It consists m1 feature maps of size m2 × m3
Something Notable
(l) (l)
m2 and m3 are influenced by border effects.
Therefore, the output feature maps when the Convolutional sum is
defined properly have size
(l) (l−1) (l)

m2 = m2 − 2h1
(l) (l−1) (l)
m3 = m3 − 2h2
57 / 148
Therefore
Thew output of layer l

(l) (l) (l)
It consists m1 feature maps of size m2 × m3
Something Notable
(l) (l)
m2 and m3 are influenced by border effects.
Therefore, the output feature maps when the Convolutional sum is
defined properly have size
(l) (l−1) (l)

m2 = m2 − 2h1
(l) (l−1) (l)
m3 = m3 − 2h2
57 / 148
Why? The Border
Example
Convolutional Maps
58 / 148
Special Case
When l = 1
The input is a single image I consisting of one or more channels.
59 / 148
Thus
We have
(l) (l) (l)
Each feature map Yi in layer l consists of m1 · m2 units arranged in a
two dimensional array.
Thus, the unit at position (x, y) computes
(l−1)
m1
(l) (l) X (l) (l−1)

Yi = Bi + Kij ∗ Yj
x,y x,y x,y
j=1
(l−1) (l) (l)
m1 h1 h2
(l) X X X (l) (l−1)
= Bi + Kij Yj
x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2
60 / 148
Thus
We have
(l) (l) (l)
Each feature map Yi in layer l consists of m1 · m2 units arranged in a
two dimensional array.
Thus, the unit at position (x, y) computes
(l−1)
m1
(l) (l) X (l) (l−1)

Yi = Bi + Kij ∗ Yj
x,y x,y x,y
j=1
(l−1) (l) (l)
m1 h1 h2
(l) X X X (l) (l−1)
= Bi + Kij Yj
x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2
60 / 148
Here, an interesting case
Only a Historical Note

The foundations for deconvolution came from Norbert Wiener of the
Massachusetts Institute of Technology in his book “Extrapolation,
Interpolation, and Smoothing of Stationary Time Series” (1949)
Basically, it tries to solve the following equation with Y (l) unknown

layer that we want to recover
(l) (l) (l−1)
Yi ∗ Kij = Yj
61 / 148
Here, an interesting case
Only a Historical Note

The foundations for deconvolution came from Norbert Wiener of the
Massachusetts Institute of Technology in his book “Extrapolation,
Interpolation, and Smoothing of Stationary Time Series” (1949)
Basically, it tries to solve the following equation with Y (l) unknown

layer that we want to recover
(l) (l) (l−1)
Yi ∗ Kij = Yj
61 / 148
In [8]
They proposed a sparcity idea to start the implementation as

(l−1)
(l) 2 (l)
m1 m1 m1
(l−1)
X X (l) (l) (l−1)

X (l) p
C Y =
Yj ∗ Kij − Yi +
Yj
i=1 j=1 j=1
2
Typically, p = 1, although other values are possible.
They look for the

n arguments o to minimize a cost of function over a set
1 I
of images y = y , ..., y
arg min C (y)

(l) (l)
Yj ∗Kij
62 / 148
In [8]
They proposed a sparcity idea to start the implementation as

(l−1)
(l) 2 (l)
m1 m1 m1
(l−1)
X X (l) (l) (l−1)

X (l) p
C Y =
Yj ∗ Kij − Yi +
Yj
i=1 j=1 j=1
2
Typically, p = 1, although other values are possible.
They look for the

n arguments o to minimize a cost of function over a set
1 I
of images y = y , ..., y
arg min C (y)

(l) (l)
Yj ∗Kij
62 / 148
Here
Then, we can generalize such cost function for that total set of
images (Minbatch)

(l−1) (l)
2 (l)
I mX m1 m1

1
λ X X (l)

(l,k) (l)

(l−1,k) X (l,k) p
Cl (y) = gij Yj ∗ Kij − Yi + Yj

2 k=1 i=1 j=1

j=1
2
Here, we have
(l−1,k)
Yi are the feature maps from the previous layer
(l)
gij is a fixed binary matrix that determines the connectivity between
feature maps at different layers
(l,k) (l−1,k)
I If Yj is connected to certain Yi elments
63 / 148
Here
Then, we can generalize such cost function for that total set of
images (Minbatch)

(l−1) (l)
2 (l)
I mX m1 m1

1
λ X X (l)

(l,k) (l)

(l−1,k) X (l,k) p
Cl (y) = gij Yj ∗ Kij − Yi + Yj

2 k=1 i=1 j=1

j=1
2
Here, we have
(l−1,k)
Yi are the feature maps from the previous layer
(l)
gij is a fixed binary matrix that determines the connectivity between
feature maps at different layers
(l,k) (l−1,k)
I If Yj is connected to certain Yi elments
63 / 148
This can be sen as
We have the following layer
+ + + +
64 / 148
They noticed some drawbacks
Using the following optimizations

Direct Gradient Descent
Iterative Reweighted Least Squares
Stochastic Gradient Descent
All of they presented problems!!!

They solved it using a new cost function
65 / 148
They noticed some drawbacks
Using the following optimizations

Direct Gradient Descent
Iterative Reweighted Least Squares
Stochastic Gradient Descent
All of they presented problems!!!

They solved it using a new cost function
65 / 148
We have that
(l,k)
An interesting use of an auxiliar variable/layer Xi
(l)
(l−1)
2
I mX
m
1 1
λX
X (l)

(l,k) (l)

(l−1,k)
Cl (y) = gij Yj ∗ Kij − Yi + ...
2 k=1 i=1
j=1

2
(l) (l)
m1
I X I Xm1
βX (l,k) (l,k) 2
X (l,k) p

Yj − Xi + Yj
2 k=1 j=1 2
k=1 j=1
This can be solved using

Alternating minimization...
66 / 148
We have that
(l,k)
An interesting use of an auxiliar variable/layer Xi
(l)
(l−1)
2
I mX
m
1 1
λX
X (l)

(l,k) (l)

(l−1,k)
Cl (y) = gij Yj ∗ Kij − Yi + ...
2 k=1 i=1
j=1

2
(l) (l)
m1
I X I Xm1
βX (l,k) (l,k) 2
X (l,k) p

Yj − Xi + Yj
2 k=1 j=1 2
k=1 j=1
This can be solved using

Alternating minimization...
66 / 148
This is based on
(l,k) (l,k)
Fixing the values of Yj and Xi
They call these two stages the Y and X sub-problems...
Therefore, they noticed

These terms introduce the sparsity constraint and gives numerical
stability [9, 10]
67 / 148
This is based on
(l,k) (l,k)
Fixing the values of Yj and Xi
They call these two stages the Y and X sub-problems...
Therefore, they noticed

These terms introduce the sparsity constraint and gives numerical
stability [9, 10]
67 / 148
Y sub-problem
(l,k)
Taking the derivative of Yj
(l−1)
 (l)

m1 m1
∂Cl (y) X (l)T  X (l) (l,k) (l−1,k) 
h
(l,k) (l,k)
i
(l,k)
=λ Fij  Ftj Yj − Yj +β Yj − Xj =0
∂Yj i=1 t=1
Where

It is a sparse convolution matrix (l)
(l) if gij = 1
Fij = (l)
0 if gij = 0
68 / 148
Y sub-problem
(l,k)
Taking the derivative of Yj
(l−1)
 (l)

m1 m1
∂Cl (y) X (l)T  X (l) (l,k) (l−1,k) 
h
(l,k) (l,k)
i
(l,k)
=λ Fij  Ftj Yj − Yj +β Yj − Xj =0
∂Yj i=1 t=1
Where

It is a sparse convolution matrix (l)
(l) if gij = 1
Fij = (l)
0 if gij = 0
68 / 148
Therefore
(l)
Fij as a sparse convolution matrix
(l)
Equivalent to convolve with Kij
Actually if you fix i, you finish with a linear system Ax = 0

Please take a look at the paper... it is interesting
I Actually this seems to be the implementation at the Tensorflow
framework
69 / 148
Therefore
(l)
Fij as a sparse convolution matrix
(l)
Equivalent to convolve with Kij
Actually if you fix i, you finish with a linear system Ax = 0

Please take a look at the paper... it is interesting
I Actually this seems to be the implementation at the Tensorflow
framework
69 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
70 / 148
As in a Multilayer Perceptron
We use a non-linearity
However, there is a drawback when using Back-Propagation under a
sigmoid function
1
s (x) =
1 + e−x
Because if we imagine a Convolutional Network as a series of layer

functions fi
y (A) = ft ◦ ft−1 ◦ · · · ◦ f2 ◦ f1 (A)
With ft is the last layer.
Therefore, we finish with a sequence of derivatives

∂y (A) ∂ft (ft−1 ) ∂ft−1 (ft−2 ) ∂f2 (f1 ) ∂f1 (A)
= · · ··· · ·
∂w1i ∂ft−1 ∂ft−2 ∂f2 ∂w1i
71 / 148
sigmoid function
1
s (x) =
1 + e−x

functions fi
y (A) = ft ◦ ft−1 ◦ · · · ◦ f2 ◦ f1 (A)

∂y (A) ∂ft (ft−1 ) ∂ft−1 (ft−2 ) ∂f2 (f1 ) ∂f1 (A)
= · · ··· · ·
∂w1i ∂ft−1 ∂ft−2 ∂f2 ∂w1i
71 / 148
sigmoid function
1
s (x) =
1 + e−x

functions fi
y (A) = ft ◦ ft−1 ◦ · · · ◦ f2 ◦ f1 (A)

∂y (A) ∂ft (ft−1 ) ∂ft−1 (ft−2 ) ∂f2 (f1 ) ∂f1 (A)
= · · ··· · ·
∂w1i ∂ft−1 ∂ft−2 ∂f2 ∂w1i
71 / 148
Therefore
Given the commutativity of the product

You could put together the derivative of the sigmoid’s
ds (x) e−x
f (x) = =
dx (1 + e−x )2
Therefore, deriving again

2
df (x) e−x 2 (e−x )
=− +
dx (1 + e−x )2 (1 + e−x )3
df (x)
After making dx
=0
We have the maximum is at x = 0
72 / 148
Therefore

ds (x) e−x
f (x) = =
dx (1 + e−x )2

2
df (x) e−x 2 (e−x )
=− +
dx (1 + e−x )2 (1 + e−x )3
df (x)
After making dx
=0
72 / 148
Therefore

ds (x) e−x
f (x) = =
dx (1 + e−x )2

2
df (x) e−x 2 (e−x )
=− +
dx (1 + e−x )2 (1 + e−x )3
df (x)
After making dx
=0
72 / 148
Therefore
The maximum for the derivative of the sigmoid

f (0) = 0.25
Therefore, Given a Deep Convolutional Network

We could finish with
k
ds (x)

lim = lim (0.25)k → 0
k→∞ dx k→∞
A vanishing derivative
Making quite difficult to do train a deeper network using this
activation function
73 / 148
Therefore

f (0) = 0.25

k
ds (x)

lim = lim (0.25)k → 0
k→∞ dx k→∞
activation function
73 / 148
Therefore

f (0) = 0.25

k
ds (x)

lim = lim (0.25)k → 0
k→∞ dx k→∞
activation function
73 / 148
Thus
The need to introduce a new function

f (x) = x+ = max (0, x)
It is called ReLu or Rectifier

With a smooth approximation (Softplus function)

ln 1 + ekx
f (x) =
k
74 / 148
Thus
The need to introduce a new function

f (x) = x+ = max (0, x)
It is called ReLu or Rectifier

With a smooth approximation (Softplus function)

ln 1 + ekx
f (x) =
k
74 / 148
Therefore, we have
When k = 1
Softplus +3.5
ReLu +3.0
+2.5
+2.0
+1.5
+1.0
+0.5
−2.8 −2.2 −1.6 −1.0 −0.4 +0.4 +1.0 +1.6 +2.2 +2.8
−0.5
75 / 148
Increase k
When k = 104
+0.004
Softplus +0.003
ReLu
+0.002
+0.001
−0.0022 −0.0016 −0.001 −0.0004 +0.0006 +0.0012 +0.0018 +0.0024

−0.001
76 / 148
Non-Linearity Layer
If layer l is a non-linearity layer

(l)
Its input is given by m1 feature maps.
What about the output

(l) (l−1)
Its output comprises again m1 = m1 feature maps
Each of them of size

(l−1) (l−1)
m2 × m3 (7)
(l) (l−1) (l) (l−1)
With m2 = m2 and m3 = m3 .
77 / 148
Non-Linearity Layer

(l)

(l) (l−1)

(l−1) (l−1)
m2 × m3 (7)
(l) (l−1) (l) (l−1)
77 / 148
Non-Linearity Layer

(l)

(l) (l−1)

(l−1) (l−1)
m2 × m3 (7)
(l) (l−1) (l) (l−1)
77 / 148
Thus
With the final output

(l) (l−1)
Yi = f Yi (8)
Where
f is the activation function used in layer l and operates point wise.
You can also add a gain to compensate

(l) (l−1)
Yi = gi f Yi (9)
78 / 148
Thus

(l) (l−1)
Yi = f Yi (8)
Where

(l) (l−1)
Yi = gi f Yi (9)
78 / 148
Thus

(l) (l−1)
Yi = f Yi (8)
Where

(l) (l−1)
Yi = gi f Yi (9)
78 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
79 / 148
Rectification Layer, Rabs
Now a rectification layer

(l) (l−1) (l−1)
Then its input comprises m1 feature maps of size m2 × m3 .
Then, the absolute value for each component of the feature maps is
computed

(l) (l)
Yi = Yi (10)
Where the absolute value

(l) (l−1)
It is computed point wise such that the output consists of m1 = m1
feature maps unchanged in size.
80 / 148

(l) (l−1) (l−1)
computed

(l) (l)
Yi = Yi (10)

(l) (l−1)
80 / 148

(l) (l−1) (l−1)
computed

(l) (l)
Yi = Yi (10)

(l) (l−1)
80 / 148
Thus
We have that
Experiments show that rectification plays a central role in achieving good
performance.
You can find this in

K. Jarrett, K. Kavukcuogl, M. Ranzato, and Y. LeCun. What is the best
multi-stage architecture for object recognition? In Computer Vision,
International Conference on, pages 2146–2153, 2009.
Remark
Rectification could be included in the non-linearity layer.
But also it can be seen as an independent layer.
81 / 148
Thus
We have that
performance.

Remark
81 / 148
Thus
We have that
performance.

Remark
81 / 148
Given that we are using Backpropagation
We need a soft approximation to f (x) = |x|
For this, we have
∂f
= sgn (x)
∂x
When x 6= 0. Why?
We can use the following approximation

exp {kx}

sgn (x) = 2 −1
1 + exp {kx}
Therefore, we have by integration and working the C

2 2
f (x) = ln (1 + exp {kx}) − x − ln (2)
k k
82 / 148
For this, we have
∂f
= sgn (x)
∂x
When x 6= 0. Why?

exp {kx}

sgn (x) = 2 −1
1 + exp {kx}

2 2
f (x) = ln (1 + exp {kx}) − x − ln (2)
k k
82 / 148
For this, we have
∂f
= sgn (x)
∂x
When x 6= 0. Why?

exp {kx}

sgn (x) = 2 −1
1 + exp {kx}

2 2
f (x) = ln (1 + exp {kx}) − x − ln (2)
k k
82 / 148
We get the following situation
Something Notable
+0.0007
+0.0006
+0.0005
+0.0004
+0.0003
+0.0002
+0.0001
−0.0004 −0.00025 −0.0001 +0.0001 +0.00025 +0.0004
−0 0001
83 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
84 / 148
Normalizing
Contrast normalization layer

The task of a local contrast normalization layer:
To enforce local competitiveness between adjacent units within a
feature map.
To enforce competitiveness units at the same spatial location.
We have two types of operations

Subtractive Normalization.
Brightness Normalization.
85 / 148
Normalizing

feature map.

85 / 148
Normalizing

feature map.

85 / 148
Normalizing

feature map.

85 / 148
Normalizing

feature map.

85 / 148
Subtractive Normalization
(l−1) (l−1) (l−1)

Given m1 feature maps of size m2 × m3
(l) (l−1)
The output of layer l comprises m1 = m1 feature maps unchanged in
size.
With the operation

(l−1)
m1
(l) (l−1) X (l−1)
Yi = Yi − KG(σ) ∗ Yj (11)
j=1
With
( )
1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2
86 / 148
(l−1) (l−1) (l−1)

(l) (l−1)
size.
With the operation

(l−1)
m1
(l) (l−1) X (l−1)
Yi = Yi − KG(σ) ∗ Yj (11)
j=1
With
( )
1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2
86 / 148
(l−1) (l−1) (l−1)

(l) (l−1)
size.
With the operation

(l−1)
m1
(l) (l−1) X (l−1)
Yi = Yi − KG(σ) ∗ Yj (11)
j=1
With
( )
1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2
86 / 148
Brightness Normalization
An alternative is to normalize the brightness in combination with the

rectified linear units

(l−1)
Yi
(l) x,y
Yi = µ (13)
x,y Pm(l−1)
1

(l−1) 2

κ+λ j=1 Yj
x,y
Where
κ, µ and λ are hyperparameters which can be set using a

ln 1 + ekx
f (x) =
k
validation set.
87 / 148
Brightness Normalization
An alternative is to normalize the brightness in combination with the

rectified linear units

(l−1)
Yi
(l) x,y
Yi = µ (13)
x,y Pm(l−1)
1

(l−1) 2

κ+λ j=1 Yj
x,y
Where
κ, µ and λ are hyperparameters which can be set using a

ln 1 + ekx
f (x) =
k
validation set.
87 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
88 / 148
Sub-sampling Layer
Motivation
The motivation of subsampling the feature maps obtained by previous
layers is robustness to noise and distortions.
How?
Normally, in traditional Convolutional Networks subsampling this is
done by applying skipping factors!!!
However, it is possible to combine subsampling with pooling and do it
in a separate layer
89 / 148
Sub-sampling Layer
Motivation
The motivation of subsampling the feature maps obtained by previous
layers is robustness to noise and distortions.
How?
Normally, in traditional Convolutional Networks subsampling this is
done by applying skipping factors!!!
However, it is possible to combine subsampling with pooling and do it
in a separate layer
89 / 148
Sub-sampling
The subsampling layer

It seems to be acting as the well know sub-sampling pyramid
90 / 148
How is sub-sampling implemented?
We know that Image Pyramids

They were designed to find:
1 filter-based representations to decompose images into information at
multiple scales,
2 To extract features/structures of interest,
3 To attenuate noise.
Example of usage of this filters

The SURF and SIFT filters
91 / 148

multiple scales,

91 / 148

multiple scales,

91 / 148

multiple scales,

91 / 148

multiple scales,

91 / 148
There are also other ways of doing this
subsampling can be done using so called skipping factors

(l) (l)
s1 and s2
The basic idea is to skip a fixed number of pixels

Therefore the size of the output feature map is given by
(l−1) (l) (l−1) (l)
(l) m2 − 2h1 (l) m3 − 2h2
m2 = (l)
and m3 = (l)
s1 + 1 s2 + 1
92 / 148
There are also other ways of doing this
subsampling can be done using so called skipping factors

(l) (l)
s1 and s2
The basic idea is to skip a fixed number of pixels

Therefore the size of the output feature map is given by
(l−1) (l) (l−1) (l)
(l) m2 − 2h1 (l) m3 − 2h2
m2 = (l)
and m3 = (l)
s1 + 1 s2 + 1
92 / 148
What is Pooling?
Pooling
Spatial pooling is way to compute image representation
based on encoded local features.
93 / 148
Pooling
Let l be a pooling layer

(l) (l−1)
It outputs from mi > mi feature maps of reduced size.
Pooling Operation
It operates by placing windows at non-overlapping positions in each
feature map and keeping one value per window such that the feature maps
are sub-sampled.
94 / 148
Pooling
Let l be a pooling layer

(l) (l−1)
It outputs from mi > mi feature maps of reduced size.
Pooling Operation
It operates by placing windows at non-overlapping positions in each
feature map and keeping one value per window such that the feature maps
are sub-sampled.
94 / 148
Thus
In the previous example

All feature maps are pooled and sub-sampled individually.
Each unit
(l)
In one of the m1 = 4 output feature maps represents the average or
the maximum within a fixed window of the corresponding feature map
in layer (l − 1).
95 / 148
Thus
In the previous example

All feature maps are pooled and sub-sampled individually.
Each unit
(l)
In one of the m1 = 4 output feature maps represents the average or
the maximum within a fixed window of the corresponding feature map
in layer (l − 1).
95 / 148
Examples of pooling
Average pooling
When using a boxcar filter, the operation is called average pooling and the
layer denoted by PA .
4 5 1 1
2 6 2 6 4.5 5
3 5 7 3 9 6.5
1 9 2 1
96 / 148
Examples of pooling
Max pooling
For max pooling, the maximum value of each window is taken. The layer
is denoted by PM .
4 5 1 1
2 6 2 6 5 6
3 5 7 3 9 7
1 9 2 1
97 / 148
An interesting property
Something notable depending in the pooling area

“In all cases, pooling helps to make the representation become
approximately invariant to small translations of the input. Invariance
to translation means that if we translate the input by a small amount,
the values of most of the pooled outputs do not change.”
I Page 342, Ian Goodfellow, Introduction to Deep Learning, 2016 [11].
The small amount

In the case of the previous examples, 1 pixel
98 / 148
An interesting property
Something notable depending in the pooling area

“In all cases, pooling helps to make the representation become
approximately invariant to small translations of the input. Invariance
to translation means that if we translate the input by a small amount,
the values of most of the pooled outputs do not change.”
I Page 342, Ian Goodfellow, Introduction to Deep Learning, 2016 [11].
The small amount

In the case of the previous examples, 1 pixel
98 / 148
Other Poolings
There are other types of pooling

L2 norm of a rectangular neighborhood
Weighted average based on the distance from the central pixel
However, we have another way of doing pooling

Striding!!!
99 / 148
Other Poolings
There are other types of pooling

L2 norm of a rectangular neighborhood
Weighted average based on the distance from the central pixel
However, we have another way of doing pooling

Striding!!!
99 / 148
Springerberg et al. [12]
They started talking about sustituing maxpooling for something

called a Stride on the Convolution
(l−1) (l) (l)

m1 h1 h2
(l) (l) X X X (l) (l−1)
Yi = Bi + Kij Yj
x,y x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2
This is a Heuristic ...

Basically you jump around by a factro r and t for the width and
height of the layer
I It was proposed to decrease memory usage...
100 / 148
Springerberg et al. [12]
They started talking about sustituing maxpooling for something

called a Stride on the Convolution
(l−1) (l) (l)

m1 h1 h2
(l) (l) X X X (l) (l−1)
Yi = Bi + Kij Yj
j=1 k=−h(l) t=−h(l)
1 2
This is a Heuristic ...

Basically you jump around by a factro r and t for the width and
height of the layer
I It was proposed to decrease memory usage...
100 / 148
Example
Horizontal Stride
101 / 148
Example
Horizontal Stride
102 / 148
Example
Horizontal Stride
103 / 148
There are attempts to understand its effects
At Convolution Level and using Tensors [13]

“Take it in your stride: Do we need striding in CNNs?” by Chen
Kong, Simon Lucey [14]
Please read Kolda’s Paper before you get into the other
You need a little bit of notation...
104 / 148
There are attempts to understand its effects
At Convolution Level and using Tensors [13]

“Take it in your stride: Do we need striding in CNNs?” by Chen
Kong, Simon Lucey [14]
Please read Kolda’s Paper before you get into the other
You need a little bit of notation...
104 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
105 / 148
Here, the people at Google [15] around 2015
They commented in the “Internal Covariate Shift Phenomena”

Due to the change in the distribution of each layer’s input
They claim
The min-batch forces to have those changes which impact on the
learning capabilities of the network.
In Neural Networks, they define this

Internal Covariate Shift as the change in the distribution of network
activation’s due to the change in network parameters during training.
106 / 148

They claim

106 / 148

They claim

106 / 148
They gave the following reasons
Consider a layer with the input u that adds the learned bias b
Then, it normalizes the result by subtracting the mean of the
activation over the training data:
b = x − E [x]
x
1
PN
I X = {x, ..., xN } the data samples and E [x] = N i=1 xi
Now, if the gradient ignores the dependence of E [x] on b

Then b = b + ∆b where ∆b ∝ − ∂∂lbx
Finally
u + (b + ∆b) − E[u + (b + ∆b)] =u + b − E[u + b]
107 / 148
b = x − E [x]
x
1
PN

Finally
u + (b + ∆b) − E[u + (b + ∆b)] =u + b − E[u + b]
107 / 148
b = x − E [x]
x
1
PN

Finally
u + (b + ∆b) − E[u + (b + ∆b)] =u + b − E[u + b]
107 / 148
Then
The following will happen

The update to b by ∆b leads to no change in the output of the layer.
Therefore
We need to integrate the normalization into the process of training.
108 / 148
Then
The following will happen

The update to b by ∆b leads to no change in the output of the layer.
Therefore
We need to integrate the normalization into the process of training.
108 / 148
Normalization via Mini-Batch Statistic
It is possible to describe the normalization as a transformation layer

b = N orm (x, X )
x
Which depends on all the training samples X which also depends on

the layer parameters
For back-propagation, we will need to generate the following terms

∂N orm (x, X ) ∂N orm (x, X )
and
∂x ∂X
109 / 148
Normalization via Mini-Batch Statistic
It is possible to describe the normalization as a transformation layer

b = N orm (x, X )
x
Which depends on all the training samples X which also depends on

the layer parameters
For back-propagation, we will need to generate the following terms

∂N orm (x, X ) ∂N orm (x, X )
and
∂x ∂X
109 / 148
Definition of Whitening
Whitening
Suppose X is a random (column) vector with non-singular covariance
matrix Σ and mean 0.
Then
Then the transformation Y = W X with a whitening matrix W
satisfying the condition W T W = Σ−1 yields the whitened random
vector Y with unit diagonal covariance.
110 / 148
Definition of Whitening
Whitening
Suppose X is a random (column) vector with non-singular covariance
matrix Σ and mean 0.
Then
Then the transformation Y = W X with a whitening matrix W
satisfying the condition W T W = Σ−1 yields the whitened random
vector Y with unit diagonal covariance.
110 / 148
Such Normalization
It could be used for all layer

But whitening the layer inputs is expensive, as it requires computing
the covariance matrix
h i
Cov [x] = Ex∈X xxT and E [x] E [x]T
I To produce the whitened activations
111 / 148
Therefore
A Better Options, we can normalize each input layer

x(k) − µ
b (k) =
x
σ
h i h i
with µ = E x(k) and σ 2 = V ar x(k)
This allows to speed up convergence

Simply normalizing each input of a layer may change what the layer
can represent.
So, we need to insert a transformation in the network

Which can represent the identity transform
112 / 148
Therefore

x(k) − µ
b (k) =
x
σ
h i h i

can represent.

112 / 148
Therefore

x(k) − µ
b (k) =
x
σ
h i h i

can represent.

112 / 148
The Transformation
The Linear transformation

b (k) + β (k)
y (k) = γ (k) x
The parameters γ (k) , β (k)

q
This allow to recover the identity by setting γ (k) =

V ar x(k) and
h i
β (k) = E x(k) if necessary.
113 / 148
The Transformation
The Linear transformation

b (k) + β (k)
y (k) = γ (k) x
The parameters γ (k) , β (k)

q
This allow to recover the identity by setting γ (k) =

V ar x(k) and
h i
β (k) = E x(k) if necessary.
113 / 148
Finally
Batch Normalizing Transform

Input: Values of x over a mini-batch: B = {x1...m }, Parameters to
be learned: γ, β
Output: {yi = BNγ,β (xi )}
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
b i + β = BNγ,β (xi )
114 / 148
Finally

be learned: γ, β
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
114 / 148
Finally

be learned: γ, β
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
114 / 148
Finally

be learned: γ, β
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
114 / 148
Finally

be learned: γ, β
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
114 / 148
Backpropagation
We have the following equations by using the loss function l

∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
− 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 + 2
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation

∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
− 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 + 2
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation

∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
− 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 + 2
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation

∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
− 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 + 2
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation

∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
− 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 + 2
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Backpropagation

∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
− 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 + 2
Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i
115 / 148
Training Batch Normalization Networks
K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them 2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
K
k=1
inf
NBN

K
5 Train NBN
k=1
NBN = NBN
7 for k = 1...K do
average over them 2
9 m
B
10 inf
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+
116 / 148
However
Santurkar et al. [16]

They found thats is not the covariance shift the one affected by it!!!
Santurkar et al. recognize that

Batch normalization has been arguably one of the most successful
architectural innovations in deep learning.
They used a standard Very deep convolutional network

on CIFAR-10 with and without BatchNorm
117 / 148
However



117 / 148
However



117 / 148
They found something quite interesting
The following facts
118 / 148
Actually Batch Normalization
It does not do anything to the Internal Covariate Shift

Actually smooth the optimization manifold
I It is not the only way to achieve it!!!
They suggest that

“This suggests that the positive impact of BatchNorm on training
might be somewhat serendipitous.”
119 / 148
Actually Batch Normalization
It does not do anything to the Internal Covariate Shift

Actually smooth the optimization manifold
I It is not the only way to achieve it!!!
They suggest that

“This suggests that the positive impact of BatchNorm on training
might be somewhat serendipitous.”
119 / 148
They actually have a connected result
To the analysis of gradient clipping!!!

They are the same group
Theorem (The effect of BatchNorm on the Lipschitzness of the loss)

For a BatchNorm network with loss Lb and an identical non-BN
network with (identical) loss L,
γ2 1 D 1 D
2 2 E2 E2
∇yj L ≤ 2 ∇yj L − 1, ∇yj L − √ ∇yj L, y
b
bj
σj m m
120 / 148
They actually have a connected result
To the analysis of gradient clipping!!!

They are the same group
Theorem (The effect of BatchNorm on the Lipschitzness of the loss)

For a BatchNorm network with loss Lb and an identical non-BN
network with (identical) loss L,
γ2 1 D 1 D
2 2 E2 E2
∇yj L ≤ 2 ∇yj L − 1, ∇yj L − √ ∇yj L, y
b
bj
σj m m
120 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
121 / 148
Fully Connected Layer
If a layer l is a fully connected layer

If layer (l − 1) is a fully connected layer, use the equation to compute
the output of ith unit at layer l:
m (l)

(l) X (l) (l) (l) (l)
zi = wi,k yk thus yi = f zi
k=0
Otherwise
(l−1) (l−1) (l−1)
Layer l expects m1 feature maps of size m2 × m3 as input.
122 / 148
Fully Connected Layer
If a layer l is a fully connected layer

If layer (l − 1) is a fully connected layer, use the equation to compute
the output of ith unit at layer l:
m (l)

(l) X (l) (l) (l) (l)
zi = wi,k yk thus yi = f zi
k=0
Otherwise
(l−1) (l−1) (l−1)
Layer l expects m1 feature maps of size m2 × m3 as input.
122 / 148
Then
Thus, the ith unit in layer l computes

(l) (l)
yi =f zi
(l−1) (l−1) (l−1)
m1
(l) X mX 2 m3
X (l)

(l−1)

zi = wi,j,r,s Yj
r,s
j=1 r=1 s=1
123 / 148
Here
(l)
Where wi,j,r,s
It denotes the weight connecting the unit at position (r, s) in the j th
feature map of layer (l − 1) and the ith unit in layer l.
Something Notable
In practice, Convolutional Layers are used to learn a feature hierarchy
and one or more fully connected layers are used for classification
purposes based on the computed features.
124 / 148
Here
(l)
Where wi,j,r,s
It denotes the weight connecting the unit at position (r, s) in the j th
feature map of layer (l − 1) and the ith unit in layer l.
Something Notable
In practice, Convolutional Layers are used to learn a feature hierarchy
and one or more fully connected layers are used for classification
purposes based on the computed features.
124 / 148
Basically
We can use a loss function at the output of such layer
N N X
K 2
X X (l)
L (W ) = En (W ) = ynk − tnk (Sum of Squared Error)
n=1 n=1 k=1
N N X K
X X (l)
L (W ) = En (W ) = tnk log ynk (Cross-Entropy Error)
n=1 n=1 k=1
Assuming W the tensor used to represent all the possible weights

We can use the Backpropagation idea as long we can apply the
corresponding derivatives.
125 / 148
Basically
We can use a loss function at the output of such layer
N N X
K 2
X X (l)
L (W ) = En (W ) = ynk − tnk (Sum of Squared Error)
n=1 n=1 k=1
N N X K
X X (l)
L (W ) = En (W ) = tnk log ynk (Cross-Entropy Error)
n=1 n=1 k=1
Assuming W the tensor used to represent all the possible weights

We can use the Backpropagation idea as long we can apply the
corresponding derivatives.
125 / 148
About this
As part of the seminar

We are preparing a series of slides about Loss Functions...
126 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
127 / 148
We have the following Architecture
Simplified Architecture by Jean LeCun “Backpropagation applied to

handwritten zip code recognition”
128 / 148
Therefore, we have
Layer l = 1
This Layer is using a ReLu f with 3 channels
(l) (l)
h1 h2
X X
(l) (l) (l) (l−1)
Y1 = B1 + K11 Y1
(l) (l)
k=−h1 t=−h2
(l) (l)
h1 h2
X X
(l) (l) (l) (l−1)
Y2 = B2 + K21 Y1
(l) (l)
k=−h1 t=−h2
(l) (l)
h1 h2
X X
(l) (l) (l) (l−1)
Y3 = B3 + K31 Y1
(l) (l)
k=−h1 t=−h2
129 / 148
Layer l = 2
We have a maxpooling of size 2 × 2

(l) (l−1) (l−1) (l−1) (l−1)
Yi = max Yi , Yi Yi , Yi
x0 ,y 0 x,y x+1,y x,y+1 x+1,y+1
130 / 148
Then, you repeat the previous process
Thus we obtain a reduced convoluted version Ym(3) of the Yn(4)

convolution and maxpooling
Thus, we use those as inputs for the fully connected layer of input.
131 / 148
The fully connected layer
Now assuming a single k = 1 neuron

(6) (5)
y1 =f z1
(6) (6)
9 m 2 m 3
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1
132 / 148
We have for simplicity sake
That our final cost function is equal to

1 (6) (6) 2

L= y1 − t1
2
133 / 148
Outline
1 Introduction
The Long Path
Drawbacks
Possible Solution
History
Local Connectivity
Sharing Parameters
3 Layers
Convolutional Layer
Deconvolution Layer
Non-Linearity Layer
Rectification Layer
Strides
4 An Example of CNN
Backpropagation
Deriving wr,s,k
134 / 148
After collecting all input/output
Therefore
We have using sum of squared errors (loss function):
1 (6) (6) 2

L= y1 − t1
2
Therefore, we can obtain

(6) 2

(6)
∂L 1 ∂ y1 − t1
(5)
= × (5)
∂wr,s,k 2 ∂wr,s,k
135 / 148
After collecting all input/output
Therefore
We have using sum of squared errors (loss function):
1 (6) (6) 2

L= y1 − t1
2
Therefore, we can obtain

(6) 2

(6)
∂L 1 ∂ y1 − t1
(5)
= × (5)
∂wr,s,k 2 ∂wr,s,k
135 / 148
Therefore
We get in the first part of the equation

(6) 2

(6)
∂ y1 − t 1 ∂y (6)
(6) (6) 1
(5)
= y1 − t1 (5)
∂wr,s,k ∂wr,s,k
With

(6) (5)
y1 = ReLu z1
136 / 148
Therefore
We get in the first part of the equation

(6) 2

(6)
∂ y1 − t 1 ∂y (6)
(6) (6) 1
(5)
= y1 − t1 (5)
∂wr,s,k ∂wr,s,k
With

(6) (5)
y1 = ReLu z1
136 / 148
Therefore
We have

(6) (5) (5)
∂y1 ∂f z1 ∂z1
(5)
= (5)
× (5)
∂wr,s,k ∂z1 ∂wr,s,k
Therefore if we use the approximation

(5) (5)
∂f z1 ekz1
(5)
= (5)

∂z1 1 + ekz1
137 / 148
Therefore
We have

(6) (5) (5)
∂y1 ∂f z1 ∂z1
(5)
= (5)
× (5)
∂wr,s,k ∂z1 ∂wr,s,k
Therefore if we use the approximation

(5) (5)
∂f z1 ekz1
(5)
= (5)

∂z1 1 + ekz1
137 / 148
(5)
∂z1
Now, we need to derive (5)
∂wr,s,k
We know that
(6) (6)
9 m 2 m 3
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1
Finally
(5)
∂z1
(4)

(5)
= Yk
∂wr,s,k r,s
138 / 148
(5)
∂z1
∂wr,s,k
We know that
(6) (6)
9 m 2 m 3
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1
Finally
(5)
∂z1
(4)

(5)
= Yk
∂wr,s,k r,s
138 / 148
(5)
∂z1
∂wr,s,k
We know that
(6) (6)
9 m 2 m 3
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1
Finally
(5)
∂z1
(4)

(5)
= Yk
∂wr,s,k r,s
138 / 148
Maxpooling
This is not derived after all, but we go directly go for the max term
Assume you get the max element for f = 1, 2, ..., 9 and j = 1
(l) (l)
h1 h2
X X
(3) (3) (3) (2)
Yf = Bf + Kf 1 Y1
(l) (l)
k=−h1 t=−h2
139 / 148
Therefore
We have then
(6) 2

(6)
∂L 1 ∂ y1 − t1

(3)
= ×
(3)

∂ Kf 1 2 ∂ Kf 1
k,t k,t

chain of derivations given

(4) (3)
Yf = f Yf
x,y x,y

(5)
(3)
∂f z1 (5) ∂f Yf
∂L
(6) (6) ∂zi x,y

(3)
= y1 − t 1 (5)
×
(4)
×
(3)

∂ Kf 1 ∂z1 ∂ Yf ∂ Kf 1
k,t x,y k,t
140 / 148
Therefore
We have then
(6) 2

(6)
∂L 1 ∂ y1 − t1

(3)
= ×
(3)

∂ Kf 1 2 ∂ Kf 1
k,t k,t

chain of derivations given

(4) (3)
Yf = f Yf
x,y x,y

(5)
(3)
∂f z1 (5) ∂f Yf
∂L
(6) (6) ∂zi x,y

(3)
= y1 − t 1 (5)
×
(4)
×
(3)

∂ Kf 1 ∂z1 ∂ Yf ∂ Kf 1
k,t x,y k,t
140 / 148
Therefore
We have
(5)
∂zi (5)

(3)
= wx,y,f
∂ Yf
x,y
Then assuming that

(l) (l)
h1 h2
X X
(3) (3) (3) (2)
Yf = Bf + Kf 1 Y1
(l) (l)
k=−h1 t=−h2
141 / 148
Therefore
We have
(5)
∂zi (5)

(3)
= wx,y,f
∂ Yf
x,y
Then assuming that

(l) (l)
h1 h2
X X
(3) (3) (3) (2)
Yf = Bf + Kf 1 Y1
(l) (l)
k=−h1 t=−h2
141 / 148
Therefore
We have

(3) (3)

(3)
∂f Yf ∂f Yf ∂ Yf
x,y x,y
= × x,y
(3) (3) (3)
∂ Kf 1 ∂ Yf ∂ Kf 1
k,t x,y k,t
Then

(3)
∂f Yf
x,y

(3)

(3)
= f0 Yf
∂ Yf x,y
x,y
142 / 148
Therefore
We have

(3) (3)

(3)
∂f Yf ∂f Yf ∂ Yf
x,y x,y
= × x,y
(3) (3) (3)
∂ Kf 1 ∂ Yf ∂ Kf 1
k,t x,y k,t
Then

(3)
∂f Yf
x,y

(3)

(3)
= f0 Yf
∂ Yf x,y
x,y
142 / 148
Finally, we have
The equation

(3)
∂ Yf
x,y = Y1(2)
(3) x−k,x−t
∂ Kf 1
k,t
143 / 148
The Other Equations
I will leave you to devise them

They are a repetitive procedure.
The interesting case the average pooling

The others are the stride and the deconvolution
144 / 148
The Other Equations
I will leave you to devise them

They are a repetitive procedure.
The interesting case the average pooling

The others are the stride and the deconvolution
144 / 148
[1] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the
recent architectures of deep convolutional neural networks,” Artificial
Intelligence Review, vol. 53, no. 8, pp. 5455–5516, 2020.
[2] R. Szeliski, Computer Vision: Algorithms and Applications.
Berlin, Heidelberg: Springer-Verlag, 1st ed., 2010.
[3] S. Haykin, Neural Networks and Learning Machines.
No. v. 10 in Neural networks and learning machines, Prentice Hall,
2009.
[4] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction
and functional architecture in the cat’s visual cortex,” The Journal of
physiology, vol. 160, no. 1, p. 106, 1962.
[5] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
145 / 148
[6] W. Zhang, K. Itoh, J. Tanida, and Y. Ichioka, “Parallel distributed
processing model with local space-invariant interconnections and its
optical architecture,” Appl. Opt., vol. 29, pp. 4790–4797, Nov 1990.
[7] J. J. Weng, N. Ahuja, and T. S. Huang, “Learning recognition and
segmentation of 3-d objects from 2-d images,” in 1993 (4th)
International Conference on Computer Vision, pp. 121–128, IEEE,
1993.
[8] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus,
“Deconvolutional networks,” in 2010 IEEE Computer Society
Conference on computer vision and pattern recognition,
pp. 2528–2535, IEEE, 2010.
[9] D. Krishnan and R. Fergus, “Fast image deconvolution using
hyper-laplacian priors,” Advances in neural information processing
systems, vol. 22, pp. 1033–1041, 2009.
146 / 148
[10] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternating
minimization algorithm for total variation image reconstruction,”
SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 248–272, 2008.
[11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.
The MIT Press, 2016.
[12] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
“Striving for simplicity: The all convolutional net,” 2015.
[13] T. G. Kolda and B. W. Bader, “Tensor decompositions and
applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
[14] C. Kong and S. Lucey, “Take it in your stride: Do we need striding in
cnns?,” arXiv preprint arXiv:1712.02502, 2017.
[15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
147 / 148
[16] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch
normalization help optimization?,” in Advances in Neural Information
Processing Systems, pp. 2483–2493, 2018.
148 / 148

Neural Network Notes

Uploaded by

Copyright:

Available Formats

Neural Network Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network Notes

Uploaded by

Copyright:

Available Formats

Introduction to Neural Networks and Deep Learning

Introduction to the Convolutional Network

March 28, 2021

Complex Architectures and PyramidalNet

of a The Attention Revolution PolyNet

Revolution 2017 WideResNext

First Results 2013 3D CNN's

Pixel values typically represent

Pixel values typically represent

Common image formats include

Low Level Process

Mid Level Process

It would be nice to automatize all these processes

Why not to use the data sets

It would be nice to automatize all these processes

Why not to use the data sets

We have the following classification [3]

The number of trainable parameters becomes extremely large

Shift to the Left

The topology of the input data is completely ignored

If we have an element that the network has never seen

We can minimize this drawbacks by getting

We can minimize this drawbacks by getting

Something Notable [4]

Something Notable [4]

Convolutional Neural Networks (CNN) were invented by [5]

INPUT LAYERS HIDDEN LAYERS 1 HIDDEN LAYERS 2

We have the following idea [6]

We would have something like this

We have the following idea [6]

We would have something like this

We decide only to connect the neurons in a local way

We decide only to connect the neurons in a local way

For gray scale, we get something like this

Then, our formula changes

For gray scale, we get something like this

Then, our formula changes

In the case of the 3 channels

In the case of the 3 channels

These units are organized into

These units are organized into

We have something like this

We have something like this

We have a collection of matrices representing this connectivity

An now why the name of convolution

We have a collection of matrices representing this connectivity

An now why the name of convolution

In computer vision [2, 7]

The image can now be represented as a matrix of integer values,

In computer vision [2, 7]

The image can now be represented as a matrix of integer values,

In computer vision [2, 7]

The image can now be represented as a matrix of integer values,

In computer vision [2, 7]

The image can now be represented as a matrix of integer values,

For example a moving average

This last moving average can be seen as

With I (j) representing the value of the pixel at position j,

Left I and Right I ∗ K

Left I and Right I ∗ K