Neural Network Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 268

Introduction to Neural Networks and Deep Learning

Introduction to the Convolutional Network

Andres Mendez-Vazquez

March 28, 2021

1 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
2 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
3 / 148
The Long Path [1]
2018
Channel Boosted CNN
Channel
Boosting

Beyond
2018
CBAM
Attention

Residual Attention
Module
2018
CMPE-SE
Feature Map SE Net
Explotation

A Small History
Transformers-CNN

Complex Architectures and PyramidalNet

of a The Attention Revolution PolyNet

Revolution 2017 WideResNext

Width ResNext
Explotation

2017
FractalNet
4 5 1 1
2 6 2 6
3 5 7 3
4.5

9
5

6.5
Residual and Multipath Multi-Path
Connectivity
2016
Dense Net

ZfNet
ResNet
Architectures
2015
1 9 2 1 Skip Connections

Optimization

Visualization
Depth
Highway Net

Parameter
Revolution

Feature
The Beginnig of Atention?
VGG
2014 Effective Reception Filed
(Small Size Filters)

Inception-ResNet

Factorization Inception-V4

Bottleneck
Inception-V3

Inception-V2

The Revolution
Parallelism
2014

Spatial
Explotation
Inception GoogleNet
Block

First Results 2013 3D CNN's


PROGRAMMING
CNN Stagnation

2006 Maxpooling

2007 NVIDIA
2006 GPU

2010 ImageNet

Explotation
Explotation

AlexNet
Spatial
Early 2000

2012
Depth

1998

LeNet

1989
ConvNet
Early Attempts
1979
Neurocognition

4 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
5 / 148
Digital Images as pixels in a digitized matrix [2]

Ilumination
Source

Output
Ilumination
Source

6 / 148
Further [2]

Pixel values typically represent


Gray levels, colors, heights, opacities etc

Something Notable
Remember digitization implies that a digital image is an
approximation of a real scene

7 / 148
Further [2]

Pixel values typically represent


Gray levels, colors, heights, opacities etc

Something Notable
Remember digitization implies that a digital image is an
approximation of a real scene

7 / 148
Images

Common image formats include


On sample/pixel per point (B&W or Grayscale)
Three samples/pixel per point (Red, Green, and Blue)
Four samples/pixel per point (Red, Green, Blue, and “Alpha”)

8 / 148
Therefore, we have the following process

Low Level Process


Noise
Imagen Removal Sharpening

9 / 148
Example

Edge Detection

10 / 148
Then

Mid Level Process


Input Processes Output
Object
Image Recognition Attributes
Segmentation

11 / 148
Example

Object Recognition

12 / 148
Therefore

It would be nice to automatize all these processes


We would solve a lot of headaches when setting up such process

Why not to use the data sets


By using a Neural Networks that replicates the process.

13 / 148
Therefore

It would be nice to automatize all these processes


We would solve a lot of headaches when setting up such process

Why not to use the data sets


By using a Neural Networks that replicates the process.

13 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
14 / 148
Multilayer Neural Network Classification

We have the following classification [3]

15 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
16 / 148
Drawbacks of previous neural networks

The number of trainable parameters becomes extremely large

Large N

17 / 148
Drawbacks of previous neural networks
In addition, little or no invariance to shifting, scaling, and other forms
of distortion

Large N

18 / 148
Drawbacks of previous neural networks
In addition, little or no invariance to shifting, scaling, and other forms
of distortion

Large N

Shift to the Left

19 / 148
Drawbacks of previous neural networks

The topology of the input data is completely ignored

20 / 148
For Example

We have
Black and white patterns: 232×32 = 21024
Gray scale patterns: 25632×32 = 2561024

21 / 148
For Example

If we have an element that the network has never seen

22 / 148
Possible Solution

We can minimize this drawbacks by getting


Fully connected network of sufficient size can produce outputs that
are invariant with respect to such variations.

Problem!!!
Training time
Network size
Free parameters

23 / 148
Possible Solution

We can minimize this drawbacks by getting


Fully connected network of sufficient size can produce outputs that
are invariant with respect to such variations.

Problem!!!
Training time
Network size
Free parameters

23 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
24 / 148
Hubel/Wiesel Architecture

Something Notable [4]


D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981)

They commented
The visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells

25 / 148
Hubel/Wiesel Architecture

Something Notable [4]


D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981)

They commented
The visual cortex consists of a hierarchy of simple, complex, and
hyper-complex cells

25 / 148
Something Like

We have
Feature Hierarchy

Hyper-complex cells

Complex cells

Simple cells

26 / 148
History

Convolutional Neural Networks (CNN) were invented by [5]


In 1989, Yann LeCun and Yoshua Bengio introduced the concept of
Convolutional Neural networks.

Patterns of Local
Face Features
Contrast

Faces

OUTPUT

INPUT LAYERS HIDDEN LAYERS 1 HIDDEN LAYERS 2

27 / 148
About CNN’s

Something Notable
CNN’s Were neurobiologically motivated by the findings of locally sensitive
and orientation-selective nerve cells in the visual cortex.

In addition
They designed a network structure that implicitly extracts relevant
features.

Properties
Convolutional Neural Networks are a special kind of multi-layer neural
networks.

28 / 148
About CNN’s

Something Notable
CNN’s Were neurobiologically motivated by the findings of locally sensitive
and orientation-selective nerve cells in the visual cortex.

In addition
They designed a network structure that implicitly extracts relevant
features.

Properties
Convolutional Neural Networks are a special kind of multi-layer neural
networks.

28 / 148
About CNN’s

Something Notable
CNN’s Were neurobiologically motivated by the findings of locally sensitive
and orientation-selective nerve cells in the visual cortex.

In addition
They designed a network structure that implicitly extracts relevant
features.

Properties
Convolutional Neural Networks are a special kind of multi-layer neural
networks.

28 / 148
About CNN’s

In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.

29 / 148
About CNN’s

In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.

29 / 148
About CNN’s

In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.

29 / 148
About CNN’s

In addition
CNN is a feed-forward network that can extract topological properties
from an image.
Like almost every other neural networks they are trained with a
version of the back-propagation algorithm.
Convolutional Neural Networks are designed to recognize visual
patterns directly from pixel images with minimal preprocessing.
They can recognize patterns with extreme variability.

29 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
30 / 148
Local Connectivity

We have the following idea [6]


Instead of using a full connectivity...

Input Image

We would have something like this


n
!
X
yi = f wi xi (1)
i=1

31 / 148
Local Connectivity

We have the following idea [6]


Instead of using a full connectivity...

Input Image

We would have something like this


n
!
X
yi = f wi xi (1)
i=1

31 / 148
Local Connectivity

We decide only to connect the neurons in a local way


Each hidden unit is connected only to a subregion (patch) of the
input image.
It is connected to all channels:

32 / 148
Local Connectivity

We decide only to connect the neurons in a local way


Each hidden unit is connected only to a subregion (patch) of the
input image.
It is connected to all channels:

32 / 148
Local Connectivity
We decide only to connect the neurons in a local way
Each hidden unit is connected only to a subregion (patch) of the
input image.
It is connected to all channels:

32 / 148
Example

For gray scale, we get something like this

Input Image

Then, our formula changes


 
X
yi = f  wi x i  (2)
i∈Lp

33 / 148
Example

For gray scale, we get something like this

Input Image

Then, our formula changes


 
X
yi = f  wi x i  (2)
i∈Lp

33 / 148
Example

In the case of the 3 channels

Input Image

Thus
 
X
yi = f  wi xci  (3)
i∈Lp ,c

34 / 148
Example

In the case of the 3 channels

Input Image

Thus
 
X
yi = f  wi xci  (3)
i∈Lp ,c

34 / 148
Solving the following problems...

First
Fully connected hidden layer would have an unmanageable number of
parameters

Second
Computing the linear activation of the hidden units would have been
quite expensive

35 / 148
Solving the following problems...

First
Fully connected hidden layer would have an unmanageable number of
parameters

Second
Computing the linear activation of the hidden units would have been
quite expensive

35 / 148
How this looks in the image...

We have

Receptive Field

36 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
37 / 148
Parameter Sharing

Second Idea
Share matrix of parameters across certain units.

These units are organized into


The same feature “map”
I Where the units share same parameters (For example, the same mask)

38 / 148
Parameter Sharing

Second Idea
Share matrix of parameters across certain units.

These units are organized into


The same feature “map”
I Where the units share same parameters (For example, the same mask)

38 / 148
Example

We have something like this


Feature Map 1 Feature Map 2 Feature Map 3

39 / 148
Example

We have something like this


Feature Map 1 Feature Map 2 Feature Map 3

39 / 148
Now, in our notation

We have a collection of matrices representing this connectivity


Wij is the connection matrix the ith input channel with the jth
feature map.
In each cell of these matrices is the weight to be multiplied with the
local input to the local neuron.

An now why the name of convolution


Yes!!! The definition is coming now.

40 / 148
Now, in our notation

We have a collection of matrices representing this connectivity


Wij is the connection matrix the ith input channel with the jth
feature map.
In each cell of these matrices is the weight to be multiplied with the
local input to the local neuron.

An now why the name of convolution


Yes!!! The definition is coming now.

40 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
41 / 148
Digital Images

In computer vision [2, 7]


We usually operate on digital (discrete) images:
Sample the 2D space on a regular grid.
Quantize each sample (round to nearest integer).

The image can now be represented as a matrix of integer values,


I : [a, b] × [c, d] → [0..255]
j−→
 
79 5 6 90 12 34 2 1

 8 90 12 34 26 78 34 5 
i↓ 8 1 3 90 12 34 11 61 
 
 
 77 90 12 34 200 2 9 45 
1 3 90 12 20 1 6 23

42 / 148
Digital Images

In computer vision [2, 7]


We usually operate on digital (discrete) images:
Sample the 2D space on a regular grid.
Quantize each sample (round to nearest integer).

The image can now be represented as a matrix of integer values,


I : [a, b] × [c, d] → [0..255]
j−→
 
79 5 6 90 12 34 2 1

 8 90 12 34 26 78 34 5 
i↓ 8 1 3 90 12 34 11 61 
 
 
 77 90 12 34 200 2 9 45 
1 3 90 12 20 1 6 23

42 / 148
Digital Images

In computer vision [2, 7]


We usually operate on digital (discrete) images:
Sample the 2D space on a regular grid.
Quantize each sample (round to nearest integer).

The image can now be represented as a matrix of integer values,


I : [a, b] × [c, d] → [0..255]
j−→
 
79 5 6 90 12 34 2 1

 8 90 12 34 26 78 34 5 
i↓ 8 1 3 90 12 34 11 61 
 
 
 77 90 12 34 200 2 9 45 
1 3 90 12 20 1 6 23

42 / 148
Digital Images

In computer vision [2, 7]


We usually operate on digital (discrete) images:
Sample the 2D space on a regular grid.
Quantize each sample (round to nearest integer).

The image can now be represented as a matrix of integer values,


I : [a, b] × [c, d] → [0..255]
j−→
 
79 5 6 90 12 34 2 1

 8 90 12 34 26 78 34 5 
i↓ 8 1 3 90 12 34 11 61 
 
 
 77 90 12 34 200 2 9 45 
1 3 90 12 20 1 6 23

42 / 148
Many times we want to eliminate noise in a image

For example a moving average

43 / 148
This is defined as

This last moving average can be seen as


n −m
X 1 X
(I ∗ k) (i) = I (i − j) × K (j) = I (i − j) (4)
N
j=−n j=m

With I (j) representing the value of the pixel at position j,


1
N
if j ∈ {−m, −m + 1, ..., 1, 0, 1, ..., m − 1, m}
K (j) =
0 else
with 0 < m < n.

44 / 148
This can be generalized into the 2D images

Left I and Right I ∗ K

45 / 148
This can be generalized into the 2D images

Left I and Right I ∗ K

46 / 148
This can be generalized into the 2D images

Left I and Right I ∗ K

47 / 148
This can be generalized into the 2D images

Left I and Right I ∗ K

48 / 148
Moving average in 2D

Basically in 2D
We can define different types of filter using the idea of weighted
average
−m X
X m
(I ∗ K) (i, j) = I (i − s, j − l) × K (s, l) (5)
s=m l=−m

For example, the Box Filter


 
1 1 1
1
K =  1 1 1  "The Box Filter" (6)

9
1 1 1

49 / 148
Moving average in 2D

Basically in 2D
We can define different types of filter using the idea of weighted
average
−m X
X m
(I ∗ K) (i, j) = I (i − s, j − l) × K (s, l) (5)
s=m l=−m

For example, the Box Filter


 
1 1 1
1
K =  1 1 1  "The Box Filter" (6)

9
1 1 1

49 / 148
Another Example

The Gaussian Filter


 
0 1 2 1 0

 1 3 4 3 1 

K= 2 5 9 5 2
 

 
 1 3 5 3 1 
0 1 2 1 0

Thus, we can define the concept of convolution


Yes, using the previous ideas

50 / 148
Another Example

The Gaussian Filter


 
0 1 2 1 0

 1 3 4 3 1 

K= 2 5 9 5 2
 

 
 1 3 5 3 1 
0 1 2 1 0

Thus, we can define the concept of convolution


Yes, using the previous ideas

50 / 148
Convolution

Definition
Let I : [a, b] × [c, d] → [0..255] be the image and
K : [e, f ] × [h, i] → R be the kernel. The output of Convolving I
with K, denoted I ∗ K is
n
X n
X
(I ∗ K) [x, y] = I (x − s, y − l) × K (s, l)
s=−n l=−n

51 / 148
Now, why not to expand this idea
Imagine that a three channel image is splitted into a three feature
map
Feature Maps

52 / 148
Mathematically, we have the following

Map i
n
3 X
X n
X
(I ∗ k) [x, y, o] = I (x − l, y − s, c) × k (l, s, c, o)
c=1 l=−n s=−n

Therefore
The convolution works as a
I Filter
I Encoder
I Decoder
I etc

53 / 148
Mathematically, we have the following

Map i
n
3 X
X n
X
(I ∗ k) [x, y, o] = I (x − l, y − s, c) × k (l, s, c, o)
c=1 l=−n s=−n

Therefore
The convolution works as a
I Filter
I Encoder
I Decoder
I etc

53 / 148
For Example, Encoder

We have the following situation

54 / 148
Notation

We have the following


(l)
Yj is a matrix representing the l layer and j th feature map.
(l)
Kij is the kernel filter with ith kernel for layer j th .

Therefore
We can see the Convolutional as a fusion of information from
different feature maps.
(l−1)
m1
X (l−1) (l)
Yj ∗ Kij
j=1

55 / 148
Notation

We have the following


(l)
Yj is a matrix representing the l layer and j th feature map.
(l)
Kij is the kernel filter with ith kernel for layer j th .

Therefore
We can see the Convolutional as a fusion of information from
different feature maps.
(l−1)
m1
X (l−1) (l)
Yj ∗ Kij
j=1

55 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to

(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks

Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .

Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to

(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks

Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .

Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to

(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks

Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .

Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to

(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks

Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .

Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Thus, we have
Given a specific layer l, we have that ith feature map in such layer
equal to

(l−1)
m1 ks ks
(l) (l) (l−1) (l)
X X X
Yi (x, y) = Bi (x, y) + Yj (x − s, y − l) Kij (x, y)
j=1 s=−ks l=−ks

Where
(l)
Yi is the ith feature map in layer l.
(l)
Bi is the bias matrix for output j.
h i h i
(l) (l) (l)
Kij is the filter of size 2h1 + 1 × 2h2 + 1 .

Thus
(l−1)
The input of layer l comprises m1 feature maps from the previous layer,
(l−1) (l−1)
each of size m2 × m3
56 / 148
Therefore

Thew output of layer l


(l) (l) (l)
It consists m1 feature maps of size m2 × m3

Something Notable
(l) (l)
m2 and m3 are influenced by border effects.
Therefore, the output feature maps when the Convolutional sum is
defined properly have size

(l) (l−1) (l)


m2 = m2 − 2h1
(l) (l−1) (l)
m3 = m3 − 2h2

57 / 148
Therefore

Thew output of layer l


(l) (l) (l)
It consists m1 feature maps of size m2 × m3

Something Notable
(l) (l)
m2 and m3 are influenced by border effects.
Therefore, the output feature maps when the Convolutional sum is
defined properly have size

(l) (l−1) (l)


m2 = m2 − 2h1
(l) (l−1) (l)
m3 = m3 − 2h2

57 / 148
Why? The Border

Example
Convolutional Maps

58 / 148
Special Case

When l = 1
The input is a single image I consisting of one or more channels.

59 / 148
Thus

We have
(l) (l) (l)
Each feature map Yi in layer l consists of m1 · m2 units arranged in a
two dimensional array.

Thus, the unit at position (x, y) computes

(l−1)
    m1
(l) (l) X  (l) (l−1)

Yi = Bi + Kij ∗ Yj
x,y x,y x,y
j=1
(l−1) (l) (l)
  m1 h1 h2    
(l) X X X (l) (l−1)
= Bi + Kij Yj
x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2

60 / 148
Thus

We have
(l) (l) (l)
Each feature map Yi in layer l consists of m1 · m2 units arranged in a
two dimensional array.

Thus, the unit at position (x, y) computes

(l−1)
    m1
(l) (l) X  (l) (l−1)

Yi = Bi + Kij ∗ Yj
x,y x,y x,y
j=1
(l−1) (l) (l)
  m1 h1 h2    
(l) X X X (l) (l−1)
= Bi + Kij Yj
x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2

60 / 148
Here, an interesting case

Only a Historical Note


The foundations for deconvolution came from Norbert Wiener of the
Massachusetts Institute of Technology in his book “Extrapolation,
Interpolation, and Smoothing of Stationary Time Series” (1949)

Basically, it tries to solve the following equation with Y (l) unknown


layer that we want to recover
(l) (l) (l−1)
Yi ∗ Kij = Yj

61 / 148
Here, an interesting case

Only a Historical Note


The foundations for deconvolution came from Norbert Wiener of the
Massachusetts Institute of Technology in his book “Extrapolation,
Interpolation, and Smoothing of Stationary Time Series” (1949)

Basically, it tries to solve the following equation with Y (l) unknown


layer that we want to recover
(l) (l) (l−1)
Yi ∗ Kij = Yj

61 / 148
In [8]

They proposed a sparcity idea to start the implementation as


(l−1)
(l) 2 (l)
m1 m1 m1
(l−1)
 X X (l) (l) (l−1)

X (l) p
C Y =
Yj ∗ Kij − Yi +
Yj
i=1 j=1 j=1
2

Typically, p = 1, although other values are possible.

They look for the


n arguments o to minimize a cost of function over a set
1 I
of images y = y , ..., y

arg min C (y)


(l) (l)
Yj ∗Kij

62 / 148
In [8]

They proposed a sparcity idea to start the implementation as


(l−1)
(l) 2 (l)
m1 m1 m1
(l−1)
 X X (l) (l) (l−1)

X (l) p
C Y =
Yj ∗ Kij − Yi +
Yj
i=1 j=1 j=1
2

Typically, p = 1, although other values are possible.

They look for the


n arguments o to minimize a cost of function over a set
1 I
of images y = y , ..., y

arg min C (y)


(l) (l)
Yj ∗Kij

62 / 148
Here

Then, we can generalize such cost function for that total set of
images (Minbatch)


(l−1) (l)
2 (l)
I mX m1 m1

1
λ X X (l)

(l,k) (l)

(l−1,k) X (l,k) p
Cl (y) = gij Yj ∗ Kij − Yi + Yj


2 k=1 i=1 j=1

j=1
2

Here, we have
(l−1,k)
Yi are the feature maps from the previous layer
(l)
gij is a fixed binary matrix that determines the connectivity between
feature maps at different layers
(l,k) (l−1,k)
I If Yj is connected to certain Yi elments

63 / 148
Here

Then, we can generalize such cost function for that total set of
images (Minbatch)


(l−1) (l)
2 (l)
I mX m1 m1

1
λ X X (l)

(l,k) (l)

(l−1,k) X (l,k) p
Cl (y) = gij Yj ∗ Kij − Yi + Yj


2 k=1 i=1 j=1

j=1
2

Here, we have
(l−1,k)
Yi are the feature maps from the previous layer
(l)
gij is a fixed binary matrix that determines the connectivity between
feature maps at different layers
(l,k) (l−1,k)
I If Yj is connected to certain Yi elments

63 / 148
This can be sen as

We have the following layer

+ + + +

64 / 148
They noticed some drawbacks

Using the following optimizations


Direct Gradient Descent
Iterative Reweighted Least Squares
Stochastic Gradient Descent

All of they presented problems!!!


They solved it using a new cost function

65 / 148
They noticed some drawbacks

Using the following optimizations


Direct Gradient Descent
Iterative Reweighted Least Squares
Stochastic Gradient Descent

All of they presented problems!!!


They solved it using a new cost function

65 / 148
We have that

(l,k)
An interesting use of an auxiliar variable/layer Xi

(l)
(l−1)
2
I mX
m
1 1
λX
X (l)

(l,k) (l)

(l−1,k)
Cl (y) = gij Yj ∗ Kij − Yi + ...
2 k=1 i=1
j=1


2
(l) (l)
m1
I X I X m1
βX (l,k) (l,k) 2
X (l,k) p

Yj − Xi + Yj
2 k=1 j=1 2
k=1 j=1

This can be solved using


Alternating minimization...

66 / 148
We have that

(l,k)
An interesting use of an auxiliar variable/layer Xi

(l)
(l−1)
2
I mX
m
1 1
λX
X (l)

(l,k) (l)

(l−1,k)
Cl (y) = gij Yj ∗ Kij − Yi + ...
2 k=1 i=1
j=1


2
(l) (l)
m1
I X I X m1
βX (l,k) (l,k) 2
X (l,k) p

Yj − Xi + Yj
2 k=1 j=1 2
k=1 j=1

This can be solved using


Alternating minimization...

66 / 148
This is based on

(l,k) (l,k)
Fixing the values of Yj and Xi
They call these two stages the Y and X sub-problems...

Therefore, they noticed


These terms introduce the sparsity constraint and gives numerical
stability [9, 10]

67 / 148
This is based on

(l,k) (l,k)
Fixing the values of Yj and Xi
They call these two stages the Y and X sub-problems...

Therefore, they noticed


These terms introduce the sparsity constraint and gives numerical
stability [9, 10]

67 / 148
Y sub-problem

(l,k)
Taking the derivative of Yj

(l−1)
 (l)

m1 m1
∂Cl (y) X (l)T  X (l) (l,k) (l−1,k) 
h
(l,k) (l,k)
i
(l,k)
=λ Fij  Ftj Yj − Yj +β Yj − Xj =0
∂Yj i=1 t=1

Where

It is a sparse convolution matrix (l)
(l) if gij = 1
Fij = (l)
0 if gij = 0

68 / 148
Y sub-problem

(l,k)
Taking the derivative of Yj

(l−1)
 (l)

m1 m1
∂Cl (y) X (l)T  X (l) (l,k) (l−1,k) 
h
(l,k) (l,k)
i
(l,k)
=λ Fij  Ftj Yj − Yj +β Yj − Xj =0
∂Yj i=1 t=1

Where

It is a sparse convolution matrix (l)
(l) if gij = 1
Fij = (l)
0 if gij = 0

68 / 148
Therefore

(l)
Fij as a sparse convolution matrix
(l)
Equivalent to convolve with Kij

Actually if you fix i, you finish with a linear system Ax = 0


Please take a look at the paper... it is interesting
I Actually this seems to be the implementation at the Tensorflow
framework

69 / 148
Therefore

(l)
Fij as a sparse convolution matrix
(l)
Equivalent to convolve with Kij

Actually if you fix i, you finish with a linear system Ax = 0


Please take a look at the paper... it is interesting
I Actually this seems to be the implementation at the Tensorflow
framework

69 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
70 / 148
As in a Multilayer Perceptron
We use a non-linearity
However, there is a drawback when using Back-Propagation under a
sigmoid function
1
s (x) =
1 + e−x

Because if we imagine a Convolutional Network as a series of layer


functions fi
y (A) = ft ◦ ft−1 ◦ · · · ◦ f2 ◦ f1 (A)
With ft is the last layer.

Therefore, we finish with a sequence of derivatives


∂y (A) ∂ft (ft−1 ) ∂ft−1 (ft−2 ) ∂f2 (f1 ) ∂f1 (A)
= · · ··· · ·
∂w1i ∂ft−1 ∂ft−2 ∂f2 ∂w1i
71 / 148
As in a Multilayer Perceptron
We use a non-linearity
However, there is a drawback when using Back-Propagation under a
sigmoid function
1
s (x) =
1 + e−x

Because if we imagine a Convolutional Network as a series of layer


functions fi
y (A) = ft ◦ ft−1 ◦ · · · ◦ f2 ◦ f1 (A)
With ft is the last layer.

Therefore, we finish with a sequence of derivatives


∂y (A) ∂ft (ft−1 ) ∂ft−1 (ft−2 ) ∂f2 (f1 ) ∂f1 (A)
= · · ··· · ·
∂w1i ∂ft−1 ∂ft−2 ∂f2 ∂w1i
71 / 148
As in a Multilayer Perceptron
We use a non-linearity
However, there is a drawback when using Back-Propagation under a
sigmoid function
1
s (x) =
1 + e−x

Because if we imagine a Convolutional Network as a series of layer


functions fi
y (A) = ft ◦ ft−1 ◦ · · · ◦ f2 ◦ f1 (A)
With ft is the last layer.

Therefore, we finish with a sequence of derivatives


∂y (A) ∂ft (ft−1 ) ∂ft−1 (ft−2 ) ∂f2 (f1 ) ∂f1 (A)
= · · ··· · ·
∂w1i ∂ft−1 ∂ft−2 ∂f2 ∂w1i
71 / 148
Therefore

Given the commutativity of the product


You could put together the derivative of the sigmoid’s

ds (x) e−x
f (x) = =
dx (1 + e−x )2

Therefore, deriving again


2
df (x) e−x 2 (e−x )
=− +
dx (1 + e−x )2 (1 + e−x )3

df (x)
After making dx
=0
We have the maximum is at x = 0

72 / 148
Therefore

Given the commutativity of the product


You could put together the derivative of the sigmoid’s

ds (x) e−x
f (x) = =
dx (1 + e−x )2

Therefore, deriving again


2
df (x) e−x 2 (e−x )
=− +
dx (1 + e−x )2 (1 + e−x )3

df (x)
After making dx
=0
We have the maximum is at x = 0

72 / 148
Therefore

Given the commutativity of the product


You could put together the derivative of the sigmoid’s

ds (x) e−x
f (x) = =
dx (1 + e−x )2

Therefore, deriving again


2
df (x) e−x 2 (e−x )
=− +
dx (1 + e−x )2 (1 + e−x )3

df (x)
After making dx
=0
We have the maximum is at x = 0

72 / 148
Therefore

The maximum for the derivative of the sigmoid


f (0) = 0.25

Therefore, Given a Deep Convolutional Network


We could finish with
k
ds (x)

lim = lim (0.25)k → 0
k→∞ dx k→∞

A vanishing derivative
Making quite difficult to do train a deeper network using this
activation function

73 / 148
Therefore

The maximum for the derivative of the sigmoid


f (0) = 0.25

Therefore, Given a Deep Convolutional Network


We could finish with
k
ds (x)

lim = lim (0.25)k → 0
k→∞ dx k→∞

A vanishing derivative
Making quite difficult to do train a deeper network using this
activation function

73 / 148
Therefore

The maximum for the derivative of the sigmoid


f (0) = 0.25

Therefore, Given a Deep Convolutional Network


We could finish with
k
ds (x)

lim = lim (0.25)k → 0
k→∞ dx k→∞

A vanishing derivative
Making quite difficult to do train a deeper network using this
activation function

73 / 148
Thus

The need to introduce a new function


f (x) = x+ = max (0, x)

It is called ReLu or Rectifier


With a smooth approximation (Softplus function)
 
ln 1 + ekx
f (x) =
k

74 / 148
Thus

The need to introduce a new function


f (x) = x+ = max (0, x)

It is called ReLu or Rectifier


With a smooth approximation (Softplus function)
 
ln 1 + ekx
f (x) =
k

74 / 148
Therefore, we have

When k = 1
Softplus +3.5

ReLu +3.0

+2.5

+2.0

+1.5

+1.0

+0.5

−2.8 −2.2 −1.6 −1.0 −0.4 +0.4 +1.0 +1.6 +2.2 +2.8
−0.5

75 / 148
Increase k

When k = 104
+0.004

Softplus +0.003

ReLu
+0.002

+0.001

−0.0022 −0.0016 −0.001 −0.0004 +0.0006 +0.0012 +0.0018 +0.0024


−0.001

76 / 148
Non-Linearity Layer

If layer l is a non-linearity layer


(l)
Its input is given by m1 feature maps.

What about the output


(l) (l−1)
Its output comprises again m1 = m1 feature maps

Each of them of size


(l−1) (l−1)
m2 × m3 (7)
(l) (l−1) (l) (l−1)
With m2 = m2 and m3 = m3 .

77 / 148
Non-Linearity Layer

If layer l is a non-linearity layer


(l)
Its input is given by m1 feature maps.

What about the output


(l) (l−1)
Its output comprises again m1 = m1 feature maps

Each of them of size


(l−1) (l−1)
m2 × m3 (7)
(l) (l−1) (l) (l−1)
With m2 = m2 and m3 = m3 .

77 / 148
Non-Linearity Layer

If layer l is a non-linearity layer


(l)
Its input is given by m1 feature maps.

What about the output


(l) (l−1)
Its output comprises again m1 = m1 feature maps

Each of them of size


(l−1) (l−1)
m2 × m3 (7)
(l) (l−1) (l) (l−1)
With m2 = m2 and m3 = m3 .

77 / 148
Thus

With the final output


 
(l) (l−1)
Yi = f Yi (8)

Where
f is the activation function used in layer l and operates point wise.

You can also add a gain to compensate


 
(l) (l−1)
Yi = gi f Yi (9)

78 / 148
Thus

With the final output


 
(l) (l−1)
Yi = f Yi (8)

Where
f is the activation function used in layer l and operates point wise.

You can also add a gain to compensate


 
(l) (l−1)
Yi = gi f Yi (9)

78 / 148
Thus

With the final output


 
(l) (l−1)
Yi = f Yi (8)

Where
f is the activation function used in layer l and operates point wise.

You can also add a gain to compensate


 
(l) (l−1)
Yi = gi f Yi (9)

78 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
79 / 148
Rectification Layer, Rabs

Now a rectification layer


(l) (l−1) (l−1)
Then its input comprises m1 feature maps of size m2 × m3 .

Then, the absolute value for each component of the feature maps is
computed

(l) (l)
Yi = Yi (10)

Where the absolute value


(l) (l−1)
It is computed point wise such that the output consists of m1 = m1
feature maps unchanged in size.

80 / 148
Rectification Layer, Rabs

Now a rectification layer


(l) (l−1) (l−1)
Then its input comprises m1 feature maps of size m2 × m3 .

Then, the absolute value for each component of the feature maps is
computed

(l) (l)
Yi = Yi (10)

Where the absolute value


(l) (l−1)
It is computed point wise such that the output consists of m1 = m1
feature maps unchanged in size.

80 / 148
Rectification Layer, Rabs

Now a rectification layer


(l) (l−1) (l−1)
Then its input comprises m1 feature maps of size m2 × m3 .

Then, the absolute value for each component of the feature maps is
computed

(l) (l)
Yi = Yi (10)

Where the absolute value


(l) (l−1)
It is computed point wise such that the output consists of m1 = m1
feature maps unchanged in size.

80 / 148
Thus

We have that
Experiments show that rectification plays a central role in achieving good
performance.

You can find this in


K. Jarrett, K. Kavukcuogl, M. Ranzato, and Y. LeCun. What is the best
multi-stage architecture for object recognition? In Computer Vision,
International Conference on, pages 2146–2153, 2009.

Remark
Rectification could be included in the non-linearity layer.
But also it can be seen as an independent layer.

81 / 148
Thus

We have that
Experiments show that rectification plays a central role in achieving good
performance.

You can find this in


K. Jarrett, K. Kavukcuogl, M. Ranzato, and Y. LeCun. What is the best
multi-stage architecture for object recognition? In Computer Vision,
International Conference on, pages 2146–2153, 2009.

Remark
Rectification could be included in the non-linearity layer.
But also it can be seen as an independent layer.

81 / 148
Thus

We have that
Experiments show that rectification plays a central role in achieving good
performance.

You can find this in


K. Jarrett, K. Kavukcuogl, M. Ranzato, and Y. LeCun. What is the best
multi-stage architecture for object recognition? In Computer Vision,
International Conference on, pages 2146–2153, 2009.

Remark
Rectification could be included in the non-linearity layer.
But also it can be seen as an independent layer.

81 / 148
Given that we are using Backpropagation
We need a soft approximation to f (x) = |x|
For this, we have

∂f
= sgn (x)
∂x
When x 6= 0. Why?

We can use the following approximation


exp {kx}
 
sgn (x) = 2 −1
1 + exp {kx}

Therefore, we have by integration and working the C


2 2
f (x) = ln (1 + exp {kx}) − x − ln (2)
k k

82 / 148
Given that we are using Backpropagation
We need a soft approximation to f (x) = |x|
For this, we have

∂f
= sgn (x)
∂x
When x 6= 0. Why?

We can use the following approximation


exp {kx}
 
sgn (x) = 2 −1
1 + exp {kx}

Therefore, we have by integration and working the C


2 2
f (x) = ln (1 + exp {kx}) − x − ln (2)
k k

82 / 148
Given that we are using Backpropagation
We need a soft approximation to f (x) = |x|
For this, we have

∂f
= sgn (x)
∂x
When x 6= 0. Why?

We can use the following approximation


exp {kx}
 
sgn (x) = 2 −1
1 + exp {kx}

Therefore, we have by integration and working the C


2 2
f (x) = ln (1 + exp {kx}) − x − ln (2)
k k

82 / 148
We get the following situation

Something Notable
+0.0007

+0.0006

+0.0005

+0.0004

+0.0003

+0.0002

+0.0001

−0.0004 −0.00025 −0.0001 +0.0001 +0.00025 +0.0004

−0 0001

83 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
84 / 148
Normalizing

Contrast normalization layer


The task of a local contrast normalization layer:
To enforce local competitiveness between adjacent units within a
feature map.
To enforce competitiveness units at the same spatial location.

We have two types of operations


Subtractive Normalization.
Brightness Normalization.

85 / 148
Normalizing

Contrast normalization layer


The task of a local contrast normalization layer:
To enforce local competitiveness between adjacent units within a
feature map.
To enforce competitiveness units at the same spatial location.

We have two types of operations


Subtractive Normalization.
Brightness Normalization.

85 / 148
Normalizing

Contrast normalization layer


The task of a local contrast normalization layer:
To enforce local competitiveness between adjacent units within a
feature map.
To enforce competitiveness units at the same spatial location.

We have two types of operations


Subtractive Normalization.
Brightness Normalization.

85 / 148
Normalizing

Contrast normalization layer


The task of a local contrast normalization layer:
To enforce local competitiveness between adjacent units within a
feature map.
To enforce competitiveness units at the same spatial location.

We have two types of operations


Subtractive Normalization.
Brightness Normalization.

85 / 148
Normalizing

Contrast normalization layer


The task of a local contrast normalization layer:
To enforce local competitiveness between adjacent units within a
feature map.
To enforce competitiveness units at the same spatial location.

We have two types of operations


Subtractive Normalization.
Brightness Normalization.

85 / 148
Subtractive Normalization

(l−1) (l−1) (l−1)


Given m1 feature maps of size m2 × m3
(l) (l−1)
The output of layer l comprises m1 = m1 feature maps unchanged in
size.

With the operation


(l−1)
m1
(l) (l−1) X (l−1)
Yi = Yi − KG(σ) ∗ Yj (11)
j=1

With
( )
  1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2

86 / 148
Subtractive Normalization

(l−1) (l−1) (l−1)


Given m1 feature maps of size m2 × m3
(l) (l−1)
The output of layer l comprises m1 = m1 feature maps unchanged in
size.

With the operation


(l−1)
m1
(l) (l−1) X (l−1)
Yi = Yi − KG(σ) ∗ Yj (11)
j=1

With
( )
  1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2

86 / 148
Subtractive Normalization

(l−1) (l−1) (l−1)


Given m1 feature maps of size m2 × m3
(l) (l−1)
The output of layer l comprises m1 = m1 feature maps unchanged in
size.

With the operation


(l−1)
m1
(l) (l−1) X (l−1)
Yi = Yi − KG(σ) ∗ Yj (11)
j=1

With
( )
  1 x2 + y 2
KG(σ) =√ exp (12)
x,y 2πσ 2 2σ 2

86 / 148
Brightness Normalization

An alternative is to normalize the brightness in combination with the


rectified linear units
 
(l−1)
  Yi
(l) x,y
Yi = µ (13)
x,y Pm(l−1)
1

(l−1) 2

κ+λ j=1 Yj
x,y

Where
κ, µ and λ are hyperparameters which can be set using a
 
ln 1 + ekx
f (x) =
k
validation set.

87 / 148
Brightness Normalization

An alternative is to normalize the brightness in combination with the


rectified linear units
 
(l−1)
  Yi
(l) x,y
Yi = µ (13)
x,y Pm(l−1)
1

(l−1) 2

κ+λ j=1 Yj
x,y

Where
κ, µ and λ are hyperparameters which can be set using a
 
ln 1 + ekx
f (x) =
k
validation set.

87 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
88 / 148
Sub-sampling Layer

Motivation
The motivation of subsampling the feature maps obtained by previous
layers is robustness to noise and distortions.

How?
Normally, in traditional Convolutional Networks subsampling this is
done by applying skipping factors!!!
However, it is possible to combine subsampling with pooling and do it
in a separate layer

89 / 148
Sub-sampling Layer

Motivation
The motivation of subsampling the feature maps obtained by previous
layers is robustness to noise and distortions.

How?
Normally, in traditional Convolutional Networks subsampling this is
done by applying skipping factors!!!
However, it is possible to combine subsampling with pooling and do it
in a separate layer

89 / 148
Sub-sampling

The subsampling layer


It seems to be acting as the well know sub-sampling pyramid

90 / 148
How is sub-sampling implemented?

We know that Image Pyramids


They were designed to find:
1 filter-based representations to decompose images into information at
multiple scales,
2 To extract features/structures of interest,
3 To attenuate noise.

Example of usage of this filters


The SURF and SIFT filters

91 / 148
How is sub-sampling implemented?

We know that Image Pyramids


They were designed to find:
1 filter-based representations to decompose images into information at
multiple scales,
2 To extract features/structures of interest,
3 To attenuate noise.

Example of usage of this filters


The SURF and SIFT filters

91 / 148
How is sub-sampling implemented?

We know that Image Pyramids


They were designed to find:
1 filter-based representations to decompose images into information at
multiple scales,
2 To extract features/structures of interest,
3 To attenuate noise.

Example of usage of this filters


The SURF and SIFT filters

91 / 148
How is sub-sampling implemented?

We know that Image Pyramids


They were designed to find:
1 filter-based representations to decompose images into information at
multiple scales,
2 To extract features/structures of interest,
3 To attenuate noise.

Example of usage of this filters


The SURF and SIFT filters

91 / 148
How is sub-sampling implemented?

We know that Image Pyramids


They were designed to find:
1 filter-based representations to decompose images into information at
multiple scales,
2 To extract features/structures of interest,
3 To attenuate noise.

Example of usage of this filters


The SURF and SIFT filters

91 / 148
There are also other ways of doing this

subsampling can be done using so called skipping factors


(l) (l)
s1 and s2

The basic idea is to skip a fixed number of pixels


Therefore the size of the output feature map is given by
(l−1) (l) (l−1) (l)
(l) m2 − 2h1 (l) m3 − 2h2
m2 = (l)
and m3 = (l)
s1 + 1 s2 + 1

92 / 148
There are also other ways of doing this

subsampling can be done using so called skipping factors


(l) (l)
s1 and s2

The basic idea is to skip a fixed number of pixels


Therefore the size of the output feature map is given by
(l−1) (l) (l−1) (l)
(l) m2 − 2h1 (l) m3 − 2h2
m2 = (l)
and m3 = (l)
s1 + 1 s2 + 1

92 / 148
What is Pooling?

Pooling
Spatial pooling is way to compute image representation
based on encoded local features.

93 / 148
Pooling

Let l be a pooling layer


(l) (l−1)
It outputs from mi > mi feature maps of reduced size.

Pooling Operation
It operates by placing windows at non-overlapping positions in each
feature map and keeping one value per window such that the feature maps
are sub-sampled.

94 / 148
Pooling

Let l be a pooling layer


(l) (l−1)
It outputs from mi > mi feature maps of reduced size.

Pooling Operation
It operates by placing windows at non-overlapping positions in each
feature map and keeping one value per window such that the feature maps
are sub-sampled.

94 / 148
Thus

In the previous example


All feature maps are pooled and sub-sampled individually.

Each unit
(l)
In one of the m1 = 4 output feature maps represents the average or
the maximum within a fixed window of the corresponding feature map
in layer (l − 1).

95 / 148
Thus

In the previous example


All feature maps are pooled and sub-sampled individually.

Each unit
(l)
In one of the m1 = 4 output feature maps represents the average or
the maximum within a fixed window of the corresponding feature map
in layer (l − 1).

95 / 148
Examples of pooling

Average pooling
When using a boxcar filter, the operation is called average pooling and the
layer denoted by PA .

4 5 1 1
2 6 2 6 4.5 5

3 5 7 3 9 6.5

1 9 2 1

96 / 148
Examples of pooling

Max pooling
For max pooling, the maximum value of each window is taken. The layer
is denoted by PM .

4 5 1 1
2 6 2 6 5 6

3 5 7 3 9 7

1 9 2 1

97 / 148
An interesting property

Something notable depending in the pooling area


“In all cases, pooling helps to make the representation become
approximately invariant to small translations of the input. Invariance
to translation means that if we translate the input by a small amount,
the values of most of the pooled outputs do not change.”
I Page 342, Ian Goodfellow, Introduction to Deep Learning, 2016 [11].

The small amount


In the case of the previous examples, 1 pixel

98 / 148
An interesting property

Something notable depending in the pooling area


“In all cases, pooling helps to make the representation become
approximately invariant to small translations of the input. Invariance
to translation means that if we translate the input by a small amount,
the values of most of the pooled outputs do not change.”
I Page 342, Ian Goodfellow, Introduction to Deep Learning, 2016 [11].

The small amount


In the case of the previous examples, 1 pixel

98 / 148
Other Poolings

There are other types of pooling


L2 norm of a rectangular neighborhood
Weighted average based on the distance from the central pixel

However, we have another way of doing pooling


Striding!!!

99 / 148
Other Poolings

There are other types of pooling


L2 norm of a rectangular neighborhood
Weighted average based on the distance from the central pixel

However, we have another way of doing pooling


Striding!!!

99 / 148
Springerberg et al. [12]

They started talking about sustituing maxpooling for something


called a Stride on the Convolution

(l−1) (l) (l)


    m1 h1 h2    
(l) (l) X X X (l) (l−1)
Yi = Bi + Kij Yj
x,y x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2

This is a Heuristic ...


Basically you jump around by a factro r and t for the width and
height of the layer
I It was proposed to decrease memory usage...

100 / 148
Springerberg et al. [12]

They started talking about sustituing maxpooling for something


called a Stride on the Convolution

(l−1) (l) (l)


    m1 h1 h2    
(l) (l) X X X (l) (l−1)
Yi = Bi + Kij Yj
x,y x,y k,t x−k,x−t
j=1 k=−h(l) t=−h(l)
1 2

This is a Heuristic ...


Basically you jump around by a factro r and t for the width and
height of the layer
I It was proposed to decrease memory usage...

100 / 148
Example

Horizontal Stride

101 / 148
Example

Horizontal Stride

102 / 148
Example

Horizontal Stride

103 / 148
There are attempts to understand its effects

At Convolution Level and using Tensors [13]


“Take it in your stride: Do we need striding in CNNs?” by Chen
Kong, Simon Lucey [14]

Please read Kolda’s Paper before you get into the other
You need a little bit of notation...

104 / 148
There are attempts to understand its effects

At Convolution Level and using Tensors [13]


“Take it in your stride: Do we need striding in CNNs?” by Chen
Kong, Simon Lucey [14]

Please read Kolda’s Paper before you get into the other
You need a little bit of notation...

104 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
105 / 148
Here, the people at Google [15] around 2015

They commented in the “Internal Covariate Shift Phenomena”


Due to the change in the distribution of each layer’s input

They claim
The min-batch forces to have those changes which impact on the
learning capabilities of the network.

In Neural Networks, they define this


Internal Covariate Shift as the change in the distribution of network
activation’s due to the change in network parameters during training.

106 / 148
Here, the people at Google [15] around 2015

They commented in the “Internal Covariate Shift Phenomena”


Due to the change in the distribution of each layer’s input

They claim
The min-batch forces to have those changes which impact on the
learning capabilities of the network.

In Neural Networks, they define this


Internal Covariate Shift as the change in the distribution of network
activation’s due to the change in network parameters during training.

106 / 148
Here, the people at Google [15] around 2015

They commented in the “Internal Covariate Shift Phenomena”


Due to the change in the distribution of each layer’s input

They claim
The min-batch forces to have those changes which impact on the
learning capabilities of the network.

In Neural Networks, they define this


Internal Covariate Shift as the change in the distribution of network
activation’s due to the change in network parameters during training.

106 / 148
They gave the following reasons
Consider a layer with the input u that adds the learned bias b
Then, it normalizes the result by subtracting the mean of the
activation over the training data:

b = x − E [x]
x
1
PN
I X = {x, ..., xN } the data samples and E [x] = N i=1 xi

Now, if the gradient ignores the dependence of E [x] on b


Then b = b + ∆b where ∆b ∝ − ∂∂lbx

Finally

u + (b + ∆b) − E[u + (b + ∆b)] =u + b − E[u + b]

107 / 148
They gave the following reasons
Consider a layer with the input u that adds the learned bias b
Then, it normalizes the result by subtracting the mean of the
activation over the training data:

b = x − E [x]
x
1
PN
I X = {x, ..., xN } the data samples and E [x] = N i=1 xi

Now, if the gradient ignores the dependence of E [x] on b


Then b = b + ∆b where ∆b ∝ − ∂∂lbx

Finally

u + (b + ∆b) − E[u + (b + ∆b)] =u + b − E[u + b]

107 / 148
They gave the following reasons
Consider a layer with the input u that adds the learned bias b
Then, it normalizes the result by subtracting the mean of the
activation over the training data:

b = x − E [x]
x
1
PN
I X = {x, ..., xN } the data samples and E [x] = N i=1 xi

Now, if the gradient ignores the dependence of E [x] on b


Then b = b + ∆b where ∆b ∝ − ∂∂lbx

Finally

u + (b + ∆b) − E[u + (b + ∆b)] =u + b − E[u + b]

107 / 148
Then

The following will happen


The update to b by ∆b leads to no change in the output of the layer.

Therefore
We need to integrate the normalization into the process of training.

108 / 148
Then

The following will happen


The update to b by ∆b leads to no change in the output of the layer.

Therefore
We need to integrate the normalization into the process of training.

108 / 148
Normalization via Mini-Batch Statistic

It is possible to describe the normalization as a transformation layer


b = N orm (x, X )
x

Which depends on all the training samples X which also depends on


the layer parameters

For back-propagation, we will need to generate the following terms


∂N orm (x, X ) ∂N orm (x, X )
and
∂x ∂X

109 / 148
Normalization via Mini-Batch Statistic

It is possible to describe the normalization as a transformation layer


b = N orm (x, X )
x

Which depends on all the training samples X which also depends on


the layer parameters

For back-propagation, we will need to generate the following terms


∂N orm (x, X ) ∂N orm (x, X )
and
∂x ∂X

109 / 148
Definition of Whitening

Whitening
Suppose X is a random (column) vector with non-singular covariance
matrix Σ and mean 0.

Then
Then the transformation Y = W X with a whitening matrix W
satisfying the condition W T W = Σ−1 yields the whitened random
vector Y with unit diagonal covariance.

110 / 148
Definition of Whitening

Whitening
Suppose X is a random (column) vector with non-singular covariance
matrix Σ and mean 0.

Then
Then the transformation Y = W X with a whitening matrix W
satisfying the condition W T W = Σ−1 yields the whitened random
vector Y with unit diagonal covariance.

110 / 148
Such Normalization

It could be used for all layer


But whitening the layer inputs is expensive, as it requires computing
the covariance matrix
h i
Cov [x] = Ex∈X xxT and E [x] E [x]T
I To produce the whitened activations

111 / 148
Therefore

A Better Options, we can normalize each input layer


x(k) − µ
b (k) =
x
σ
h i h i
with µ = E x(k) and σ 2 = V ar x(k)

This allows to speed up convergence


Simply normalizing each input of a layer may change what the layer
can represent.

So, we need to insert a transformation in the network


Which can represent the identity transform

112 / 148
Therefore

A Better Options, we can normalize each input layer


x(k) − µ
b (k) =
x
σ
h i h i
with µ = E x(k) and σ 2 = V ar x(k)

This allows to speed up convergence


Simply normalizing each input of a layer may change what the layer
can represent.

So, we need to insert a transformation in the network


Which can represent the identity transform

112 / 148
Therefore

A Better Options, we can normalize each input layer


x(k) − µ
b (k) =
x
σ
h i h i
with µ = E x(k) and σ 2 = V ar x(k)

This allows to speed up convergence


Simply normalizing each input of a layer may change what the layer
can represent.

So, we need to insert a transformation in the network


Which can represent the identity transform

112 / 148
The Transformation

The Linear transformation


b (k) + β (k)
y (k) = γ (k) x

The parameters γ (k) , β (k)


q
This allow to recover the identity by setting γ (k) =
 
V ar x(k) and
h i
β (k) = E x(k) if necessary.

113 / 148
The Transformation

The Linear transformation


b (k) + β (k)
y (k) = γ (k) x

The parameters γ (k) , β (k)


q
This allow to recover the identity by setting γ (k) =
 
V ar x(k) and
h i
β (k) = E x(k) if necessary.

113 / 148
Finally

Batch Normalizing Transform


Input: Values of x over a mini-batch: B = {x1...m }, Parameters to
be learned: γ, β
Output: {yi = BNγ,β (xi )}
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
b i + β = BNγ,β (xi )

114 / 148
Finally

Batch Normalizing Transform


Input: Values of x over a mini-batch: B = {x1...m }, Parameters to
be learned: γ, β
Output: {yi = BNγ,β (xi )}
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
b i + β = BNγ,β (xi )

114 / 148
Finally

Batch Normalizing Transform


Input: Values of x over a mini-batch: B = {x1...m }, Parameters to
be learned: γ, β
Output: {yi = BNγ,β (xi )}
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
b i + β = BNγ,β (xi )

114 / 148
Finally

Batch Normalizing Transform


Input: Values of x over a mini-batch: B = {x1...m }, Parameters to
be learned: γ, β
Output: {yi = BNγ,β (xi )}
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
b i + β = BNγ,β (xi )

114 / 148
Finally

Batch Normalizing Transform


Input: Values of x over a mini-batch: B = {x1...m }, Parameters to
be learned: γ, β
Output: {yi = BNγ,β (xi )}
1 Pm
1 µB = m i=1 xi
2 2 1 Pm
σB = m i=1 (xi − µ)2
xi −µB
3 b= √
x 2 +
σB
4 y i = γ (k) x
b i + β = BNγ,β (xi )

114 / 148
Backpropagation

We have the following equations by using the loss function l


∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
  − 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 +  2

  Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i

115 / 148
Backpropagation

We have the following equations by using the loss function l


∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
  − 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 +  2

  Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i

115 / 148
Backpropagation

We have the following equations by using the loss function l


∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
  − 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 +  2

  Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i

115 / 148
Backpropagation

We have the following equations by using the loss function l


∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
  − 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 +  2

  Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i

115 / 148
Backpropagation

We have the following equations by using the loss function l


∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
  − 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 +  2

  Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i

115 / 148
Backpropagation

We have the following equations by using the loss function l


∂l ∂l
1
∂b
xi
= ∂y i ×γ
∂l Pm ∂l
  − 3
2
∂σB2 = i=1 ∂ b
xi
× (xi − µB ) × − 21 × σB2 +  2

  Pm
−2×(xi −µB )
× √−1
∂l Pm ∂l ∂l
3
∂µB = i=1 ∂ b
xi 2
+ ∂σB2 × i=1
m
σB +
∂l ∂l ∂l 2×(xi −µB ) ∂l
4
∂xi = × √ 12 + 2 × m + ∂µB × 1
m
∂b
xi σB + ∂σB
∂l Pm ∂l
5
∂γ = i=1 ∂y i ×x
bi
∂l Pm ∂l
6
∂β = i=1 ∂y i

115 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
Training Batch Normalization Networks
 K
Input: Network N with trainable parameters Θ; subset of activations x(k)
k=1
inf
Output: Batch-normalized network for inference NBN
1 tr = N // Training BN network
NBN
2 for k = 1...K do  tr
3 Add transformation y (k) = BNγ (k) ,β (k) x(k) to NBN
4 tr with input x(k) to take y (k) instead
Modify each layer in NBN

tr to optimize the parameters Θ ∪ γ (k) , β (k)
K
5 Train NBN
k=1
6 inf tr // Inference BN network with frozen parameters
NBN = NBN
7 for k = 1...K do
8 Process multiple training mini-batches B, each of size m, and
average over them  2
9 m
E [x] = EB [µB ] and V ar [x] = m−1 σB
B
10 inf
In NBN , replace the transform y = BNγ,β (x) with
h i
γ γE[x]
11 y= √ ×x+ β− √
V ar[x]+ V ar[x]+

116 / 148
However

Santurkar et al. [16]


They found thats is not the covariance shift the one affected by it!!!

Santurkar et al. recognize that


Batch normalization has been arguably one of the most successful
architectural innovations in deep learning.

They used a standard Very deep convolutional network


on CIFAR-10 with and without BatchNorm

117 / 148
However

Santurkar et al. [16]


They found thats is not the covariance shift the one affected by it!!!

Santurkar et al. recognize that


Batch normalization has been arguably one of the most successful
architectural innovations in deep learning.

They used a standard Very deep convolutional network


on CIFAR-10 with and without BatchNorm

117 / 148
However

Santurkar et al. [16]


They found thats is not the covariance shift the one affected by it!!!

Santurkar et al. recognize that


Batch normalization has been arguably one of the most successful
architectural innovations in deep learning.

They used a standard Very deep convolutional network


on CIFAR-10 with and without BatchNorm

117 / 148
They found something quite interesting

The following facts

118 / 148
Actually Batch Normalization

It does not do anything to the Internal Covariate Shift


Actually smooth the optimization manifold
I It is not the only way to achieve it!!!

They suggest that


“This suggests that the positive impact of BatchNorm on training
might be somewhat serendipitous.”

119 / 148
Actually Batch Normalization

It does not do anything to the Internal Covariate Shift


Actually smooth the optimization manifold
I It is not the only way to achieve it!!!

They suggest that


“This suggests that the positive impact of BatchNorm on training
might be somewhat serendipitous.”

119 / 148
They actually have a connected result

To the analysis of gradient clipping!!!


They are the same group

Theorem (The effect of BatchNorm on the Lipschitzness of the loss)


For a BatchNorm network with loss Lb and an identical non-BN
network with (identical) loss L,
γ2 1 D 1 D
2  2 E2 E2 
∇yj L ≤ 2 ∇yj L − 1, ∇yj L − √ ∇yj L, y
b
bj
σj m m

120 / 148
They actually have a connected result

To the analysis of gradient clipping!!!


They are the same group

Theorem (The effect of BatchNorm on the Lipschitzness of the loss)


For a BatchNorm network with loss Lb and an identical non-BN
network with (identical) loss L,
γ2 1 D 1 D
2  2 E2 E2 
∇yj L ≤ 2 ∇yj L − 1, ∇yj L − √ ∇yj L, y
b
bj
σj m m

120 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
121 / 148
Fully Connected Layer

If a layer l is a fully connected layer


If layer (l − 1) is a fully connected layer, use the equation to compute
the output of ith unit at layer l:

m (l)
 
(l) X (l) (l) (l) (l)
zi = wi,k yk thus yi = f zi
k=0

Otherwise
(l−1) (l−1) (l−1)
Layer l expects m1 feature maps of size m2 × m3 as input.

122 / 148
Fully Connected Layer

If a layer l is a fully connected layer


If layer (l − 1) is a fully connected layer, use the equation to compute
the output of ith unit at layer l:

m (l)
 
(l) X (l) (l) (l) (l)
zi = wi,k yk thus yi = f zi
k=0

Otherwise
(l−1) (l−1) (l−1)
Layer l expects m1 feature maps of size m2 × m3 as input.

122 / 148
Then

Thus, the ith unit in layer l computes


 
(l) (l)
yi =f zi
(l−1) (l−1) (l−1)
m1
(l) X mX 2 m3
X (l)

(l−1)

zi = wi,j,r,s Yj
r,s
j=1 r=1 s=1

123 / 148
Here

(l)
Where wi,j,r,s
It denotes the weight connecting the unit at position (r, s) in the j th
feature map of layer (l − 1) and the ith unit in layer l.

Something Notable
In practice, Convolutional Layers are used to learn a feature hierarchy
and one or more fully connected layers are used for classification
purposes based on the computed features.

124 / 148
Here

(l)
Where wi,j,r,s
It denotes the weight connecting the unit at position (r, s) in the j th
feature map of layer (l − 1) and the ith unit in layer l.

Something Notable
In practice, Convolutional Layers are used to learn a feature hierarchy
and one or more fully connected layers are used for classification
purposes based on the computed features.

124 / 148
Basically

We can use a loss function at the output of such layer

N N X
K  2
X X (l)
L (W ) = En (W ) = ynk − tnk (Sum of Squared Error)
n=1 n=1 k=1
N N X K  
X X (l)
L (W ) = En (W ) = tnk log ynk (Cross-Entropy Error)
n=1 n=1 k=1

Assuming W the tensor used to represent all the possible weights


We can use the Backpropagation idea as long we can apply the
corresponding derivatives.

125 / 148
Basically

We can use a loss function at the output of such layer

N N X
K  2
X X (l)
L (W ) = En (W ) = ynk − tnk (Sum of Squared Error)
n=1 n=1 k=1
N N X K  
X X (l)
L (W ) = En (W ) = tnk log ynk (Cross-Entropy Error)
n=1 n=1 k=1

Assuming W the tensor used to represent all the possible weights


We can use the Backpropagation idea as long we can apply the
corresponding derivatives.

125 / 148
About this

As part of the seminar


We are preparing a series of slides about Loss Functions...

126 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
127 / 148
We have the following Architecture

Simplified Architecture by Jean LeCun “Backpropagation applied to


handwritten zip code recognition”

128 / 148
Therefore, we have

Layer l = 1
This Layer is using a ReLu f with 3 channels
(l) (l)
h1 h2
    X X    
(l) (l) (l) (l−1)
Y1 = B1 + K11 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2

(l) (l)
h1 h2
    X X    
(l) (l) (l) (l−1)
Y2 = B2 + K21 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2

(l) (l)
h1 h2
    X X    
(l) (l) (l) (l−1)
Y3 = B3 + K31 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2

129 / 148
Layer l = 2

We have a maxpooling of size 2 × 2


          
(l) (l−1) (l−1) (l−1) (l−1)
Yi = max Yi , Yi Yi , Yi
x0 ,y 0 x,y x+1,y x,y+1 x+1,y+1

130 / 148
Then, you repeat the previous process

Thus we obtain a reduced convoluted version Ym(3) of the Yn(4)


convolution and maxpooling
Thus, we use those as inputs for the fully connected layer of input.

131 / 148
The fully connected layer

Now assuming a single k = 1 neuron


 
(6) (5)
y1 =f z1
(6) (6)
9 m 2 m 3  
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1

132 / 148
We have for simplicity sake

That our final cost function is equal to


1  (6) (6) 2

L= y1 − t1
2

133 / 148
Outline
1 Introduction
The Long Path
The Problem of Image Processing
Multilayer Neural Network Classification
Drawbacks
Possible Solution

2 Convolutional Networks
History
Local Connectivity
Sharing Parameters

3 Layers
Convolutional Layer
Convolutional Architectures
A Little Bit of Notation
Deconvolution Layer
Alternating Minimization
Non-Linearity Layer
Fixing the Problem, ReLu function
Back to the Non-Linearity Layer
Rectification Layer
Local Contrast Normalization Layer
Sub-sampling and Pooling
Strides
Normalization Layer AKA Batch Normalization
Finally, The Fully Connected Layer

4 An Example of CNN
The Proposed Architecture
Backpropagation
Deriving wr,s,k
Deriving the Kernel Filters
134 / 148
After collecting all input/output

Therefore
We have using sum of squared errors (loss function):
1  (6) (6) 2

L= y1 − t1
2

Therefore, we can obtain


(6) 2
 
(6)
∂L 1 ∂ y1 − t1
(5)
= × (5)
∂wr,s,k 2 ∂wr,s,k

135 / 148
After collecting all input/output

Therefore
We have using sum of squared errors (loss function):
1  (6) (6) 2

L= y1 − t1
2

Therefore, we can obtain


(6) 2
 
(6)
∂L 1 ∂ y1 − t1
(5)
= × (5)
∂wr,s,k 2 ∂wr,s,k

135 / 148
Therefore

We get in the first part of the equation


(6) 2
 
(6)
∂ y1 − t 1   ∂y (6)
(6) (6) 1
(5)
= y1 − t1 (5)
∂wr,s,k ∂wr,s,k

With
 
(6) (5)
y1 = ReLu z1

136 / 148
Therefore

We get in the first part of the equation


(6) 2
 
(6)
∂ y1 − t 1   ∂y (6)
(6) (6) 1
(5)
= y1 − t1 (5)
∂wr,s,k ∂wr,s,k

With
 
(6) (5)
y1 = ReLu z1

136 / 148
Therefore

We have
 
(6) (5) (5)
∂y1 ∂f z1 ∂z1
(5)
= (5)
× (5)
∂wr,s,k ∂z1 ∂wr,s,k

Therefore if we use the approximation


 
(5) (5)
∂f z1 ekz1
(5)
= (5)

∂z1 1 + ekz1

137 / 148
Therefore

We have
 
(6) (5) (5)
∂y1 ∂f z1 ∂z1
(5)
= (5)
× (5)
∂wr,s,k ∂z1 ∂wr,s,k

Therefore if we use the approximation


 
(5) (5)
∂f z1 ekz1
(5)
= (5)

∂z1 1 + ekz1

137 / 148
(5)
∂z1
Now, we need to derive (5)
∂wr,s,k

We know that
(6) (6)
9 m 2 m 3  
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1

Finally
(5)
∂z1 
(4)

(5)
= Yk
∂wr,s,k r,s

138 / 148
(5)
∂z1
Now, we need to derive (5)
∂wr,s,k

We know that
(6) (6)
9 m 2 m 3  
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1

Finally
(5)
∂z1 
(4)

(5)
= Yk
∂wr,s,k r,s

138 / 148
(5)
∂z1
Now, we need to derive (5)
∂wr,s,k

We know that
(6) (6)
9 m 2 m 3  
(5) X X X (5) (4)
z1 = wr,s,k Yk
r,s
k=1 r=1 s=1

Finally
(5)
∂z1 
(4)

(5)
= Yk
∂wr,s,k r,s

138 / 148
Maxpooling

This is not derived after all, but we go directly go for the max term
Assume you get the max element for f = 1, 2, ..., 9 and j = 1
(l) (l)
h1 h2
    X X    
(3) (3) (3) (2)
Yf = Bf + Kf 1 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2

139 / 148
Therefore

We have then
(6) 2
 
(6)
∂L 1 ∂ y1 − t1

(3)
 = × 
(3)

∂ Kf 1 2 ∂ Kf 1
k,t k,t

We have the following


 chain of derivations given
   
(4) (3)
Yf = f Yf
x,y x,y

  

(5)
 (3)
 ∂f z1 (5) ∂f Yf
∂L 
(6) (6) ∂zi x,y

(3)
 = y1 − t 1 (5)
× 
(4)
 × 
(3)

∂ Kf 1 ∂z1 ∂ Yf ∂ Kf 1
k,t x,y k,t

140 / 148
Therefore

We have then
(6) 2
 
(6)
∂L 1 ∂ y1 − t1

(3)
 = × 
(3)

∂ Kf 1 2 ∂ Kf 1
k,t k,t

We have the following


 chain of derivations given
   
(4) (3)
Yf = f Yf
x,y x,y

  

(5)
 (3)
 ∂f z1 (5) ∂f Yf
∂L 
(6) (6) ∂zi x,y

(3)
 = y1 − t 1 (5)
× 
(4)
 × 
(3)

∂ Kf 1 ∂z1 ∂ Yf ∂ Kf 1
k,t x,y k,t

140 / 148
Therefore

We have
(5)
∂zi (5)

(3)
 = wx,y,f
∂ Yf
x,y

Then assuming that


(l) (l)
h1 h2
    X X    
(3) (3) (3) (2)
Yf = Bf + Kf 1 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2

141 / 148
Therefore

We have
(5)
∂zi (5)

(3)
 = wx,y,f
∂ Yf
x,y

Then assuming that


(l) (l)
h1 h2
    X X    
(3) (3) (3) (2)
Yf = Bf + Kf 1 Y1
x,y x,y k,t x−k,x−t
(l) (l)
k=−h1 t=−h2

141 / 148
Therefore

We have
     
(3) (3)
 
(3)
∂f Yf ∂f Yf ∂ Yf
x,y x,y
  =   ×  x,y
(3) (3) (3)
∂ Kf 1 ∂ Yf ∂ Kf 1
k,t x,y k,t

Then
  
(3)
∂f Yf  
x,y

(3)

(3)
 = f0 Yf
∂ Yf x,y
x,y

142 / 148
Therefore

We have
     
(3) (3)
 
(3)
∂f Yf ∂f Yf ∂ Yf
x,y x,y
  =   ×  x,y
(3) (3) (3)
∂ Kf 1 ∂ Yf ∂ Kf 1
k,t x,y k,t

Then
  
(3)
∂f Yf  
x,y

(3)

(3)
 = f0 Yf
∂ Yf x,y
x,y

142 / 148
Finally, we have

The equation
 
(3)
∂ Yf  
 x,y = Y1(2)
(3) x−k,x−t
∂ Kf 1
k,t

143 / 148
The Other Equations

I will leave you to devise them


They are a repetitive procedure.

The interesting case the average pooling


The others are the stride and the deconvolution

144 / 148
The Other Equations

I will leave you to devise them


They are a repetitive procedure.

The interesting case the average pooling


The others are the stride and the deconvolution

144 / 148
[1] A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the
recent architectures of deep convolutional neural networks,” Artificial
Intelligence Review, vol. 53, no. 8, pp. 5455–5516, 2020.
[2] R. Szeliski, Computer Vision: Algorithms and Applications.
Berlin, Heidelberg: Springer-Verlag, 1st ed., 2010.
[3] S. Haykin, Neural Networks and Learning Machines.
No. v. 10 in Neural networks and learning machines, Prentice Hall,
2009.
[4] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction
and functional architecture in the cat’s visual cortex,” The Journal of
physiology, vol. 160, no. 1, p. 106, 1962.
[5] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based
learning applied to document recognition,” Proceedings of the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.

145 / 148
[6] W. Zhang, K. Itoh, J. Tanida, and Y. Ichioka, “Parallel distributed
processing model with local space-invariant interconnections and its
optical architecture,” Appl. Opt., vol. 29, pp. 4790–4797, Nov 1990.
[7] J. J. Weng, N. Ahuja, and T. S. Huang, “Learning recognition and
segmentation of 3-d objects from 2-d images,” in 1993 (4th)
International Conference on Computer Vision, pp. 121–128, IEEE,
1993.
[8] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus,
“Deconvolutional networks,” in 2010 IEEE Computer Society
Conference on computer vision and pattern recognition,
pp. 2528–2535, IEEE, 2010.
[9] D. Krishnan and R. Fergus, “Fast image deconvolution using
hyper-laplacian priors,” Advances in neural information processing
systems, vol. 22, pp. 1033–1041, 2009.

146 / 148
[10] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternating
minimization algorithm for total variation image reconstruction,”
SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 248–272, 2008.
[11] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.
The MIT Press, 2016.
[12] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
“Striving for simplicity: The all convolutional net,” 2015.
[13] T. G. Kolda and B. W. Bader, “Tensor decompositions and
applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
[14] C. Kong and S. Lucey, “Take it in your stride: Do we need striding in
cnns?,” arXiv preprint arXiv:1712.02502, 2017.
[15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.

147 / 148
[16] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch
normalization help optimization?,” in Advances in Neural Information
Processing Systems, pp. 2483–2493, 2018.

148 / 148

You might also like