CNN PPT Unit Iv
CNN PPT Unit Iv
CNN PPT Unit Iv
• Overview
1. The Convolution Operation
2. Motivation
3. Pooling
4. Convolution and Pooling as an Infinitely Strong Prior
5. Variants of the Basic Convolution Function
6. Structured Outputs
7. Data Types
8. Efficient Convolution Algorithms
9. Random or Unsupervised Features
10.The Neuroscientific Basis for Convolutional Networks
11. Convolutional Networks and the History of Deep
Learning
Overview
1. Overview of Convolutional
Networks
2. Traditional versus
Convolutional Networks
3. Topics in CNNs
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
Runtime of Traditional vs Convolutional Networks
Deep Learning
Srihari
• What convolution is
• Motivation behind using convolution in a neural network
• Pooling, which almost all convolutional networks employ
• Usually the operation used in a convolutional neural network
does not correspond precisely to convolution in math
• We describe several variants on convolution function used in
practice
• Making convolution more efficient
• Convolution networks stand out as an example of
neuroscientific principles in deep learning
• Very deep convolutional network architectures
Topics in Convolutional Networks
• Overview
1. The Convolution Operation
2. Motivation
3. Pooling
4. Convolution and Pooling as an Infinitely Strong Prior
5. Variants of the Basic Convolution Function
6. Structured Outputs
7. Data Types
8. Efficient Convolution Algorithms
9. Random or Unsupervised Features
10.The Neuroscientific Basis for Convolutional Networks
11. Convolutional Networks and the History of Deep
Learning
Topics
1. What is convolution?
2. Convolution: continuous and discrete cases
3. Convolution in two dimensions
4. Discrete convolution viewed as matrix
multiplication
What is Convolution?
g [t-τ ]
Computation of 1-D Discrete Convolution
Parameters of convolution:
Kernel size (F)
Padding (P)
Stride (S)
(f *g)[t] g[t-τ ]
f [t]
Two-Dimensional Convolution
Sharply peaked kernel K for edge detection Kernels K1-K4 for line detection
I(i , j) K(i , j) S(i , j)
Commutativity of Convolution
• Convolution is commutative. We can equivalently write:
S(i, j ) = ( K * I )(i, j ) = ∑ ∑ I ( i − m, j −
n)K(m,n)
m n
• This formula is easier to implement in a ML library since
there is less variation in the range of valid values of m and n
• Commutativity arises because we have flipped the kernel
relative to the input
• As m increases, index to the input increases, but index to the kernel
decreases
Cross-Correlation
• Same as convolution, but without flipping the kernel
S(i, j ) = ( K * I )(i, j ) = ∑ ∑ I(i + m, j + n)K(m,n)
m n
Kernel g(t):
f [t]
etc. up to y8
We can also write the equations in terms of elements of a general 8 ✕ 8 weight matrix
W as:
where
http://colah.github.io/posts/2014-07-Understanding-Convolutions/
Sparse Connectivity, Viewed from Below
When s3 is formed
by matrix
multiplication
Maintaining Performance with Reduced Connections
• How is it possible to obtain good performance while keeping a number of k
connections several magnitudes less than m inputs
• In a deep neural network, units in deeper layers may indirectly interact with
a larger portion of the input
• Sparsely connected network requires only k x n parameters and O(kn)
runtime where n is the number of outputs
• Receptive field in deeper layers is larger than the receptive field in shallow layers
2. Fully connected model: Single black arrow indicates use of the central
element of the weight matrix
• Model has no parameter sharing, so the parameter is used only
once
• What is Pooling?
• Three stages of CNNs
• Two terminologies: simple layers, complex layers
• Types of Pooling functions: Max, Average
• Translation invariance
• Rotation invariance
• Pooling with downsampling
• ConvNet architectures
• Shortcoming of pooling
What is Pooling?
• Pooling in a CNN is a subsampling step
• It replaces output at a location with a summary statistic of nearby outputs
• E,g,, Max pooling reports the maximum output within a rectangular neighborhood
https://deepai.org/machine-learning-glossary-and-terms/max-pooling
The Pooling Stage in a
CNN
• Typical layer of a CNN consists of
three stages
• Stage 1:
• perform several convolutions in
parallel to produce a set of linear
activations
• Stage 2 (Detector):
• each linear activation is run through
a nonlinear activation function such
as ReLU
• Stage 3 (Pooling):
• Use a pooling function to modify
output of the layer further
5
Pooling Layer in Keras
• MaxPooling 1D
• Arguments
• pool_size: Integer, size of the max pooling windows.
• strides: Integer, or None. Factor by which to downscale. E.g. 2 will halve
the input. If None, it will default to pool_size.
• padding: One of "valid" or "same" (case-insensitive).
• data_format: A string, one of channels_last (default) or channels_first.
The ordering of the dimensions in the inputs. channels_last
corresponds to inputs with shape (batch, steps, features) while
channels_first corresponds to inputs with shape (batch, features, steps)
• MaxPooling 2D
Typical Subsampling in a Deep Network
Outputs of maxpooling
Outputs of nonlinearity
• Same network after the input has been shifted by one pixel
• Every input value has changed, but only half the values of
output have changed because maxpooling units are only
sensitive to maximum value in neighborhood not exact value
Importance of Translation Invariance
• Architectures
shown are
illustrative and
not designed for
real use
• Real networks
have branching
structures
• Chain structures
shown for
simplicity
A ConvNet architecture
INPUT: 32x32x3 holds raw pixel values: an image of width 32, height 32 and 3 color channels RGB
CONV layer will compute the output of neurons connected to local regions in the input
Each computing a dot product between their weights and a small region they are connected to
In the input volume. This may result in a volume such as 32x32x12 if we used 12 filters
POOL layer will perform a down-sampling operation along spatial dimensions (width, height)
resulting in a volume such as 12x16x12
Source: https://www.cs.toronto.edu/~frossard/post/vgg16/
VGG-16 pre-trained model for Keras
https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3 2
5
Pooling, Invariance, Equivariance
• Pooling is supposed to obtain positional, orientational,
proportional or rotational invariance.
• However it is a very crude approach
• In reality it removes all sorts of positional invariance
• Leads to detecting right image in Fig. 1 as a correct ship
1. Disfiguration Transformation 2.
Proportional Transformation
• Equivariance makes network understand the rotation or
proportion change and adapt itself accordingly so that the
spatial positioning inside an image is not lost
• A solution is capsule nets
Source: https://hackernoon.com/what-is-a-capsnet-or-capsule-network-2bfbe48769cc
Definition of Equivariance
• Generalizes the concept of invariance
• Invariance is a property which remains unchanged when
transformations of a certain type are applied to the objects
• Area and perimeter of a triangle are invariants
• Translating/rotating a triangle does not change its area or
perimeter
• Triangle centers such as the centroid, circumcenter, are not
invariant
• Because moving a triangle will cause its centers to move
• Instead, these centers are equivariant
• Applying any Euclidean congruence (a combination of a translation
and rotation) to a triangle, and then constructing its center, produces
the same point as constructing the center first, and then applying the
same congruence
f(gx)=g f(x to the center.
Area(O)=Area(t(O))
) Center (O)≠Center(t(O))
O=Triangle; t=translation
g(x) =center(x)
f(x)=translation (x)
Topics in Convolutional Networks
• Overview
1. The Convolution Operation
2. Motivation
3. Pooling
4. Convolution and Pooling as an Infinitely Strong Prior
5. Variants of the Basic Convolution Function
6. Structured Outputs
7. Data Types
8. Efficient Convolution Algorithms
9. Random or Unsupervised Features
10. The Neuroscientific Basis for Convolutional Networks
11. Convolutional Networks and the History of Deep Learning
Topics in Infinitely Strong
•
Prior
Weak and Strong Priors
• Convolution as an infinitely strong prior
• Pooling as an infinitely strong prior
• Under-fitting with convolution and pooling
• Permutation invariance
Prior Parameter Distribution
• Role of a prior probability distribution over the
parameters of a model is to encode our
belief as to what models are reasonable
before seeing the data
Weak and Strong Priors
• A weak prior
• A distribution with high entropy
• e.g., Gaussian with high variance
• The data can move
parameters more or less freely
• A strong prior
• Very low entropy
• E.g., a Gaussian with low
variance
• Such a prior plays a more active
role in determining where the
parameters end up
Infinitely Strong Prior
• An infinitely strong prior places zero probability on some
parameters
• It says that some parameter values are forbidden regardless of
their support from data
• With an infinitely strong prior, irrespective of the data the prior
cannot be changed
Convolutional Network
• Convolutional
networks are simply
neural networks that
use convolution in
place of general
matrix multiplication
in at least one of their
layers
Convolution as Infinitely Strong Prior
• Convolutional net is similar to a fully connected net but with an
infinitely strong prior over its weights
• It says that the weights for one hidden unit must be identical to the
weights of its neighbor, but shifted in space
• Prior also says that the weights must be zero, except for in the small
spatially contiguous receptive field assigned to that hidden unit
f (w i , j, k, l )
Multichannel Convolution
• Because we are dealing with multichannel convolution, linear
operations are not usually commutative, even of kernel flipping
is used
• These multi-channel operations are only commutative if
each operation has the same number of output channels as
input channels.
Definition of 4-D kernel tensor
• Assume we have a 4-D kernel tensor K with
element
K i , j , k , l giving the connection strength between
• a unit in channel i of the output and
• a unit in channel j of the input,
• with an offset of k rows and l columns between output and input units
• Assume our input consists of observed data V with element
V i , j , k giving the value of the input unit
• within channel i at row j and column k.
• Assume our output consists of Z with the same format as V
• If Z is produced by convolving K across V without flipping
K , then
Convolution with a Stride: Definition
• We may want to skip over some positions in the kernel to
reduce computational cost
• At the cost of not extracting fine features
• We can think of this as down-sampling the output of the full
convolution function
• If we want to sample only every s pixels in each direction of
output, then we can define a down-sampled convolution
function c such that
No padding
Zero padding
DeepLizard
Effect of Zero-padding on Network Size
Convolutional net with a kernel of width 6 at every layer
No pooling, so only convolution shrinks network size
We do not use any implicit zero
padding
Causes representation to shrink by
five pixels at each layer
Starting from an input of 16 pixels
we are only able to have 3
convolutional layers and the last
layer does not ever move the
kernel, so only two layers are
convolutional
Tiled convolution
has a set of t different
kernels With t=2
Traditional convolution
equivalent to tiled convolution
with t=1
There is only one kernel and it is
applied everywhere
Operations to Implement CNNs
Number vs volume
3. In this paper, authors claims that Batch Norm makes loss surface
smoother(i.e. it bounds the magnitude of the gradients much more tightly).
https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8
ImageNet
https://www.image-net.org
https://machinelearning.wtf/terms/synset/
Example of different
data types for CNNs
When Not to Use Convolution
• The use of convolution for processing variably sized inputs
makes sense only for inputs that have variable size because
they contain varying amounts of observation of the same kind of
thing—different lengths of recordings over time, different widths
of observations over space, etc.
• Convolution does not make sense if the input has variable size
because it can optionally include different kinds of observations.
1962
RECEPTIVE FIELDS, BINOCULAR
INTERACTION
AND FUNCTIONAL ARCHITECTURE IN
THE CAT'S VISUAL CORTEX
1968...
Cat image by CNX OpenStax is licensed
under CC BY 4.0; changes made
Lecture 5 -
A bit of
history Human brain
cortex
Lecture 5 -
Hierarchical
organization
Lecture 5 -
A bit of
history:
Neocognitron
[Fukushima 1980]
Lecture 5 -
Topics in Convolutional Networks
Overview
1. The Convolution Operation
2. Motivation
3. Pooling
4. Convolution and Pooling as an Infinitely Strong Prior
5. Variants of the Basic Convolution Function
6. Structured Outputs
7. Data Types
8. Efficient Convolution Algorithms
9. Random or Unsupervised Features
10.The Neuroscientific Basis for Convolutional Networks
11. Convolutional Networks and the History of Deep
Learning
A bit of history...
The Mark I Perceptron machine was the first
implementation of the perceptron algorithm.
recognized
letters of the alphabet
update rule:
These figures are reproduced from Widrow 1960, Stanford Electronics Laboratories Technical
Widrow and Hoff, ~1960: Adaline/Madaline Report with permission from Stanford University Special Collections.
recognizable math
LeNet-5
Lecture 5 -
A bit of history...
Reinvigorated research in
Deep Learning
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
“AlexNet”
Lecture 5 -
First strong results
Acoustic Modeling using Deep Belief Networks
Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton, 2010
Context-Dependent Pre-trained Deep Neural Networks
for Large Vocabulary Speech Recognition
George Dahl, Dong Yu, Li Deng, Alex Acero, 2012
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
https://theaisummer.com/cnn-architectures/
Progress of CNN Architectures on ImageNet
https://paperswithcode.com/sota/image-classification-on-imagenet
Accuracy vs G-FLOPs
Number of
parameters does
not mean greater
accuracy!
Number of
parameters
• With more layers (depth) one can capture richer and more
complex features, but such models are hard to train (due to
the vanishing gradients)
Compound scaling:
Open question(s)!!!
LeNet-5
AlexNet
VGG-16
Inception-v1
ResNet-50