CNN PPT Unit Iv

Topics in Convolutional Networks
• Overview
1. The Convolution Operation
2. Motivation
3. Pooling
4. Convolution and Pooling as an Infinitely Strong Prior
5. Variants of the Basic Convolution Function
6. Structured Outputs
7. Data Types
8. Efficient Convolution Algorithms
9. Random or Unsupervised Features
10.The Neuroscientific Basis for Convolutional Networks
11. Convolutional Networks and the History of Deep
Learning
Overview
1. Overview of Convolutional
Networks
2. Traditional versus
Convolutional Networks
3. Topics in CNNs
Most well used deep learning network it seems

Overview of Convolutional Networks
• Convolutional networks, also known as Convolutional neural
networks, CNNs, or ConvNets are a specialized kind of neural
network
• Used for processing data that has a known grid-like topology
• Ex: time-series data, which is a 1-D grid, taking samples at intervals
• Image data, which are 2-D grid of pixels
• Utilize convolution, which is a specialized kind of linear

operation
Neural Net Matrix Multiplication
• Each layer produces values that are obtained from previous
layer by performing a matrix-multiplication
Augmented network
• In an unaugmented network
• Hidden layer produces values z =h (W(1)T x + b(1))
• Output layer produces values y =σ(W(2)T x + b(2))
• Note:W(1) and W(2) are matrices rather than vectors
• Example with D=3, M=3 x=[x1,x2,x3]T
• We have two weight matrices W(1) and W(2)
First Network layer Network layer output In matrix multiplication notation

Matrix Multiplication for 2D Convolution
Far fewer weights needed than full matrix multiplication!

Convolutional Layer for Image Recognition
CNN - Neural Network with a Convolutional Layer
• CNNs are simply neural networks that use convolution in place
of general matrix multiplication in at least one of their layers
• Convolution can be viewed as multiplication by a matrix
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
Runtime of Traditional vs Convolutional Networks
Deep Learning
Srihari
• Traditional neural network layers use matrix multiplication by a

matrix of parameters with a separate parameter describing
the interaction between each input unit and each output unit
s =g(WTx )
• With m inputs and n outputs, matrix multiplication requires m x n

parameters and O(m×n) runtime per example
• This means every output unit interacts with every input unit
• Convolutional network layers have sparse interactions
• If we limit no of connections for each input to k we need k x n

parameters and O(k×n) runtime
• What convolution is
• Motivation behind using convolution in a neural network
• Pooling, which almost all convolutional networks employ
• Usually the operation used in a convolutional neural network
does not correspond precisely to convolution in math
• We describe several variants on convolution function used in
practice
• Making convolution more efficient
• Convolution networks stand out as an example of
neuroscientific principles in deep learning
• Very deep convolutional network architectures
• Overview
2. Motivation
3. Pooling
7. Data Types
Learning
Topics
1. What is convolution?
2. Convolution: continuous and discrete cases
3. Convolution in two dimensions
4. Discrete convolution viewed as matrix
multiplication
What is Convolution?
• Convolution is an operation on two functions of a real-valued

argument and can be considered an averaging process
• Examples of the two functions in one dimension
• Tracking location of a spaceship by a laser sensor
• A laser sensor provides a single output x(t), the position of spaceship
at time t
• w a function of a real-valued argument
• If laser sensor is noisy, we want a weighted average that gives more
weight to recent observations
• Weighting function is w(a) where a is age of a measurement
• Convolution is the weighted average or smoothed estimate of
the position of the spaceship
• A new function s
s(t) = ∫x(a)w(t −a)da
Definition of Convolution of Input and Kernel
• Convolution function s is a weighted

average of x
s(t) = ∫x(a)w(t −a)da
Convolution of f(u) and g(u)
• This operation is typically denoted with an asterisk
s(t) = (x * w)
(t)
• w needs to be a valid probability density function, or the
output is not a weighted average
• w needs to be 0 for negative arguments, or we will look into the Properties:
future
Commutative
• In convolution network terminology the first Associative
function x is referred to as the input, the second Distributive
function w is referred to as the kernel
• The output s is referred to as the feature map
Performing 1-D Convolution
• One-dimensional continuous case
• Input f(t) is convolved with a kernel
g(t) ∞
Note that (f * g )=(g * f )
(f * g) ≡ ∫ f (τ)g(t − τ ) d τ
−∞
1.Express each function

in terms of a dummy
variable τ
2.Reflect one of the

functions g(τ)g(-τ)
3.Add a time offset t,

which allows g(t-τ) to
slide along the τ axis
4. Start t at -∞ and slide

all the way to +∞ https://
Wherever the two functions intersect en.wikipedia.org/wiki/
Convolution
find the integral of their product
Convolution with Discrete Variables
• Ex: Laser sensor may only provide data at regular intervals
• Time index t takes on only integer values
• x and w are defined only on integer t
∞
s(t) = (x * w)(t) = ∑ x(a)w(t −a)

a=−∞
• In ML applications, input is a multidimensional array of data and

the kernel is a multidimensional array of parameters that are
adapted by the learning algorithm
• These arrays are referred to as tensors
• Input and kernel are explicitly stored separately
• The functions are zero everywhere except at these points
Convolution in Discrete Case
• Here we have discrete functions f
and g
∞
(f * g)[t ] = ∑ f [τ]⋅ g[t −

τ]
τ=−∞
f [t ]
g [t-τ ]
Computation of 1-D Discrete Convolution
Parameters of convolution:
Kernel size (F)
Padding (P)
Stride (S)
(f *g)[t] g[t-τ ]
f [t]
Two-Dimensional Convolution
• Convolutions over more than one axis

• For a 2D image I as input and a 2D kernel K we have
S(i, j ) = (I * K )(i, j ) = ∑ ∑ I ( m , n ) K ( i − m, j −
n)
m n
Sharply peaked kernel K for edge detection Kernels K1-K4 for line detection
I(i , j) K(i , j) S(i , j)
Commutativity of Convolution
• Convolution is commutative. We can equivalently write:
S(i, j ) = ( K * I )(i, j ) = ∑ ∑ I ( i − m, j −
n)K(m,n)
m n
• This formula is easier to implement in a ML library since
there is less variation in the range of valid values of m and n
• Commutativity arises because we have flipped the kernel
relative to the input
• As m increases, index to the input increases, but index to the kernel
decreases
Cross-Correlation
• Same as convolution, but without flipping the kernel
S(i, j ) = ( K * I )(i, j ) = ∑ ∑ I(i + m, j + n)K(m,n)
m n
• Both often referred to as convolution, and whether

kernel is flipped or not
• In ML, the learning algorithm will learn appropriate
values of the kernel in the appropriate place
Example of 2D convolution
• Convolution without kernel
flipping applied to a 2D
tensor
• Output is restricted to case
where kernel is situated
entirely within the image
• Arrows show how upper-
left of input tensor is used
to form upper-left of output
tensor
Discrete Convolution Viewed as Matrix
Multiplication
• Convolution can be viewed as multiplication by a matrix
• However the matrix has several entries constrained to be zero
• Or constrained to be equal to other elements
• For univariate discrete convolution: Univariate Toeplitz matrix:
• Rows are shifted versions of previous row
• 2D case: doubly block circulant matrix

• It corresponds to convolution
• Overview
2. Motivation
3. Pooling
7. Data Types
10. The Neuroscientific Basis for Convolutional Networks
11. Convolutional Networks and the History of Deep Learning
Overview of Convolutional Networks
• Convolutional networks, also known as Convolutional
neural networks (CNNs or ConvNet) are a specialized
kind of neural network
• Used for processing data that has a known grid-like
topology
• Time-series data, which is a 1-D grid, taking samples at intervals
• Image data, which are 2-D grid of pixels
• Utilize convolution as a specialized kind of linear
operation
• The convolution operation is used in place of a general
matrix multiplication in at least one layer
Motivation for Using Convolution Networks
1. Convolution leverages three important

ideas to improve ML systems:
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
2. Convolution also allows for working with

inputs of variable size
Sparse Connectivity Due to Image Convolution
• Input image may have millions of pixels,
• But we can detect edges with kernels of
hundreds of pixels
• If limit number of connections for each input n to k
• now need kxn parameters and O(k×n)
runtime
• It is possible to get good performance
with k < < n
• Convolutional networks have

sparse interactions
• Accomplished by making
the kernel smaller than the
input Next slide shows
graphical depiction
5
Neural Hetwork for 1-D convolution
Kernel g(t):
f [t]
Equations for outputs of this network:
etc. up to y8
We can also write the equations in terms of elements of a general 8 ✕ 8 weight matrix
W as:
where
http://colah.github.io/posts/2014-07-Understanding-Convolutions/
Sparse Connectivity, Viewed from Below
• Highlight one input x3

and output units s affected
by it
• Top: when s is formed by
convolution with a kernel of
width 3, only three outputs are
affected by x3
• Bottom: when s is formed by
matrix multiplication,
connectivity is no longer
sparse
• Then all outputs are affected
by x3
Sparse Connectivity, Viewed from Above
• Highlight one output s3 and inputs x that affect this unit
• These units are known as the receptive field of s3
When s3 is formed by convolution

with a kernel of width 3
When s3 is formed
by matrix
multiplication
Maintaining Performance with Reduced Connections
• How is it possible to obtain good performance while keeping a number of k
connections several magnitudes less than m inputs
• In a deep neural network, units in deeper layers may indirectly interact with
a larger portion of the input
• Sparsely connected network requires only k x n parameters and O(kn)
runtime where n is the number of outputs
• Receptive field in deeper layers is larger than the receptive field in shallow layers
• Reasoning: the deep network can efficiently describe complicated

interactions between many variables from simple building blocks that only
have sparse interactions
Convolution with a Stride s
• Receptive field in deeper

layers is larger than the
receptive field in shallow
layers
• This effect increases if the
network includes
architectural features like
strided convolution or
pooling
• Stride s denotes how

many steps we are
moving in each steps in
convolution filters
Parameter Sharing
• Parameter sharing refers to using the same parameter for
more than one function in a model
• In a traditional neural net each element of the weight
matrix is used exactly once when computing the output of
a layer
• It is multiplied by one element of the input and never revisited
• Parameter sharing is synonymous with tied weights
• Value of the weight applied to one input is tied to a weight applied
elsewhere
• In a convolutional net, each member of the kernel is used
in every position of the input (except at the boundary–
subject to design decisions)
Efficiency of Parameter Sharing
• Parameter sharing by convolution operation means

that rather than learning a separate set of parameters
for every location, we learn only one set
• This does not affect runtime of forward propagation–

which is still O(k ✕ n)
• But further reduces the storage requirements to k

parameters
• k is orders of magnitude less than inputs m
• Since m inputs and n outputs are roughly the same size, k
is much smaller than mxn
How Parameter Sharing Works
• Black arrrows: connections that use a particular parameter
1. Convolutional model: Black arrows indicate uses of the central element
of a 3-element kernel
2. Fully connected model: Single black arrow indicates use of the central
element of the weight matrix
• Model has no parameter sharing, so the parameter is used only
once
• Thus sparse connectivity and parameter sharing can

dramatically improve efficiency of image edge detection as
shown in next slide
Efficiency of Convolution for Edge Detection
• Image on right formed by taking each pixel of input image and

subtracting the value of its neighboring pixel on the left
• This is a measure of all the vertically oriented edges in input image
which is useful for object detection
Input Both images are 280 pixels tall

Input image is 320 pixels wide
image Output image is 319 pixels wide
• Transformation can be described by a convolution kernel

containing two elements and requires
319×320×3=267,960 flops (2 multiplies, one add)
• Same transformation would require 320×280×319×280,
i.e., 8 billion entries in the matrix
• Convolution is 4 billion times more efficient
Equivariance of Convolution to Translation
• This particular form of parameter sharing leads to
equivariance to translation
• Equivariant means that if the input changes, the output changes
in the same way
• A function f (x) is equivariant to a function g if f (g (x))=g (f (x))
• If g is a function that translates the input, i.e., that shifts it, then
the convolution function is equivariant to g
• I(x,y) is image brightness at point (x,y)
• I’=g(I) is image function with I’(x,y)=I(x-1,y), i.e., shifts every pixel of I
one unit to the right
• If we apply g to I and then apply convolution, the output will be the same
as if we applied convolution to I’, then applied transformation g to the
output
Example of Equivariance
• With 2D images convolution creates a map where

certain features appear in the input
• If we move the object in the input, the representation

will move the same amount in the output
• It is useful to detect edges in the first layer of

a convolutional network
• Same edges appear everywhere in image, so it is

practical to share parameters across entire image
Absence of Equivariance
• In some cases, we may not wish to share parameters
across an entire image
• If image is cropped to be centered on a face, we may want
different features from different parts of the face
• Part of the network processing the top of the face looks for
eyebrows
• Part of the network processing the bottom of the face looks for the
chin
• Certain image operations such as scale and rotation are

not equivariant to convolution
• Other mechanisms are needed for such transformations
• Overview
2. Motivation
3. Pooling
7. Data Types
Learning
Topics in Pooling
• What is Pooling?
• Three stages of CNNs
• Two terminologies: simple layers, complex layers
• Types of Pooling functions: Max, Average
• Translation invariance
• Rotation invariance
• Pooling with downsampling
• ConvNet architectures
• Shortcoming of pooling
What is Pooling?
• Pooling in a CNN is a subsampling step
• It replaces output at a location with a summary statistic of nearby outputs
• E,g,, Max pooling reports the maximum output within a rectangular neighborhood
https://deepai.org/machine-learning-glossary-and-terms/max-pooling
The Pooling Stage in a
CNN
• Typical layer of a CNN consists of
three stages
• Stage 1:
• perform several convolutions in
parallel to produce a set of linear
activations
• Stage 2 (Detector):
• each linear activation is run through
a nonlinear activation function such
as ReLU
• Stage 3 (Pooling):
• Use a pooling function to modify
output of the layer further
5
Pooling Layer in Keras
• MaxPooling 1D
• Arguments
• pool_size: Integer, size of the max pooling windows.
• strides: Integer, or None. Factor by which to downscale. E.g. 2 will halve
the input. If None, it will default to pool_size.
• padding: One of "valid" or "same" (case-insensitive).
• data_format: A string, one of channels_last (default) or channels_first.
The ordering of the dimensions in the inputs. channels_last
corresponds to inputs with shape (batch, steps, features) while
channels_first corresponds to inputs with shape (batch, features, steps)
• MaxPooling 2D
Typical Subsampling in a Deep Network
• Input image is filtered by 4 5×5 convolutional kernels which

create 4 feature maps,
• Feature maps are subsampled by max pooling.
• The next layer applies ten 5×5 convolutional kernels to these
subsampled images and again we pool the feature maps.
• The final layer is a fully connected layer where all generated
features are combined and used in the classifier
(essentially logistic regression).
Two Terminologies for a Typical CNN Layer
1. CNN viewed as a small 2. CNN viewed as a larger

number of complex number of simple layers
layers, each layer • Every processing step is a layer
having many stages in its own right
• 1-1 mapping between kernel • Not every layer has parameters
tensors and network layers
Effect of Pooling
• Pooling is performed for two reasons

1. Dimensionality reduction
2. Invariance to transformations of rotation and translation
Some Types of Linear Transformations
Types of Pooling functions
• A pooling function replaces the output of the net at a certain
location with a summary statistic of the nearby inputs
• Popular pooling functions:
1. max pooling operation reports the maximum output within a
rectangular neighborhood
2. Average of a rectangular neighborhood
3. L2 norm of a rectangular neighborhood
4. Weighted average based on the distance from the central
pixel
Pooling Causes Translation Invariance
• In all cases, pooling helps make the representation become
approximately invariant to small translations of the input
• If we translate the input by a small amount values of most of the
outputs does not change (example next slide)
• Pooling can be viewed as adding a strong prior that the function the
layer learns must be invariant to small translations
Max Pooling Introduces Invariance to Translation
• View of middle of output of a convolutional layer
Outputs of maxpooling
Outputs of nonlinearity
• Same network after the input has been shifted by one pixel
• Every input value has changed, but only half the values of
output have changed because maxpooling units are only
sensitive to maximum value in neighborhood not exact value
Importance of Translation Invariance
• Invariance to translation is important if we care about whether a

feature is present rather than exactly where it is
• Ex: for detecting a face just need to know that an eye is present in a
region, not its exact location
• In other context it is more important to preserve location of a
feature
• Ex: to determine a corner we need to know whether two edges are
present and test whether they meet
Learning other Invariances
• Pooling over spatial regions produces invariance to translation

• But if we pool over the results of separately parameterized
convolutions, the features can learn which transformations to
become invariant to
Learning Invariance to Rotation
• A pooling unit that

pools over multiple
features that are
learned with
separate parameters
can learn to be
invariant to
transformations of
the input
Input tilted left Tilted Right

gets large response
from unit tuned to
left-tilted images
Using Fewer Pooling Units than Detector Units
• Because pooling summarizes the responses over a whole

neighborhood, it is possible to use fewer pooling units than
detector units
• This is due to the effect of reporting summary statistics for pooling
regions spaced k pixels apart rather than one pixel apart
• This improves computational efficiency because the next layer has k
times fewer inputs to process
• An example is given next
Pooling with Downsampling
• Max-pooling with a pool width of three and a stride between
pools of two
• This reduces representation size by a factor of two

• Which reduces computational burden of next layer
• Rightmost pooling region has a smaller size but must be included
if we don’t want to ignore some of the detector units
Downsampling makes a digital audio signal smaller by

lowering its sampling rate or sample size (bits per
sample). It decreases the bit rate when transmitting
over a limited bandwidth or to convert to a more
limited audio format
Subsampling as Average Pooling
Theoretical Guidance on Pooling
• Which kind of pooling should one use in different situations
• Possible to dynamically pool features together
• Run a clustering algorithm on locations of interesting features
• Yields a different set of pooling regions for each image
• Another approach: learn a single pooling structure
• Pooling can complicate architectures that use top-down
information
• E.g., Boltzmann machines and autoencoders
Deep Learning
Examples of Architectures for Classification with CNNs

Srihari
CNN that processes Processes variable- Does not have any

a fixed size image sized image fully-connected layer
• Architectures
shown are
illustrative and
not designed for
real use
• Real networks
have branching
structures
• Chain structures
shown for
simplicity
A ConvNet architecture
INPUT: 32x32x3 holds raw pixel values: an image of width 32, height 32 and 3 color channels RGB
CONV layer will compute the output of neurons connected to local regions in the input
Each computing a dot product between their weights and a small region they are connected to
In the input volume. This may result in a volume such as 32x32x12 if we used 12 filters
POOL layer will perform a down-sampling operation along spatial dimensions (width, height)
resulting in a volume such as 12x16x12
Activations of an example ConvNet architecture.

Initial volume stores raw image pixels (left) and the last volume stores class scores (right)
Each volume of activations along the processing path is shown as a column.
Since it is difficult to visualize 3D volumes, each volume’s slices are laid out in rows
VGG (Visual Geometry Group) Net
• VGG is a convolutional neural network model

• K. Simonyan and A. Zisserman, University of Oxford
• “Very Deep Convolutional Networks for Large-Scale Image Recognition”
• The model achieves 93% top-5 test accuracy in ImageNet
• which is a dataset of over 14 million images belonging to 1000 classes.
VGG
16
Source: https://www.cs.toronto.edu/~frossard/post/vgg16/
VGG-16 pre-trained model for Keras
https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3 2
5
Pooling, Invariance, Equivariance
• Pooling is supposed to obtain positional, orientational,
proportional or rotational invariance.
• However it is a very crude approach
• In reality it removes all sorts of positional invariance
• Leads to detecting right image in Fig. 1 as a correct ship
1. Disfiguration Transformation 2.
Proportional Transformation
• Equivariance makes network understand the rotation or
proportion change and adapt itself accordingly so that the
spatial positioning inside an image is not lost
• A solution is capsule nets
Source: https://hackernoon.com/what-is-a-capsnet-or-capsule-network-2bfbe48769cc
Definition of Equivariance
• Generalizes the concept of invariance
• Invariance is a property which remains unchanged when
transformations of a certain type are applied to the objects
• Area and perimeter of a triangle are invariants
• Translating/rotating a triangle does not change its area or
perimeter
• Triangle centers such as the centroid, circumcenter, are not
invariant
• Because moving a triangle will cause its centers to move
• Instead, these centers are equivariant
• Applying any Euclidean congruence (a combination of a translation
and rotation) to a triangle, and then constructing its center, produces
the same point as constructing the center first, and then applying the
same congruence
f(gx)=g f(x to the center.
Area(O)=Area(t(O))
) Center (O)≠Center(t(O))
O=Triangle; t=translation
g(x) =center(x)
f(x)=translation (x)
• Overview
2. Motivation
3. Pooling
7. Data Types
Topics in Infinitely Strong
•
Prior
Weak and Strong Priors
• Convolution as an infinitely strong prior
• Pooling as an infinitely strong prior
• Under-fitting with convolution and pooling
• Permutation invariance
Prior Parameter Distribution
• Role of a prior probability distribution over the
parameters of a model is to encode our
belief as to what models are reasonable
before seeing the data
Weak and Strong Priors
• A weak prior
• A distribution with high entropy
• e.g., Gaussian with high variance
• The data can move
parameters more or less freely
• A strong prior
• Very low entropy
• E.g., a Gaussian with low
variance
• Such a prior plays a more active
role in determining where the
parameters end up
Infinitely Strong Prior
• An infinitely strong prior places zero probability on some
parameters
• It says that some parameter values are forbidden regardless of
their support from data
• With an infinitely strong prior, irrespective of the data the prior
cannot be changed
Convolutional Network
• Convolutional
networks are simply
neural networks that
use convolution in
place of general
matrix multiplication
in at least one of their
layers
Convolution as Infinitely Strong Prior
• Convolutional net is similar to a fully connected net but with an
infinitely strong prior over its weights
• It says that the weights for one hidden unit must be identical to the
weights of its neighbor, but shifted in space
• Prior also says that the weights must be zero, except for in the small
spatially contiguous receptive field assigned to that hidden unit
Convolution with a kernel of width 3

s3 is a hidden unit. It has 3 weights
which are the same as for s4
• Convolution introduces an infinitely strong prior probability distribution

over the parameters of a layer
• This prior says that the function the layer should learn contains only
local interactions and is equivariant to translation
Pooling as an Infinitely Strong Prior
• The use of pooling is an infinitely strong prior
such that each unit should be invariant to
small translations
• Maxpooling example:
Implementing as a Prior
• Implementing a convolutional net as a fully
connected net with an infinitely strong prior
would be computationally wasteful
• However, thinking of a convolutional net as a
fully connected net with an infinitely strong
prior can give us insights into how
convolutional nets work
Key Insight: Underfitting
• Convolution and pooling can cause
under-fitting
• Under-fitting happens when model
has high bias
• Convolution and pooling are
useful when the assumptions made High Bias/Underfit
only
by the prior are reasonably can be countered by:
accurate 1. Add hidden layers
• Pooling may be inappropriate in 2.Increase hidden
units/layer
some cases 3.Decrease regular.
• If the task relies on preserving spatial parameter λ
information, using pooling on all 4. Add features
features can increase training error
When Pooling may be Inappropriate
• Some convolutional architectures are designed to use
pooling on some channels but not on other channels
• Thus highly invariant features and features will not under-fit
when the translation invariance prior is incorrect
• When a task involves incorporating information from a
very distant locations in the input, the prior imposed
by convolution may be inappropriate
Comparing Model Performance with/without
Convolution
• Convolutional models have built in spatial relationships
• In benchmarks of statistical learning performance we
should only compare convolutional models to other
convolutional models – since they have
hard coded knowledge of spatial relationships
• Models without convolution will be able to learn even if
we permuted all pixels in the image
• Permutation invariance: f (x 1 ,x 2 ,x 3 )=f (x 2 ,x 1 ,x 3 )=f(x 3 ,x 1 ,x 2 )
• There are separate benchmarks for models that are
permutation invariant
Topics in
Convolutional
• Overview Networks
2. Motivation
3. Pooling
7. Data Types
Topics in
Variants of Convolution
• Neural net convolutionFunctions
is not same as mathematical convolution
• How convolution in neural networks is different
• Multichannel convolution due to image color and batches
• Convolution with a stride
• Locally connected layers (unshared convolution)
• Tiled convolution
• Implementation of a convolutional network
Neural Net Convolution is Different
• Convolution in the context of neural networks does not refer
exactly to the standard convolution operation in mathematics
• The functions used are slightly different
• Next describe the differences in detail and highlight their
useful properties
Convolution Operation in Neural Networks
• Refers to an operation that consists of many applications of

convolution in parallel
• This is because convolution with a single kernel can only extract one
kind of feature, albeit at many locations
• Usually want to extract many kinds of features at many locations
• Input is usually not a grid of real values
• Rather it is a vector of observations
• E.g., a color image has R , G , B values at each pixel
• Input to the next layer is the output of the first layer which has many
different convolutions at each position
• When working with images, input and output are 3-D tensors
Four Indices for Weights for Image
algorithms
1. One index for Channel
2. Two indices for spatial coordinates of each
channel
3. Fourth index for different samples in a
batch
• We omit the batch axis for simplicity of
discussion
f (w i , j, k, l )
Multichannel Convolution
• Because we are dealing with multichannel convolution, linear
operations are not usually commutative, even of kernel flipping
is used
• These multi-channel operations are only commutative if
each operation has the same number of output channels as
input channels.
Definition of 4-D kernel tensor
• Assume we have a 4-D kernel tensor K with
element
K i , j , k , l giving the connection strength between
• a unit in channel i of the output and
• a unit in channel j of the input,
• with an offset of k rows and l columns between output and input units
• Assume our input consists of observed data V with element
V i , j , k giving the value of the input unit
• within channel i at row j and column k.
• Assume our output consists of Z with the same format as V
• If Z is produced by convolving K across V without flipping
K , then
Convolution with a Stride: Definition
• We may want to skip over some positions in the kernel to
reduce computational cost
• At the cost of not extracting fine features
• We can think of this as down-sampling the output of the full
convolution function
• If we want to sample only every s pixels in each direction of
output, then we can define a down-sampled convolution
function c such that
• Denote s as the stride. It is possible to define a different

stride for each direction
Strided convolution video
Convolution with Stride: Implementation
Here we use a stride of 2
Convolution with a stride of length two

implemented in a single operation
Convolution with a stride greater than

one pixel is mathematically
equivalent to convolution with a unit
stride followed by downsampling.
Two-step approach is computationally

wasteful, because it computers many
values that are discarded
Downsampling makes a digital audio signal smaller by

lowering its sampling rate or sample size (bits per sample). It
decreases the bit rate when transmitting over a limited
bandwidth or to convert to a more limited audio format
Padding to CNN Sides and Size
Add zero units to the edges on the sides of the network to
reduce shrinking over the depth so as to maintain
performance
No padding
Zero padding
DeepLizard
Effect of Zero-padding on Network Size
Convolutional net with a kernel of width 6 at every layer
No pooling, so only convolution shrinks network size
We do not use any implicit zero
padding
Causes representation to shrink by
five pixels at each layer
Starting from an input of 16 pixels
we are only able to have 3
convolutional layers and the last
layer does not ever move the
kernel, so only two layers are
convolutional
By adding 5 implicit zeroes to

Each layer, we prevent the
representation from shrinking
with depth
This allows us to make an 11
arbitrarily deep convolutional
network
Padding video
Locally Connected Layer
• In some cases, we do not actually want to use convolution, but
rather locally connected layers
• adjacency matrix in the graph of our MLP is the same, but every
connection has its own weight, specified by a 6-D tensor W.
• The indices into W are respectively:
• i, the output channel,
• j, the output row,
• k, the output column,
• l, the input channel,
• m, the row offset within the input, and
• n, the column offset within the input.
• The linear part of a locally connected layer is then given
by
• Also called unshared convolution

Local Connections, Convolution, vs full Connections
A locally connected layer with a patch size
of two pixels. Each edge is labeled with a
unique letter to show that each edge is
associated with its own weight parameter.
A convolutional layer with a kernel width
of two pixels. Has exactly the same
connectivity as the locally connected
layer. The difference lies not in which
units interact with each other, but in how
the parameters are shared. Locally
connected layer has no parameter
sharing. The convolutional layer uses the
same two weights repeatedly across the
entire input, as indicated by the repetition
of the letters labeling each edge.
A fully connected layer resembles a locally

connected layer in the sense that each
edge has its own parameter (there are too
many to label explicitly with letters in this
diagram). It does not, however, have the
restricted connectivity of the locally
connected layer.
Use of Locally Connected Layers
• Locally connected layers are useful when
• we know that each feature should be a function of a small
part of space, but there is no reason to think that the same
feature should occur across all of space
• Ex: if we want to tell if an image is a picture of a face,

we only need to look for the mouth in the bottom half
of the image
Constraining Outputs
• Constrain each output channel i to be a
function of only a subset of the input channels l
• Make the first m output channels connect to only
the first n input channels,
• The second m output channels connect to only the
second n input channels, etc
• Modeling interactions between few channels
allows fewer parameters to:
• Reduce memory, increase statistical efficiency,
reduce computation for forward/back-propagation.
• It accomplishes these goals without reducing
number of hidden units.
Network with Further Restricted Connectivity
A convolutional network with the first

two output channels connected to only
the first two input channels, and the
second two output channels connected
to only the second two input channels
Tiled Convolution
• Compromise between a convolutional layer and a
locally connected layer.
• Rather than learning a separate set of weights at every
spatial location, we learn a set of kernels that we pass
through as we move through space.
• This means that immediately neighboring locations will
have different filters, like in a locally connected layer,
• Now the memory requirements for storing the parameters
will increase only by a factor of the size of this set of kernels
rather than the size of the entire output feature map
Comparison of locally connected layers, tiled
convolution and standard convolution
A locally connected layer

that has no sharing at all
Each connection has its
own weight
Tiled convolution
has a set of t different
kernels With t=2
Traditional convolution
equivalent to tiled convolution
with t=1
There is only one kernel and it is
applied everywhere
Operations to Implement CNNs
• Besides convolution, other operations are necessary to

implement a convolutional network.
• To perform learning, need to compute gradient with respect to
the kernel, given the gradient with respect to the outputs.
• In some simple cases, this operation can be performed using
the convolution operation, but with a stride greater than 1,
there are issues.
Implementation of Convolution
• Convolution is a linear operation and can thus be
described as a matrix multiplication
• First reshape the input tensor into a flat vector
• Matrix is then a function of the convolution kernel
• Matrix is sparse and each element of the kernel is copied to several
elements of the matrix.
• This view helps us to derive some of the other
operations needed to implement a convolutional
network
• Overview
2. Motivation
3. Pooling
7. Data Types
CNNs to Output a High Dimensional Object
• CNNs can be used to output a high-dimensional

structured object, rather than just predicting a
class label for a classification task or a real value
fora regression task.
• Typically this object is just a tensor, emitted by a
standard convolutional layer.
• Example, the model might emit a tensor S,
where Si,j,k is the probability that pixel (j, k) of
the input to the network belongs to class i.
• This allows the model to label every pixel in an
image and draw precise masks that follow the
outlines of individual objects
• Change architecture for large D outputs
Recurrent CNN for Pixel Labeling
Overview
2. Motivation
3. Pooling
7. Data Types
Learning
Data for CNNs
• The data used with a convolutional network usually consist of

several channels, each channel being the observation of a
different quantity at some point in space or time.
• Data sizes and types
Number vs volume
1 billion tweets – 130 Gs (140bytes – tweet)
ImageNet - over 14 million labeled high-resolution images

belonging to roughly 22,000 categories – 150 Gs
• Normalize the data to [0,1]

Deep learning likes normalized vectors
Benefits of Normalization
• Normalization has always been an active area of research in deep learning.
Normalization techniques can decrease your model’s training time by a huge
factor.
1. It normalizes each feature so that they maintains the contribution of every

feature, as some feature has higher numerical value than others. This way
our network can be unbiased(to higher value features).
2. It reduces Internal Covariate Shift. It is the change in the distribution of

network activations due to the change in network parameters during training.
To improve the training, we seek to reduce the internal covariate shift.
3. In this paper, authors claims that Batch Norm makes loss surface
smoother(i.e. it bounds the magnitude of the gradients much more tightly).
4. It makes the Optimization faster because normalization doesn’t allow weights

to explode all over the place and restricts them to a certain range.
5. An unintended benefit of Normalization is that it helps network in

Regularization(only slightly, not significantly).
https://medium.com/techspace-usict/normalization-techniques-in-deep-neural-networks-9121bf100d8
ImageNet
https://www.image-net.org
A synset is WordNet's terminology for a synonym ring. WordNet is a database of

English words grouped into sets of synonyms. WordNet's synsets are often useful in
information retrieval and natural language processing tasks to discover when two
different words can mean similar things
https://machinelearning.wtf/terms/synset/
Example of different
data types for CNNs
When Not to Use Convolution
• The use of convolution for processing variably sized inputs
makes sense only for inputs that have variable size because
they contain varying amounts of observation of the same kind of
thing—different lengths of recordings over time, different widths
of observations over space, etc.
• Convolution does not make sense if the input has variable size
because it can optionally include different kinds of observations.
• Example: if we are processing college applications, and our

features consist of both grades and standardized test scores,
but not every applicant took the standardized test, then it does
not make sense to convolve the same weights over features
corresponding to the grades as well as the features
corresponding to the test scores.
Overview
2. Motivation
3. Pooling
7. Data Types
Learning
Efficient Convolution
• When a d-dimensional kernel can be expressed as the outer
product of d vectors, one vector per dimension, the kernel is
called separable. When the kernel is separable, naive
convolution is inefficient.
• It is equivalent to composed one-dimensional convolutions with

each of these vectors. The composed approach is significantly
faster than performing one d-dimensional convolution with their
outer product.
• The kernel also takes fewer parameters to represent as

vectors.If the kernel is w elements wide in each dimension, then
naive multidimensional convolution requires O(wd) runtime and
parameter storage space, while separable convolution requires
O(w × d) runtime and parameter storage space. Not every
convolution can be represented in this way.
Overview
2. Motivation
3. Pooling
7. Data Types
Learning
Reduced Cost of Training
Typically, the most expensive part of convolutional network
training is learning the features,
Three basic strategies for obtaining convolution kernels without

supervised training.
1. Initialize them randomly.

2. Design them by hand, for example, by setting each kernel to
detect edges at a certain orientation or scale.
3. Learn the kernels with an unsupervised criterion.
Overview
2. Motivation
3. Pooling
7. Data Types
Learning
A bit of
history:
Hubel & Wiesel,

1959
RECEPTIVE FIELDS OF SINGLE
NEURONES IN
THE CAT'S STRIATE CORTEX
1962
RECEPTIVE FIELDS, BINOCULAR
INTERACTION
AND FUNCTIONAL ARCHITECTURE IN
THE CAT'S VISUAL CORTEX
1968...
Cat image by CNX OpenStax is licensed
under CC BY 4.0; changes made
Lecture 5 -
A bit of
history Human brain
Topographical mapping in the cortex:

nearby cells in cortex represent
nearby regions in the visual field
Visual
cortex
Retinotopy images courtesy of Jesse Gomez in the

Stanford Vision & Perception Neuroscience Lab.
Lecture 5 -
Hierarchical
organization
Illustration of hierarchical organization in early visual

pathways by Lane McIntosh, copyright CS231n 2017
Lecture 5 -
A bit of
history:
Neocognitron
[Fukushima 1980]
“sandwich” architecture (SCSCSC…)

simple cells: modifiable parameters
complex cells: perform pooling
Lecture 5 -
Overview
2. Motivation
3. Pooling
7. Data Types
Learning
A bit of history...
The Mark I Perceptron machine was the first
implementation of the perceptron algorithm.
The machine was connected to a camera that used

20×20 cadmium sulfide photocells to produce a 400-
pixel image.
recognized
letters of the alphabet
update rule:
Frank Rosenblatt, ~1957: Perceptron

This image by Rocky Acosta is licensed under CC-BY 3.0
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture55- - 1

Lecture April 18, 2017
1
9
A bit of history...
These figures are reproduced from Widrow 1960, Stanford Electronics Laboratories Technical
Widrow and Hoff, ~1960: Adaline/Madaline Report with permission from Stanford University Special Collections.

2
0
A bit of history...
recognizable math
Illustration of Rumelhart et al., 1986 by Lane McIntosh,

copyright CS231n 2017
Rumelhart et al., 1986: First time back-propagation became popular

2
1
A bit of history:
Gradient-based learning applied to
document recognition
[LeCun, Bottou, Bengio, Haffner
1998]
LeNet-5
Lecture 5 -
A bit of history...
[Hinton and Salakhutdinov 2006]
Reinvigorated research in
Deep Learning
Illustration of Hinton and Salakhutdinov 2006 by Lane

McIntosh, copyright CS231n 2017

2
3
A bit of history:
ImageNet Classification with Deep
Convolutional Neural Networks
[Krizhevsky, Sutskever, Hinton, 2012]
Figure copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.
“AlexNet”
Lecture 5 -
First strong results
Acoustic Modeling using Deep Belief Networks
Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton, 2010
Context-Dependent Pre-trained Deep Neural Networks
for Large Vocabulary Speech Recognition
George Dahl, Dong Yu, Li Deng, Alex Acero, 2012
Imagenet classification with deep convolutional

Illustration of Dahl et al. 2012 by Lane McIntosh, copyright
neural networks CS231n 2017
Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton,

2012
Figures copyright Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012. Reproduced with permission.

2
5
Best deep CNN architectures and their
principles: from AlexNet to EfficientNet
Nikolas Adaloglou (1/21/21)
https://theaisummer.com/cnn-architectures/
Progress of CNN Architectures on ImageNet
https://paperswithcode.com/sota/image-classification-on-imagenet
Accuracy vs G-FLOPs
Number of
parameters does
not mean greater
accuracy!
Number of
parameters
Bianco ‘18, arXiv

Focus on Scaling
Types of Scaling
Individual scaling:
• With more layers (depth) one can capture richer and more
complex features, but such models are hard to train (due to
the vanishing gradients)
• Wider networks are much easier to train. They tend to be able

to capture more fine-grained features but saturate quickly.
• By training with higher resolution images, convnets are in

theory able to capture more fine-grained details. Again, the
accuracy gain diminishes for quite high resolutions
Compound scaling:
• Scale up network depth (more layers), width (more channels

per layer), resolution (input image) simultaneously
Compound Scaling – EfficientNet-B0
Summary of ConvNet Results
Big nets can win

2. Motivation
3. Pooling
7. Data Types
Why are CNNs so Good?
Maybe they’re not
Classification problems are overrated

Idiot savant problems (just need a good AI idiot savant)
Do not have autonomous driving with only a CNN

Most serious AI problems are stateful, CNNs are not
So are many stateful problems in physics, chemistry, biology, economics and

engineering
But many stateful problems have stateless components

What is the difference and how do we know what to use?
Open question(s)!!!
How to mix stateful and stateless problems in a meaningful way!

See this already with CNNs combined with RNNs
Popular CNNs
LeNet-5
AlexNet
VGG-16
Inception-v1
ResNet-50

CNN PPT Unit Iv

Uploaded by

Copyright:

Available Formats

CNN PPT Unit Iv

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CNN PPT Unit Iv

Uploaded by

Copyright:

Available Formats

Topics in Convolutional Networks

Most well used deep learning network it seems

• Image data, which are 2-D grid of pixels

• Utilize convolution, which is a specialized kind of linear

First Network layer Network layer output In matrix multiplication notation

Far fewer weights needed than full matrix multiplication!

• Traditional neural network layers use matrix multiplication by a

• With m inputs and n outputs, matrix multiplication requires m x n

• If we limit no of connections for each input to k we need k x n

• Convolution is an operation on two functions of a real-valued

• Convolution function s is a weighted

1.Express each function

2.Reflect one of the

3.Add a time offset t,

4. Start t at -∞ and slide

s(t) = (x * w)(t) = ∑ x(a)w(t −a)

• In ML applications, input is a multidimensional array of data and

(f * g)[t ] = ∑ f [τ]⋅ g[t −

• Convolutions over more than one axis

• Both often referred to as convolution, and whether

• 2D case: doubly block circulant matrix

1. Convolution leverages three important

2. Convolution also allows for working with

• Convolutional networks have

Equations for outputs of this network:

• Highlight one input x3

When s3 is formed by convolution

• Reasoning: the deep network can efficiently describe complicated

• Receptive field in deeper

• Stride s denotes how

• Parameter sharing by convolution operation means

• This does not affect runtime of forward propagation–

• But further reduces the storage requirements to k

• Thus sparse connectivity and parameter sharing can

• Image on right formed by taking each pixel of input image and

Input Both images are 280 pixels tall

• Transformation can be described by a convolution kernel

• With 2D images convolution creates a map where

• If we move the object in the input, the representation

• It is useful to detect edges in the first layer of

• Same edges appear everywhere in image, so it is

• Certain image operations such as scale and rotation are

• Input image is filtered by 4 5×5 convolutional kernels which

1. CNN viewed as a small 2. CNN viewed as a larger

• Pooling is performed for two reasons

• View of middle of output of a convolutional layer

• Invariance to translation is important if we care about whether a

• Pooling over spatial regions produces invariance to translation

• A pooling unit that

Input tilted left Tilted Right

• Because pooling summarizes the responses over a whole

• This reduces representation size by a factor of two

Downsampling makes a digital audio signal smaller by

Examples of Architectures for Classification with CNNs

CNN that processes Processes variable- Does not have any

Activations of an example ConvNet architecture.

• VGG is a convolutional neural network model

Convolution with a kernel of width 3

• Convolution introduces an infinitely strong prior probability distribution