2 CNN

DSE 3151 DEEP LEARNING
Convolutional Neural Networks
Dr. Rohini Rao & Dr. Abhilash K Pai

Dept. of Data Science and Computer Applications
MIT Manipal
The Convolution Operation - 1D
▪ Convolution is a linear operation on two functions of a real-valued argument, where one function is applied over
the other to yield element-wise dot products.
▪ Example: Consider a discrete signal ‘xt’ which represents the position of a spaceship at time ‘t’
recorded by a laser sensor.
▪ Now, suppose that this sensor is noisy.

x0
▪ To obtain a less noisy measurement we would like to average several measurements.
▪ Considering that, the most recent measurements are more important, we would like to take
a weighted average over ‘xt’. The new estimate at time ‘t’ is computed as follows: x1
convolution
∞
𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎 = 𝑥 ∗ 𝑤 𝑡
𝑎=0
input Filter/Mask/Kernel
x2
▪ Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
▪ In practice, we would sum only over a small window.
6
For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎

𝑎=0
▪ We just slide the filter over the input and compute the value of st based on a window around xt
w-6 w-5 w-4 w-3 w-2 w-1 w0

w 0.01 0.01 0.02 0.02 0.04 0.4 0.5
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
6

𝑎=0
w-6 w-5 w-4 w-3 w-2 w-1 w0

w 0.01 0.01 0.02 0.02 0.04 0.4 0.5
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80 1.96
6

𝑎=0
w-6 w-5 w-4 w-3 w-2 w-1 w0

w 0.01 0.01 0.02 0.02 0.04 0.4 0.5
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80 1.96 2.11
▪ Use cases of 1-D convolution : Audio signal processing, stock market analysis, time series analysis etc.
Convolution in 2-D using Images : What is an Image?
What we see
What a computer sees

▪ An image can be represented mathematically as a function f(x,y) which gives the intensity value at
position (x,y), where, f(x,y) ε {0,1,….,Imax-1} and x,y ε {0,1,…..,N-1}.
▪ Larger the value of N, more is the clarity of the picture (larger resolution), but more data to be analyzed
in the image.
▪ If the image is a Gray-scale (8-bit per pixel) image, then it requires N2 Bytes for storage.
▪ If the image is color - RGB, each pixel requires 3 Bytes of storage space.
N is the resolution of the image and Imax is the level of discretized brightness value.
Digital camera
[Source: D. Hoiem]
▪ Sample the 2-D space on a regular grid.

▪ Quantize each sample, i.e., the photons arriving at each active cell are
integrated and then digitized.
[Source: D. Hoiem]
▪ A grid (matrix) of intensity values.
255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 0 0 255 255 255 255 255 255 255
255 255 255 75 75 75 255 255 255 255 255 255
255 255 75 95 95 75 255 255 255 255 255 255
255 255 96 127 145 175 255 255 255 255 255 255
255 255 127 145 175 175 175 255 95 255 255 255
255 255 127 145 200 200 175 175 95 255 255 255
255 255 127 145 145 175 127 127 95 47 255 255
255 255 127 145 145 175 127 127 95 47 255 255
255 255 74 127 127 127 95 95 95 47 255 255
255 255 255 74 74 74 74 74 74 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255
▪ Images are good examples of 2-D inputs.
▪ A 2-D convolution of an Image ‘I’ using a filter ‘K’ of size ‘m x n’ is now defined as (looking at previous pixels):
𝑚−1 𝑛−1
𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑖−𝑎,𝑗−𝑏 𝐾𝑎,𝑏

𝑎=0 𝑏=0
▪ In practice, one of the way is to look at the succeeding pixels:

𝑚−1 𝑛−1
𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑖+𝑎,𝑗+𝑏 𝐾𝑎,𝑏

𝑎=0 𝑏=0
▪ Another way is to consider center pixel as reference pixel, and then look at its surrounding pixels:
𝑚/2 𝑛/2
𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼 𝑖−𝑎,𝑗−𝑏 ∗ 𝐾 𝑚/2 +𝑎, 𝑛/2 +𝑏

𝑎= −𝑚/2 𝑏= −𝑛/2
Pixel of interest
0 1 0 0 1
0 0 1 1 0
1 0 0 0 1
0 1 0 0 1
0 0 1 0 1
Source: https://developers.google.com/
Input Image
Input Image
Smoothening Filter
Sharpening Filter
Filter for edge

detection
The Convolution Operation – 2D : Various filters (edge detection)
Prewitt
-1 0 1 1 1 1
-1 0 1 0 0 0
-1 0 1 -1 -1 -1
Sx Sy After applying
Horizontal edge
detection filter
Sobel
-1 0 1 1 2 1
-2 0 2 0 0 0
-1 0 1 -1 -2 -1
Sx Sy Input image After applying
Vertical edge
Laplacian Roberts detection filter
0 1 0 0 1 1 0
1 -4 1 -1 0 0 -1
0 1 0 Sx Sy
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0 4 x 4 Feature Map
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
Input image: 6 x 6
-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
1 0 0 0 0 1 Repeat for each filter!

3 -1 -3 -1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0 -3 1 0 -3
-1 -1 -2 1
1 0 0 0 1 0 Feature
0 1 0 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
0 0 1 0 1 0 Two 4 x 4 images
3 -2 -2 -1 Forming 4 x 4 x 2 matrix
Input image: 6 x 6 -1 0 -4 3
The Convolution Operation –RGB Images
R G B
Apply the filter to R, G, and B channels of

the image and combine the resultant
feature maps to obtain a 2-D feature map.
Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science
The Convolution Operation –RGB Images multiple filters
11 -1-1 -1-1 -1-1 11 -1-1 -1-1 11 -1-1
1 -1 -1 -1 0 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 0 -1 Filter 2
0 0 0 Filter K
-1-1 -1-1 11 -1-1 11 -1-1 -1-1 11 -1-1
-1 -1 1 -1 0 -1 -1 1 -1
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0 K-filters = K-Feature Maps
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0 Depth of feature map = No. of feature maps = No. of filters
The Convolution Operation : Terminologies
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0 -1-1 11 -1-1
0 0 1 1 0 0 -1 1 -1
1 00 00 10 11 00 0 -1-1 11 -1-1
1 0 0 0 1 0 0 0 0 Filter
0 11 00 00 01 10 0 -1-1 11 -1-1
0 1 0 0 1 0 -1 1 -1
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
1. Depth of an Input Image = No. of channels in the Input Image = Depth of a filter
2. Assuming square filters, Spatial Extent (F) of a filter is the size of the filter
The Convolution Operation : Zero Padding
conv3x3
2x2
4x4
Pad Zeros and then convolve to obtain a

feature map with dimension = input image dimension
The Convolution Operation : Zero Padding
Feature map size: 5x5
Input image size: 5x5
Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science
Convolutional Neural Network (CNN) : At a glance
cat | dog
Convolution
Pooling
Can repeat Fully Connected
many times Feedforward network
Convolution
Pooling
Source: CS 898: Deep Learning and Its Applications, University of Flattened

Waterloo, Canada.
Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
• Max Pooling
3 -1 -3 -1 -1 -1 -1 -1 • Average Pooling
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Pooling
Max. Pooling Average Pooling
Stride ?
Why Pooling ?
▪ Subsampling pixels will not change the object
bird
bird
Subsampling
▪ We can subsample the pixels to make image smaller
▪ Therefore, fewer parameters to characterize the image

Relation between i/p size, feature map size, filter size
Input Image
-1-1 11 -1-1
-1 0 -1 Stride length = S
1 0 0 0 0 1 -1-1 11 -1-1 No. of Filters = K
1 0 0 0 0 1 -1 0 -1 Padding = P
0 11 00 00 01 00 1 -1-1 11 -1-1
0 1 0 0 1 0 -1 0 -1 Filter
0 00 11 01 00 10 0
0 0 1 1 0 0 -1-1 11 -1-1
1 00 00 10 11 00 0 H1 -1 1 -1
1 0 0 0 1 0 -1-1 11 -1-1
0 11 00 00 01 10 0 0 0 0 F H2
0 1 0 0 1 0 -1-1 11 -1-1
0 00 11 00 01 10 0 -1 1 -1
0 0 1 0 1 0
D1
0 0 1 0 1 0 D1
D2
F
1 -1 -1
W1
11 -1-1 -1-1 W2
-1-1 11 -1-1
𝑾𝟏 − 𝑭 + 𝟐𝑷 -1 1 -1 𝑯𝟏 − 𝑭 + 𝟐𝑷
𝑾𝟐 = +𝟏 -1-1 -1-1 11 𝑯𝟐 = +𝟏 𝑫𝟐 = 𝑲
𝑺 -1 -1 1 𝑺
Important properties of CNN
▪ Sparse Connectivity
▪ Shared weights
▪ Equivariant representation
Properties of CNN
1 1 1
1 -1 -1 Filter 1 2 0 -1
-1 1 -1 3 0 -1
-1 -1 1 4 0 -1
3
..
7 0 1
1 0 0 0 0 1 8 1 -1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0 -1
-1
..
1 0 0 0 1 0 1
Fewer parameters!
13 0
0 1 0 0 1 0 Only connect to 9 inputs, not fully
14 0 connected (Sparse Connectivity)
0 0 1 0 1 0
15 1
6 x 6 Image 16 1
..
Properties of CNN
Is sparse connectivity good?
Ian Goodellow et al. 2016
Properties of CNN
1 1
1 -1 -1 2 0
-1 1 -1 3 0
-1 -1 1 4 0 3
..
7 0
1 0 0 0 0 1 8 1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0
-1
..
1 0 0 0 1 0
13 0
0 1 0 0 1 0 Even Fewer parameters!
14 0
0 0 1 0 1 0 Fewer parameters!
15 1
6 x 6 Image Shared weights
16 1
..
Equivariance to translation
▪ A function f is equivariant to a function g if f(g(x)) = g(f(x)) or if the output changes in the same way as the
input.
▪ This is achieved by the concept of weight sharing.
▪ As the same weights are shared across the images, hence if an object occurs in any image, it will be detected
irrespective of its position in the image.
Source: Translational Invariance Vs Translational Equivariance | by Divyanshu Mishra | Towards Data Science
CNN vs Fully Connected NN
▪ A CNN compresses the fully connected NN in two ways:
▪ Reducing the number of connections
▪ Shared weights
▪ Max pooling further reduces the parameters to characterize an image.
Convolutional Neural Network (CNN) : Non-linearity with activation
cat | dog
Convolution +
ReLU
Pooling
Fully Connected
Feedforward network
Convolution+
ReLu
Pooling
Source: CS 898: Deep Learning and Its Applications, University of Flattened

Waterloo, Canada.
LeNet-5 Architecture for handwritten text recognition
#Param.
#Param. ((5*5*16)*120 +
#Param. #Param. 120 = 48120 #Param.
((5*5*6)+1) * 16 = 2416
((5*5*1)+1) * 6 = 156 =0 84*120 + 84=
#Param. 10164
=0 #Param.
84*10 + 10= 850
tanh tanh
sigmoid
S =1, F=5, S =2, F=2, S =1, F=5, S =2, F=2,

K=6, P=2 K=6, P=0 K=16, P=0 K=16, P=0
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324.
LeNet-5 Architecture for handwritten number recognition
Source: http://yann.lecun.com/
ImageNet Dataset
More than 14 million images. 22,000 Image categories
Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database."

IEEE conference on computer vision and pattern recognition. IEEE, 2009.
ImageNet Large Scale Visual Recognition Challenge
• 1000 ImageNet Categories
ZFNet
AlexNet (2012)
▪ Used ReLU activation function instead of

sigmoid and tanh.
▪ Used data augmentation techniques

that consisted of image translations,
horizontal reflections, and patch
extractions.
▪ Implemented dropout layers.
AlexNet Architecture
#Param. = 0 #Param. = 0
#Param. #Param.
#Param.
((5*5*96)+1) * 256 = 614656 ((3*3*256)+1) * 384 =
((11*11*3)+1) * 96 = 34944
885120
#Param. = 0
#Param.
((3*3*384)+1) * 256 =884992
Total #Param.
#Param. 62M
((3*3*384)+1) * 384 =
1327488
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
ZFNet Architecture (2013)
• Used filters of size 7x7 instead of 11x11 in AlexNet
• Used Deconvnet to visualize the intermediate results.
Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks.
In European conference on computer vision (pp. 818-833). Springer, Cham.
ZFNet
Visualizing and Understanding Deep Neural Networks by Matt Zeiler - YouTube
ZFNet
VGGNet Architecture (2014)
Image Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
• Used filters of size 3x3 in all the convolution layers.
• 3 conv layers back-to-back have an effective receptive field of 7x7.
• Also called VGG-16 as it has 16 layers.
• This work reinforced the notion that convolutional neural networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition , International Conference on Learning Representations (ICLR14)
GoogleNet Architecture (2014)
• Most of the architectures discussed till now

apply either of the following after each
convolution operation:
• Max Pooling
• 3x3 convolution
• 5x5 convolution
• Idea: Why cant we apply them all together

at the same time and concatenate the
feature maps.
• Problem: This will result in large number of

computations.
• Specifically, each element of the output

required O(FxFxD) computations
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)
• Solution: Apply 1x1 convolutions
• 1x1 convolution aggregates along the depth.
• So, if we apply D1 1x1 convolutions (D1<D), we

will get an output of size W x H x D1
• So, the total number of computations will reduce to

O(FxFxD1)
• We could then apply subsequent 3x3, 5 x5 filters on

this reduced output
• Also, we might want to use different

dimensionality reductions (applying 1x1
convolutions of different sizes) before the
3x3 and 5x5 filters.
• We can also add the maxpooling layer

followed by 1x1 convolution.
• After this, we concatenate all these layers.
• This is called the Inception module.
• GoogleNet contains many such inception

The Inception module modules.

Global average pooling
• 12 times less parameters and 2 times more

computations than AlexNet
• Used Global Average Pooling instead of

Flattening.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., & Rabinovich, A. (2015). Going deeper with convolutions.
ResNet Architecture (2015)
Effect of increasing layers of shallow CNN when experimented over the CIFAR dataset
Source: Residual Networks (ResNet) - Deep Learning - GeeksforGeeks

Shallow CNN +
Shallow CNN Additional layers
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
ResNet-34
Source: Residual Networks (ResNet) - Deep Learning - GeeksforGeeks


2 CNN

Uploaded by

Copyright:

Available Formats

2 CNN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 CNN

Uploaded by

Copyright:

Available Formats

DSE 3151 DEEP LEARNING

Convolutional Neural Networks

Dr. Rohini Rao & Dr. Abhilash K Pai

▪ Now, suppose that this sensor is noisy.

▪ To obtain a less noisy measurement we would like to average several measurements.

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎

w-6 w-5 w-4 w-3 w-2 w-1 w0

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎

w-6 w-5 w-4 w-3 w-2 w-1 w0

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎

w-6 w-5 w-4 w-3 w-2 w-1 w0

s 1.80 1.96 2.11

What a computer sees

▪ Sample the 2-D space on a regular grid.

▪ A grid (matrix) of intensity values.

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑖−𝑎,𝑗−𝑏 𝐾𝑎,𝑏

▪ In practice, one of the way is to look at the succeeding pixels:

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑖+𝑎,𝑗+𝑏 𝐾𝑎,𝑏

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼 𝑖−𝑎,𝑗−𝑏 ∗ 𝐾 𝑚/2 +𝑎, 𝑛/2 +𝑏

Filter for edge

1 0 0 0 0 1 Repeat for each filter!

Apply the filter to R, G, and B channels of

Pad Zeros and then convolve to obtain a

Feature map size: 5x5

Input image size: 5x5

Source: CS 898: Deep Learning and Its Applications, University of Flattened

Max. Pooling Average Pooling

▪ Subsampling pixels will not change the object

▪ We can subsample the pixels to make image smaller

▪ Therefore, fewer parameters to characterize the image

Is sparse connectivity good?

Ian Goodellow et al. 2016

▪ This is achieved by the concept of weight sharing.

▪ A CNN compresses the fully connected NN in two ways:

▪ Reducing the number of connections

▪ Max pooling further reduces the parameters to characterize an image.

Source: CS 898: Deep Learning and Its Applications, University of Flattened

S =1, F=5, S =2, F=2, S =1, F=5, S =2, F=2,

More than 14 million images. 22,000 Image categories

Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database."

▪ Used ReLU activation function instead of

▪ Used data augmentation techniques

▪ Implemented dropout layers.

• Used filters of size 7x7 instead of 11x11 in AlexNet

• Used Deconvnet to visualize the intermediate results.

Visualizing and Understanding Deep Neural Networks by Matt Zeiler - YouTube

Image Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• Used filters of size 3x3 in all the convolution layers.

• 3 conv layers back-to-back have an effective receptive field of 7x7.

• Also called VGG-16 as it has 16 layers.

• Most of the architectures discussed till now

• Idea: Why cant we apply them all together

• Problem: This will result in large number of

• Specifically, each element of the output

• Solution: Apply 1x1 convolutions

• 1x1 convolution aggregates along the depth.

• So, if we apply D1 1x1 convolutions (D1<D), we

• So, the total number of computations will reduce to

• We could then apply subsequent 3x3, 5 x5 filters on