03 Convolutional Neural Networks
03 Convolutional Neural Networks
03 Convolutional Neural Networks
g(.) h111
x2
f(.) y1
x3
x4 y2
f(.)
x5
Input Output
Idea: (1) Features are local
(2) Their presence/absence is stationary
Concept by Yann LeCun
Convolutional layers
x1
Index can not be permutated
h111
x2
f(.) y1
g(.) h112
x3
x4 y2
f(.)
x5
Input Output
Idea: (1) Features are local
(2) Their presence/absence is stationary
Concept by Yann LeCun
Convolutional layers
x1
Index can not be permutated
h111
x2
f(.) y1
h112
x3
x4 g(.) h113
f(.) y2
x5
Input Output
Idea: (1) Features are local, (2) Their presence/absence is stationary
(3) GPU implementation for inexpensive super-computing
LeNet, AlexNet
Receptive fields of neurons
• Levine and Shefner (1991) define a receptive field as an "area
in which stimulation leads to response of a particular sensory
neuron" (p. 671).
Source: http://psych.hanover.edu/Krantz/receptive/
The concept of the best stimulus
• Depending on excitatory and inhibitory
connections, there is an optimal stimulus that
falls only in the excitatory region
• On-center retinal ganglion cell example shown
here
Source: http://psych.hanover.edu/Krantz/receptive/
On-center vs. off-
center
Source: https://en.wikipedia.org/wiki/Receptive_field
Bar detection example
Source: http://psych.hanover.edu/Krantz/receptive/
Gabor filters model simple cell in
visual cortex
Source: https://en.wikipedia.org/wiki/Gabor_filter
Modeling oriented edges using Gabor
Source: https://en.wikipedia.org/wiki/Gabor_filter
Feature maps using Gabor filters
Source: https://en.wikipedia.org/wiki/Gabor_filter
Haar filters
Source: http://www.cosy.sbg.ac.at/~hegenbart/
More feature maps
Source: http://www.cosy.sbg.ac.at/~hegenbart/
Convolution
• Classical definitions
∞
−∞
∞
𝑥=−∞
Source: http://bmia.bmt.tue.nl/education/courses/fev/course/notebooks/triangleblockconvolution.gif
Convolution in 2-D (sharpening filter)
Source: https://upload.wikimedia.org/wikipedia/commons/4/4f/3D_Convolution_Animation.gif
Let the network learn conv kernels
Source: "Gradient-based learning applied to document recognition" by Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, in Proc. IEEE, Nov. 1998.
Types of pooling
• Two types of popular pooling methods
– Average
– Max
Why?
Source: "Gradient-based learning applied to document recognition" by Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, in Proc. IEEE, Nov. 1998.
Fully connected layers
Source: "Gradient-based learning applied to document recognition" by Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, in Proc. IEEE, Nov. 1998.
Visualizing weights, conv layer 1
Source: http://cs231n.github.io/understanding-cnn/
Visualizing feature map, conv layer 1
Source: http://cs231n.github.io/understanding-cnn/
Visualizing weights, conv layer 2
Source: http://cs231n.github.io/understanding-cnn/
Visualizing feature map,
conv layer 2
Source: http://cs231n.github.io/understanding-cnn/
CNN for speech processing
Source: "Convolutional neural networks for speech recognition" by Ossama Abdel-Hamid et al., in IEEE/ACM Trans. ASLP, Oct, 2014
CNN for DNA-protein binding
Source: "Convolutional neural network architectures for predicting DNA–protein binding” by Haoyang Zeng et al., Bioinformatics 2016, 32 (12)
Layer design: Deep, but smaller convolutions
• Starting with VGGNet, most conv nets use small filter sizes
– Why? Detect spatial co-occurrence of features that aren’t distant
• Only the first conv layer may have a larger filter size
– Why? In the image space, we need to detect complex features
Source: “Deep Residual Learning for Image Recognition” by He, Zhang, Ren, Sun, https://arxiv.org/pdf/1512.03385v1.pdf
Residual skip connections
• Train H(x) of the form F(x) + x
• F(x) can be thought of as a residual
Source: “Deep Residual Learning for Image Recognition” by He, Zhang, Ren, Sun, https://arxiv.org/pdf/1512.03385v1.pdf
Smaller errors with deeper ResNets
Source: “Deep Residual Learning for Image Recognition” by He, Zhang, Ren, Sun, https://arxiv.org/pdf/1512.03385v1.pdf
Convolution and pooling revisited
Class Probability
Max
FC Layer
Feature
Map
Pooling Layer
Feature
* ReLU
Map
Convolutional Layer
Input
Inputs can be padded
Image
to match the input
and output size
Variations of convolutional filter achieve various
purposes
• N-D convolutions generalize over 2-D
• Stride variation leads to pooling
• Atrous (dilated) convolutions cover more area
with less parameters
• Transposed convolution increases the feature
map size
• Layer-wise convolutions reduce parameters
• 1x1 convolutions reduce feature maps
• Separable convolutions reduce parameters
• Network-in-network learns a nonlinear conv
Convolutions in 3-D
*
Convolutions with stride > 1
*
Atrous (dilated) convolutions can
increase the receptive field without
increasing the number of weights
Image pixels 5x5 kernel 3x3 kernel 5x5 dilated kernel with only
3x3 trainable weights
Transposed (de-) convolution increases feature map
size
*
MobileNet filters each feature map separately
*
* * *
*
“MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications” by Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang
Tobias Weyand Marco Andreetto Hartwig Adam, 2017
Using 1x1 convolutions is equivalent to having a
fully connected layer
• This way, a fully
convolutional network
can be constructed from a
regular CNN such as
VGG11
* = ReLU
Separable convolutions
* *
Network in network
• Instead of a linear filter with a nonlinear
squashing function, N-i-N uses an MLP in a
convolutional (sliding) fashion
Source: “Network in Network” by Min Lin, Qiang Chen, Shuicheng Yan, https://arxiv.org/pdf/1312.4400v3.pdf
A standard architecture
on a large image with
global average pooling
GAP
Layer
Semantic segmentation is labeling pixels
according to their classes
x CNN Pforeground(x,y)
x
Convolution
Upsampling
Segmentation
Map
Upsampling layer
Feature Map
*
Upsampling
Pooling (downsampling) layer
Feature Map
Transposed
Convolutional layer
convolution
Input Image
Downsampling and upsampling leads to
an hour-glass structure
Segmentation
Map
Feature Map
Feature Map
Input Image
Input Image
Convolutional layer
Feature Map
Feature Map
Upsampling layer
Segmentation
Map
Let us rearrange the layers horizontally
Input Image
Convolutional layer
Feature Map
Convolutional layer
Feature Map
Upsampling layer
Feature Map
Upsampling layer
Feature Map
More layers can be added
Convolutional layer
Segmentation
Map
Input Image
Feature Map
Feature Map
Feature Map
Feature
Map
Feature Map
Feature Map
Segmentation
Visually rearrange layers in a big U
Map
Input Image
Feature Map
Feature Map
Feature Map
Feature
Map
Feature Map
Feature Map
finer spatial context
Feature Map
Feature Map
Segmentation
Concatenate previous feature maps for
Map
U-Net is based on the ideas described in the
previous slides
Input Segmentation
sub-image map
Legend
Conv 3x3, ReLU
Copy and Crop
Max Pool 2x2
Up-conv 2x2
Conv 1x1
Source: “U-Net: Convolutional Networks for Biomedical Image Segmentation” Olaf Ronneberger, Philipp Fischer, Thomas Brox, 2015
A sample output for nucleus segmentation in
pathology
(x,y,h,w)
Regular pooling Last activation
linear
Faster R-CNN architecture
classifier
RoI pooling
256-d
Region Proposal Network
feature maps Intermediate layer
Conv
Layers
Sliding window
Source: “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks” Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, 2017
Classification and regression on
region proposals
RPN
Case study: YOLO
• Problem: Simultaneous detection, localization, and
classification of multiple objects in a fast manner
Source: “You Only Look Once:Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi http://pjreddie.com/yolo/
Case study: YOLO
Source: “You Only Look Once:Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi http://pjreddie.com/yolo/
Case study: YOLO
Output size: S×S× (Bx5 + C)
#Bounding boxes per cell = B R5, i.e. (x,y,w,h, confidence)
#Classes = C
Source: “You Only Look Once:Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi http://pjreddie.com/yolo/
Case study: YOLO
Source: “You Only Look Once:Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi http://pjreddie.com/yolo/
Case study: YOLO
Source: “You Only Look Once:Unified, Real-Time Object Detection” Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi http://pjreddie.com/yolo/
CNN Revisited
Class Probability
Max
FC Layer
Feature Map
Pooling Layer
* ReLU
Feature Map
Convolutional Layer
FC Layer
Feature Map
Pooling Layer
Feature Map
Convolutional Layer
Input Image
These features are transferable and can
be used in an SVM, for example
Kernel of an SVM
Feature Map
Feature Map xi xj
Convolutional Layer
Input Image
Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, and Lipson, Hod. “How transferable are features in deep neural networks?”
Properties of a kernel
• Similarity metric
• Positive semi-definite
Learning the kernel is called metric learning
• Inverse of similarity
• It is symmetric
Shared weights
Test Reference
Image Image
For example, face verification
Metric
Shared weights
§ Similarity examples
§ Dot product
§ Arc cosine
§ Radial basis function (RBF)
Some distance and similarity measures
§ Distances examples
§ ||(f(xi) − f(xj)||22
§ |(f(xi) − f(xj)|1
§ Similarity examples
§ f(xi)T f(xj) or f(xi) . f(xj)
§ f(xi) . f(xj) / ( ||f(xi)|| ||f(xj)|| )
§ exp(− ||xi − xj||2/σ2)