Convolutional Neural Networks in Computer Vision: Jochen Lang

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Convolutional Neural Networks in Computer

Vision
Jochen Lang

[email protected]

Faculté de génie | Faculty of Engineering


Jochen Lang, EECS
[email protected]
Convolutional neural networks

• Convolutional neural networks (CNNs)


– “Classic layers”: Convolutional, pooling and fully-
connected layers
– Visualizing CNNs

Jochen Lang, EECS


[email protected]
Convolutional Networks

• Yann Le Cun et al., LeNet [1998]


– The paper describes a network architecture first
introduced in 1989. It defines the principles of deep
learning for OCR and speech recognition.

©IEEE, 1998

LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD. Backpropagation
applied to handwritten zip code recognition. Neural computation. 1989;1(4):541-51.
Jochen Lang, EECS
[email protected]
Convolutional Network Layers

• Images are arguably (very) high dimensional, e.g., 1080p


Million dimensions
• dimensions may however not be independent
• Convolutional layers help to summarize the image
– Easily understood as linear FIR filters or 2D
convolutions
– The filter coefficients are the weights of the neural
network layer
– The filters of a layer however slide (as in classical
image filtering) over the input image
– The output of many different filters and of the same
filter at different locations are combined deeper into the
network

Jochen Lang, EECS


[email protected]
Example: Convolutional Layer

– Padded input in
RGB 7x7x3
– Filter W0 3x3x3
applied at
stride 2 (move
filter by two
pixels after
each
application)
– Filter W1 3x3x3
stride 2
Image Source: cs231n.github.io
– Combine output “Convolutional Neural Networks for Visual
into 3x3x2 recognition”, Karpathy et al., Stanford

Jochen Lang, EECS


[email protected]
Convolution a Closer Look

• Inner product, i.e., multiplying the pixel values with the


weights

• Same as image correlation!

Image Source: cs231n.github.io


“Convolutional Neural Networks for Visual
recognition”, Karpathy et al., Stanford

Jochen Lang, EECS


[email protected]
Main Concepts in CNNs
• Convolutional layers introduce the following core ideas into
machine learning
– sparse interaction
• the filter kernels are chosen smaller than the input
image
• the deeper layers in the network are indirectly
connected to many inputs
– parameter sharing
• the same filter kernels are applied at many (or all)
input locations
– equivariant representation
• the filter kernel does not change over the image and
hence if we shift the input image, the output will
shift correspondingly

Jochen Lang, EECS


[email protected]
Convolutional Layers

• As in the previous example, a convolutional layer


combines many filter kernels
– E.g., first layer of VGG-16
• Input RGB image of size 224x224 can be viewed
as a tensor of size 224x224x3
• The first convolution will produce 64 output
channels, i.e., 64 multi-channel kernels are
applied to produce an output of 224x224x64
– Multi-channel kernels means that the filter sum over
image area and color, e.g., in VGG net the first filter
kernel size is 3x3x3, and in the next layer over
feature size, here still 224x224 and channels, here
64
Jochen Lang, EECS
[email protected]
Multichannel Convolution

• Multiple input and output channels add further


summations to our convolution operator
• Let the output of the convolution be , , with
the input (an image or features of the previous layer)
and the Kernel than the   convolution function is:

, , , , ,, ,
, ,
Note that we have four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel. The indices are the location of
the output.

Jochen Lang, EECS


[email protected]
Padding

• Just as in image filtering, convolutional kernels have to


set up to deal with the boundary. Most common
strategies are “valid” and “same”
• “valid” means no padding, i.e., each output is only
calculated from actual input pixels or features
• “same” means zero padding is used to calculate output
for all input pixels or features
g g
g g
valid same
f f

g g
g g Image source S. Lazebnik

Jochen Lang, EECS


[email protected]
Tensor Indices
• The indices in the equations are 1-indexed (according to
mathematical convention).
• Equation assumes padding has been applied to input
• Example: First convolutional layer in VGG
– Input is V with dimension (depth first) with
“same” (zero) padding for a filter means that the
input has spatial dimensions but output is
same, i.e., . Input depth is 3, output depth is
64.  

, , , , ,, ,
, ,
with
and n .
– Note that with “valid” padding the indices would be

Jochen Lang, EECS


[email protected]
Stride

• Image filters are typically applied with stride=1, i.e.,


the filter is moved over 1 pixel at a time
• We can use larger motion or stride
– Output will be smaller than input
• In classical convolutional networks the number of
channels increases but the spatial resolution
decreases deeper in the network
• Same kernel size can summarize larger input size
• Alternative to increasing stride is to apply pooling

Jochen Lang, EECS


[email protected]
Multichannel Convolution with Stride

• Let the output of the convolution with stride be , ,


with the input (an image or features of the
previous layer) and the Kernel than the convolution
function with stride is:  

, , , , ,, ,
, ,
Note that we have still four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel. But the indices for the
location of the output are multiplied with the stride in
the input image.

Jochen Lang, EECS


[email protected]
Example: Stride of One and Two

• Feature map of and kernel with same padding


• Stride

Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
• Stride arithmetic for
deep
Learning,
2018.

Jochen Lang, EECS


[email protected]
Relationship to Classic Hidden Layers

• Classic hidden layers connect every input to every


output
• A convolutional layer can be implemented as a classic
hidden layer where all filter coefficients are zero except
for the kernel size and the weights between parallel
hidden layers are shared.
– Leads to the concept of an infinitely strong prior
• Forced zero weights and forced shared weights
– Clearly backpropagation should still work
– Introduce extra sums in backpropagation to include
all inputs and outputs influenced by the weights and
the sensitivity of the output.

Jochen Lang, EECS


[email protected]
Backpropagation with CNNs
• Let , , be the output of a convolutional layer
with kernel and multichannel image
• The output will be a tensor with spatial index and
channel index
• The overall network will have a loss for a given
image and Kernel
– Then we need to take the gradient tensor

, ,
, ,
from the output to calculate the influence of the Kernel
weights, i.e., the partials and to
, , ,
backpropagate the loss further, i.e.,
, ,

Jochen Lang, EECS


[email protected]
Influence of Weights

• Given the gradient tensor at the layer output

, ,
, ,
We need to calculate the
  partials

, , , ,
, , ,
,
Note that we have four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel.
The equation assumes 1-based indexing and a stride ,
the index is over the output dimension.

Jochen Lang, EECS


[email protected]
Backpropagation through the Layer
• Given the gradient tensor at the layer output

, ,
, ,
We need to calculate the
  partials   

,, , ,,
, ,
, ,

Note that we have the indices for the spatial dimension


of the input and for the input channel.
The two other summation are over all convolutions that
operate on the input from these locations.
All output channels q need to be summed up.
The equation assumes 1-based indexing and a stride

Jochen Lang, EECS


[email protected]
Example: Back-propagation

• Feature map of and kernel with valid padding


and stride and its back-propagation

Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
arithmetic for
deep
Learning,
2018.

Jochen Lang, EECS


[email protected]
“Deconvolution”

• Deconvolution in CNN refers to convolutions that


increases the spatial dimensions of the output relative
to the input. But Deconvolution is a misnomer.
• Deconvolution as a mathematical operation is defined as
recovering a signal that has undergone a convolution.
– Consider where is the input image, is
the filter and is the output image (see lecture on
image processing).
– Deconvolution is recovering given and
– This operation is linear in the Fourier domain
where the operation is performed for each
Fourier coefficient.

Jochen Lang, EECS


[email protected]
“Deconvolution” in CNNs

• Goal of “Deconvolution”
– In many architectures (in particular, auto encoders),
we like the output to be the same size of the input
– We need to go from a “minimal representation” back
to the input image size
– This is the same as in backpropagation when we
distribute the loss from the output to the input of a
convolutional layer
This can be understood as fractionally-strided convolution

Jochen Lang, EECS


[email protected]
Fractionally-Strided Convolution

Recall in the strided convolution for output map , , with


stride , the indices strides the filter over some
input indices. Hence if the stride than the filter will
stride “slower” than the input.
• Stride with natural numbers, i.e., to reduce
the output size.
• Stride with fractions of natural numbers, i.e.,
to increase the ouput size.
Fractionally-strided convolution can be realized by adding
zero-padded rows and columns in-between the input map
before filtering with stride .

Jochen Lang, EECS


[email protected]
Fractionally-Strided Convolution

Example:

Jochen Lang, EECS


[email protected]
Example: Strided and Fractionally-
strided Convolution
• Convolution of feature map of size and kernel
with same padding and stride leads to a
output.
• A corresponding “deconvolution” uses a stride and
a padding of .

Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
arithmetic for
deep
Learning,
2018.

Jochen Lang, EECS


[email protected]
Convolution as Matrix Operation

• Convolution is a linear operation


• With appropriate matrix shape, a convolution layer can
be expressed as matrix multiply and back-propagation
is then a multiplication with a matrix transpose
• Example:
feature map after padding convolved with a
kernel leads to a matrix of size multiplying
the feature map reshaped row-major as a
vector for each channel.

𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0 0 0
0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0 0
0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0

Jochen Lang, EECS


[email protected]
Sparse Matrix Operation

• Back-propagation through a layer and deconvolution are


simple to see if matrices are used.
• Given then where is the
gradient tensor of the output written as a vector.

Jochen Lang, EECS


[email protected]
Special Cases of Convolution
• Locally connected layer (or unshared convolution)
• Weights are specific at each location, i.e., we use a
different kernel at each
  location

, , , , , , ,, ,
, ,
• Tiled convolution
– Is a compromise between regular convolution and
unshared convolution
– Neighboring input regions or tiles use different kernels
but distant tiles use the same kernels again, i.e., we
rotate through the kernels. Expressed as modulo t, we
get  

, , , , ,, , , % , %
, ,

Jochen Lang, EECS


[email protected]
Other Layers in CNNs

• Other layers are needed besides our classic hidden


layer, which is referred to as a fully-connected layer in
CNNs, and the convolutional layer
• Pooling layer
– Combine different outputs and summarizes the
result, e.g., max pooling (select the highest value)
• Activation layer
– Separate the linear weighting of the inputs and the
activation function into separate layers, e.g., ReLu
layer
• Already discussed

Jochen Lang, EECS


[email protected]
Convolutional Layer in Context

Image source I. Goodfellow et al.,


Deep Learning, MIT Press, 2016

Jochen Lang, EECS


[email protected]
Pooling Layers

• Pooling makes the output invariant to small translations


– E.g. Max Pooling
• The output of the max pooling operators is the
maximum input over the input neighborhood
0.3 0.7
0.7
0.2 0.1

• The output is the same independent where the


maximum occurs
• Most pooling operations have no parameters to learn,
e.g., max pooling

Jochen Lang, EECS


[email protected]
Pooling

• While max pooling seems to be used frequently, other


options can be used
– Median pooling
– Mean pooling
norm pooling
• The translational invariance of pooling is only
appropriate if we don’t care of the exact location of an
output
– Pooling brakes the spatial connection over the input
neighborhood to the output

Jochen Lang, EECS


[email protected]
Alternatives to Pooling

• Pooling is sometimes regarded unfavorable


• Instead a fully-convolutional design can be used
– The pooling operation can be replaced by 1x1
convolution as it sums over the channels
– It is less arbitrary as learnable weights are used
– It keeps the architecture more homogenous,
potentially giving a speed advantage

Jochen Lang, EECS


[email protected]
Visualizing CNNs

• The filters in CNNs are multi-channel filter and all but


the first convolutional layers can therefore not directly
be viewed.
• Max pooling introduces spatial uncertainty where an
output activation originates
• Deconvnet by Zeiler and Fergus, ECCV 2014, introduces
a way to visualize classic convolutional network beyond
the first convolutional layer
– Essentially a form of backpropagation to the input to
trace back where high activations originate in the
image

Jochen Lang, EECS


[email protected]
Structure of Deconvnet
Deconvnet Convnet

Zeiler and Fergus, Visualizing and Understanding


Convolutional Networks, ECCV 2014.

Jochen Lang, EECS


[email protected]
Depiction of Deconvnet Operation

• Key ideas:
– Attach a separate net which operates in reverse
– Replace max pooling with max location switches

Zeiler and Fergus, Visualizing and Understanding


Convolutional Networks, ECCV 2014.

Jochen Lang, EECS


[email protected]
Convnet

• The convent used by Zeiler and Fergus

Zeiler and Fergus, Visualizing and Understanding


Convolutional Networks, ECCV 2014.

Jochen Lang, EECS


[email protected]
Visualization of First Layer

• Max activation for all kernels in Layer 1 (left) and for 9


kernels on right with 9 input images creating top
activation

Zeiler and Fergus,


Visualizing and
Understanding
Convolutional Networks,
ECCV 2014.

Jochen Lang, EECS


[email protected]
Layer 2

• Similar to traditional corner or feature detectors


• Responding to shape, texture, structure, etc.

Zeiler and
Fergus,
Visualizing and
Understanding
Convolutional
Networks,
ECCV 2014.

Jochen Lang, EECS


[email protected]
Layer 3 Zeiler and
Fergus,
Visualizing and
Understanding
• More complex groupings, i.e., textures or patterns Convolutional
• Even face textures identified Networks,
ECCV 2014.

Jochen Lang, EECS


[email protected]
Layer 4
• More high level
groupings
• Not yet class label or
object specific
• E.g., dog faces,
animal legs,
foreground water

Zeiler and Fergus, Visualizing


and Understanding
Convolutional Networks, ECCV
2014.

Jochen Lang, EECS


[email protected]
Layer 5
Close to final output,
i.e., object classes
• Object-specific with
large variation
including pose
variations

Zeiler and Fergus, Visualizing


and Understanding
Convolutional Networks, ECCV
2014.

Jochen Lang, EECS


[email protected]
Summary

• Convolutional Layers
– Multichannel convolution
– Backpropagation
• Other Layers
– Pooling
– Fully-connected layers
– Activation functions
• Deconvnet
– Visualizing activations

Jochen Lang, EECS


[email protected]

You might also like