Convolutional Neural Networks in Computer Vision: Jochen Lang
Convolutional Neural Networks in Computer Vision: Jochen Lang
Convolutional Neural Networks in Computer Vision: Jochen Lang
Vision
Jochen Lang
©IEEE, 1998
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD. Backpropagation
applied to handwritten zip code recognition. Neural computation. 1989;1(4):541-51.
Jochen Lang, EECS
[email protected]
Convolutional Network Layers
– Padded input in
RGB 7x7x3
– Filter W0 3x3x3
applied at
stride 2 (move
filter by two
pixels after
each
application)
– Filter W1 3x3x3
stride 2
Image Source: cs231n.github.io
– Combine output “Convolutional Neural Networks for Visual
into 3x3x2 recognition”, Karpathy et al., Stanford
, , , , ,, ,
, ,
Note that we have four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel. The indices are the location of
the output.
g g
g g Image source S. Lazebnik
, , , , ,, ,
, ,
with
and n .
– Note that with “valid” padding the indices would be
, , , , ,, ,
, ,
Note that we have still four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel. But the indices for the
location of the output are multiplied with the stride in
the input image.
Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
• Stride arithmetic for
deep
Learning,
2018.
, ,
, ,
from the output to calculate the influence of the Kernel
weights, i.e., the partials and to
, , ,
backpropagate the loss further, i.e.,
, ,
, ,
, ,
We need to calculate the
partials
, , , ,
, , ,
,
Note that we have four indices for the spatial
dimension of the kernel, for the output channel and
for the input channel.
The equation assumes 1-based indexing and a stride ,
the index is over the output dimension.
, ,
, ,
We need to calculate the
partials
,, , ,,
, ,
, ,
Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
arithmetic for
deep
Learning,
2018.
• Goal of “Deconvolution”
– In many architectures (in particular, auto encoders),
we like the output to be the same size of the input
– We need to go from a “minimal representation” back
to the input image size
– This is the same as in backpropagation when we
distribute the loss from the output to the input of a
convolutional layer
This can be understood as fractionally-strided convolution
Example:
Image source:
Vincent
Dumoulin and
Francesco Visin,
A guide to
convolution
arithmetic for
deep
Learning,
2018.
𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0 0 0
0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0 0
0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 𝑘 , 𝑘 , 𝑘 , 0 0 0 0 0 0 0 0 0 0
⋱
, , , , , , ,, ,
, ,
• Tiled convolution
– Is a compromise between regular convolution and
unshared convolution
– Neighboring input regions or tiles use different kernels
but distant tiles use the same kernels again, i.e., we
rotate through the kernels. Expressed as modulo t, we
get
, , , , ,, , , % , %
, ,
• Key ideas:
– Attach a separate net which operates in reverse
– Replace max pooling with max location switches
Zeiler and
Fergus,
Visualizing and
Understanding
Convolutional
Networks,
ECCV 2014.
• Convolutional Layers
– Multichannel convolution
– Backpropagation
• Other Layers
– Pooling
– Fully-connected layers
– Activation functions
• Deconvnet
– Visualizing activations