CV Lec6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 57

Convolutional Neural

Networks
Convolutional Neural Networks
• Convolutional Neural Networks are very similar to ordinary Neural Networks.

• They are made up of neurons that have learnable weights and biases.

• Each neuron receives some inputs, performs a dot product and optionally follows it with
a non-linearity.

• The whole network still expresses a single differentiable score function: from the raw
image pixels on one end to class scores at the other.

• And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer
and all the tips/tricks we developed for learning regular Neural Networks still apply.
So what changes?
• ConvNet architectures make the explicit assumption that the inputs are images,
which allows us to encode certain properties into the architecture.

• These then make the forward function more efficient to implement and vastly
reduce the amount of parameters in the network.
Convnets are everywhere
Convnets are everywhere
Convnets are everywhere
Convnets are everywhere
Layers used to build convnets
• a simple ConvNet is a sequence of layers, and every layer of a ConvNet
transforms one volume of activations to another through a differentiable
function.

• We use three main types of layers to build ConvNet architectures:


• Convolutional Layer
• Pooling Layer
• Fully-Connected Layer

• We will stack these layers to form a full ConvNet architecture.


Convolutional Layer
• The CONV layer’s parameters consist of a set of learnable filters.

• Every filter is small spatially (along width and height) but extends through the full depth
of the input volume.

• For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5
pixels width and height, and 3 because images have depth 3, the color channels).

• During the forward pass, we slide (more precisely, convolve) each filter across the width
and height of the input volume and compute dot products between the entries of the
filter and the input at any position.
Convolutional Layer
• As we slide the filter over the width and height of the input volume, we will produce a 2-
dimensional activation map that gives the responses of that filter at every spatial
position.

• Intuitively, the network will learn filters that activate when they see some type of visual
feature such as an edge of some orientation or a blotch of some color on the first layer,
or eventually entire honeycomb or wheel-like patterns on higher layers of the network.

• Now, we will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of
them will produce a separate 2-dimensional activation map.

• We will stack these activation maps along the depth dimension and produce the output
volume.
Convolutional layer
• Local Connectivity
• When dealing with high-dimensional inputs such as images, as we saw above
it is impractical to connect neurons to all neurons in the previous volume.

• Instead, we will connect each neuron to only a local region of the input
volume. The spatial extent of this connectivity is a hyperparameter called the
receptive field of the neuron (equivalently this is the filter size).
ConvNet is a sequence of Convolution Layers, with activation
functions
What do convolutional filters learn?
What do convolutional filters learn?
What do convolutional filters learn?
A closer look at spatial dimensions
Spatial arrangement
• We have explained the connectivity of each neuron in the Conv Layer
to the input volume, but we haven’t yet discussed how many neurons
there are in the output volume or how they are arranged.

• Three hyperparameters control the size of the output volume:


• depth
• stride
• zero-padding
Spatial arrangement
Spatial arrangement
Spatial arrangement
Spatial arrangement
Spatial arrangement
Spatial arrangement
Spatial arrangement
Spatial arrangement
Spatial arrangement
Spatial arrangement
Spatial arrangement
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
In practice: Common to zero pad the border
In practice: Common to zero pad the border
In practice: Common to zero pad the border
Conv Layer: summary
• Let’s assume input is W1 x H1 x C
• Conv layer needs 4 hyperparameters:
• Number of filters K
• The filter size F
• The stride S
• The zero padding P

• This will produce an output of W2 x H2 x K


• where:
• W2 = (W1 - F + 2P)/S + 1
• H2 = (H1 - F + 2P)/S + 1

• Number of parameters: F2CK and K biases


Examples
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2

Output volume size?


Examples
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2

Output volume size?


(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
Examples
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


Examples
Input volume: 32x32x3
10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias)

=> 76*10 = 760


Real-world example
• The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted
images of size [227x227x3].

• On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4
and no zero padding P=0.

• Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer
output volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was
connected to a region of size [11x11x3] in the input volume.

• Moreover, all 96 neurons in each depth column are connected to the same [11x11x3]
region of the input, but with different weights!!
Parameter Sharing
• Using the real-world example above, we see that there are 55*55*96 = 290,400 neurons
in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias. Together, this
adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet
alone. Clearly, this number is very high.

• It turns out that we can dramatically reduce the number of parameters by making one
reasonable assumption: That if one feature is useful to compute at some spatial position
(x,y), then it should also be useful to compute at a different position (x2,y2).

• With this parameter sharing scheme, the first Conv Layer in our example would now have
only 96 unique set of weights (one for each depth slice), for a total of 96*11*11*3 =
34,848 unique weights, or 34,944 parameters (+96 biases).
Parameter Sharing
• Parameter sharing is a fundamental concept in Convolutional Neural Networks
(CNNs) and refers to the idea of using the same set of weights (parameters) for
multiple units in a layer.
Pooling Layer
• It is common to periodically insert a Pooling layer in-between successive Conv layers in a
ConvNet architecture.

• Its function is to progressively reduce the spatial size of the representation to reduce the
number of parameters and computation in the network.

• The Pooling Layer operates independently on every depth slice of the input and resizes it
spatially.

• The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2
downsamples every depth slice in the input by 2 along both width and height
Pooling Layer
• makes the representations smaller and more manageable
• operates over each activation map independently
Pooling Layer
Pooling Layer: summary
• Let’s assume input is W1 x H1 x C
• Conv layer needs 2 hyperparameters:
• The spatial extent F
• The stride S

• This will produce an output of W2 x H2 x C where:


• W2 = (W1 - F )/S + 1
• H2 = (H1 - F)/S + 1

• Number of parameters: 0
Fully Connected Layer (FC layer)
• Contains neurons that connect to the entire input volume, as in
ordinary Neural Networks.

You might also like