Res Net 2
Res Net 2
Res Net 2
Backward (gradient
flow) Is optimization is as easy as stacking layers?
Gradients Vanishing
• The multiplying property of gradients causes the phenomenon.
• This can be addressed by:
①Normalized Initialization
②Batch Normalization
identity
Smooth Forward Propagation
• Any xl is directly forward-prop to any
xL, plus residual.
• Any xl is additive outcome.
In contrast to the multiplicity:
Smooth Backward Propagation
• The gradients flow is also in
the form of addition.
• The gradients of any layer is
unlikely to vanish.
In contrast to the multiplicity
What if shortcut mapping h(x) ≠ identity?
If scaling the shortcut
• If h is multiplicative, e.g. h(x) = x, the forward and backward is
denoted as
• f(x)=ReLU
(original)
• f(x)=BN+ReLU
• f(x)=identity
(pre-activation ResNet)
ReLU v.s. ReLU+BN
• BN could block propagation.
• Keep the shortest path as
smooth as possible.
ReLU v.s. Identity
• ReLU could block propagation
when the network is deep.
• Pre-activation ease the
difficulty in optimization.
ImageNet Results
Conclusions from He
• Keep the shortest path as smooth(clean) as possible!
By making h(x) and f(x) identity mapping.
Forward and backward signals directly flow this path.
Yl+1
Example of the unrolling
• We take L=3 and l=0 for
example for unrolling.
• The data flows along paths
exponentially from input to
output.
• We infer that residual
networks have 2n paths
(multiplicity).
Different from traditional NN
• In traditional NN, each layer depends only on the previous layer.
• This means depth may not be the only key idea in deep learning.
Lesion Study
• Experiment 1: Deleting individual layers from neural networks.
p l k Cnk p 1 p
Vanishing gradients in ResNet
• Data flows along all the paths in
ResNet, while not all paths carry the
same amount of gradients.
• We sample individual paths of a
certain length and measure the norm
of gradients that arrives at the input.
• The gradient magnitude of a path
decreases exponentially with the
number of modules.
ResNets – exponential ensembles of
relatively shallow networks
• We multiply the frequency
of each path length with
its expected gradient
magnitude.
• Almost all of the gradient
updates come from paths
relatively shallow.
Discussion
• Removing residual modules mostly removes long paths
• The paths that contribute gradient are very short compared to the
overall depth of the network.
• ResNet in ResNet
“ResNet in ResNet: Generalizing Residual Architectures”, arxiv 2016/3/25,
Sasha Targ et al.