L2 - UCLxDeepMind DL2020

WELCOME TO THE
UCL x DeepMind
lecture series
In this lecture series, leading research scientists
from leading AI research lab, DeepMind, will give
12 lectures on an exciting selection of topics
in Deep Learning, ranging from the fundamentals
of training neural networks via advanced ideas
around memory, attention, and generative
modelling to the important topic of responsible
innovation.
Please join us for a deep dive lecture series
into Deep Learning!
#UCLxDeepMind
General
information Exits:
At the back, the way you came in
Wifi:
UCL guest
TODAY’S SPEAKER
Wojciech Czarnecki
Wojciech Czarnecki is a Research Scientist
at DeepMind. He obtained his phd from the
Jagiellonian University in Cracow, during which
he worked on the intersection of machine learning,
information theory and cheminformatics. Since joining
DeepMind in 2016, Wojciech has been mainly working
on deep reinforcement learning, with a focus on
multi-agent systems, such as recent Capture the Flag
project or AlphaStar, the first AI to reach the highest
league of human players in a widespread professional
esport without simplification of the game.
Neural networks are the models responsible
for the deep learning revolution since 2006,
but their foundations go back as far as to the
1960s. In this lecture we will go through the
basics of how these models operate, learn
and solve problems. We will also set various
terminology/naming conventions to prepare
attendees for further, more advanced talks.
Finally, we will briefly touch upon more
research oriented directions of neural
network design and development.
TODAY’S LECTURE
Neural Networks
Foundations
Neural Networks
Foundations
Wojciech Czarnecki
UCL x DeepMind Lectures

Plan for this Lecture Private & Confidential
01 02 03
Overview Neural Networks Learning
04 05 06
Pieces of the puzzle Practical issues Bonus:
Multiplicative interactions
What is not covered in this lecture Private & Confidential
01 02 03
“Old school” Biologically plausible Other
- (Restricted) - Spiking networks - Capsules

Boltzmann - Physical Simulators - Graph networks
Machines - Neural
- Deep Differential
Belief Equations
Networks - Convolutional
- Hopfield Networks Networks
- Self Organising - Recurrent
Maps Neural
Networks
1 Overview
Computer Vision
Computer Vision Text and Speech
Computer Vision Text and Speech Control
Compute
Compute Data
Compute Data Modularity
The deep learning puzzle
Node Node Node
Node Node Node Loss
Data Target
Node Node Node
How to adjust Node

this input,
Node if my Node Node Loss
output needs to
change?
What to output?
Data Target
Node Node Node
Node
Node Node Node Loss
Differentiable
wrt. inputs
What to output?
Data Target
2 Neural
networks

Real neuron
Soma Want to learn more?
Hodgkin AL, Huxley AF A quantitative
description of membrane current and
its application to conduction and
excitation in nerve. The Journal of
Physiology. 117 (4): 500–44. (1952)
- Connected to others
Axon
Represents simple
computation
Dendrite Has inhibition and
excitation
connections
Human brain is estimated to contain around - Has a state

86,000,000,000 of such neurons. - Outputs spikes
Each is connected to thousands of other neurons.
Artificial neuron
Want to learn more?
“Soma”
McCulloch, Warren S.; Pitts, Walter
A logical calculus of the ideas
immanent in nervous activity
Bulletin of Mathematical Biophysics. 5 (4):
115–133. (1943)
- Easy to compose
“Axon” Represents simple
computation
“Dendrite” Has inhibition and
excitation
connections
The goal of simple artificial neurons models is to reflect - Is stateless wrt. time
some neurophysiological observations, not to reproduce - Outputs real values
their dynamics.
Artificial neuron
Want to learn more?
McCulloch, Warren S.; Pitts, Walter
A logical calculus of the ideas
immanent in nervous activity
Bulletin of Mathematical Biophysics. 5 (4):
115–133. (1943)
- Easy to compose
Represents simple
computation
Has inhibition and
excitation
connections
The goal of simple artificial neurons models is to reflect - Is stateless wrt. time
some neurophysiological observations, not to reproduce - Outputs real values
their dynamics.
Linear layer
Want to learn more?
Jouppi, Norman P. et al. In-Datacenter
Performance Analysis of a Tensor
Processing Unit™ 44th International
Symposium on Computer Architecture
(ISCA) (2017)
- Easy to compose
Collection of artificial
neurons
Can be efficiently
vectorised
Fits highly optimised
In Machine Learning linear really means affine. hardware (GPU/TPU)
Neurons in a layer are often called units.
Parameters are often called weights.
Isn’t this
just linear
regression?
Node Node Node
Node Node Node Loss
Data Target
Single layer
neural networks
Linear Node Loss
Data Target
Linear Node Loss
Data Target
Sigmoid activation function
Want to learn more?
Hinton G. Deep belief networks.
Scholarpedia. 4 (5): 5947. (2009)
- Introduces non-linear
behaviour
- Produces probability
estimate
- Has simple derivatives
- Saturates
- Derivatives vanish
Activation functions are often called non-linearities.
Activation functions are applied point-wise.
Linear Sigmoid Loss
Data Target
Cross entropy
Want to learn more?
Murphy, Kevin Machine Learning: A
Probabilistic Perspective (2012)
- Encodes negation of
logarithm of
probability of correct
classification
Composable with
sigmoid
- Numerically unstable
Cross entropy loss is also called
negative log likelihood or logistic loss.
The simplest “neural” classifier
Linear Sigmoid Cross
entropy
Want to learn more?
Cramer, J. S. The origins of logistic
regression (Technical report). 119.
Tinbergen Institute. pp. 167–178 (2002)
Data Target
Encodes negation of
logarithm of
probability of entirely
correct classification
Equivalent to logistic
regression model
Cross entropy loss is also called Numerically unstable

negative log likelihood or logistic loss.
Being additive over samples allows for efficient learning.
Softmax
Want to learn more?
Goodfellow, Ian; Bengio, Yoshua; Courville,
Aaron Softmax Units for Multinoulli
Output Distributions. Deep Learning. MIT
Press. pp. 180–184. (2016)
- Multi-dimensional
generalisation of
sigmoid
- Produces probability
estimate
- Has simple derivatives
Softmax is the most commonly used final activation in - Saturates

classification. - Derivatives vanish
It can also be used to have a smooth version of maximum.
Softmax + Cross entropy
Linear Softmax Cross
entropy
Want to learn more?
Martins, Andre, and Ramon Astudillo.
From softmax to sparsemax: A sparse
model of attention and multi-label
Data Target classification. International Conference
on Machine Learning. (2016)
Encodes negation of
logarithm of
probability of entirely
correct classification
Equivalent to
multinomial logistic
regression model
Widely used not only in classification but also in RL.
Cannot represent sparse outputs (sparsemax). Numerically stable
Does not scale too well with k. combination
Uses
Handwritten digits
recognition at 92% level.
Highly dimensional spaces

are surprisingly easy to
shatter with hyperplanes.
Widely used in commercial applications.
For a long time a crucial model for Natural Language Processing under the name of
MaxEnt (Maximum Entropy Classifier).
... and limitations
... and limitations
Two layer
neural networks
Linear Sigmoid Linear Softmax Cross
Linear Node Loss entropy
Data Target
Entropy
entropy
Data
Data Target
Target
Entropy
entropy
Data
Data Target
Target
Entropy
entropy
Data
Data Target
Target
Entropy
entropy
Data
Data Target
Target
Entropy
entropy
Data
Data Target
Target
Entropy
entropy
Data
Data Target
Target
Entropy
entropy
Data
Data Target
Target
1-hidden layer network vs XOR
Entropy
entropy Want to learn more?
Blum, E. K. Approximation of Boolean
functions by sigmoidal networks: Part I:
Data
Data Target
Target XOR and other two-variable functions
Neural computation 1.4 532-540. (1989)
With just 2 hidden

neurons we solve XOR
Hidden layer allows us
to bent and twist
input space
We use linear model
on top, to do the
classification
Hidden layer provides non-linear input space
transformation so that final linear layer can classify.
http://playground.tensorflow.org/ by Daniel Smilkov and Shan Carter
Universal Approximation Theorem
Want to learn more?
Cybenko., G. Approximations by
superpositions of sigmoidal functions,
Mathematics of Control, Signals, and
For any continuous function from a hypercube Systems, 2 (4), 303-314 (1989)
[0,1]d to real numbers, and every positive epsilon,

there exists a sigmoid based, 1-hidden layer
neural network that obtains at most epsilon error
in functional space. - One of the most
important theoretical
results for Neural
Networks
- Shows, that they are
extremely expressive
- Tells us nothing about
Big enough network can approximate, learning
but not represent any smooth function. - Size of network grows
The math trick is to show that networks exponentially
are dense in the space of target functions.
Want to learn more?
Kurt Hornik Approximation Capabilities
For any continuous function from a hypercube

of Multilayer Feedforward Networks,
Neural Networks, 4(2), 251–25 (1991)
[0,1]d to real numbers, non-constant, bounded

and continuous activation function f, and every
positive epsilon, there exists a 1-hidden layer
neural network using f that obtains at most
epsilon error in functional space. - One of the most
important theoretical
results for Neural
Networks
but not represent any smooth function. - Size of network grows
Universal Approximation Theorem Intuition

Want to learn more?
For any continuous function from a Kurt Hornik Approximation Capabilities
hypercube [0,1]d to real numbers, of Multilayer Feedforward Networks,

Neural Networks, 4(2), 251–25 (1991)
non-constant, bounded and continuous

activation function f, and every positive
epsilon, there exists a 1-hidden layer
- One of the most
neural network using f that obtains at important theoretical
most epsilon error in functional space. results for Neural
Networks
but not represent any smooth funciton. - Size of network grows
Deep neural
networks
Linear Node Linear Node Linear Node Linear Node Loss
Data Target
Rectified Linear Unit (ReLU)
Want to learn more?
Hahnloser, R.; Sarpeshkar, R.; Mahowald, M.
A.; Douglas, R. J.; Seung, H. S. Digital
selection and analogue amplification
coexist in a cortex-inspired silicon
circuit. Nature. 405: 947–951 (2000)
- Introduces non-linear
behaviour
- Creates piecewise
linear functions
- Derivatives do not
vanish
- Dead neurons can
occur
One of the most commonly used activation functions. - Technically not
Made math analysis of networks much simpler. differentiable
at 0
Lines/corners Shapes Object Class
detection detection detection detection
Linear ReLU Linear ReLU Linear ReLU Linear Softmax Loss
Data Target
Depth
Want to learn more?
Guido Montúfar, Razvan Pascanu,
Kyunghyun Cho, Yoshua Bengio. On the
Number of Linear Regions of Deep
Neural Networks Arxiv (2014)
Data
- Expressing
symmetries and
regularities is much
easier with deep
model than wide one.
Deep model means

many non-linear
Number of linear regions grows exponentially with depth, composition and thus
and polynomially with width. harder learning
Neural networks as computational graphs
Linear ReLU Linear ReLU Linear ReLU Linear Sotfmax Cross

Entropy
Data Target
3 Learning

Linear algebra recap
Gradient Jacobian
Gradient descent recap
Want to learn more?
Kingma, Diederik P., and Jimmy Ba. Adam:
A method for stochastic optimization
arXiv preprint arXiv:1412.6980 (2014).
- Works for any

“smooth enough”
function
- Can be used on
non-smooth targets
but with less
guarantees
Choice of learning rate is critical. - Converges to local
Main learning algorithm behind deep learning. optimum
Many modifications: Adam, RMSProp, ...
Neural networks as computational graphs - API
Forward pass
Backward pass
Neural networks as computational graphs - API
Forward pass
Backward pass
Gradient descent and computational graph
Want to learn more?

Abadi, Martín, et al. Tensorflow: A system
for large-scale machine learning. 12th
Symposium on Operating Systems Design
and Implementation (2016)
Gradient descent and computational graph
Want to learn more?

Abadi, Martín, et al. Tensorflow: A system
for large-scale machine learning. 12th
Symposium on Operating Systems Design
and Implementation (2016)
Chain rule, backprop and automatic differentiation
Linear layer as a computational graph
dot
y
W +
b
Symmetry
between
weights
and inputs
Note that backward pass is a Biases are adjusted

computational graph itself. proportional to error
ReLU as a computational graph
y
x relu
Can be seen as gating the incoming

gradients. The ones going through
neurons that were active are passed
We usually put “gradient” at zero through, and the rest zeroed.
to be equal to zero.
Softmax as a computational graph
x exp
y
sum div
Backwards pass is essentially

a difference between incoming gradient
Since exponents of big numbers will
and our output.
cause overflow, it is rarely explicitly
written like this.
Cross entropy as a computational graph
p log
L
t dot neg
Dividing by p can be
numerically unstable
We can also backprop

Even though it is a loss, we could into labels themselves
still multiply its backwards by
another incoming errors.
Cross entropy with logits as a computational graph
x exp
sum div log
L
t dot neg
Simplifies extremely!
We can also backprop

For numerical stability it is usually into labels themselves
a single operation in
a computational graph.
Example - 3 layer MLP with ReLU activations
x dot + relu dot + relu dot + exp out
W1 b1 W2 b2 W3 b3 sum div log
L
t dot neg
slice b1
slice W
slice
2
b2
slice W
slice
3
b3
slice sum div log
L
t dot neg
θ
slice b1
slice W
slice
2
b2
slice W
slice
3
b3
slice sum div log
L
t dot neg
θ
4 Pieces of the
puzzle

Max as a computational graph
y
x max
Gradients only flow through the selected

element. Consequently we are not
Used in max pooling. learning how to select.
Conditional execution as a computational graph
x
y
mul
p
Backwards pass is
gated in the same
way forward one is
We can learn
conditionals
Let’s assume p is probability themselves too,
distribution (e.g. one hot). just use softmax.
Quadratic loss as a computational graph
t
L
sqr sum
Backwards pass is
just a difference in
predictions
Learning
targets is
Typical loss for all regression analogous
problems (e.g. Value function fitting)
5 Practical
issues

Overfitting and regularisation
Want to learn more?
Lp regularisation
Vapnik, Vladimir. The nature of statistical
learning theory. Springer science &
business media, (2013)
Dropout
Noising data
- As your model gets
Early stopping more powerful, it can
create extremely
Batch/Layer norm complex hypotheses,
Figure from Belkin et al. (2019)
even if they are not
needed
Keeping things simple
guarantees that if the
Classical results from statistics and Statistical Learning training error is small,
Theory which analyses the worst case scenario. so will the test be.
Want to learn more?
Belkin, Mikhail, et al. Reconciling modern
machine-learning practice and the
classical bias–variance trade-off.
Proceedings of the National Academy of
Sciences 116.32 (2019)
- As models grow, their

learning dynamics
changes, and they
Figure from Belkin et al. (2019)
become less prone to
overfitting.
- New, exciting
theoretical results,
also mapping these
New results, that take into consideration learning effects. huge networks onto
Gaussian Processes.
Want to learn more?
Nakkiran, Preetum, et al. Deep double
descent: Where bigger models and
more data hurt. arXiv preprint
arXiv:1912.02292 (2019)
- Even big models still

Figure from Prettum et al. (2019) need (can benefit
from) regularisation
techniques.
We need new notions

of effective
complexity of our
Model complexity is not as simple as hypotheses classes.
number of parameters.
Diagnosing and debugging
Want to learn more?
Karpathy A. A Recipe for Training Neural
- Initialisation matters Networks

http://karpathy.github.io/2019/04/25/reci
pe/ (2019)
- Overfit small sample
- Monitor training loss

It is always worth
- Monitor weights norms and NaNs spending time on
verifying correctness.
- Add shape asserts
Be suspicious of good
- Start with Adam results more than bad
ones.
- Change one thing at the time
Experience is key, just
keep trying!
6 Bonus:
Multiplicative
interactions

What
MLPs
cannot do?
What
MLPs
cannot do?
f(x,z) = 〈x,z〉
Multiplicative interactions
Want to learn more?
Siddhant M. Jayakumar et al.
Multiplicative Interactions and Where
to Find Them Proceedings of
International Conference on Learning
Representations (2019)
Multiplicative units
unify attention, metric
learning and many
others
They enrich the

hypothesis space of
regular neural
Being able to approximate something is not the same as networks in a
represent it. meaningful way
If you want to do research in fundamental building
blocks of Neural Networks, do not seek to
marginally improve the way they behave by finding
new activation function.
Ask yourself what current modules cannot

represent or guarantee right now,
and propose a module that can.
Thank you
Questions

L2 - UCLxDeepMind DL2020

Uploaded by

Copyright:

Available Formats

L2 - UCLxDeepMind DL2020

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L2 - UCLxDeepMind DL2020

Uploaded by

Copyright:

Available Formats

WELCOME TO THE

UCL x DeepMind Lectures

- (Restricted) - Spiking networks - Capsules

Node Node Node

Node Node Node Loss

Node Node Node

How to adjust Node

Node Node Node

UCL x DeepMind Lectures

Human brain is estimated to contain around - Has a state

Node Node Node Loss

Cross entropy loss is also called Numerically unstable

Softmax is the most commonly used ﬁnal activation in - Saturates

Highly dimensional spaces

Widely used in commercial applications.

With just 2 hidden

[0,1]d to real numbers, and every positive epsilon,

For any continuous function from a hypercube

[0,1]d to real numbers, non-constant, bounded

Universal Approximation Theorem

hypercube [0,1]d to real numbers, of Multilayer Feedforward Networks,

non-constant, bounded and continuous

Deep model means

Linear ReLU Linear ReLU Linear ReLU Linear Sotfmax Cross

UCL x DeepMind Lectures

- Works for any

Want to learn more?

Want to learn more?

Note that backward pass is a Biases are adjusted

Can be seen as gating the incoming

Backwards pass is essentially

We can also backprop

sum div log

We can also backprop

x dot + relu dot + relu dot + exp out

W1 b1 W2 b2 W3 b3 sum div log

x dot + relu dot + relu dot + exp out

x dot + relu dot + relu dot + exp out

UCL x DeepMind Lectures

Gradients only ﬂow through the selected

UCL x DeepMind Lectures

- As models grow, their

- Even big models still

We need new notions

- Initialisation matters Networks

- Overﬁt small sample

- Monitor training loss

UCL x DeepMind Lectures

They enrich the

Ask yourself what current modules cannot

You might also like