L2 - UCLxDeepMind DL2020

Download as pdf or txt
Download as pdf or txt
You are on page 1of 104

WELCOME TO THE

UCL x DeepMind
lecture series
In this lecture series, leading research scientists
from leading AI research lab, DeepMind, will give
12 lectures on an exciting selection of topics
in Deep Learning, ranging from the fundamentals
of training neural networks via advanced ideas
around memory, attention, and generative
modelling to the important topic of responsible
innovation.
Please join us for a deep dive lecture series
into Deep Learning!

#UCLxDeepMind
General
information Exits:
At the back, the way you came in

Wifi:
UCL guest
TODAY’S SPEAKER

Wojciech Czarnecki
Wojciech Czarnecki is a Research Scientist
at DeepMind. He obtained his phd from the
Jagiellonian University in Cracow, during which
he worked on the intersection of machine learning,
information theory and cheminformatics. Since joining
DeepMind in 2016, Wojciech has been mainly working
on deep reinforcement learning, with a focus on
multi-agent systems, such as recent Capture the Flag
project or AlphaStar, the first AI to reach the highest
league of human players in a widespread professional
esport without simplification of the game.
Neural networks are the models responsible
for the deep learning revolution since 2006,
but their foundations go back as far as to the
1960s. In this lecture we will go through the
basics of how these models operate, learn
and solve problems. We will also set various
terminology/naming conventions to prepare
attendees for further, more advanced talks.
Finally, we will briefly touch upon more
research oriented directions of neural
network design and development.
TODAY’S LECTURE

Neural Networks
Foundations
Neural Networks
Foundations
Wojciech Czarnecki

UCL x DeepMind Lectures


Plan for this Lecture Private & Confidential

01 02 03
Overview Neural Networks Learning

04 05 06
Pieces of the puzzle Practical issues Bonus:
Multiplicative interactions
What is not covered in this lecture Private & Confidential

01 02 03
“Old school” Biologically plausible Other

- (Restricted) - Spiking networks - Capsules


Boltzmann - Physical Simulators - Graph networks
Machines - Neural
- Deep Differential
Belief Equations
Networks - Convolutional
- Hopfield Networks Networks
- Self Organising - Recurrent
Maps Neural
Networks
1 Overview
Computer Vision
Computer Vision Text and Speech
Computer Vision Text and Speech Control
Compute
Compute Data
Compute Data Modularity
The deep learning puzzle

Node Node Node

Node Node Node Loss

Data Target
The deep learning puzzle

Node Node Node

How to adjust Node


this input,
Node if my Node Node Loss
output needs to
change?
What to output?
Data Target
The deep learning puzzle

Node Node Node

Node
Node Node Node Loss
Differentiable
wrt. inputs
What to output?
Data Target
2 Neural
networks

UCL x DeepMind Lectures


Real neuron
Soma Want to learn more?
Hodgkin AL, Huxley AF A quantitative
description of membrane current and
its application to conduction and
excitation in nerve. The Journal of
Physiology. 117 (4): 500–44. (1952)

- Connected to others
Axon
Represents simple
computation
Dendrite Has inhibition and
excitation
connections

Human brain is estimated to contain around - Has a state


86,000,000,000 of such neurons. - Outputs spikes
Each is connected to thousands of other neurons.
Artificial neuron
Want to learn more?
“Soma”
McCulloch, Warren S.; Pitts, Walter
A logical calculus of the ideas
immanent in nervous activity
Bulletin of Mathematical Biophysics. 5 (4):
115–133. (1943)

- Easy to compose
“Axon” Represents simple
computation
“Dendrite” Has inhibition and
excitation
connections

The goal of simple artificial neurons models is to reflect - Is stateless wrt. time
some neurophysiological observations, not to reproduce - Outputs real values
their dynamics.
Artificial neuron
Want to learn more?
McCulloch, Warren S.; Pitts, Walter
A logical calculus of the ideas
immanent in nervous activity
Bulletin of Mathematical Biophysics. 5 (4):
115–133. (1943)

- Easy to compose
Represents simple
computation
Has inhibition and
excitation
connections

The goal of simple artificial neurons models is to reflect - Is stateless wrt. time
some neurophysiological observations, not to reproduce - Outputs real values
their dynamics.
Linear layer
Want to learn more?
Jouppi, Norman P. et al. In-Datacenter
Performance Analysis of a Tensor
Processing Unit™ 44th International
Symposium on Computer Architecture
(ISCA) (2017)

- Easy to compose

Collection of artificial
neurons
Can be efficiently
vectorised
Fits highly optimised
In Machine Learning linear really means affine. hardware (GPU/TPU)
Neurons in a layer are often called units.
Parameters are often called weights.
Isn’t this
just linear
regression?
Node Node Node

Node Node Node Loss

Data Target
Single layer
neural networks
Linear Node Loss

Data Target
Linear Node Loss

Data Target
Sigmoid activation function
Want to learn more?
Hinton G. Deep belief networks.
Scholarpedia. 4 (5): 5947. (2009)

- Introduces non-linear
behaviour
- Produces probability
estimate
- Has simple derivatives

- Saturates
- Derivatives vanish
Activation functions are often called non-linearities.
Activation functions are applied point-wise.
Linear Sigmoid Loss

Data Target
Cross entropy
Want to learn more?
Murphy, Kevin Machine Learning: A
Probabilistic Perspective (2012)

- Encodes negation of
logarithm of
probability of correct
classification
Composable with
sigmoid

- Numerically unstable
Cross entropy loss is also called
negative log likelihood or logistic loss.
The simplest “neural” classifier
Linear Sigmoid Cross
entropy
Want to learn more?
Cramer, J. S. The origins of logistic
regression (Technical report). 119.
Tinbergen Institute. pp. 167–178 (2002)
Data Target

Encodes negation of
logarithm of
probability of entirely
correct classification
Equivalent to logistic
regression model

Cross entropy loss is also called Numerically unstable


negative log likelihood or logistic loss.
Being additive over samples allows for efficient learning.
Softmax
Want to learn more?
Goodfellow, Ian; Bengio, Yoshua; Courville,
Aaron Softmax Units for Multinoulli
Output Distributions. Deep Learning. MIT
Press. pp. 180–184. (2016)

- Multi-dimensional
generalisation of
sigmoid
- Produces probability
estimate
- Has simple derivatives

Softmax is the most commonly used final activation in - Saturates


classification. - Derivatives vanish
It can also be used to have a smooth version of maximum.
Softmax + Cross entropy
Linear Softmax Cross
entropy
Want to learn more?
Martins, Andre, and Ramon Astudillo.
From softmax to sparsemax: A sparse
model of attention and multi-label
Data Target classification. International Conference
on Machine Learning. (2016)

Encodes negation of
logarithm of
probability of entirely
correct classification
Equivalent to
multinomial logistic
regression model
Widely used not only in classification but also in RL.
Cannot represent sparse outputs (sparsemax). Numerically stable
Does not scale too well with k. combination
Uses

Handwritten digits
recognition at 92% level.

Highly dimensional spaces


are surprisingly easy to
shatter with hyperplanes.

Widely used in commercial applications.

For a long time a crucial model for Natural Language Processing under the name of
MaxEnt (Maximum Entropy Classifier).
... and limitations
... and limitations
Two layer
neural networks
Linear Sigmoid Linear Softmax Cross
Linear Node Loss entropy

Data Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy

Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy

Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy

Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy

Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy

Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy

Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy

Data
Data Target
Target
1-hidden layer network vs XOR
Linear Sigmoid Linear Softmax Cross
Entropy
entropy Want to learn more?
Blum, E. K. Approximation of Boolean
functions by sigmoidal networks: Part I:
Data
Data Target
Target XOR and other two-variable functions
Neural computation 1.4 532-540. (1989)

With just 2 hidden


neurons we solve XOR
Hidden layer allows us
to bent and twist
input space
We use linear model
on top, to do the
classification
Hidden layer provides non-linear input space
transformation so that final linear layer can classify.
http://playground.tensorflow.org/ by Daniel Smilkov and Shan Carter
http://playground.tensorflow.org/ by Daniel Smilkov and Shan Carter
Universal Approximation Theorem
Want to learn more?
Cybenko., G. Approximations by
superpositions of sigmoidal functions,
Mathematics of Control, Signals, and
For any continuous function from a hypercube Systems, 2 (4), 303-314 (1989)

[0,1]d to real numbers, and every positive epsilon,


there exists a sigmoid based, 1-hidden layer
neural network that obtains at most epsilon error
in functional space. - One of the most
important theoretical
results for Neural
Networks
- Shows, that they are
extremely expressive
- Tells us nothing about
Big enough network can approximate, learning
but not represent any smooth function. - Size of network grows
The math trick is to show that networks exponentially
are dense in the space of target functions.
Universal Approximation Theorem
Want to learn more?
Kurt Hornik Approximation Capabilities

For any continuous function from a hypercube


of Multilayer Feedforward Networks,
Neural Networks, 4(2), 251–25 (1991)

[0,1]d to real numbers, non-constant, bounded


and continuous activation function f, and every
positive epsilon, there exists a 1-hidden layer
neural network using f that obtains at most
epsilon error in functional space. - One of the most
important theoretical
results for Neural
Networks
- Shows, that they are
extremely expressive
- Tells us nothing about
Big enough network can approximate, learning
but not represent any smooth function. - Size of network grows
The math trick is to show that networks exponentially
are dense in the space of target functions.
Universal Approximation Theorem Intuition
Universal Approximation Theorem Intuition
Universal Approximation Theorem Intuition
Universal Approximation Theorem Intuition
Universal Approximation Theorem Intuition
Universal Approximation Theorem Intuition
http://playground.tensorflow.org/ by Daniel Smilkov and Shan Carter

Universal Approximation Theorem


Want to learn more?
For any continuous function from a Kurt Hornik Approximation Capabilities

hypercube [0,1]d to real numbers, of Multilayer Feedforward Networks,


Neural Networks, 4(2), 251–25 (1991)

non-constant, bounded and continuous


activation function f, and every positive
epsilon, there exists a 1-hidden layer
- One of the most
neural network using f that obtains at important theoretical
most epsilon error in functional space. results for Neural
Networks
- Shows, that they are
extremely expressive
- Tells us nothing about
Big enough network can approximate, learning
but not represent any smooth funciton. - Size of network grows
The math trick is to show that networks exponentially
are dense in the space of target functions.
Deep neural
networks
Linear Node Linear Node Linear Node Linear Node Loss

Data Target
Rectified Linear Unit (ReLU)
Want to learn more?
Hahnloser, R.; Sarpeshkar, R.; Mahowald, M.
A.; Douglas, R. J.; Seung, H. S. Digital
selection and analogue amplification
coexist in a cortex-inspired silicon
circuit. Nature. 405: 947–951 (2000)

- Introduces non-linear
behaviour
- Creates piecewise
linear functions
- Derivatives do not
vanish
- Dead neurons can
occur
One of the most commonly used activation functions. - Technically not
Made math analysis of networks much simpler. differentiable
at 0
Lines/corners Shapes Object Class
detection detection detection detection
Linear ReLU Linear ReLU Linear ReLU Linear Softmax Loss

Data Target
Depth
Want to learn more?
Guido Montúfar, Razvan Pascanu,
Kyunghyun Cho, Yoshua Bengio. On the
Number of Linear Regions of Deep
Neural Networks Arxiv (2014)

Data

- Expressing
symmetries and
regularities is much
easier with deep
model than wide one.

Deep model means


many non-linear
Number of linear regions grows exponentially with depth, composition and thus
and polynomially with width. harder learning
Neural networks as computational graphs

Linear ReLU Linear ReLU Linear ReLU Linear Sotfmax Cross


Entropy

Data Target
Neural networks as computational graphs
Neural networks as computational graphs
Neural networks as computational graphs
Neural networks as computational graphs
Neural networks as computational graphs
3 Learning

UCL x DeepMind Lectures


Linear algebra recap

Gradient Jacobian
Gradient descent recap
Want to learn more?
Kingma, Diederik P., and Jimmy Ba. Adam:
A method for stochastic optimization
arXiv preprint arXiv:1412.6980 (2014).

- Works for any


“smooth enough”
function
- Can be used on
non-smooth targets
but with less
guarantees
Choice of learning rate is critical. - Converges to local
Main learning algorithm behind deep learning. optimum
Many modifications: Adam, RMSProp, ...
Neural networks as computational graphs - API

Forward pass

Backward pass
Neural networks as computational graphs - API

Forward pass

Backward pass
Gradient descent and computational graph

Want to learn more?


Abadi, Martín, et al. Tensorflow: A system
for large-scale machine learning. 12th
Symposium on Operating Systems Design
and Implementation (2016)
Gradient descent and computational graph

Want to learn more?


Abadi, Martín, et al. Tensorflow: A system
for large-scale machine learning. 12th
Symposium on Operating Systems Design
and Implementation (2016)
Chain rule, backprop and automatic differentiation
Chain rule, backprop and automatic differentiation
Chain rule, backprop and automatic differentiation
Chain rule, backprop and automatic differentiation
Chain rule, backprop and automatic differentiation
Linear layer as a computational graph

dot
y
W +

b
Symmetry
between
weights
and inputs

Note that backward pass is a Biases are adjusted


computational graph itself. proportional to error
ReLU as a computational graph

y
x relu

Can be seen as gating the incoming


gradients. The ones going through
neurons that were active are passed
We usually put “gradient” at zero through, and the rest zeroed.
to be equal to zero.
Softmax as a computational graph

x exp

y
sum div

Backwards pass is essentially


a difference between incoming gradient
Since exponents of big numbers will
and our output.
cause overflow, it is rarely explicitly
written like this.
Cross entropy as a computational graph

p log

L
t dot neg

Dividing by p can be
numerically unstable

We can also backprop


Even though it is a loss, we could into labels themselves
still multiply its backwards by
another incoming errors.
Cross entropy with logits as a computational graph

x exp

sum div log

L
t dot neg
Simplifies extremely!

We can also backprop


For numerical stability it is usually into labels themselves
a single operation in
a computational graph.
Example - 3 layer MLP with ReLU activations

x dot + relu dot + relu dot + exp out

W1 b1 W2 b2 W3 b3 sum div log

L
t dot neg
Example - 3 layer MLP with ReLU activations

x dot + relu dot + relu dot + exp out

slice b1
slice W
slice
2
b2
slice W
slice
3
b3
slice sum div log

L
t dot neg

θ
Example - 3 layer MLP with ReLU activations

x dot + relu dot + relu dot + exp out

slice b1
slice W
slice
2
b2
slice W
slice
3
b3
slice sum div log

L
t dot neg

θ
4 Pieces of the
puzzle

UCL x DeepMind Lectures


Max as a computational graph

y
x max

Gradients only flow through the selected


element. Consequently we are not
Used in max pooling. learning how to select.
Conditional execution as a computational graph

x
y
mul

p
Backwards pass is
gated in the same
way forward one is

We can learn
conditionals
Let’s assume p is probability themselves too,
distribution (e.g. one hot). just use softmax.
Quadratic loss as a computational graph

t
L
sqr sum

Backwards pass is
just a difference in
predictions

Learning
targets is
Typical loss for all regression analogous
problems (e.g. Value function fitting)
5 Practical
issues

UCL x DeepMind Lectures


Overfitting and regularisation
Want to learn more?
Lp regularisation
Vapnik, Vladimir. The nature of statistical
learning theory. Springer science &
business media, (2013)

Dropout

Noising data
- As your model gets
Early stopping more powerful, it can
create extremely
Batch/Layer norm complex hypotheses,
Figure from Belkin et al. (2019)
even if they are not
needed
Keeping things simple
guarantees that if the
Classical results from statistics and Statistical Learning training error is small,
Theory which analyses the worst case scenario. so will the test be.
Overfitting and regularisation
Want to learn more?
Belkin, Mikhail, et al. Reconciling modern
machine-learning practice and the
classical bias–variance trade-off.
Proceedings of the National Academy of
Sciences 116.32 (2019)

- As models grow, their


learning dynamics
changes, and they
Figure from Belkin et al. (2019)
become less prone to
overfitting.
- New, exciting
theoretical results,
also mapping these
New results, that take into consideration learning effects. huge networks onto
Gaussian Processes.
Overfitting and regularisation
Want to learn more?
Nakkiran, Preetum, et al. Deep double
descent: Where bigger models and
more data hurt. arXiv preprint
arXiv:1912.02292 (2019)

- Even big models still


Figure from Prettum et al. (2019) need (can benefit
from) regularisation
techniques.

We need new notions


of effective
complexity of our
Model complexity is not as simple as hypotheses classes.
number of parameters.
Diagnosing and debugging
Want to learn more?
Karpathy A. A Recipe for Training Neural

- Initialisation matters Networks


http://karpathy.github.io/2019/04/25/reci
pe/ (2019)

- Overfit small sample

- Monitor training loss


It is always worth
- Monitor weights norms and NaNs spending time on
verifying correctness.
- Add shape asserts
Be suspicious of good
- Start with Adam results more than bad
ones.
- Change one thing at the time
Experience is key, just
keep trying!
6 Bonus:
Multiplicative
interactions

UCL x DeepMind Lectures


What
MLPs
cannot do?
What
MLPs
cannot do?
f(x,z) = 〈x,z〉
Multiplicative interactions
Want to learn more?
Siddhant M. Jayakumar et al.
Multiplicative Interactions and Where
to Find Them Proceedings of
International Conference on Learning
Representations (2019)

Multiplicative units
unify attention, metric
learning and many
others

They enrich the


hypothesis space of
regular neural
Being able to approximate something is not the same as networks in a
represent it. meaningful way
If you want to do research in fundamental building
blocks of Neural Networks, do not seek to
marginally improve the way they behave by finding
new activation function.

Ask yourself what current modules cannot


represent or guarantee right now,
and propose a module that can.
Thank you
Questions

You might also like