L2 - UCLxDeepMind DL2020
L2 - UCLxDeepMind DL2020
L2 - UCLxDeepMind DL2020
UCL x DeepMind
lecture series
In this lecture series, leading research scientists
from leading AI research lab, DeepMind, will give
12 lectures on an exciting selection of topics
in Deep Learning, ranging from the fundamentals
of training neural networks via advanced ideas
around memory, attention, and generative
modelling to the important topic of responsible
innovation.
Please join us for a deep dive lecture series
into Deep Learning!
#UCLxDeepMind
General
information Exits:
At the back, the way you came in
Wifi:
UCL guest
TODAY’S SPEAKER
Wojciech Czarnecki
Wojciech Czarnecki is a Research Scientist
at DeepMind. He obtained his phd from the
Jagiellonian University in Cracow, during which
he worked on the intersection of machine learning,
information theory and cheminformatics. Since joining
DeepMind in 2016, Wojciech has been mainly working
on deep reinforcement learning, with a focus on
multi-agent systems, such as recent Capture the Flag
project or AlphaStar, the first AI to reach the highest
league of human players in a widespread professional
esport without simplification of the game.
Neural networks are the models responsible
for the deep learning revolution since 2006,
but their foundations go back as far as to the
1960s. In this lecture we will go through the
basics of how these models operate, learn
and solve problems. We will also set various
terminology/naming conventions to prepare
attendees for further, more advanced talks.
Finally, we will briefly touch upon more
research oriented directions of neural
network design and development.
TODAY’S LECTURE
Neural Networks
Foundations
Neural Networks
Foundations
Wojciech Czarnecki
01 02 03
Overview Neural Networks Learning
04 05 06
Pieces of the puzzle Practical issues Bonus:
Multiplicative interactions
What is not covered in this lecture Private & Confidential
01 02 03
“Old school” Biologically plausible Other
Data Target
The deep learning puzzle
Node
Node Node Node Loss
Differentiable
wrt. inputs
What to output?
Data Target
2 Neural
networks
- Connected to others
Axon
Represents simple
computation
Dendrite Has inhibition and
excitation
connections
- Easy to compose
“Axon” Represents simple
computation
“Dendrite” Has inhibition and
excitation
connections
The goal of simple artificial neurons models is to reflect - Is stateless wrt. time
some neurophysiological observations, not to reproduce - Outputs real values
their dynamics.
Artificial neuron
Want to learn more?
McCulloch, Warren S.; Pitts, Walter
A logical calculus of the ideas
immanent in nervous activity
Bulletin of Mathematical Biophysics. 5 (4):
115–133. (1943)
- Easy to compose
Represents simple
computation
Has inhibition and
excitation
connections
The goal of simple artificial neurons models is to reflect - Is stateless wrt. time
some neurophysiological observations, not to reproduce - Outputs real values
their dynamics.
Linear layer
Want to learn more?
Jouppi, Norman P. et al. In-Datacenter
Performance Analysis of a Tensor
Processing Unit™ 44th International
Symposium on Computer Architecture
(ISCA) (2017)
- Easy to compose
Collection of artificial
neurons
Can be efficiently
vectorised
Fits highly optimised
In Machine Learning linear really means affine. hardware (GPU/TPU)
Neurons in a layer are often called units.
Parameters are often called weights.
Isn’t this
just linear
regression?
Node Node Node
Data Target
Single layer
neural networks
Linear Node Loss
Data Target
Linear Node Loss
Data Target
Sigmoid activation function
Want to learn more?
Hinton G. Deep belief networks.
Scholarpedia. 4 (5): 5947. (2009)
- Introduces non-linear
behaviour
- Produces probability
estimate
- Has simple derivatives
- Saturates
- Derivatives vanish
Activation functions are often called non-linearities.
Activation functions are applied point-wise.
Linear Sigmoid Loss
Data Target
Cross entropy
Want to learn more?
Murphy, Kevin Machine Learning: A
Probabilistic Perspective (2012)
- Encodes negation of
logarithm of
probability of correct
classification
Composable with
sigmoid
- Numerically unstable
Cross entropy loss is also called
negative log likelihood or logistic loss.
The simplest “neural” classifier
Linear Sigmoid Cross
entropy
Want to learn more?
Cramer, J. S. The origins of logistic
regression (Technical report). 119.
Tinbergen Institute. pp. 167–178 (2002)
Data Target
Encodes negation of
logarithm of
probability of entirely
correct classification
Equivalent to logistic
regression model
- Multi-dimensional
generalisation of
sigmoid
- Produces probability
estimate
- Has simple derivatives
Encodes negation of
logarithm of
probability of entirely
correct classification
Equivalent to
multinomial logistic
regression model
Widely used not only in classification but also in RL.
Cannot represent sparse outputs (sparsemax). Numerically stable
Does not scale too well with k. combination
Uses
Handwritten digits
recognition at 92% level.
For a long time a crucial model for Natural Language Processing under the name of
MaxEnt (Maximum Entropy Classifier).
... and limitations
... and limitations
Two layer
neural networks
Linear Sigmoid Linear Softmax Cross
Linear Node Loss entropy
Data Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy
Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy
Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy
Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy
Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy
Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy
Data
Data Target
Target
Linear Sigmoid Linear Softmax Cross
Entropy
entropy
Data
Data Target
Target
1-hidden layer network vs XOR
Linear Sigmoid Linear Softmax Cross
Entropy
entropy Want to learn more?
Blum, E. K. Approximation of Boolean
functions by sigmoidal networks: Part I:
Data
Data Target
Target XOR and other two-variable functions
Neural computation 1.4 532-540. (1989)
Data Target
Rectified Linear Unit (ReLU)
Want to learn more?
Hahnloser, R.; Sarpeshkar, R.; Mahowald, M.
A.; Douglas, R. J.; Seung, H. S. Digital
selection and analogue amplification
coexist in a cortex-inspired silicon
circuit. Nature. 405: 947–951 (2000)
- Introduces non-linear
behaviour
- Creates piecewise
linear functions
- Derivatives do not
vanish
- Dead neurons can
occur
One of the most commonly used activation functions. - Technically not
Made math analysis of networks much simpler. differentiable
at 0
Lines/corners Shapes Object Class
detection detection detection detection
Linear ReLU Linear ReLU Linear ReLU Linear Softmax Loss
Data Target
Depth
Want to learn more?
Guido Montúfar, Razvan Pascanu,
Kyunghyun Cho, Yoshua Bengio. On the
Number of Linear Regions of Deep
Neural Networks Arxiv (2014)
Data
- Expressing
symmetries and
regularities is much
easier with deep
model than wide one.
Data Target
Neural networks as computational graphs
Neural networks as computational graphs
Neural networks as computational graphs
Neural networks as computational graphs
Neural networks as computational graphs
3 Learning
Gradient Jacobian
Gradient descent recap
Want to learn more?
Kingma, Diederik P., and Jimmy Ba. Adam:
A method for stochastic optimization
arXiv preprint arXiv:1412.6980 (2014).
Forward pass
Backward pass
Neural networks as computational graphs - API
Forward pass
Backward pass
Gradient descent and computational graph
dot
y
W +
b
Symmetry
between
weights
and inputs
y
x relu
x exp
y
sum div
p log
L
t dot neg
Dividing by p can be
numerically unstable
x exp
L
t dot neg
Simplifies extremely!
L
t dot neg
Example - 3 layer MLP with ReLU activations
slice b1
slice W
slice
2
b2
slice W
slice
3
b3
slice sum div log
L
t dot neg
θ
Example - 3 layer MLP with ReLU activations
slice b1
slice W
slice
2
b2
slice W
slice
3
b3
slice sum div log
L
t dot neg
θ
4 Pieces of the
puzzle
y
x max
x
y
mul
p
Backwards pass is
gated in the same
way forward one is
We can learn
conditionals
Let’s assume p is probability themselves too,
distribution (e.g. one hot). just use softmax.
Quadratic loss as a computational graph
t
L
sqr sum
Backwards pass is
just a difference in
predictions
Learning
targets is
Typical loss for all regression analogous
problems (e.g. Value function fitting)
5 Practical
issues
Dropout
Noising data
- As your model gets
Early stopping more powerful, it can
create extremely
Batch/Layer norm complex hypotheses,
Figure from Belkin et al. (2019)
even if they are not
needed
Keeping things simple
guarantees that if the
Classical results from statistics and Statistical Learning training error is small,
Theory which analyses the worst case scenario. so will the test be.
Overfitting and regularisation
Want to learn more?
Belkin, Mikhail, et al. Reconciling modern
machine-learning practice and the
classical bias–variance trade-off.
Proceedings of the National Academy of
Sciences 116.32 (2019)
Multiplicative units
unify attention, metric
learning and many
others