Lecun 20201027 Att
Lecun 20201027 Att
Lecun 20201027 Att
Yann LeCun
NYU - Courant Institute & Center for Data Science
Facebook AI Research
http://yann.lecun.com
AT&T, 2020-10-27
Y. LeCun
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Y. LeCun
Weight
matrix
●
Backprop for the weight gradients:
y (desired output)
x (input)
Y. LeCun
pooling
Multiple subsampling
[Fukushima 1982][LeCun 1989, 1998],[Riesenhuber 1999]...... convolutions
Y. LeCun
Pooling
Filter Bank +non-linearity
Pooling
(AT&T, 1995)
Graph transformer network trained
to read check amounts.
Trained globally with Negative-Log-
Likelihood loss.
Content filtering.
Hate speech, calls to violence, weapon
sales, terrorist propaganda….
Y. LeCun
ConvNets in neuroscience
[Eickenberg et al. NeuroImage 2016]
Y. LeCun
Today, Facebook, Google, Amazon and others are built around DL.
Take deep learning out, and the companies fold
Much of the computation is in data centers on regular CPUs
Highly-optimized code.
Increasingly, they will run on dedicated hardware.
More power efficient
Soon, ConvNets and other DL systems will be everywhere.
Smartphones, AR glasses, VR goggles, cars, medical imaging systems,
vacuum cleaners, cameras, toys, and almost all consumer electronics.
Y. LeCun
K Ti X Dot Products
e
Y =∑ Ci V i
i
Ci= K Tj X Input (Address) X
∑e
j
Y. LeCun
Transformer Architecture
Multi-head attention
[Waswani ArXiv:1706.03762]
10 to 60 stages
BERT model
[Devlin ArXiv:1810.04805]
Trained to fill in missing words
Y. LeCun
IPAM workshop:
http://www.ipam.ucla.edu/programs/workshops/new-deep-learning-techniques/
Y. LeCun
Lessons learned #3
3.1: Dynamic networks are gaining in popularity (e.g. for NLP)
Dynamicity breaks many assumptions of current hardware
Can’t optimize the compute graph distribution at compile time.
Can’t do batching easily!
3.2: Large-Scale Memory-Augmented Networks & Transformers...
...Will require efficient associative memory/nearest-neighbor search
3.3: Graph ConvNets are very promising for many applications
Say goodbye to matrix multiplications?
Say goodbye to tensors?
3.4: Large Neural Nets may have sparse activity
How to exploit sparsity in hardware?
How do humans
and animals
learn so quickly?
Not supervised.
Not Reinforced.
Y. LeCun
biological
motion
gravity, inertia
Physics stability, conservation of
support momentum
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
proto-imitation
crawling walking
emotional contagion
Y. LeCun
Photos courtesy of
Emmanuel Dupoux
Self-Supervised
Learning
Capture dependencies.
Predict everything from everything else.
Y. LeCun
Example
Blue dots are
data points
y x
Y. LeCun
Energy
Function F(x,y)
x y
x
Y. LeCun
Softmax
z
Masked Auto-Encoder Dec(h)
Latent variable turns
BERT [Waswani 2018] Softmax vector(s) into
h
RoBERTa [Ott 2019] Observed word(s)
Issues:
Pred(x)
latent variables are in
output space
x y
No abstract LV to control corruption
the output
This is a [...] of text extracted This is a piece of text extracted
How to cover the space of [...] a large set of [...] articles from a large set of news articles
corruptions?
Y. LeCun
Transformer Architecture
Multi-head attention (associative memory)
[Waswani ArXiv:1706.03762]
10 to 60 stages
BERT model
[Devlin ArXiv:1810.04805]
Trained to fill in missing words
Y. LeCun
x corruption y
Contrastive Joint Embedding C(h,h’)
Averag
w w’
Bootstrap Your Own Latent
[Grill arXiv:2006.07733]
Enc(x) Enc(y)
y C(y,y)
Prediction
Dec(z,h)
represents independent z
explanatory factors of variation of Latent
the prediction. Pred(x) Variables
The information capacity of the
latent variable must be minimized.
Otherwise all the information for the x y
Dec(z,h)
Examples of R(z): h
Effective dimension R(z)
Quantization / discretization z
L0 norm (# of non-0 components) Pred(x)
Dec(z,h)
R(z) z
Variational AE
D(z,z)
F(y) approximated by sampling and/or variational method
Enc(y,h)
y
Encoder performs amortized inference [Gregor & YLC, ICML 2010]
Y. LeCun
Self-supervised prediction
under uncertainty
Y. LeCun
Training:
Dec(z,h)
Observe frames +
Compute h
Predict z̄ from R(z) z
encoder z̄
z z[t] z z
s[t] s[t+1]
Perception G(s,a,z) G(s,a,z) G(s,a,z) G(s,a,z)
a a[t] a a
Y. LeCun
z z[t] z z
s[t] s[t+1]
Perception G(s,a,z) G(s,a,z) G(s,a,z) G(s,a,z)
conclusions
SSL is the future
Learning hierarchical features in a task-invariant way
Plenty of data, massive networks
Learning Forward Models for Model-Based Control
Challenge: handling uncertainty in the prediction: energy-based models
Reasoning through vector representations and energy minimization
Energy-Based Models with latent variables
Replace symbols by vectors and logic by continuous functions.
Learning hierarchical representations of action plans
No idea how to do that!
There is no such thing as AGI. Intelligence is always specialized.
We should talk about rat-level, cat-level, or human-level AI (HLAI).
Y. LeCun
Conjectures
Self-Supervised Learning to learn models of the world.
Models of the world: learning with uncertainty in the prediction
Perhaps common sense will emerge from learning world models
Emotions are (often) anticipations of outcomes
According to predictions from the model of the world
Reasoning is finding actions that optimize outcomes
Constraint satisfaction/cost minimization rather than logic
Consciouness may be the deliberate configuration of our world model
engine?
We only have one (configurable) model of the world
If our brains had infinite capacity, we would not need consciousness
Thank You!