Lecun 20201027 Att

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

AI & Machine Learning:

Past, Present & Future

Yann LeCun
NYU - Courant Institute & Center for Data Science
Facebook AI Research
http://yann.lecun.com
AT&T, 2020-10-27
Y. LeCun

Supervised Learning works but requires many labeled samples


Training a machine by showing examples instead of programming it
When the output is wrong, tweak the parameters of the machine
Works well for:
Speech→words
Image→categories
Portrait→ name
Photo→caption
CAR
Text→topic
…. PLANE
Y. LeCun

Traditional Machine Learning → Deep Learning


Traditional Hand-Engineered Trainable
Machine Feature Extractor classifier
Learning &
Pattern
Recognition Feature Predicted
Vector output

Deep Learning: learning hierarchical representations


Trainable Trainable Trainable
Module Module Module

Low-level High-level Predicted


representation representation output
Y. LeCun

Multilayer Architectures == Compositional Structure of Data


Natural is data is compositional => it is efficiently representable hierarchically

Low-Level Mid-Level High-Level Trainable


Feature Feature Feature Classifier

Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Y. LeCun

(Deep) Multi-Layer Neural Nets


Multiple Layers of simple units
ReLU ( x )= max ( x , 0)
Each units computes a weighted sum of its inputs
Weighted sum is passed through a non-linear function
The learning algorithm changes the weights

Weight
matrix

Ceci est une voiture


Y. LeCun

Computing Gradients by Back-Propagation



A practical Application of Chain Rule

Backprop for the state gradients:


Backprop for the weight gradients:

y (desired output)
x (input)
Y. LeCun

What is Deep Learning?


Definition: Deep Learning is building a system by
y C(y,y)
assembling parameterized modules into a (possibly
dynamic) computation graph, and training it to perform a
task by optimizing the parameters using a gradient-based Dec(z,h)
method.
+
Graph can be defined dynamically by input-dependent
programs: differentiable programming
R(z) z
Output may be computed through complex (non feed-forward)
process, e.g. by minimizing some energy function: relaxation,
constraint satisfaction, structured prediction,…. D(z,z)
h
Learning paradigms and objective functions are up to the
designer: supervised, reinforced, self-supervised/unsupervised, z
classification, prediction, reconstruction,…. Pred(x)
Note: the limitations of Supervised Learning are sometimes Enc(y,h)
mistakenly seen as intrinsic limitations of DL
x y
Y. LeCun

I was at AT&T from 1988 to 1996


Larry D. Jackel

2002 lab reunion


Y. LeCun

Adaptive Systems Research Dept (BL11315, 1985-1996)


New device fabrication → neural net hardware → ML research
Holmdel, NJ
Y. LeCun

Hubel & Wiesel's Model of the Architecture of the Visual Cortex


[Thorpe & Fabre-Thorpe 2001]
[Hubel & Wiesel 1962]:
simple cells detect local features
complex cells “pool” the outputs
of simple cells within a
retinotopic neighborhood.

“Simple cells” “Complex


cells”

pooling
Multiple subsampling
[Fukushima 1982][LeCun 1989, 1998],[Riesenhuber 1999]...... convolutions
Y. LeCun

Convolutional Network Architecture [LeCun et al. NIPS 1989]

Filter Bank +non-linearity

Pooling
Filter Bank +non-linearity

Pooling

Filter Bank +non-linearity

Inspired by [Hubel & Wiesel 1962] &


[Fukushima 1982] (Neocognitron):
simple cells detect local features
complex cells “pool” the outputs of simple
cells within a retinotopic neighborhood.
Y. LeCun

Convolutional Network (LeNet5, vintage 1990)


Filters-tanh → pooling → filters-tanh → pooling → filters-tanh
Y. LeCun

LeNet character recognition demo 1992


Running on an AT&T DSP32C (floating-point DSP, 20 MFLOPS)
Y. LeCun

ConvNets can recognize multiple objects


All layers are convolutional
Networks performs simultaneous segmentation and recognition
[LeCun, Bottou, Bengio, Haffner, Proc IEEE 1998]
Y. LeCun

1986-1996 Neural Net Hardware at Bell Labs, Holmdel


1986: 12x12 resistor array
Fixed resistor values
E-beam lithography: 6x6microns
1988: 54x54 neural net
Programmable ternary weights 6 microns

On-chip amplifiers and I/O


1991: Net32k: 256x128 net
Programmable ternary weights
320GOPS, 1-bit convolver.
1992: ANNA: 64x64 net
ConvNet accelerator: 4GOPS
6-bit weights, 3-bit activations
Check Reader Y. LeCun

(AT&T, 1995)
Graph transformer network trained
to read check amounts.
Trained globally with Negative-Log-
Likelihood loss.

50% percent correct, 49% reject,


1% error (detectable later in the
process).

Fielded in 1996, used in many


banks in the US and Europe.

Processed an estimated 10% to


20% of all the checks written in the
US in the early 2000s.
[LeCun, Bottou, Bengio, Haffner 1998]
The Deep Learning
Revolution

Since 2010 or so.


Y. LeCun

Deep ConvNets for Object Recognition (on GPU)


AlexNet [Krizhevsky et al. NIPS 2012], OverFeat [Sermanet et al. 2013]
1 to 10 billion connections, 10 million to 1 billion parameters, 8 to 20 layers.
Y. LeCun

Applications of Deep Learning


Medical image analysis [Mnih 2015]
Self-driving cars
Accessibility
Face recognition
Language translation
Virtual assistants*
Content Understanding for: [MobilEye]
Filtering
Selection/ranking
Search
Games
Security, anomaly detection
Diagnosis, prediction
Science!
[Geras 2017] [Esteva 2017]
Y. LeCun

Supervised DL works amazingly well, when you have data


And services like Facebook, Instagram, Google, Youtube,… are built
around it.
Content understanding, filtering, ranking,
translation, accessibility….
Content understanding,
filtering, ranking, translation,
accessibility….
Y. LeCun

Deep Learning Saves Lives


Automated emergency Braking
Systems
Reduce collisions by 40%
Use Convolutional nets.

Tumor detection in mammograms


[Wu et al. ArXiv:1903.08297]
https://github.com/nyukat/breast_cancer
_classifier

Content filtering.
Hate speech, calls to violence, weapon
sales, terrorist propaganda….
Y. LeCun

FastMRI (NYU+FAIR): 4x-8x speed up for MRI data acquisition

MRI images subsampled


(in k-space) by 4x and 8x
[Zbontar et al.
ArXiv:1811.08839]
U-Net architecture
4-fold acceleration
8-fold acceleration
K-space masks
Y. LeCun

ConvNets in neuroscience
[Eickenberg et al. NeuroImage 2016]
Y. LeCun

ConvNets in Astrophysics [He et al. PNAS 07/2019]

1. Train a coarse-grained 3D U-Net to approximate a fine-grained


simulation on a small volume
2. Use it for a simulation on a large volume (the early universe)
Y. LeCun

Deep Learning in Science

Protein design / Molecular dynamics


Protein structure/function prediction
Material Science / Molecular dynamics
Prediction of material properties
High energy Physics
Jet filtering / analysis [Komiske arXiv:1612.01551]
Cosmology / Astrophysics
Infering constants from observations
Statistical studies of galaxies,
Dark matter through gravitational lensing
Others….
Y. LeCun

Deep Learning at the edge

Today, Facebook, Google, Amazon and others are built around DL.
Take deep learning out, and the companies fold
Much of the computation is in data centers on regular CPUs
Highly-optimized code.
Increasingly, they will run on dedicated hardware.
More power efficient
Soon, ConvNets and other DL systems will be everywhere.
Smartphones, AR glasses, VR goggles, cars, medical imaging systems,
vacuum cleaners, cameras, toys, and almost all consumer electronics.
Y. LeCun

Three challenges for AI & Machine Learning


1. Learning with fewer labeled samples and/or fewer trials
Supervised and reinforcement learning require too much samples/trails
Self-supervised learning / learning dependencies / to fill in the blanks
learning to represent the world in a non task-specific way

2. Learning to reason, like Daniel Kahneman’s “System 2”


Beyond feed-forward, System 1 subconscious computation.
Making reasoning compatible with learning.

3. Learning to plan complex action sequences


Learning hierarchical representations of action plans
New Deep Learning
Architectures
Attention, Memory
Dynamic architectures,
hyper networks.
Y. LeCun

Differentiable Associative Memory == “soft RAM”


Memory Networks, Transformer Network,
Sum
ELMO, GPT, BERT, GPT2, RoBERTa, XLM-R…
Used very widely in NLP
Essentially a “soft” RAM or hash table Values
Memory networks [Weston et 2014] (FAIR) Vi
Stacked-Augmented Recurrent Neural Net
[Joulin & Mikolov 2014] (FAIR) Softmax
Neural Turing Machine [Graves 2014],
Keys
Differentiable Neural Computer [Graves 2016] Ki

K Ti X Dot Products
e
Y =∑ Ci V i
i
Ci= K Tj X Input (Address) X
∑e
j
Y. LeCun

Transformer Architecture
Multi-head attention
[Waswani ArXiv:1706.03762]
10 to 60 stages
BERT model
[Devlin ArXiv:1810.04805]
Trained to fill in missing words
Y. LeCun

DETR: ConvNet → Transformer for object detection


DETR [Carion et al. ArXiv:2005.12872]
https://github.com/facebookresearch/detr
ConvNet → Transformer
Object-based visual reasoning
Transformer: dynamic networks
Through “attention”
Y. LeCun

DETR: results on panoptic segmentation


Y. LeCun

Networks produced by other networks

2D image to 3D model [Ltitwin & Wolf arXiv:1908.06277]


Net1 → weights of Net2: implicit function for 3D shape
Y. LeCun

ConvNets on Graphs (fixed and data-dependent)


Graphs can represent: Natural
language, social networks, chemistry,
physics, communication networks...

Review paper: “Geometric deep learning: going


beyond euclidean data”, MM Bronstein, J Bruna, Y
LeCun, A Szlam, P Vandergheynst, IEEE Signal
Processing Magazine 34 (4), 18-42, 2017
[ArXiv:1611.08097]
Y. LeCun

Spectral ConvNets / Graph ConvNets

Regular grid graph


Standard ConvNet
Fixed irregular graph
Spectral ConvNet
Dynamic irregular graph
Graph ConvNet

IPAM workshop:
http://www.ipam.ucla.edu/programs/workshops/new-deep-learning-techniques/
Y. LeCun

Lessons learned #3
3.1: Dynamic networks are gaining in popularity (e.g. for NLP)
Dynamicity breaks many assumptions of current hardware
Can’t optimize the compute graph distribution at compile time.
Can’t do batching easily!
3.2: Large-Scale Memory-Augmented Networks & Transformers...
...Will require efficient associative memory/nearest-neighbor search
3.3: Graph ConvNets are very promising for many applications
Say goodbye to matrix multiplications?
Say goodbye to tensors?
3.4: Large Neural Nets may have sparse activity
How to exploit sparsity in hardware?
How do humans
and animals
learn so quickly?
Not supervised.
Not Reinforced.
Y. LeCun

When infants learn how the world works [afterpointing


Emmanuel Dupoux]
Social
helping vs false perceptual
Communication hindering beliefs

Actions face tracking rational, goal-


directed actions
Perception

biological
motion
gravity, inertia
Physics stability, conservation of
support momentum

Object permanence shape


constancy
Objects solidity, rigidity

natural kind categories Age (months)


Production

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
proto-imitation
crawling walking
emotional contagion
Y. LeCun

How do Human and Animal Babies Learn?


How do they learn how the world works?
Largely by observation, with remarkably little interaction (initially).
They accumulate enormous amounts of background knowledge
About the structure of the world, like intuitive physics.
Perhaps common sense emerges from this knowledge?

Photos courtesy of
Emmanuel Dupoux
Self-Supervised
Learning
Capture dependencies.
Predict everything from everything else.
Y. LeCun

Self-Supervised Learning = Learning to Fill in the Blanks


Reconstruct the input or Predict missing parts of the input.
time or space →
Y. LeCun

Self-Supervised Learning = Learning to Fill in the Blanks


Reconstruct the input or Predict missing parts of the input.
time or space →
Y. LeCun

Two Uses for Self-Supervised Learning

1. Learning hierarchical representations of the world


SSL pre-training precedes a supervised or RL phase

2. Learning predictive (forward) models of the world


Learning models for Model-Predictive Control, policy
learning for control, or model-based RL.

Question: how to represent uncertainty/multi-


modality in the prediction?
Energy-Based
Models
Capture dependencies through
an energy function.
Y. LeCun

Energy-Based Models (EBM)


Energy function F(x,y) scalar-valued. Energy
Takes low values when y is compatible with x and higher Function
F(x,y)
values when y is less compatible with x
Inference: find values of y that make F(x,y) small.
There may be multiple solutions x y

Note: the energy is used for inference, not for learning

Example
Blue dots are
data points
y x
Y. LeCun

Energy-Based Model: gradient-based inference


If y is continuous
We can use a gradient-
based method for y
inference.

Energy
Function F(x,y)

x y

x
Y. LeCun

Training an Energy-Based Model


Parameterize F(x,y)
Training samples: x[i], y[i]
Shape F(x,y) so that:
F(x[i], y[i]) is strictly smaller than F(x[i], y) for all
y different from y[i]
Keep F smooth
Max-likelihood probabilistic methods break that!
Two classes of learning methods: Energy
F(x,y)
1. Contrastive methods: push down on F(x[i], Function
y[i]), push up on other points F(x[i], y’)
2. Regularized/Architectural Methods: build
x y
F(x,y) so that the volume of low energy regions
is limited or minimized through regularization
Y. LeCun

Contrastive Methods vs Regularized/Architectural Methods


Contrastive: [they all are different ways to pick which points to push up]
C1: push down of the energy of data points, push up everywhere else: Max likelihood (needs
tractable partition function or variational approximation)
C2: push down of the energy of data points, push up on chosen locations: max likelihood with
MC/MMC/HMC, Contrastive divergence, Metric learning/Siamese nets, Ratio Matching, Noise
Contrastive Estimation, Min Probability Flow, adversarial generator/GANs
C3: train a function that maps points off the data manifold to points on the data manifold:
denoising auto-encoder, masked auto-encoder (e.g. BERT)
Regularized/Architectural: [Different ways to limit the information capacity of the latent representation]
A1: build the machine so that the volume of low energy space is bounded: PCA, K-means,
Gaussian Mixture Model, Square ICA, normalizing flows…
A2: use a regularization term that measures the volume of space that has low energy: Sparse
coding, sparse auto-encoder, LISTA, Variational Auto-Encoders, discretization/VQ/VQVAE.
A3: F(x,y) = C(y, G(x,y)), make G(x,y) as "constant" as possible with respect to y: Contracting
auto-encoder, saturating auto-encoder
A4: minimize the gradient and maximize the curvature around data points: score matching
Denoising AE: discrete outputs y C(y,y)

[Vincent et al. JMLR 2008] Switches

Softmax
z
Masked Auto-Encoder Dec(h)
Latent variable turns
BERT [Waswani 2018] Softmax vector(s) into
h
RoBERTa [Ott 2019] Observed word(s)

Issues:
Pred(x)
latent variables are in
output space
x y
No abstract LV to control corruption

the output
This is a [...] of text extracted This is a piece of text extracted
How to cover the space of [...] a large set of [...] articles from a large set of news articles
corruptions?
Y. LeCun

Transformer Architecture
Multi-head attention (associative memory)
[Waswani ArXiv:1706.03762]
10 to 60 stages
BERT model
[Devlin ArXiv:1810.04805]
Trained to fill in missing words
Y. LeCun

Multilignual Transformer Architecture XLM-R


[Lample & Conneau ArXiv:1901.07291]
Y. LeCun

Supervised Symbol Manipulation

Solving integrals and differential


equations symbolically with a
transformer architecture
[Lample & Charton
arXiv:1912.01412]
Accuracy on various problems →
Y. LeCun

Natural language understanding & generation [MMBlenderbot]


Denoising AE: continuous y C(y,y)

Image inpainting [Pathak 17]


Doesn’t quite work for feature learning Dec(h)

Most current approaches


h do not have latent
variables
Pred(x)

x corruption y
Contrastive Joint Embedding C(h,h’)

Distance measured in feature space h h’

Multiple “predictions” through feature invariance


Siamese nets, metric learning
[Bromley NIPS’93],[Chopra CVPR’05],[Hadsell CVPR’06] Enc(x) Enc(y)

Advantage: no pixel-level reconstruction


w
Difficulty: hard negative mining
x y
Successful examples for images:
DeepFace [Taigman et al. CVPR’14]
Positive pair:
PIRL [Misra et al. Arxiv:1912.01991] Make F small
MoCo [He et al. Arxiv:1911.05722]
SimCLR [Chen et al. ArXiv:2002.05709] Negative pair:
Speech: Make F large
Wav2vec 2.0 [Baevski et al. ArXiv:2006.11477]
Non-Contrastive Embedding C(h,h’)

Advantage: no pixel-level reconstruction


Pred(x)
Eliminates hard negative mining

Siamese nets with slightly different weights h h’

Averag
w w’
Bootstrap Your Own Latent
[Grill arXiv:2006.07733]
Enc(x) Enc(y)

Using cluster centers as targets


DeepCluster [Caron arXiv:1807.05520] x y

SwAV [Caron arXiv:2006.09882]


Y. LeCun

Non-Contrastive Methods for Latent Variable Models?


Latent variables: parameterize the set of predictions

y C(y,y)
Prediction

Dec(z,h)

Ideally, the latent variable h

represents independent z
explanatory factors of variation of Latent
the prediction. Pred(x) Variables
The information capacity of the
latent variable must be minimized.
Otherwise all the information for the x y

prediction will go into it. Observation Desired Prediction


Y. LeCun

Regularized Latent Variable EBM


Regularizer R(z) limits the information capacity of z y C(y,y)
Without regularization, every y may be reconstructed
exactly (flat energy surface)

Dec(z,h)

Examples of R(z): h
Effective dimension R(z)

Quantization / discretization z
L0 norm (# of non-0 components) Pred(x)

L1 norm with decoder normalization


Maximize lateral inhibition / competition
x y
Add noise to z while limiting its L2 norm (VAE)
<your_information_throttling_method_goes_here>
Y. LeCun

RLVEBM: Regularized or Variational Auto-Encoder


A2: regularize the volume of the low energy regions
y C(y,y)
Regularized Auto-Encoder, Sparse AE, LISTA

Dec(z,h)

R(z) z

Variational AE
D(z,z)
F(y) approximated by sampling and/or variational method

Enc(y,h)

y
Encoder performs amortized inference [Gregor & YLC, ICML 2010]
Y. LeCun

Convolutional Sparse Auto-Encoder on Natural Images


Filters and Basis Functions obtained. Linear decoder (conv)
with 1, 2, 4, 8, 16, 32, and 64 filters [Kavukcuoglu NIPS 2010]
Encoder Filters Decoder Filters Encoder Filters Decoder Filters
Y. LeCun

Convolutional Sparse Auto-Encoder on Natural Images

Trained on CIFAR 10 (32x32 color images)


Architecture: Linear decoder, LISTA recurrent encoder

sparse codes (z) from encoder 9x9 decoder kernels


Learning World Models
with
Regularized Latent-Variable
Energy-Based Models

Self-supervised prediction
under uncertainty
Y. LeCun

Conditional Regularized Latent-Variable EBM

Regularized Latent Variable EBM for y C(y,y)


video Prediction
Dec(z,h)

Predictor captures the useful information


R(z)
from the past in h z

Regularized latent variable capture the


D(z,z)
unpredictable information in the output h

Regularizer ensures latent variable does


not capture all the information. Pred(x) Enc(y,h)

Encoder performs amortized inference


x y
Conditional VAE + Drop Out y C(y,y)

Training:
Dec(z,h)
Observe frames +

Compute h
Predict z̄ from R(z) z

encoder z̄

Sample z, with: D(z,z)

P(z / z̄ )∝ exp [−β ( D( z , z̄ )+ R( z))] h z

Half the time, set z=0 Pred(x) Enc(y,h)

Predict next frame


y
backprop x
Y. LeCun

Actual, Deterministic, VAE+Dropout Predictor/encoder


Y. LeCun

Forward Model for Model-Predictive Control


Forward model: s[t+1] = G(s[t], a[t], z[t])
Cost/Energy: f[t] = C(s[t])
Latent variable z sampled from q(z) proportional to exp(-R(z))
Optimize (a[1],a[2]…,a[T]) = argmin Σt C(s[t])
through backprop (== Kelley-Bryson adjoint state method)
R(z) C(s) R(z) C(s) R(z) C(s) R(z) C(s)

z z[t] z z

s[t] s[t+1]
Perception G(s,a,z) G(s,a,z) G(s,a,z) G(s,a,z)

a a[t] a a
Y. LeCun

Forward Model for Gradient-Based Policy Learning


Forward model: s[t+1] = G(s[t], a[t], z[t])
Cost/Energy: f[t] = C(s[t],a[t])
Latent variable z sampled from q(z) proportional to exp(-R(z))
Policy: a[t] = P(s[t])
Learn P through backprop (== Kelley-Bryson adjoint state method)

R(z) C(s,a) R(z) C(s,a) R(z) C(s,a) R(z) C(s,a)

z z[t] z z

s[t] s[t+1]
Perception G(s,a,z) G(s,a,z) G(s,a,z) G(s,a,z)

P(s) P(s) a[t] P(s) P(s)


Y. LeCun

Driving an Invisible Car in “Real” Traffic


Y. LeCun

conclusions
SSL is the future
Learning hierarchical features in a task-invariant way
Plenty of data, massive networks
Learning Forward Models for Model-Based Control
Challenge: handling uncertainty in the prediction: energy-based models
Reasoning through vector representations and energy minimization
Energy-Based Models with latent variables
Replace symbols by vectors and logic by continuous functions.
Learning hierarchical representations of action plans
No idea how to do that!
There is no such thing as AGI. Intelligence is always specialized.
We should talk about rat-level, cat-level, or human-level AI (HLAI).
Y. LeCun

The way to Human-Level AI?


World Model: predicts future states
Critic: predicts expected objective
Cost: computes objective
configuration
Perception: estimates world state
Actor: computes action Actor Model of
The World
Critic
Perception
We only have one world model
engine!
Cost

Configurator: configures the world


model engine for the situation at
hand.
Is this what consciousness is?
Y. LeCun

Conjectures
Self-Supervised Learning to learn models of the world.
Models of the world: learning with uncertainty in the prediction
Perhaps common sense will emerge from learning world models
Emotions are (often) anticipations of outcomes
According to predictions from the model of the world
Reasoning is finding actions that optimize outcomes
Constraint satisfaction/cost minimization rather than logic
Consciouness may be the deliberate configuration of our world model
engine?
We only have one (configurable) model of the world
If our brains had infinite capacity, we would not need consciousness
Thank You!

You might also like