Lecun 20201027 Att

AI & Machine Learning:
Past, Present & Future
Yann LeCun
NYU - Courant Institute & Center for Data Science
Facebook AI Research
http://yann.lecun.com
AT&T, 2020-10-27
Y. LeCun
Supervised Learning works but requires many labeled samples

Training a machine by showing examples instead of programming it
When the output is wrong, tweak the parameters of the machine
Works well for:
Speech→words
Image→categories
Portrait→ name
Photo→caption
CAR
Text→topic
…. PLANE
Y. LeCun
Traditional Machine Learning → Deep Learning

Traditional Hand-Engineered Trainable
Machine Feature Extractor classifier
Learning &
Pattern
Recognition Feature Predicted
Vector output
Deep Learning: learning hierarchical representations

Trainable Trainable Trainable
Module Module Module
Low-level High-level Predicted

representation representation output
Y. LeCun
Multilayer Architectures == Compositional Structure of Data

Natural is data is compositional => it is efficiently representable hierarchically
Low-Level Mid-Level High-Level Trainable

Feature Feature Feature Classifier
Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]
Y. LeCun
(Deep) Multi-Layer Neural Nets

Multiple Layers of simple units
ReLU ( x )= max ( x , 0)
Each units computes a weighted sum of its inputs
Weighted sum is passed through a non-linear function
The learning algorithm changes the weights
Weight
matrix
Ceci est une voiture

Y. LeCun
Computing Gradients by Back-Propagation

●
A practical Application of Chain Rule
●
Backprop for the state gradients:
●
Backprop for the weight gradients:
y (desired output)
x (input)
Y. LeCun
What is Deep Learning?

Definition: Deep Learning is building a system by
y C(y,y)
assembling parameterized modules into a (possibly
dynamic) computation graph, and training it to perform a
task by optimizing the parameters using a gradient-based Dec(z,h)
method.
+
Graph can be defined dynamically by input-dependent
programs: differentiable programming
R(z) z
Output may be computed through complex (non feed-forward)
process, e.g. by minimizing some energy function: relaxation,
constraint satisfaction, structured prediction,…. D(z,z)
h
Learning paradigms and objective functions are up to the
designer: supervised, reinforced, self-supervised/unsupervised, z
classification, prediction, reconstruction,…. Pred(x)
Note: the limitations of Supervised Learning are sometimes Enc(y,h)
mistakenly seen as intrinsic limitations of DL
x y
Y. LeCun
I was at AT&T from 1988 to 1996

Larry D. Jackel
2002 lab reunion

Y. LeCun
Adaptive Systems Research Dept (BL11315, 1985-1996)

New device fabrication → neural net hardware → ML research
Holmdel, NJ
Y. LeCun
Hubel & Wiesel's Model of the Architecture of the Visual Cortex

[Thorpe & Fabre-Thorpe 2001]
[Hubel & Wiesel 1962]:
simple cells detect local features
complex cells “pool” the outputs
of simple cells within a
retinotopic neighborhood.
“Simple cells” “Complex

cells”
pooling
Multiple subsampling
[Fukushima 1982][LeCun 1989, 1998],[Riesenhuber 1999]...... convolutions
Y. LeCun
Convolutional Network Architecture [LeCun et al. NIPS 1989]
Filter Bank +non-linearity
Pooling
Pooling
Inspired by [Hubel & Wiesel 1962] &

[Fukushima 1982] (Neocognitron):
simple cells detect local features
complex cells “pool” the outputs of simple
cells within a retinotopic neighborhood.
Y. LeCun
Convolutional Network (LeNet5, vintage 1990)

Filters-tanh → pooling → filters-tanh → pooling → filters-tanh
Y. LeCun
LeNet character recognition demo 1992

Running on an AT&T DSP32C (floating-point DSP, 20 MFLOPS)
Y. LeCun
ConvNets can recognize multiple objects

All layers are convolutional
Networks performs simultaneous segmentation and recognition
[LeCun, Bottou, Bengio, Haffner, Proc IEEE 1998]
Y. LeCun
1986-1996 Neural Net Hardware at Bell Labs, Holmdel

1986: 12x12 resistor array
Fixed resistor values
E-beam lithography: 6x6microns
1988: 54x54 neural net
Programmable ternary weights 6 microns
On-chip amplifiers and I/O

1991: Net32k: 256x128 net
Programmable ternary weights
320GOPS, 1-bit convolver.
1992: ANNA: 64x64 net
ConvNet accelerator: 4GOPS
6-bit weights, 3-bit activations
Check Reader Y. LeCun
(AT&T, 1995)
Graph transformer network trained
to read check amounts.
Trained globally with Negative-Log-
Likelihood loss.
50% percent correct, 49% reject,

1% error (detectable later in the
process).
Fielded in 1996, used in many

banks in the US and Europe.
Processed an estimated 10% to

20% of all the checks written in the
US in the early 2000s.
[LeCun, Bottou, Bengio, Haffner 1998]
The Deep Learning
Revolution
Since 2010 or so.

Y. LeCun
Deep ConvNets for Object Recognition (on GPU)

AlexNet [Krizhevsky et al. NIPS 2012], OverFeat [Sermanet et al. 2013]
1 to 10 billion connections, 10 million to 1 billion parameters, 8 to 20 layers.
Y. LeCun
Applications of Deep Learning

Medical image analysis [Mnih 2015]
Self-driving cars
Accessibility
Face recognition
Language translation
Virtual assistants*
Content Understanding for: [MobilEye]
Filtering
Selection/ranking
Search
Games
Security, anomaly detection
Diagnosis, prediction
Science!
[Geras 2017] [Esteva 2017]
Y. LeCun
Supervised DL works amazingly well, when you have data

And services like Facebook, Instagram, Google, Youtube,… are built
around it.
Content understanding, filtering, ranking,
translation, accessibility….
Content understanding,
filtering, ranking, translation,
accessibility….
Y. LeCun
Deep Learning Saves Lives

Automated emergency Braking
Systems
Reduce collisions by 40%
Use Convolutional nets.
Tumor detection in mammograms

[Wu et al. ArXiv:1903.08297]
https://github.com/nyukat/breast_cancer
_classifier
Content filtering.
Hate speech, calls to violence, weapon
sales, terrorist propaganda….
Y. LeCun
FastMRI (NYU+FAIR): 4x-8x speed up for MRI data acquisition
MRI images subsampled

(in k-space) by 4x and 8x
[Zbontar et al.
ArXiv:1811.08839]
U-Net architecture
4-fold acceleration
8-fold acceleration
K-space masks
Y. LeCun
ConvNets in neuroscience
[Eickenberg et al. NeuroImage 2016]
Y. LeCun
ConvNets in Astrophysics [He et al. PNAS 07/2019]
1. Train a coarse-grained 3D U-Net to approximate a fine-grained

simulation on a small volume
2. Use it for a simulation on a large volume (the early universe)
Y. LeCun
Deep Learning in Science
Protein design / Molecular dynamics

Protein structure/function prediction
Material Science / Molecular dynamics
Prediction of material properties
High energy Physics
Jet filtering / analysis [Komiske arXiv:1612.01551]
Cosmology / Astrophysics
Infering constants from observations
Statistical studies of galaxies,
Dark matter through gravitational lensing
Others….
Y. LeCun
Deep Learning at the edge
Today, Facebook, Google, Amazon and others are built around DL.
Take deep learning out, and the companies fold
Much of the computation is in data centers on regular CPUs
Highly-optimized code.
Increasingly, they will run on dedicated hardware.
More power efficient
Soon, ConvNets and other DL systems will be everywhere.
Smartphones, AR glasses, VR goggles, cars, medical imaging systems,
vacuum cleaners, cameras, toys, and almost all consumer electronics.
Y. LeCun
Three challenges for AI & Machine Learning

1. Learning with fewer labeled samples and/or fewer trials
Supervised and reinforcement learning require too much samples/trails
Self-supervised learning / learning dependencies / to fill in the blanks
learning to represent the world in a non task-specific way
2. Learning to reason, like Daniel Kahneman’s “System 2”

Beyond feed-forward, System 1 subconscious computation.
Making reasoning compatible with learning.
3. Learning to plan complex action sequences

Learning hierarchical representations of action plans
New Deep Learning
Architectures
Attention, Memory
Dynamic architectures,
hyper networks.
Y. LeCun
Differentiable Associative Memory == “soft RAM”

Memory Networks, Transformer Network,
Sum
ELMO, GPT, BERT, GPT2, RoBERTa, XLM-R…
Used very widely in NLP
Essentially a “soft” RAM or hash table Values
Memory networks [Weston et 2014] (FAIR) Vi
Stacked-Augmented Recurrent Neural Net
[Joulin & Mikolov 2014] (FAIR) Softmax
Neural Turing Machine [Graves 2014],
Keys
Differentiable Neural Computer [Graves 2016] Ki
K Ti X Dot Products
e
Y =∑ Ci V i
i
Ci= K Tj X Input (Address) X
∑e
j
Y. LeCun
Transformer Architecture
Multi-head attention
[Waswani ArXiv:1706.03762]
10 to 60 stages
BERT model
[Devlin ArXiv:1810.04805]
Trained to fill in missing words
Y. LeCun
DETR: ConvNet → Transformer for object detection

DETR [Carion et al. ArXiv:2005.12872]
https://github.com/facebookresearch/detr
ConvNet → Transformer
Object-based visual reasoning
Transformer: dynamic networks
Through “attention”
Y. LeCun
DETR: results on panoptic segmentation

Y. LeCun
Networks produced by other networks
2D image to 3D model [Ltitwin & Wolf arXiv:1908.06277]

Net1 → weights of Net2: implicit function for 3D shape
Y. LeCun
ConvNets on Graphs (fixed and data-dependent)

Graphs can represent: Natural
language, social networks, chemistry,
physics, communication networks...
Review paper: “Geometric deep learning: going

beyond euclidean data”, MM Bronstein, J Bruna, Y
LeCun, A Szlam, P Vandergheynst, IEEE Signal
Processing Magazine 34 (4), 18-42, 2017
[ArXiv:1611.08097]
Y. LeCun
Spectral ConvNets / Graph ConvNets
Regular grid graph

Standard ConvNet
Fixed irregular graph
Spectral ConvNet
Dynamic irregular graph
Graph ConvNet
IPAM workshop:
http://www.ipam.ucla.edu/programs/workshops/new-deep-learning-techniques/
Y. LeCun
Lessons learned #3
3.1: Dynamic networks are gaining in popularity (e.g. for NLP)
Dynamicity breaks many assumptions of current hardware
Can’t optimize the compute graph distribution at compile time.
Can’t do batching easily!
3.2: Large-Scale Memory-Augmented Networks & Transformers...
...Will require efficient associative memory/nearest-neighbor search
3.3: Graph ConvNets are very promising for many applications
Say goodbye to matrix multiplications?
Say goodbye to tensors?
3.4: Large Neural Nets may have sparse activity
How to exploit sparsity in hardware?
How do humans
and animals
learn so quickly?
Not supervised.
Not Reinforced.
Y. LeCun
When infants learn how the world works [afterpointing

Emmanuel Dupoux]
Social
helping vs false perceptual
Communication hindering beliefs
Actions face tracking rational, goal-

directed actions
Perception
biological
motion
gravity, inertia
Physics stability, conservation of
support momentum
Object permanence shape

constancy
Objects solidity, rigidity
natural kind categories Age (months)

Production
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
proto-imitation
crawling walking
emotional contagion
Y. LeCun
How do Human and Animal Babies Learn?

How do they learn how the world works?
Largely by observation, with remarkably little interaction (initially).
They accumulate enormous amounts of background knowledge
About the structure of the world, like intuitive physics.
Perhaps common sense emerges from this knowledge?
Photos courtesy of
Emmanuel Dupoux
Self-Supervised
Learning
Capture dependencies.
Predict everything from everything else.
Y. LeCun
Self-Supervised Learning = Learning to Fill in the Blanks

Reconstruct the input or Predict missing parts of the input.
time or space →
Y. LeCun
Self-Supervised Learning = Learning to Fill in the Blanks

Reconstruct the input or Predict missing parts of the input.
time or space →
Y. LeCun
Two Uses for Self-Supervised Learning
1. Learning hierarchical representations of the world

SSL pre-training precedes a supervised or RL phase
2. Learning predictive (forward) models of the world

Learning models for Model-Predictive Control, policy
learning for control, or model-based RL.
Question: how to represent uncertainty/multi-

modality in the prediction?
Energy-Based
Models
Capture dependencies through
an energy function.
Y. LeCun
Energy-Based Models (EBM)

Energy function F(x,y) scalar-valued. Energy
Takes low values when y is compatible with x and higher Function
F(x,y)
values when y is less compatible with x
Inference: find values of y that make F(x,y) small.
There may be multiple solutions x y
Note: the energy is used for inference, not for learning
Example
Blue dots are
data points
y x
Y. LeCun
Energy-Based Model: gradient-based inference

If y is continuous
We can use a gradient-
based method for y
inference.
Energy
Function F(x,y)
x y
x
Y. LeCun
Training an Energy-Based Model

Parameterize F(x,y)
Training samples: x[i], y[i]
Shape F(x,y) so that:
F(x[i], y[i]) is strictly smaller than F(x[i], y) for all
y different from y[i]
Keep F smooth
Max-likelihood probabilistic methods break that!
Two classes of learning methods: Energy
F(x,y)
1. Contrastive methods: push down on F(x[i], Function
y[i]), push up on other points F(x[i], y’)
2. Regularized/Architectural Methods: build
x y
F(x,y) so that the volume of low energy regions
is limited or minimized through regularization
Y. LeCun
Contrastive Methods vs Regularized/Architectural Methods

Contrastive: [they all are different ways to pick which points to push up]
C1: push down of the energy of data points, push up everywhere else: Max likelihood (needs
tractable partition function or variational approximation)
C2: push down of the energy of data points, push up on chosen locations: max likelihood with
MC/MMC/HMC, Contrastive divergence, Metric learning/Siamese nets, Ratio Matching, Noise
Contrastive Estimation, Min Probability Flow, adversarial generator/GANs
C3: train a function that maps points off the data manifold to points on the data manifold:
denoising auto-encoder, masked auto-encoder (e.g. BERT)
Regularized/Architectural: [Different ways to limit the information capacity of the latent representation]
A1: build the machine so that the volume of low energy space is bounded: PCA, K-means,
Gaussian Mixture Model, Square ICA, normalizing flows…
A2: use a regularization term that measures the volume of space that has low energy: Sparse
coding, sparse auto-encoder, LISTA, Variational Auto-Encoders, discretization/VQ/VQVAE.
A3: F(x,y) = C(y, G(x,y)), make G(x,y) as "constant" as possible with respect to y: Contracting
auto-encoder, saturating auto-encoder
A4: minimize the gradient and maximize the curvature around data points: score matching
Denoising AE: discrete outputs y C(y,y)
[Vincent et al. JMLR 2008] Switches
Softmax
z
Masked Auto-Encoder Dec(h)
Latent variable turns
BERT [Waswani 2018] Softmax vector(s) into
h
RoBERTa [Ott 2019] Observed word(s)
Issues:
Pred(x)
latent variables are in
output space
x y
No abstract LV to control corruption
the output
This is a [...] of text extracted This is a piece of text extracted
How to cover the space of [...] a large set of [...] articles from a large set of news articles
corruptions?
Y. LeCun
Transformer Architecture
Multi-head attention (associative memory)
[Waswani ArXiv:1706.03762]
10 to 60 stages
BERT model
[Devlin ArXiv:1810.04805]
Trained to fill in missing words
Y. LeCun
Multilignual Transformer Architecture XLM-R

[Lample & Conneau ArXiv:1901.07291]
Y. LeCun
Supervised Symbol Manipulation
Solving integrals and differential

equations symbolically with a
transformer architecture
[Lample & Charton
arXiv:1912.01412]
Accuracy on various problems →
Y. LeCun
Natural language understanding & generation [MMBlenderbot]

Denoising AE: continuous y C(y,y)
Image inpainting [Pathak 17]

Doesn’t quite work for feature learning Dec(h)
Most current approaches

h do not have latent
variables
Pred(x)
x corruption y
Contrastive Joint Embedding C(h,h’)
Distance measured in feature space h h’
Multiple “predictions” through feature invariance

Siamese nets, metric learning
[Bromley NIPS’93],[Chopra CVPR’05],[Hadsell CVPR’06] Enc(x) Enc(y)
Advantage: no pixel-level reconstruction

w
Difficulty: hard negative mining
x y
Successful examples for images:
DeepFace [Taigman et al. CVPR’14]
Positive pair:
PIRL [Misra et al. Arxiv:1912.01991] Make F small
MoCo [He et al. Arxiv:1911.05722]
SimCLR [Chen et al. ArXiv:2002.05709] Negative pair:
Speech: Make F large
Wav2vec 2.0 [Baevski et al. ArXiv:2006.11477]
Non-Contrastive Embedding C(h,h’)
Advantage: no pixel-level reconstruction

Pred(x)
Eliminates hard negative mining
Siamese nets with slightly different weights h h’
Averag
w w’
Bootstrap Your Own Latent
[Grill arXiv:2006.07733]
Enc(x) Enc(y)
Using cluster centers as targets

DeepCluster [Caron arXiv:1807.05520] x y
SwAV [Caron arXiv:2006.09882]

Y. LeCun
Non-Contrastive Methods for Latent Variable Models?

Latent variables: parameterize the set of predictions
y C(y,y)
Prediction
Dec(z,h)
Ideally, the latent variable h
represents independent z
explanatory factors of variation of Latent
the prediction. Pred(x) Variables
The information capacity of the
latent variable must be minimized.
Otherwise all the information for the x y
prediction will go into it. Observation Desired Prediction

Y. LeCun
Regularized Latent Variable EBM

Regularizer R(z) limits the information capacity of z y C(y,y)
Without regularization, every y may be reconstructed
exactly (flat energy surface)
Dec(z,h)
Examples of R(z): h
Effective dimension R(z)
Quantization / discretization z
L0 norm (# of non-0 components) Pred(x)
L1 norm with decoder normalization

Maximize lateral inhibition / competition
x y
Add noise to z while limiting its L2 norm (VAE)
<your_information_throttling_method_goes_here>
Y. LeCun
RLVEBM: Regularized or Variational Auto-Encoder

A2: regularize the volume of the low energy regions
y C(y,y)
Regularized Auto-Encoder, Sparse AE, LISTA
Dec(z,h)
R(z) z
Variational AE
D(z,z)
F(y) approximated by sampling and/or variational method
Enc(y,h)
y
Encoder performs amortized inference [Gregor & YLC, ICML 2010]
Y. LeCun
Convolutional Sparse Auto-Encoder on Natural Images

Filters and Basis Functions obtained. Linear decoder (conv)
with 1, 2, 4, 8, 16, 32, and 64 filters [Kavukcuoglu NIPS 2010]
Encoder Filters Decoder Filters Encoder Filters Decoder Filters
Y. LeCun
Convolutional Sparse Auto-Encoder on Natural Images
Trained on CIFAR 10 (32x32 color images)

Architecture: Linear decoder, LISTA recurrent encoder
sparse codes (z) from encoder 9x9 decoder kernels

Learning World Models
with
Regularized Latent-Variable
Energy-Based Models
Self-supervised prediction
under uncertainty
Y. LeCun
Conditional Regularized Latent-Variable EBM
Regularized Latent Variable EBM for y C(y,y)

video Prediction
Dec(z,h)
Predictor captures the useful information

R(z)
from the past in h z
Regularized latent variable capture the

D(z,z)
unpredictable information in the output h
Regularizer ensures latent variable does

not capture all the information. Pred(x) Enc(y,h)
Encoder performs amortized inference

x y
Conditional VAE + Drop Out y C(y,y)
Training:
Dec(z,h)
Observe frames +
Compute h
Predict z̄ from R(z) z
encoder z̄
Sample z, with: D(z,z)
P(z / z̄ )∝ exp [−β ( D( z , z̄ )+ R( z))] h z
Half the time, set z=0 Pred(x) Enc(y,h)
Predict next frame

y
backprop x
Y. LeCun
Actual, Deterministic, VAE+Dropout Predictor/encoder

Y. LeCun
Forward Model for Model-Predictive Control

Forward model: s[t+1] = G(s[t], a[t], z[t])
Cost/Energy: f[t] = C(s[t])
Latent variable z sampled from q(z) proportional to exp(-R(z))
Optimize (a[1],a[2]…,a[T]) = argmin Σt C(s[t])
through backprop (== Kelley-Bryson adjoint state method)
R(z) C(s) R(z) C(s) R(z) C(s) R(z) C(s)
z z[t] z z
s[t] s[t+1]
Perception G(s,a,z) G(s,a,z) G(s,a,z) G(s,a,z)
a a[t] a a
Y. LeCun
Forward Model for Gradient-Based Policy Learning

Forward model: s[t+1] = G(s[t], a[t], z[t])
Cost/Energy: f[t] = C(s[t],a[t])
Latent variable z sampled from q(z) proportional to exp(-R(z))
Policy: a[t] = P(s[t])
Learn P through backprop (== Kelley-Bryson adjoint state method)
R(z) C(s,a) R(z) C(s,a) R(z) C(s,a) R(z) C(s,a)
z z[t] z z
s[t] s[t+1]
Perception G(s,a,z) G(s,a,z) G(s,a,z) G(s,a,z)
P(s) P(s) a[t] P(s) P(s)

Y. LeCun
Driving an Invisible Car in “Real” Traffic

Y. LeCun
conclusions
SSL is the future
Learning hierarchical features in a task-invariant way
Plenty of data, massive networks
Learning Forward Models for Model-Based Control
Challenge: handling uncertainty in the prediction: energy-based models
Reasoning through vector representations and energy minimization
Energy-Based Models with latent variables
Replace symbols by vectors and logic by continuous functions.
Learning hierarchical representations of action plans
No idea how to do that!
There is no such thing as AGI. Intelligence is always specialized.
We should talk about rat-level, cat-level, or human-level AI (HLAI).
Y. LeCun
The way to Human-Level AI?

World Model: predicts future states
Critic: predicts expected objective
Cost: computes objective
configuration
Perception: estimates world state
Actor: computes action Actor Model of
The World
Critic
Perception
We only have one world model
engine!
Cost
Configurator: configures the world

model engine for the situation at
hand.
Is this what consciousness is?
Y. LeCun
Conjectures
Self-Supervised Learning to learn models of the world.
Models of the world: learning with uncertainty in the prediction
Perhaps common sense will emerge from learning world models
Emotions are (often) anticipations of outcomes
According to predictions from the model of the world
Reasoning is finding actions that optimize outcomes
Constraint satisfaction/cost minimization rather than logic
Consciouness may be the deliberate configuration of our world model
engine?
We only have one (configurable) model of the world
If our brains had infinite capacity, we would not need consciousness
Thank You!

Lecun 20201027 Att

Uploaded by

Copyright:

Available Formats

Lecun 20201027 Att

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecun 20201027 Att

Uploaded by

Copyright:

Available Formats

AI & Machine Learning:

Past, Present & Future

Supervised Learning works but requires many labeled samples

Traditional Machine Learning → Deep Learning

Deep Learning: learning hierarchical representations

Low-level High-level Predicted

Multilayer Architectures == Compositional Structure of Data

Low-Level Mid-Level High-Level Trainable

(Deep) Multi-Layer Neural Nets

Ceci est une voiture

Computing Gradients by Back-Propagation

What is Deep Learning?

I was at AT&T from 1988 to 1996

2002 lab reunion

Adaptive Systems Research Dept (BL11315, 1985-1996)

Hubel & Wiesel's Model of the Architecture of the Visual Cortex

“Simple cells” “Complex

Convolutional Network Architecture [LeCun et al. NIPS 1989]

Filter Bank +non-linearity

Filter Bank +non-linearity

Inspired by [Hubel & Wiesel 1962] &

Convolutional Network (LeNet5, vintage 1990)

LeNet character recognition demo 1992

ConvNets can recognize multiple objects

1986-1996 Neural Net Hardware at Bell Labs, Holmdel

On-chip amplifiers and I/O

50% percent correct, 49% reject,

Fielded in 1996, used in many

Processed an estimated 10% to

Since 2010 or so.

Deep ConvNets for Object Recognition (on GPU)

Applications of Deep Learning

Supervised DL works amazingly well, when you have data

Deep Learning Saves Lives

Tumor detection in mammograms

FastMRI (NYU+FAIR): 4x-8x speed up for MRI data acquisition

MRI images subsampled

ConvNets in Astrophysics [He et al. PNAS 07/2019]

1. Train a coarse-grained 3D U-Net to approximate a fine-grained

Deep Learning in Science

Protein design / Molecular dynamics

Deep Learning at the edge

Three challenges for AI & Machine Learning

2. Learning to reason, like Daniel Kahneman’s “System 2”

3. Learning to plan complex action sequences

Differentiable Associative Memory == “soft RAM”

DETR: ConvNet → Transformer for object detection

DETR: results on panoptic segmentation

Networks produced by other networks

2D image to 3D model [Ltitwin & Wolf arXiv:1908.06277]

ConvNets on Graphs (fixed and data-dependent)

Review paper: “Geometric deep learning: going

Spectral ConvNets / Graph ConvNets

Regular grid graph

When infants learn how the world works [afterpointing

Actions face tracking rational, goal-

Object permanence shape

natural kind categories Age (months)

How do Human and Animal Babies Learn?