Lecun 20240124 Uw Lyttle

Download as pdf or txt
Download as pdf or txt
You are on page 1of 84

Lytle Lecture 2023-2024

Objective-Driven AI
Towards AI systems that can learn,
remember, reason, plan,
have common sense,
yet are steerable and safe

Yann LeCun
New York University
Meta – Fundamental AI Research

University of Washington
Lytle Lecture
2024-01-24
M-51, HSO
Y. LeCun

Machine Learning sucks! (compared to humans and animals)

Supervised learning (SL) requires large numbers of labeled samples.


Reinforcement learning (RL) requires insane amounts of trials.
Self-Supervised Learning (SSL) works great but...
Generative prediction only works for text and other discrete modalities

Animals and humans:


Can learn new tasks very quickly.
Understand how the world works
Can reason an plan
Humans and animals have common sense
There behavior is driven by objectives (drives)
Y. LeCun

We Need Human-Level AI for Intelligent Assistant

Smart glasses
Communicates through voice, vision, display,
electro-myogram interfaces (EMG)
Intelligent Asistant
Can answer all of our questions
Helps us in our daily lives
“Her”
Knows our preferences and interests (2013)

For this, we need machines with common sense


Machines that understand how the world works
Machines that can remember, reason, plan.
Y. LeCun

Future AI Assistants need Human-Level AI


AI assistants will require (super-)human-level intelligence
Like having a staff of smart “people” working for us
But, we are nowhere near human-level AI today
Any 17 year-old can learn to drive in 20 hours of training
Any 10 year-old can learn to clear the dinner table in one shot
Any house cat can plan complex actions

What are we missing?


Learning how to world works (not just from text)
World models. Common sense
Memory, Reasoning, Hierarchical Planning
Y. LeCun

Desiderata for AMI (Advanced Machine Intelligence)

Systems that learn world models from


sensory inputs
E.g. learn intuitive physics from video
Systems that have persistent memory
Large-scale associative memories
Systems that can plan actions
So as to fulfill an objective
Systems that are controllable & safe
By design, not by fine-tuning.

Objective-Driven AI Architecture
Self-Supervised Learning
has taken over the world

For understanding and generating text,


images, video, 3D models, speech,
proteins,...
Y. LeCun

Self-Supervised Learning via Denoising / Reconstruction


Denoising Auto-Encoder [Vincent 2008], BERT [Devlin 2018], RoBERTa [Ott 2019]

Learned
representation

Corruption
masking

This is a [...] of text extracted This is a piece of text extracted


[...] a large set of [...] articles from a large set of news articles
Y. LeCun

No Language Left Behind (NLLB)


Language translation between 202 languages
in any of the 40602 directions
Training set: 18 billion pairs of sentences for 2440 language directions
Most pairs have less than 1 million sentences
https://ai.facebook.com/research/no-language-left-behind/
A single neural net with
54 billion parameters
Performance gets better
as more languages are
added
Relies on Self-
Supervised Learning
and back-translation.
Y. LeCun

No Language Left Behind (NLLB)


Y. LeCun

SeamlessM4T
Speech or text input: 100 languages
Text output: 100 languages
Speech output: 35 languages
Seamless Expressive: real-time, preserves voice & expression
https://ai.meta.com/blog/seamless-m4t/
Y. LeCun

Deep Learning Connects People to knowledge & to each other

Meta (FB, Instagram), Google, YouTube, Amazon, are built around


Deep learning
Take Deep Learning out of them, and they crumble.
DL helps us deal with the information deluge
Search, retrieval, ranking, question-answering
Requires machines to understand content
Translation / transcription / accessibility
language ↔ language; text ↔ speech; image → text
People speak thousands of different languages
3 billion people can’t use technology today.
800 million are illiterate, 300 million are visually impaired
Y. LeCun

On-Line Content Moderation


Filtering out illegal and dangerous content
What constitutes acceptable content?
Meta doesn’t see itself as having the legitimacy to decide
But in the absence of regulations, it has to do it.
Types of objectionable content on Facebook
(with % taken down preemptively & prevalence, Q1 2022)
Hate Speech (95.6%, 0.02%), Violence incitement (98.1%, 0.03%),
Violence (99.5%, 0.04%), Bullying/Harassment (67%, 0.09%), Child
endangerment (96.4%), Suicide/Self-Injury (98.8%), Nudity (96.7%,
0.04%), Terrorism (16M pieces), Fake accounts (1.5B), Spam (1.8B)
https://transparency.fb.com/data/community-standards-enforcement
AI is the solution, not the problem
Y. LeCun

Hate speech suppression/down-ranking on Facebook

Of the violating content we actioned for hate speech, how much did
we find and action before people reported it?
https://transparency.fb.com/reports/community-standards-enforcement/hate-speech/facebook/
95.6%

23.6%
Y. LeCun

Protein folding and inverse folding (protein design)


ESMfold, ESMfold-2 (FAIR) AlphaFold, AlphaFold-2 (DeepMind)
Protein Folding:
from a sequence of amino
acids to 3D structure
[Jumper 21, Rives 19]

Protein Generation
[Lin et al. 2021]
Protein Design:
from 3D structure to
sequences of amino acids
For drug design
[Lin & al. BioRxiv:2022.07.20.500902]
Y. LeCun

ESM Metagenomic Atlas (FAIR+NYU)


615 million proteins with
predicted 3D structure
Interactive website
https://esmatlas.com/
Paper:
[Lin et al. 2022] Evolutionary-
scale prediction of atomic level
protein structure with a language
model
https://www.biorxiv.org/co
ntent/10.1101/2022.07.20.
500902
Code:
https://github.com/faceboo
kresearch/esm
Generative AI and
Auto-Regressive
Large Language Models
Y. LeCun

Auto-Regressive Generative Architectures

Outputs one “token” after another


Tokens may represent words, image patches, speech segments...

Stochastic
Encoder
Predictor

x[t-3] x[t-2] x[t-1] x[t] x[t+1]

Prompt Predicted token

Stochastic
Encoder
Predictor

x[t-2] x[t-1] x[t] x[t+1] x[t+2]

Context
Y. LeCun

Auto-Regressive Large Language Models (AR-LLMs)


Outputs one text token after another
Tokens may represent words or subwords
Encoder/predictor is a transformer architecture
With billions of parameters: typically from 1B to 500B
Training data: 1 to 2 trillion tokens
LLMs for dialog/text generation:
Open: BlenderBot, Galactica, LlaMA, Llama-2, Code Llama (FAIR), Mistral-7B
(Mistral), Falcon (UAE), Alpaca (Stanford), Yi (01.AI)….
Proprietary: Meta AI (Meta), LaMDA/Bard (Google), Chinchilla (DeepMind),
ChatGPT (OpenAI) …
Performance is amazing … but … they make stupid mistakes
Factual errors, logical errors, inconsistency, limited reasoning, toxicity...
LLMs have limited knowledge of the underlying reality
They have no common sense & they can’t plan their answer
Y. LeCun

Llama-2: https://ai.meta.com/llama/

Open source code / free & open models / can be used commercially
Available on Azure, AWS, HuggingFace,….
Y. LeCun

Meta AI: free public chatbot based on Llama-2 technology

Connect with “Meta AI” in Messenger app, and WhatsApp.


28 specialized Facebook chatbots: e.g. Snoop Dogg as Dungeon Master.
Y. LeCun

Auto-Regressive Generative Models Suck!

Auto-Regressive LLMs are doomed.


They cannot be made factual, non-toxic, etc.
Tree of “correct”
They are not controllable
answers Tree of all possible
Probability e that any produced token takes
token sequences
us outside of the set of correct answers
Probability that answer of length n is
correct:
n
P(correct) = (1-e)
This diverges exponentially.
It’s not fixable (without a major redesign).

See also [Dziri...Choi, ArXiv:2305.18654]


Y. LeCun

Auto-Regressive Generative Models Suck!


AR-LLMs
Have a constant number of computational steps between input and
output. Weak representational power.
Do not really reason. Do not really plan, Have no common sense
Noema Magazine, August 2023
Y. LeCun

Limitations of LLMs: no planning!


Auto-Regressive LLMs (at best)
approximate the functions of the
Wernicke and Broca areas in the brain.
What about the pre-frontal cortex?

ArXiv:2301.06627 ArXiv:2206.10498
Y. LeCun

Auto-Regressive LLMs Suck !

Auto-Regressive LLMs are good for


Writing assistance, first draft generation, stylistic polishing.
Code writing assistance
What they not good for:
Producing factual and consistent answers (hallucinations!)
Taking into account recent information (anterior to the last training)
Behaving properly (they mimic behaviors from the training set)
Reasoning, planning, math
Using “tools”, such as search engines, calculators, database queries…
We are easily fooled by their fluency.
But they don’t know how the world works.
Y. LeCun

Current AI Technology is (still) far from Human Level


Machines do not learn how the world works, like animals and humans
Auto-Regressive LLMs can not approach human-level intelligence
Fluency, but limited world model, limited planning, limited reasoning.
Most human and animal knowledge is non verbal.
We are still missing major advances to reach animal intelligence
AI is super-human in some narrow domains

There is no questions that, eventually, machines will eventually


surpass human intelligence in all domains
Humanity’s total intelligence will increase
We should welcome that not fear it.
Y. LeCun

We are missing something really big!


Never mind humans, cats and dogs can do amazing feats
Robots intelligence doesn’t come anywhere close
Any 10 year-old can learn to clear up the dinner table and fill up
the dishwasher in minutes.
We do not have robots that can do that.
Any 17 year-old can learn to drive a car in 20 hours of practice
We still don’t have unlimited Level-5 autonomous driving
Any house cat can plan complex actions

We keep bumping into Moravec’s paradox


Things that are easy for humans are difficult
for AI and vice versa.
Y. LeCun

Data bandwidth and volume: LLM vs child.


LLM
Trained on 1.0E13 tokens (0.75E13 words). Each token is 2 bytes.
Data volume: 2.0E13 bytes.
Would take 170,000 years for a human to read (8h/day, 250 w/minute)

Human child
16,000 wake hours in the first 4 years (30 minutes of YouTube uploads)
2 million optical nerve fibers, carrying about 10 bytes/sec each.
Data volume: 1.1E15 bytes

A four year-old child has seen 50 times more data than an LLM !
Y. LeCun

Three challenges for AI & Machine Learning


1. Learning representations and predictive models of the world
Using Self-supervised learning from video and other sensory inputs
learning to represent the world in a non task-specific way
Learning predictive world models for planning and control
2. Learning to reason, like Daniel Kahneman’s “System 2”
Beyond feed-forward, System 1 subconscious computation.
Making reasoning compatible with learning.
Reasoning and planning as energy minimization.
3. Learning to plan complex actions to satisfy objectives
Learning hierarchical representations of action plans
Y. LeCun

What are we missing?


Systems that learn world models from
sensory inputs
E.g. learn intuitive physics from video
Systems that have persistent memory
Large-scale associative memories
Systems that can plan actions
So as to fulfill an objective
Reason like “System 2” in humans
Systems that are controllable & safe
By design, not by fine-tuning.
Objective-Driven AI Architecture
Objective-Driven AI Systems
AI that can learn, reason, plan,
Yet is safe and controllable

“A path towards autonomous machine intelligence”


https://openreview.net/forum?id=BZ5a1r-kVsf

[various versions of this talk on YouTube]


Y. LeCun

Modular Cognitive Architecture for Objective-Driven AI


Configurator
Configures other modules for task configurator
Perception Short-term
memory
Estimates state of the world World Model
World Model
Predicts future world states Perception
Actor Critic
Cost Intrinsic Cost
Compute “discomfort” cost

Actor
Find optimal action sequences action
Short-Term Memory
Stores state-cost episodes percept
Y. LeCun

Objective-Driven AI
Perception: Computes an abstract representation of the state of the world
Possibly combined with previously-acquired information in memory
World Model: Predict the state resulting from an imagined action sequence
Task Objective: Measures divergence to goal
Guardrail Objective: Immutable objective terms that ensure safety
Operation: Finds an action sequence that minimizes the objectives
Guardrail
Objective
memory
Task
Perception World Model Objective
Initial World state Predicted state
representation Sequence
representation
Action
Sequence
Y. LeCun

Objective-Driven AI: Multistep/Recurrent World Model


Same world model applied at multiple time steps
Guardrail costs applied to entire state trajectory
This is identical to Model Predictive Control (MPC)
Action inference by minimization of the objectives
Using gradient-based method, graph search, DP, MCTS,….

Guardrail Guardrail
Costs Costs

Task
Perception World Model World Model Cost
World state Predicted state Final state
representation representation representation

action0 action1
Y. LeCun

Objective-Driven AI: Non-Deterministic World Model


The world is not deterministic or fully predictable
Latent variables parameterize the set of plausible predictions
Can be sampled from a prior or swept through a set.
Planning can be done for worst case or average case
Uncertainty in outcome can be predicted and quantified

Guardrail Guardrail
Latent Costs Latent
Costs

Task
Perception World Model World Model Cost
World state Predicted state Final state
representation representation representation

action0 action1
Y. LeCun

Objective-Driven AI: Hierarchical Planning


Hierarchical World Model and Planning
Higher levels make longer-term predictions in more abstract representations
Predicted states at higher levels define subtask objectives for lower level
Guardrail objectives ensure safety at every level

Guardrail2 z1 Guardrail2 z1

Task
Enc1(x) Pred1 Pred1
Objective
s1initial s1

z0 Guardrail1 z0 Guardrail1 Subtask


Objective

Enc0(x) Pred0 Pred0

s0 initial s0

a0 a1
Y. LeCun

Objective-Driven AI: Hierarchical Planning


Hierarchical Planning: going from NYU to Paris
Taxi or train? Which
EWR or JFK? Airline?
Guardrail2 z1 Guardrail2 z1

Distance
Enc1(x) Pred1 Pred1
To Paris
At NYU s1

hail or call?
Obstacles? Traffic?
z0 Guardrail1 z0 Guardrail1 Distance
To airport

Enc0(x) Pred0 Pred0


Sitting in
my NYU
office a0 a1
Go down Grab a taxi
In the street To airport
Y. LeCun

Objective-Driven AI: Hierarchical Planning


z2 z2
Multiple levels of world models
Predicted state at level k Enc2(s[0]) Pred2 Pred2 C(s2)

determines subtask s2 initial


s2 final

for level k-1 a2 a2

Gradient-based optimization z1 z1 C(s1,s2)


can be applied to action
variables at all levels Enc1(x) Pred1 Pred1
s1 final
Sampling can be applied s1 initial
a1 a1
to latent variables
z0 C(s0,s1)
at all levels. z0

Enc0(x) Pred0 Pred0

s0 initial s0 final

a0 a0
How could Machines
Learn World Models
from Sensory Input?

with
Self-Supervised Learning
Y. LeCun
pointing
How
Social could machines learn like animals and humans?
helping vs false perceptual
Communication hindering beliefs

Actions face tracking rational, goal-


directed actions
Perception

biological
motion
gravity, inertia
Physics stability, conservation of
support momentum

Object permanence shape


constancy
Objects solidity, rigidity

[Emmanuel natural kind categories Age (months)


Dupoux]
Production

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
proto-imitation
crawling walking
emotional contagion

How do babies learn


how the world
works?
Y. LeCun

Generative World Models with Self-Supervised Training?


Generative world model architecture

Representation of the Prediction of the


State of the world State of the world
At time t At time t+1

Masking,
Action

This is a [...] of text extracted This is a piece of text extracted


[...] a large set of [...] articles from a large set of news articles
Y. LeCun

Generative Architectures DO NOT Work for Images


[Mathieu,
Because the world is only partially Couprie,
predictable LeCun
A predictive model should ICLR 2016]
represent multiple predictions
Probabilistic models are
intractable in high-dim continuous
domains.
Generative Models must predict
every detail of the world

My solution: Joint-Embedding
Predictive Architecture

[Henaff, Canziani, LeCun ICLR 2019]


Y. LeCun

Joint Embedding World Model: Self-Supervised Training


Joint Embedding Predictive Architecture [LeCun 2022], [Assran 2023]

Representation of the Prediction of the


State of the world Representation of the
At time t State of the world
At time t+1

Transformation,
Action
Y. LeCun

Architectures: Generative vs Joint Embedding


Generative: predicts y (with all the details, including irrelevant ones)
Joint Embedding: predicts an abstract representation of y

a) Generative Architecture b) Joint Embedding Architecture


Examples: VAE, MAE...
Y. LeCun

Joint Embedding Architectures


Computes abstract representations for x and y
Tries to make them equal or predictable from each other.

a) Joint Embedding Architecture (JEA) b) Deterministic Joint Embedding c) Joint Embedding Predictive
Examples: Siamese Net, Pirl, MoCo, Predictive Architecture (DJEPA) Architecture (JEPA)
SimCLR, BarlowTwins, VICReg, Examples: BYOL, VICRegL, I-JEPA Examples: Equivariant VICReg
I-JEPA…..
Y. LeCun

Architecture for the world model: JEPA


JEPA: Joint Embedding
Predictive Architecture.
x: observed past and present
y: future
a: action
z: latent variable (unknown)
D( ): prediction cost
C( ): surrogate cost
JEPA predicts a representation
of the future Sy from a
representation of the past and
present Sx
Energy-Based Models

Capturing dependencies through an energy function


Y. LeCun

Energy-Based Models: Implicit function


The only way to formalize & understand all model types
Gives low energy to compatible pairs of x and y
Gives higher energy to incompatible pairs
Energy
Landscape
F(x,y) y

x y

time or space →

x
Y. LeCun

Training Energy-Based Models: Collapse Prevention

A flexible energy surface can take any shape.


We need a loss function that shapes the energy surface so that:
Data points have low energies
Points outside the regions of high data density have higher energies.
Collapse! Contrastive Method Regularized Methods
Y. LeCun

EBM Training: two categories of methods

Contrastive methods y
Push down on energy of Contrastive
samples

training samples Low energy


region Contrastive
y
Method
Pull up on energy of
suitably-generated
contrastive samples x

Scales very badly with y


dimension x

Regularized Methods Training


samples Regularized
Regularizer minimizes the Method

volume of space that can


x
take low energy
Y. LeCun

Recommendations:

Abandon generative models


in favor joint-embedding architectures
Abandon probabilistic model
in favor of energy-based models
Abandon contrastive methods
in favor of regularized methods
Abandon Reinforcement Learning
In favor of model-predictive control
Use RL only when planning doesn’t yield the
predicted outcome, to adjust the world model
or the critic.
Y. LeCun

Training a JEPA with Regularized Methods

Four terms in the cost


Maximize information
content in Minimize
representation of x Prediction
Error
Maximize information Maximize Maximize
content in Information Information
Content
representation of y Content

Minimize Prediction Minimize


error Information
Content
Minimize information
content of latent
variable z
Y. LeCun

VICReg: Variance, Invariance, Covariance Regularization

Variance:
Maintains variance of
components of
representations

Invariance:
Minimizes prediction
error.

Barlow Twins [Zbontar et al. ArXiv:2103.03230], VICReg [Bardes, Ponce, LeCun arXiv:2105.04906, ICLR 2022],
VICRegL [Bardes et al. NeurIPS 2022], MCR2 [Yu et al. NeurIPS 2020][Ma, Tsao, Shum, 2022]
Y. LeCun

VICReg: Variance, Invariance, Covariance Regularization

Variance:
Maintains variance of
components of
representations
Covariance:
Decorrelates
components of
covariance matrix of
representations
Invariance:
Minimizes prediction
error.
Barlow Twins [Zbontar et al. ArXiv:2103.03230], VICReg [Bardes, Ponce, LeCun arXiv:2105.04906, ICLR 2022],
VICRegL [Bardes et al. NeurIPS 2022], MCR2 [Yu et al. NeurIPS 2020][Ma, Tsao, Shum, 2022]
Y. LeCun

VICReg: Variance, Invariance, Covariance Regularization

Variance:
Maintains variance of
components of
representations
Covariance:
Decorrelates
components of
covariance matrix of
representations
Invariance:
Minimizes prediction
error.
Barlow Twins [Zbontar et al. ArXiv:2103.03230], VICReg [Bardes, Ponce, LeCun arXiv:2105.04906, ICLR 2022],
VICRegL [Bardes et al. NeurIPS 2022], MCR2 [Yu et al. NeurIPS 2020][Ma, Tsao, Shum, 2022]
Y. LeCun

SSL-Pretrained Joint Embedding for Image Recognition


Training a supervised linear head
JEA pretrained with VICReg
Cross
Costs entropy
d=8192 Linear
Classifier
Proj(hx) Proj(hy)

hx hy d=2048 hx

FeX(x) FeX(y) ConvNext FeX(x)


ConvNet

x y x label

“polar bear”
Y. LeCun

VICReg: Results with linear head and semi-supervised.


Y. LeCun

VICReg: Results with transfer tasks.


Y. LeCun

VICRegL: local matching latent variable for segmentation

Latent variable optimization:


Finds a pairing between local feature vectors of the two images
[Bardes, Ponce, LeCun, NeurIPS 2022, arXiv:2210.01571]
Y. LeCun

VICRegL: local matching latent variable for segmentation


Y. LeCun

Distillation Methods Student


branch
Teacher
branch
Cost
d=256
Modified Siamese nets
Pred( )
Predictor head eliminates variation of
representations due to distortions
Examples: Proj(hx) Proj(hy)

Bootstrap Your Own Latents [Grill


d=2048
arXiv:2006.07733]

EMA
hx w hy

SimSiam [Chen & He arXiv:2011.10566]


FeX(x) FeX(y)
DINOv2 [Oquab arXiv:2304.07193]
Advantages x y
No negative samples
Y. LeCun

DINOv2: image foundation model


self-supervised generic image features
Demo: https://dinov2.metademolab.com/
Paper: [Oquab et al. ArXiv:2304.07193]
Classification
86.5% on IN1k with frozen features and
linear head.
Fine-grained classification
Depth estimation
Semantic segmentation
Instance Retrieval
Dense & sparse feature matching
Y. LeCun

DINOv2: image foundation model


Demo: https://dinov2.metademolab.com/
Paper: [Oquab et al. ArXiv:2304.07193]
Y. LeCun

DINOv2: Joint Embedding Architecture

SSL by distillation

cross-ent

classify quantize
Y. LeCun

DINOv2

Feature visualization: RGB = top 3 principal components


Y. LeCun

DINOv2

Feature extraction, depth estimation, segmentation


Y. LeCun

Canopy Height Map using DINOv2


Estimates tree canopy
height from satellite
images using DINOv2
features
Using ground truth from
Lidar images
0.5 meter resolution
images
[ArXiv:2304.07213]
Tolan et al.: Sub-meter
resolution canopy height
maps using self-
supervised learning and a
vision transformer trained
on Aerial and GEDI Lidar
Y. LeCun

Image-JEPA: uses masking & transformer architectures


“SSL from images with a JEPA”
[M. Assran et al arxiv:2301.08243]
Jointly embeds a context and a
number of neighboring patches.
Uses predictors
Uses only masking
Y. LeCun

I-JEPA Results

Training is fast

Non-generative method
beat reconstruction-
based generative
methods such as
Masked Auto-Encoder
(with a frozen trunk).
Y. LeCun

I-JEPA Results on ImageNet

JEPA better than generative


architecture on pixels.

Closing the gap with methods


that use data augments

Methods with only masking


No data augmentation

Methods with data


augmentation
Similar to SimCLR
Y. LeCun

I-JEPA Results on ImageNet with 1% training

JEPA better than generative


architecture on pixels.
Closing the gap with methods
that use data augments
Methods with only masking
Methods with data
augmentation
Y. LeCun

I-JEPA: Visualizing Predicted Representations

original context predictions original context predictions


Y. LeCun

MC-JEPA: Motion & Content JEPA

[Bardes, Ponce, LeCun 23]


Simultaneous SSL for
Image recognition
Motion estimation
Trained on
ImageNet 1k
Various video datasets
Uses VCReg to prevent
collapse
ConvNext-T backbone
Y. LeCun

MC-JEPA: Motion & Content JEPA

Motion estimation architecture uses a top-down hierarchical


predictor that “warp” feature maps.
Y. LeCun

MC-JEPA: Optical Flow Estimation Results


Y. LeCun

Problems to Solve

JEPA with regularized latent variables


Learning and planning in non-deterministic environments
Planning algorithms in the presence of uncertainty
Gradient-based methods and combinatorial search methods
Learning Cost Modules (Inverse RL)
Energy-based approach: give low cost to observed trajectories
Planning with inaccurate world models
Preventing bad plans in uncertain parts of the space
Exploration to adjust world models
Intrinsic objectives for curiosity
Y. LeCun

Things we are working on

Self-Supervised Learning from Video


Hierarchical video JEPA trained with SSL

LLMs that can reason & plan, driven by objectives


Dialog systems that plan in representation space and use AR-LLM to
turn representations into text

Learning hierarchical planning


Training a multi-timescale H-JEPA on toy planning problems.
Y. LeCun

Points
Computing power
AR-LLM use a fixed amount of computation per token
Objective-Driven AI is Turing complete (inference == optimization)
We are still missing essential concepts to reach human-level AI
Scaling up auto-regressive LLMs will not take us there
We need machines to learn how the world works
Learning World Models with Self-Supervised Learning and JEPA
Non-generative architecture, predicts in representation space
Objective-Driven AI Architectures
Can plan their answers
Must satisfy objectives: are steerable & controllable
Guardrail objectives can make them safe by construction.
Y. LeCun

Future Universal Virtual Assistant

All of our interactions with the digital world


will be mediated by AI assistants.
They will constitute a repository of all
human knowledge and culture
They will constitute a shared infrastructure
Like the Internet today.
These AI platform MUST be open source
Otherwise, our culture will be controlled by a few companies
on the West Coast of the US or in China.
Training them will have to be crowd-sourced
Open source AI platforms are necessary
Y. LeCun

What does this vision mean for industrial policy?

AI systems will become a common platform


The platforms (foundation models) will need to be open
They will condense all of human knowledge
Guardrail objectives will be shared for safety
Training and fine-tuning will be crowd-sourced
Linguistic, cultural, and interest groups will fine-tune base models to
cater to their interests.
Proprietary systems for vertical applications will be built on top
When everyone has an AI assistant, we will need
Massive computing infrastructure for inference: efficient inference chips.
Move as much as possible to the edge.
Y. LeCun

Questions
How long is it going to take to reach human-level AI?
Years to decades. Many problems to solve on the way.
Before we get to HLAI, we will get to cat-level AI, dog-level AI,...
What is AGI?
There is no such thing. Intelligence is highly multidimensional
Intelligence is a collection of skills + ability to learn new skills quickly
Even humans can only accomplish a tiny subset of all tasks
Will machines surpass human intelligence
Yes, they already do in some narrow domains.
There is no question that machine will eventually surpass human
intelligence in all domains where humans are intelligent (and more)
Y. LeCun

Questions
Are there short-term risks associated with powerful AI?
Yes, as with every technology.
Disinformation, propaganda, hate, spam,...: AI is the solution!
Concentration of information sources
All those risks can be mitigated
Are there long-term risks with (super-)human-level AI?
Robots will not take over the world! a mistaken projection of human nature on machines
Intelligence is not correlated with a desire to dominate, even in humans
Objective-Driven AI systems will be made subservient to humans
AI will not be a “species” competing with us.
We will design its goals and guardrails.
Y. LeCun

Why the doomers are wrong


The speculations about the probability of human extinction p(doom) are
just that: speculations.
There are infinite ways to build dangerous and unreliable AI, and only a
few ways to do it right. But a few good ways is all we need.
There are infinite ways to build unreliable turbojets,…
… but safe and reliable turbojets do exist. They are the ones we use.
All doom scenarios assume that there is no way to build safe AI systems
Some scenarios assume that the slightest mistake will spell doom.
But this is not how technology development works.
Developing safe and reliable AI systems will take time
Safer AI is simply better AI with the proper objectives and guardrails.
This will take years (decades?) of careful engineering
Just like the design of safe, reliable, and efficient turbojets.
Y. LeCun

Questions
How to solve the alignment problem?
Through trial and error and testing in sand-boxed systems
We are very familiar with designing objectives for human and
superhuman entities. It’s called law making.
What if bad people get their hand on on powerful AI?
Their Evil AI will be taken down by the Good Guys’ AI police.
What are the benefits of human-level AI?
AI will amplify human intelligence, progress will accelerate
As if everyone had a super-smart staff working for them
The effect on society may be as profound as the printing press
By amplifying human intelligence, AI will bring a new
era of enlightenment, a new renaissance for humanity.
Thank
you!

You might also like