The Little Book of Deep Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 155

The Little Book

of
Deep Learning

François Fleuret
François Fleuret is a professor of computer sci-
ence at the University of Geneva, Switzerland.

The cover illustration is a schematic of the


Neocognitron by Fukushima [1980], a key an-
cestor of deep neural networks.
Contents

List of figures 7

Foreword 8

I Foundations 10
1 Machine Learning 11
1.1 Learning from data . . . . . . . 12
1.2 Basis function regression . . . . 14
1.3 Under and overfitting . . . . . . 16
1.4 Categories of models . . . . . . 18
2 Efficient computation 20
2.1 GPUs, TPUs, and batches . . . . 21
2.2 Tensors . . . . . . . . . . . . . . 23
3 Training 26
3.1 Losses . . . . . . . . . . . . . . 27
3.2 Autoregressive models . . . . . 30
3.3 Gradient descent . . . . . . . . 33
3.4 Backpropagation . . . . . . . . 38
3.5 Training protocols . . . . . . . 43

3 155
3.6 The benefits of scale . . . . . . 46

II Deep models 51
4 Model components 52
4.1 The notion of layer . . . . . . . 53
4.2 Linear layers . . . . . . . . . . . 55
4.3 Activation functions . . . . . . 64
4.4 Pooling . . . . . . . . . . . . . . 67
4.5 Dropout . . . . . . . . . . . . . 70
4.6 Normalizing layers . . . . . . . 73
4.7 Skip connections . . . . . . . . 77
4.8 Attention layers . . . . . . . . . 80
4.9 Token embedding . . . . . . . . 87
4.10 Positional encoding . . . . . . . 88
5 Architectures 90
5.1 Multi-Layer Perceptrons . . . . 91
5.2 Convolutional networks . . . . 93
5.3 Attention models . . . . . . . . 100

III Applications 107


6 Prediction 108
6.1 Image denoising . . . . . . . . . 109
6.2 Image classification . . . . . . . 111
6.3 Object detection . . . . . . . . . 112
6.4 Semantic segmentation . . . . . 117
6.5 Speech recognition . . . . . . . 120
6.6 Text-image representations . . . 122
4 155
7 Synthesis 125
7.1 Text generation . . . . . . . . . 126
7.2 Image generation . . . . . . . . 128

The missing bits 132

Afterword 138

Bibliography 139

Index 148

5 155
List of Figures

1.1 Kernel regression . . . . . . . . . . 14


1.2 Overfitting of kernel regression . . 16

3.1 Causal autoregressive model . . . . 32


3.2 Gradient descent . . . . . . . . . . . 34
3.3 Back-propagation . . . . . . . . . . 38
3.4 Train and validation losses . . . . . 44
3.5 Scaling laws . . . . . . . . . . . . . 47
3.6 Model training costs . . . . . . . . . 49

4.1 1d convolution . . . . . . . . . . . . 57
4.2 2d convolution . . . . . . . . . . . . 58
4.3 Stride, padding, and dilation . . . . 59
4.4 Receptive field . . . . . . . . . . . . 62
4.5 Activation functions . . . . . . . . . 65
4.6 Max pooling . . . . . . . . . . . . . 68
4.7 Dropout . . . . . . . . . . . . . . . . 71
4.8 Batch normalization . . . . . . . . . 74
4.9 Skip connections . . . . . . . . . . . 78
4.10 Interpretation of the attention operator 81
4.11 Attention operator . . . . . . . . . . 83

6 155
4.12 Multi-Head Attention layer . . . . . 85

5.1 Multi-Layer Perceptron . . . . . . . 91


5.2 LeNet-like convolutional model . . 94
5.3 Residual block . . . . . . . . . . . . 95
5.4 Downscaling residual block . . . . . 96
5.5 ResNet-50 . . . . . . . . . . . . . . . 97
5.6 Self and cross-attention blocks . . . 101
5.7 Transformer . . . . . . . . . . . . . 102
5.8 GPT model . . . . . . . . . . . . . . 104
5.9 ViT model . . . . . . . . . . . . . . 105

6.1 Convolutional object detector . . . 113


6.2 Object detection with SSD . . . . . 114
6.3 Semantic segmentation with PSP . . 118
6.4 CLIP zero-shot prediction . . . . . . 124

7.1 Denoising diffusion . . . . . . . . . 129

7 155
Foreword

The current period of progress in artificial in-


telligence was triggered when Krizhevsky et al.
[2012] showed that an ararti
ti
tifi
fi
ficial
cial
cial neu
neural
ral
ral net
network
network
work
with a simple structure, which had been known
for more than twenty years [LeCun et al., 1989],
could beat complex state-of-the-art image recog-
nition methods by a huge margin, simply by
being a hundred times larger, and trained on a
data set similarly scaled up.

This breakthrough was made possible thanks


to Graph
Graphiical
cal Pro
Process
cess
cessing
ing Units
Units (GPU
GPU
GPUs),
GPU mass-
market highly parallel computing devices de-
veloped for real-time image synthesis and repur-
posed for artificial neural networks.

Since then, under the umbrella term of “deep deep


deep
learn
learning
learning
ing,” innovations in the structures of these
networks, the strategies to train them, and ded-
icated hardware have allowed for an exponen-
tial increase in both their size and the quantity
8 155
of training data they take advantage of [Sevilla
et al., 2022]. This has resulted in a wave of suc-
cessful applications across technical domains,
from computer vision and robotics, to speech,
and natural language processing.

Although the bulk of deep learning is not par-


ticularly difficult to understand, it combines di-
verse components, which makes it complicated
to learn. It involves multiple branches of mathe-
matics such as calculus, probabilities, optimiza-
tion, linear algebra, and signal processing, and it
is also deeply anchored in computer science, pro-
gramming, algorithmic, and high-performance
computing. Instead of going into detail and try-
ing to be exhaustive, this little book is limited to
the necessary background and technical tools to
understand a few important models.

If you did not get this book from its official URL

https://fleuret.org/public/lbdl.pdf

please do so, so that I can estimate the number


of readers.

François Fleuret,
May 21, 2023

9 155
Part I

Foundations

10 155
Chapter 1

Machine Learning

Deep
Deep learn
learning
learning
ing belongs historically to the larger
field of statistical machine learning, as it funda-
mentally concerns methods able to learn repre-
sentations from data. The techniques involved
come originally from ar arti
ti
tifi
fi
ficial
cial
cial neu
neural
ral
ral net
networks
networks
works,
works
and the “deep” qualifier highlights that models
are long compositions of mappings, now known
to achieve greater performance.

The modularity of deep models, their versatility,


and scaling qualities, have resulted in a plethora
of specific mathematical methods and software
development tools that have established deep
learning as a separate and vast technical field.

11 155
1.1 Learning from data
The simplest use case for a model trained from
data is when a signal x is accessible, for instance,
the picture of a license plate, from which one
wants to predict a quantity y, such as the string
of characters written on the plate.

In many real-world situations where x is a high-


dimension signal captured in an uncontrolled
environment, it is too complicated to come up
with an analytical recipe that relates x and y.

What one can do is to collect a large train


training
training
ing
set
set 𝒟 of pairs (xn ,yn ), and devise a para
paramet
paramet
metric
metric
ric
model
model f , a piece of computer code that incorpo-
rates train
trainable
trainable
able pa
param
ram
rameeeters w that modulate its
behavior, and such that, with the proper values
w∗ , it is a good predictor. “Good” here means
that if an x is given to this piece of code, the
value ŷ = f (x;w∗ ) it computes is a good esti-
mate of the y that would have been associated
to x in the training set had it been there.

This notion of goodness is usually formalized


with a loss ℒ (w) which is small when f (·;w) is
good on 𝒟 . Then, train
training
ing the model consists of
computing a value w∗ that minimizes ℒ (w∗ ).

Most of the content of this book is about the defi-


12 155
nition of f which, in realistic scenarios, is a com-
plex combination of pre-defined sub-modules.

The trainable parameters that compose w are


often referred to as weights
weights,
weights by analogy with
the synaptic weights of biological neural net-
works. In addition to these parameters, models
usually depend on meta
meta-pa
pa
param
param
rame
rameeters
ters which are
set according to domain prior knowledge, best
practices, or resource constraints. They may also
be optimized in some way, but with techniques
different from those used to optimize w.

13 155
1.2 Basis function regression
We can illustrate the training of a model in a sim-
ple case where xn and yn are two real numbers,
the loss is the mean
mean squared ererror
error
ror
N
1X
ℒ (w) = (yn −f (xn ;w))2 , (1.1)
N
n=1

and f (·;w) is a linear combination of a pre-


defined basis of functions f1 ,...,fK , with w =
(w1 ,...,wK ):
K
X
f (x;w) = wk fk (x).
k=1

Since f (xn ;w) is linear with respect to the wk s


and ℒ (w) is quadratic with respect to f (xn ;w),

Figure 1.1: Given a basis of functions (blue curves)


and a training set (black dots), we can compute an
optimal linear combination of the former (red curve)
to approximate the latter for the mean squared error.

14 155
the loss ℒ (w) is quadratic with respect to the
wk s, and finding w∗ that minimizes it boils down
to solving a linear system. See Figure 1.1 for an
example with Gaussian kernels as fk .

15 155
1.3 Under and overfitting
A key element is the interplay between the ca ca-
ca
pac
pacity
pacity
ity of the model, that is its flexibility and
ability to fit diverse data, and the amount and
quality of the training data. When the capacity
is insufficient, the model cannot fit the data and
the error during training is high. This is referred
to as un
under
under
der-fit
der fit
fitting
fitting
ting.

On the contrary, when the amount of data is


insufficient, as illustrated with an example in
Figure 1.2, the performance during training can
be excellent, but unrelated to the actual fit to
the data structure, as in that case the model will
often learn random noise present in the signal.

Figure 1.2: If the amount of training data is small com-


pared to the capacity of the model, the performance
during training reflects poorly the actual fit to the un-
derlying data structure, and consequently the useful-
ness for prediction.

16 155
This is over
overfit
overfit
fitting
fitting
ting.

So, a large part of the art of applied machine


learning is to design models that are not too
flexible yet still able to fit the data. This is done
by crafting the right in induc
induc
ductive
ductive
tive bias
bias in a model,
which means that its structure corresponds to
the underlying structure of the data at hand.

Even though this classical perspective is relevant


for reasonably-sized deep models, things get con-
fusing with large ones that have a very large
number of trainable parameters and extreme ca-
pacity yet still perform well for prediction. We
will come back to this in § 3.5.

17 155
1.4 Categories of models
We can organize the use of machine learning
models into three broad categories:

• Re
Regres
Regres
gression
gression consists of predicting a
continuous-valued vector y ∈ RK , for instance,
a geometrical position of an object, given an
input signal X. This is a multi-dimensional
generalization of the setup we saw in § 1.2. The
training set is composed of pairs of an input
signal and a ground
ground truth value.

• Clas
Classi
Classi
sifi
sifi
fica
fica
cation
tion aims at predicting a value from
a finite set {1,...,C}, for instance, the label Y of
an image X. As for regression, the training set
is composed of pairs of input signal, and ground
ground
truth
truth quantity, here a label from that set. The
standard way of tackling this is to predict one
score per potential class, such that the correct
class has the maximum score.

• Den
Density
Density
sity mod
model
model
eling
ing has as its objective to model
the probability density function of the data µX
itself, for instance, images. In that case, the train-
ing set is composed of values xn without associ-
ated quantities to predict, and the trained model
should allow either the evaluation of the prob-
ability density function, or sampling from the
distribution, or both.
18 155
Both regression and classification are generally
referred to as su
super
per
pervised
pervised learn
learning
ing
ing since the value
to be predicted, which is required as a target dur-
ing training, has to be provided, for instance, by
human experts. On the contrary, density mod-
eling is usually seen as un unsu
unsu
super
per
pervised
pervised learn
learning
learning
ing
since it is sufficient to take existing data, with-
out the need for producing an associated ground-
truth.

These three categories are not disjoint; for in-


stance, classification can be cast as class-score
regression, or discrete sequence density model-
ing as iterated classification. Furthermore, they
do not cover all cases. One may want to predict
compounded quantities, or multiple classes, or
model a density conditional on a signal.

19 155
Chapter 2

Efficient computation

From an implementation standpoint, deep learn-


ing is about executing heavy computations with
large amounts of data. The Graph
Graphi
Graphiical
cal Pro
Process
Process
cessing
cessing
ing
Units
Units (GPUs
GPUs
GPUs)
GPUs have been instrumental in the suc-
cess of the field by allowing such computations
to be run on affordable hardware.

The importance of their use, and the resulting


technical constraints on the computations that
can be done efficiently, force the research in the
field to constantly balance mathematical sound-
ness and implementability of novel methods.

20 155
2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally de-
signed for real-time image synthesis, which re-
quires highly parallel architectures that happen
to be fitting for deep models. As their usage
for AI has increased, GPUs have been equipped
with dedicated sub-components referred to as
ten
tensor
sor cores
cores,
cores and deep-learning specialized chips
such as Google’s Ten
Tensor
Tensor Pro
Process
Process
cessing
cessing
ing Units (TPUs
TPUs
TPUs)
have been produced.

A GPU possesses several thousands of parallel


units, and its own fast memory. The limiting fac-
tor is usually not the number of computing units
but the read
read-write
read write op
oper
er
eraations to
to mem
memory
memory
ory.
ory The
slowest link is between the CPU memory and
the GPU memory and consequently one should
avoid copying data across devices. Moreover
the structure of the GPU itself involves multiple
levels of cache
cache mem
memory
memoryory, which are smaller but
faster, and computation should be organized to
avoid copies between these different caches.

This is achieved in particular by organizing the


computation in batches of sam
samples
samples
ples that can fit
entirely in the GPU memory and are processed
in parallel. When an operator combines a sample
and model parameters, both have to be moved

21 155
to the cache memory near the actual computing
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing
it for every sample. In practice, a GPU processes
a batch that fits in memory almost as quickly as
a single sample.

A standard GPU has a theoretical peakpeak per


perfor
perfor
for-
for
mance
mance of 10 -10 floating point operations
13 14

(FLOPs
FLOPs
FLOPs) per second, and its memory typically
ranges from 8 to 80 gigabytes. The standard
FP32 encoding of float numbers is on 32 bits, but
empirical results show that using encoding on
16 bits, or even less for some operands, does not
degrade performance.

We come back in § 3.6 to the very large size of


deep architectures.

22 155
2.2 Tensors
GPUs and deep
deep learn
learning
ing frame
frameworks
frameworks
works such as Py-
Torch or JAX manipulate the quantities to be
processed by organizing them as ten
tensors
tensors
sors,
sors which
are series of scalars arranged along several dis-
crete axes. They are elements of RN1 ×···×ND
that generalize the notion of vector and matrix.

Tensors are used to represent both the signals to


process, the train
trainable
trainable pa
param
ram
rame
rameeters
ters of the models,
and the intermediate quantities they compute.
The latters are called ac
acti
ti
tiva
va
vations
vations
tions,
tions in reference to
neuronal activations.

For instance, a time series is naturally encoded


as a T ×D tensor, or, for historical reasons, as a
D×T tensor, where T is its duration and D is
the dimension of the feature representation at
every time step, often referred to as the number
of chan
channels
channels
nels.
nels Similarly a 2d-structured signal can
be represented as a D×H ×W tensor, where H
and W are its width and height. An RGB image
would correspond to D = 3, but the number of
channels can grow up to several thousands in
large models.

Adding more dimensions allows for the represen-


tation of series of objects. Fifty RGB images of
resolution 32×24 can, for instance, be encoded
23 155
as a 50×3×24×32 tensor.

Deep learning libraries all provide a large num-


ber of operations that encompass standard linear
algebra, complex reshaping and extraction, and
deep-learning specific operations, some of which
we will see in Chapter 4. The implementation of
tensors separates the shape representation from
the storage layout of the coefficients in mem-
ory, which allows many reshaping, transposing,
and extraction operations to be done without
coefficient copying, hence extremely rapidly.

In practice, virtually any computation can be


decomposed into elementary tensor operations,
which avoids non-parallel loops at the language
level and poor memory management.

Besides being convenient tools, tensors are


instrumental in achieving computational effi-
ciency. All the people involved in designing the
complex object that is an operational deep model,
from the researchers and software developers de-
signing the model, the libraries, and the drivers,
to the engineers designing the computers and the
computing chips themselves, know that the data
will be manipulated as tensors. The resulting
constraints on locality and block decomposabil-
ity allow all the actors in this chain to optimize

24 155
their designs.

25 155
Chapter 3

Training

As introduced in § 1.1, training a model con-


sists of minimizing a loss ℒ (w) which reflects
the performance of the predictor f (·;w) on a
train
training
training set 𝒟 . Since the models are usually
extremely complex, and their performance is di-
rectly related to how well the loss is minimized,
this minimization is a key challenge, which in-
volves both computational and mathematical dif-
ficulties.

26 155
3.1 Losses
The example of the mean
mean squared ererror
error
ror of Equa-
tion 1.1 is a standard loss for predicting a con-
tinuous value.

For classification, the usual strategy is that the


output of the model is a vector with one com-
ponent f (x;w)y per class y, interpreted as the
logarithm of a non-normalized probability, or
logit
logit.
logit With X the input signal and Y the class to
predict, we can then compute from f an estimate
of the pos
poste
te
terior
terior prob
probaabil
biliities
ties:

expf (x;w)y
P̂ (Y = y | X = x) = P .
z expf (x;w)z

This expression is generally referred to as the


soft
softmax
softmax
max,
max or more adequately, the sof
soft
softtargmax
argmax,
argmax of
the logits.

To be consistent with this interpretation the


model should be trained to maximize the proba-
bility of the true classes, hence to minimize the
cross
cross-en
cross en
entropy
entropy
tropy,
tropy expressed as
N
1X expf (xn ;w)yn
ℒce (w) = −log P .
N z expf (xn ;w)z
n=1 | {z }
Lce (f (xn ;w),yn )

27 155
For density modeling, the standard loss is the
likelihood of the data. If f (x;w) is to be inter-
preted as a normalized log-probability or density,
the loss is the opposite of the sum of its value
over training samples.

In certain setups, even though the value to be


predicted is continuous, the supervision takes
the form of ranking constraints. The typical do-
main where this is the case is met
metric
metric
ric learn
learning
learning
ing,
ing
where the objective is to learn a measure of dis-
tance between samples such that two samples
from the same semantic class, e.g., two pictures
of the same person, are closer to each other than
to a sample from another class, e.g., any picture
of someone else.

The standard approach for such cases is to min-


imize a con
contrastive
trastive
trastive loss
loss, in that case, for in-
stance, the sum over triplets (xa ,xb ,xc ), such
that ya = yb ̸= yc , of

max(0,1−f (xa ,xc ;w)+f (xa ,xb ;w)).

This quantity will be strictly positive unless


f (xa ,xc ;w) ≥ 1+f (xa ,xb ;w).

It is also possible to add terms to the loss that


depend on the trainable parameters of the model
themselves to favor certain configurations.
28 155
The weight dedecay
decay regularization, for instance,
consists of adding to the loss a term proportional
to the sum of the squared parameters. It can be
interpreted as having a Gaussian Bayesian prior
on the parameters, which favors smaller values
and reduces the influence of the data. This de-
grades performance on the training set, but re-
duces the gap between the performance in train-
ing and that on new, unseen data.

Usually, the loss to minimize is not the actual


quantity one wants to optimize ultimately, but a
proxy for which finding the best model parame-
ters is easier. For instance, cross-entropy is the
standard loss for classification, even though the
actual performance measure is a classification
error rate, because the latter has no informative
gradient, a key requirement as we will see in
§ 3.3.

29 155
3.2 Autoregressive models
Many spectacular applications in computer vi-
sion and natural language processing have been
tackled by modeling the distribution of a high-
dimension discrete vector with the chain rule:

P (X1 = x1 ,X2 = x2 ,...,XT = xT ) =


P (X1 = x1 )
×P (X2 = x2 | X1 = x1 )
...
×P (XT = xT | X1 = x1 ,...,XT −1 = xT −1 ).

Although it is valid for any type of random quan-


tity, this decomposition finds its most efficient
use when the signal of interest can be encoded
into a sequence of discrete totokens
tokens from a finite
vo
vocab
vocab
cabu
cabuulary
lary {1,...K}.

With the convention that the additional token ∅


stands for an “unknown” quantity, we can rep-
resent the event {X1 = x1 ,...,Xt = xt } as the
vector (x1 ,...,xt ,∅,...,∅).

Then, given a model

f (x1 ,...,xt−1 ,∅,...,∅;w) =


log P̂ (Xt | X1 = x1 ,...,Xt−1 = xt−1 ),
30 155
the chain rule states that one can sample a full se-
quence of length T by sampling the xt s one after
another, each according to the predicted poste-
rior distribution, given the x1 ,...,xt−1 already
sampled. This is an au autore
tore
toregres
gres
gressive
gressive
sive generative
model.

Training such a model could be achieved naively


by minimizing the sum across training sequences
x and time steps t of

Lce f (x1 ,...,xt−1 ,∅,...,∅;w),xt ,
however such an approach is inefficient, as most
computations done for t < t′ have to be repeated
for t′ .

The standard strategy to address this issue is to


design a model that predicts the distributions
of all the xt of the sequence at once, but with a
structure such that the prediction of xt ’s logits
depends only on the input values x1 ,...,xt−1 .
Such a model is called causal
causal, since it corre-
sponds in the case of temporal series to not let-
ting the future influence the past, as illustrated
in Figure 3.1. As we will see in § 7.1, it can be
trained with the cross-entropy summed over all
the time steps for every sequence processed.

One important technical detail is that when


dealing with language, the representation as to-
31 155
y1 y2 y3 ... yT yT +1

cst x1 x2 ... xT −1 xT

Figure 3.1: An autoregressive model f , is causal


causal if a
time step xt of the input sequence can only modulate
a predicted ys = P̂ (Xs | Xt<s ) for s > t.

kens can be done in multiple ways, from the


finest granularity of individual symbols to entire
words. The conversion to and from the token
representation is done by a separate algorithm
called a to
tok
tokkenizer
enizer.
enizer

A standard method is the ByteByte Pair


Pair En
Encod
cod
coding
coding
ing
(BPE
BPE
BPE)
BPE [Sennrich et al., 2015] that constructs to-
kens by hierarchically merging groups of char-
acters, trying to get tokens that represent frag-
ments of words of various lengths but of similar
frequencies, allocating tokens to long frequent
fragments, as well as to rare individual symbols.

32 155
3.3 Gradient descent
Except in specific cases like the linear regression
we saw in § 1.2, the optimal parameters w∗ do
not have a closed-form expression. In the general
case, the tool of choice to minimize a function
is gra
gradi
gradi
dient
ent
ent de
descent
scent
scent.
scent It consists of initializing the
parameters with a random w0 , and then improv-
ing this estimate by iterating gra
gradi
gradi
dient
ent
ent steps
steps,
steps each
consisting of computing the gradient of the loss
with respect to the parameters, and subtracting
a fraction of it:

wn+1 = wn −η∇ℒ |w (wn ). (3.1)

This procedure corresponds to moving the cur-


rent estimate a bit in the direction corresponding
locally to the maximum decrease of ℒ (w), as
illustrated in Figure 3.2.

The meta-parameter η is referred to as the learn


learn-
learn
ing
ing rate
rate.
rate It is a positive value that modulates how
quickly the minimization is done, and must be
chosen carefully. If it is too small, the optimiza-
tion will be slow at best, and may be trapped in
a lo
local
local
cal min
miniimum early. If it is too large, the opti-
mization may bounce around a good minimum
and never descend into it. As we will see in § 3.5,
it can depend on the iteration number n.

33 155
w

ℒ (w)

Figure 3.2: At every point w, the gradient ∇ℒ |w (w) is


in the direction that maximizes the increase of ℒ , or-
thogonal to the level curves (top). The gradient descent
minimizes ℒ (w) iteratively by subtracting a fraction
of the gradient at every step, resulting in a trajectory
that follows the steepest descent (bottom).

34 155
As with many algorithms, intuition tends to
break down in very high dimensions, and al-
though it may seem that this procedure would
be easily trapped in a local minimum, in reality,
due to the number of parameters, the design of
the models, and the stochasticity of the data, its
efficiency is far greater than one might expect.

All the losses used in practice can be expressed


as an average of a per sample, or per small group
of samples, loss
N
1X
ℒ (w) = 𝓁n (w),
N
n=1

where
𝓁n (w) = L(f (xn ;w),yn )
for some L, and the gradient is then
N
1X
∇ℒ |w (w) = ∇𝓁n |w (w). (3.2)
N
n=1

The resulting gra


gradi
di
dient
dient
ent de
descent
descent
scent would compute
exactly the sum in 3.2, which is usually computa-
tionally heavy, and then update the parameters
according to 3.1. However, under reasonable as-
sumptions of exchangeability, for instance, if the
samples have been properly shuffled, any partial
35 155
sum of 3.2 is an unbiased estimator of the full
sum, albeit noisy. So, updating the parameters
from partial sums corresponds to doing more
gradient steps for the same computational bud-
get, with noisier estimates of the gradient. Due
to the redundancy in the data, this happens to
be a far more efficient strategy.

We saw in § 2.1 that processing a batch of sam-


ples small enough to fit in the computing de-
vice’s memory is generally as fast as processing
a single one. Hence, the standard approach is to
split the full set 𝒟 into batches
batches,
batches and to update
the parameters from the estimate of the gradient
computed from each. This is referred to as mini-
batch stochastic gradient descent, or stochas
stochastic
stochastic
tic
gra
gradi
gradi
dient
dient
ent de
descent
descent
scent (SGD
SGD
SGD)
SGD for short.

It is important to note that this process is ex-


tremely gradual, and that the number of mini-
batches and gradient steps are typically of the
order of several millions.

Plenty of variations of this standard strategy


have been proposed. The most popular one is
Adam
Adam [Kingma and Ba, 2014], which keeps run-
ning estimates of the mean and variance of each
component of the gradient, and normalizes them
automatically, avoiding scaling issues and differ-

36 155
ent training speeds in different parts of a model.

37 155
3.4 Backpropagation
Using gra
gradi
di
dient
ent
ent de
descent
descent
scent requires a tech-
nical means to compute ∇𝓁|w (w) where
𝓁= L(f (x;w);y). Given that f and L are both
compositions of standard tensor operations, as
for any mathematical expression, the chain rule
allows us to get an expression of it.

For the sake of making notation lighter–which,


unfortunately, will be needed in what follows–
we do not specify at which point gradients are
computed, since the context makes it clear.

fd (·;wd )
x(d−1) x(d)
×Jfd |x
∇𝓁 |x(d−1) ∇𝓁 |x(d)
×Jfd |w

∇𝓁 |wd

Figure 3.3: Given a model f = fD ◦···◦f1 , the forward


pass (top) consists of computing the outputs x(d) of
the mappings fd in order. The backward pass (bottom)
computes the gradients of the loss with respect to the
activation x(d) and the parameters wd backward by
multiplying them by the Jacobians.

38 155
Forward and backward passes
Consider the simple case of a composition of
mappings

f = f1 ◦f2 ◦···◦fD .

The output of f (x;w) can be computed by start-


ing with x(0) = x and applying iteratively

x(d) = fd (x(d−1) ;wd ),

with x(D) as the final value.

The individual scalar values of these interme-


diate results x(d) are traditionally called ac acti
acti
ti-
ti
va
vations
vations
tions in reference to neuron activations, the
value D is the depth
depth of the model, the individual
mappings fd are referred to as laylayers
layers
ers,
ers as we will
see is § 4.1, and their sequential evaluation is the
for
forward
forward
ward pass
pass (see Figure 3.3, top).

Conversely, the gradient ∇𝓁 |x(d−1) of the loss


with respect to the output x(d−1) of fd−1 is the
product of the gradient ∇𝓁 |x(d) with respect
to the output of fd multiplied by the Jacobian
Jfd−1 |x of fd−1 with respect to its first variable
x. Thus, the gradients with respect to the out-
puts of all the fd s can be computed recursively
backward, starting with ∇𝓁 |x(D) = ∇L|x .
39 155
And the gradient that we are interested in for
training, that is ∇𝓁 |wd , is the gradient with re-
spect to the output of fd multiplied by the Jaco-
bian Jfd |w of fd with respect to the parameters.

This iterative computation of the gradients with


respect to the intermediate activations, com-
bined with that of the gradients with respect
to the layers’ parameters, is the back
backward
backward
ward pass
pass
(see Figure 3.3, bottom). The combination of
this computation with the procedure of gradient
descent is called back
backprop
prop
propa
propaaga
gation
gation
tion.
tion

In practice, the implementation details of the


forward and backward passes are hidden from
programmers. Deep learning frameworks are
able to automatically construct the sequence of
operations to compute gradients. A particularly
convenient algorithm is Au
Auto
to
tograd
tograd
grad [Baydin et al.,
2015], which tracks tensor operations and builds,
on the fly, the combination of operators for gra-
dients. Thanks to this, a piece of imperative
programming that manipulates tensors can auto-
matically compute the gradient of any quantity
with respect to any other.

40 155
Resource usage
Regarding the comcompupu
puta
ta
tational
tional cost
cost,
cost as we will
see, the bulk of the computation goes into linear
operations that require one matrix product for
the forward pass and two for the products by
the Jacobians for the backward pass, making the
latter roughly twice as costly as the former.

The mem
memory
memory
ory re
require
require
quirement
ment
ment during inference is
roughly equal to that of the most demanding
individual layer. For training, however, the back-
ward pass requires keeping the activations com-
puted during the forward pass to compute the
Jacobians, which results in a memory usage that
grows proportionally to the model’s depth. Tech-
niques exist to trade the memory usage for com-
putation by either relying on re reversible
reversible
versible lay
layers
layers
ers
[Gomez et al., 2017], or using check
checkpoint
checkpoint
pointing
pointing
ing,
ing
which consists of storing activations for some
layers only and recomputing the others on the fly
with partial forward passes during the backward
pass [Chen et al., 2016].

Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-
wards through an operator, it may be rescaled

41 155
by a multiplicative factor, and consequently de-
crease or increase exponentially when it tra-
verses many layer. When it decreases exponen-
tially, this is called the van
vanish
vanish
ishing
ing
ing gra
gradi
gradi
dient
ent
ent,
ent and
it may make the training impossible, or, in its
milder form, cause different parts of the model
to be updated at different speeds, degrading their
co-adaptation [Glorot and Bengio, 2010].

As we will see in Chapter 4, multiple techniques


have been developed to prevent this from hap-
pening, reflecting a change in perspective that
was crucial to the success of deep-learning: in-
stead of trying to improve generic optimization
methods, the effort shifted to engineering the
models themselves to make them optimizable.

42 155
3.5 Training protocols
Training a deep network requires defining a pro-
tocol to make the most of computation and data,
and ensure that performance will be good on
new data.

As we saw in § 1.3, the performance on the train-


ing samples may be misleading, so in the sim-
plest setup one needs at least two sets of samples:
one is a train
training
training set
set, used to optimize the model
parameters, and the other is a test
test set
set,
set to estimate
the performance of the trained model.

Additionally, there are usually meta


meta-pa
meta pa
param
param
rame
rameeters
ters
to adapt, in particular, those related to the model
architecture, the learning rate, and the regular-
ization terms in the loss. In that case, one needs
a val
vali
valiida
dation
dation set that is disjoint from both the
training set and the test set to assess the best
configuration.

The full training is usually decomposed into


epochs
epochs, each of them corresponding to going
through all the training examples once. The
usual dynamic of the losses is that the train loss
decreases as long as the optimization runs while
the validation loss may reach a minimum after
a certain number of epochs and then start to
increase, reflecting an over
overfit
overfit
fitting
fitting
ting regime, as in-
43 155
Overfitting

Loss
Validation

Train
Number of epochs

Figure 3.4: As training progresses, a model’s perfor-


mance is usually monitored through losses. The train
loss is the one driving the optimization process and
goes down, while the validation loss is estimated on
an other set of examples to assess the overfitting of
the model. This phenomenon appears when the model
starts to take into account random structures specific
to the training set at hands, resulting in the validation
loss starting to increase.

troduced in § 1.3 and illustrated on Figure 3.4.

Paradoxically, although they should suffer from


severe overfitting due to their capacity, large
models usually continue to improve as training
progresses. This may be due to the in induc
duc
ductive
ductive
tive
bias
bias of the model becoming the main driver of
optimization when performance is near perfect

44 155
on the training set [Belkin et al., 2018].

An important design choice is the learn


learning
learning
ing rate
rate
sched
schedule
ule
ule during training. The general policy is
that the learning rate should be initially large to
avoid having the optimization being trapped in
a bad local minimum early, and that it should get
smaller so that the optimized parameter values
do not bounce around, and reach a good mini-
mum in a narrow valley of the loss landscape.

The training of extremely large models may take


months on thousands of powerful GPUs and
have a financial cost of several million dollars. At
this scale, the training may involve many man-
ual interventions informed, in particular, by the
dynamics of the loss evolution.

45 155
3.6 The benefits of scale
There is an accumulation of empirical results
showing that performance, for instance, esti-
mated through the loss on test data, improves
with the amount of data according to remarkable
scal
scaling
scaling laws
laws,
laws as long as the model size increases
correspondingly [Kaplan et al., 2020], see Figure
3.5.

Benefiting from these scaling laws in the multi-


billion samples regime is possible in part thanks
to the structural plasticity of models, which al-
lows them to be scaled up arbitrarily, as we will
see, by increasing the number of layers or fea-
ture dimensions. But it is also made possible
by the distributed nature of the computation
implemented by these models and by stochasstochas-
stochas
tic
tic gra
gradi
gradi
dient
ent
ent de
descent
descent
scent,
scent which requires only a tiny
fraction of the data at a time and can operate
with data sets whose size is orders of magnitude
greater than that of the computing device’s mem-
ory. This has resulted in an exponential growth
of the models, as illustrated in Figure 3.6.

Typical vision models have 10–100 million traintrain-


train
able
able pa
param
param
rameeeters and require 10 –10 FLOPs
18 19

for training [He et al., 2015; Sevilla et al., 2022].


Language models have from 100 million to hun-

46 155
Test loss

Compute (peta-FLOP/s-day)
Test loss

Data set size (tokens)


Test loss

Number of parameters

Figure 3.5: Test loss of a language model vs. the amount


of computation in petaflop/s-day, the data set size in
number of tokens, that is fragments of words, and the
model size in number of parameters [Kaplan et al.,
2020].
47 155
Dataset Year Nb. of images Size
ImageNet 2012 1.2M 150Gb
Cityscape 2016 25K 60Gb
LAION-5B 2022 5.8B 240Tb

Dataset Year Nb. of books Size


WMT-18-de-en 2018 14M 8Gb
The Pile 2020 1.6B 825Gb
OSCAR 2020 12B 6Tb

Table 3.1: Some examples of publicly available datasets.


The equivalent number of books is an indicative esti-
mate for 250 pages of 2000 characters per book.

dreds of billions of trainable parameters and re-


quire 1020 –1023 FLOPs for training [Devlin et al.,
2018; Brown et al., 2020; Chowdhery et al., 2022;
Sevilla et al., 2022]. The latter require machines
with multiple high-end GPUs.

Training these large models is impossible with


datasets of moderate size with a detailed ground-
truth expensive to produce. Instead, it is done
with datasets automatically produced by combin-
ing data available on the internet with minimal
curation, if any. These sets may combine multi-
ple modalities, such as text and images from web
pages, or sound and images from videos, which
can be used for large-scale supervised training.

The most impressive current successes of arti-

48 155
1TWh
PaLM-540B

1024
GPT-3 LaMDA

AlphaZero Whisper
Training cost (FLOP)

ViT-H/14
1MWh
AlphaGo CLIP-ViT-L/14
GPT-2
1021
BERT-Large

Transformer

ResNet-152 GPT
1KWh
VGG16

1018 AlexNet GoogLeNet

2015 2020

Year
Figure 3.6: Training costs in number of FLOP of some
landmark models [Sevilla et al., 2023]. The colors in-
dicate the domains of application: Computer Vision
(blue), Natural Language Processing (red), or other
(black). The dashed lines correspond to the energy con-
sumption using A100s SXM in 16 bits precision.

49 155
ficial intelligence rely on very large language
models, which we will see in § 5.3 and § 7.1,
trained on extremely large text datasets, see Ta-
ble 3.1.

50 155
Part II

Deep models

51 155
Chapter 4

Model components

A deep model is nothing more than a complex


tensorial computation that can be decomposed
ultimately into standard mathematical opera-
tions from linear algebra and analysis. Over the
years, the field has developed a large collection
of high-level modules that have a clear semantic,
and complex models combining these modules,
which have proven to be effective in specific ap-
plication domains.

Empirical evidence and theoretical results show


that greater performance is achieved with deeper
architectures, that is, long compositions of map-
pings. As we saw in section § 3.4, training such
a model is challenging due to the van
vanish
vanish
ishing
ing
ing gra
gra-
gra
di
dient
dient
ent,
ent and multiple important technical contri-
butions have mitigated this problem.

52 155
4.1 The notion of layer
We call lay
layers
layers standard complex compounded
tensor operations that have been designed and
empirically identified as being generic and effi-
cient. They often incorporate trainable param-
eters and correspond to a convenient level of
granularity for designing and describing large
deep models. The term is inherited from sim-
ple multi-layer neural networks, even though
modern models may take the form of a complex
graph of such modules, incorporating multiple
parallel pathways.

Y
4×4
g n=4

f
×K
32×32
X

In the following pages, I try to stick to the con-


vention for model depiction illustrated above:

• operators / layers are depicted as boxes,

• darker coloring indicates that they embed


trainable parameters,

• non-default valued meta-parameters are


53 155
added in blue on their right,

• a dashed outer frame with a multiplicative


factor indicates that a group of layers is repli-
cated in series, each with its own set of trainable
parameters if any, and

• the dimension of their output is specified on


the right when it differs from their input.

Additionally, layers that have a complex internal


structure are depicted with a greater height.

54 155
4.2 Linear layers
Lin
Linear
Linear
ear lay
layers
layers are the most important modules
in terms of computation and number of parame-
ters. They benefit from decades of research and
engineering in algorithmic and chip design for
matrix operations.

fully-connected layers
The most basic one is the fully
fully-con
fully con
connected
connected
nected layer
layer,
layer
parameterized by w = (W,b), where W is a
D′ ×D weight ma matrix
trix
trix, and b is a bias
bias vec
vector
vector
tor of di-
mension D′ . It implements a matrix/vector prod-
uct generalized to arbitrary tensor shapes. Given
an input X of dimension D1 ×···×DK ×D, it
computes an output Y of dimension D1 ×···×
DK ×D′ with

∀d1 ,...,dK ,
Y [d1 ,...,dK ] = W X[d1 ,...,dK ]+b.

While at first sight such an affine operation


seems limited to geometric transformations such
as rotations or symmetries, it can implement far
more than that. In particular, projections for di-
mension reduction or signal filtering, but also,
from the perspective of the dot product being a
measure of similarity, a matrix-vector product
55 155
can be interpreted as computing matching scores
between a query, as encoded by the vector, and
keys, as encoded by the matrix rows.

As we saw in § 3.3, the gradient descent starts


with the paparam
param
rame
rameeters’
ters’ ran
random
dom
dom ini
initial
tial
tializa
tializa
ization
ization
tion.
tion If
this is done too naively, as seen in § 3.4, the
network may suffer from exploding or vanishing
activations and gradients [Glorot and Bengio,
2010]. Deep learning frameworks implement
initialization methods that modulate the random
parameters’ scales according to the tensor shape
to prevent pathological behaviors of the signal
during the forward and backward passes.

Convolutional layers
A linear layer can take as input an arbitrarily-
shaped tensor by reshaping it into a vector, as
long as it has the correct number of coefficients.
However, such a layer is poorly adapted to deal-
ing with large tensors, since the number of pa-
rameters and number of operations are propor-
tional to the product of the input and output
dimensions. For instance, to process an RGB
image of size 256×256 as input and compute a
result of the same size, it would require approxi-
mately 4×1010 parameters and multiplications.

56 155
Y Y

ϕ ψ

X X

Y Y

ϕ ψ

X X
... ...

Y Y

ϕ ψ

X X

1d transposed
1d convolution
convolution
Figure 4.1: A 1d convolution (left) takes as input
a D×T tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D×K, and stores
the resulting D′ ×1 tensors into Y . A 1d transposed
convolution (right) takes as input a D×T tensor, ap-
plies the same affine mapping ψ(·;w) to every sub-
tensor of shape D×1, and sums the shifted resulting
D′ ×K tensors. Both can process inputs of different
size.

57 155
ϕ ψ

Y X
X Y

2d transposed
2d convolution
convolution
Figure 4.2: A 2d convolution (left) takes as input a
D×H ×W tensor X, applies the same affine map-
ping ϕ(·;w) to every sub-tensor of shape D×K ×L,
and stores the resulting D′ ×1×1 tensors into Y . A
2d transposed convolution (right) takes as input a
D×H ×W tensor, applies the same affine mapping
ψ(·;w) to every D×1×1 sub-tensor, and sums the
shifted resulting D′ ×K ×L tensors into Y .

Besides these practical issues, most of the high-


dimension signals are strongly structured. For
instance, images exhibit short-term correlations
and statistical stationarity to translation, scaling,
and certain symmetries. This is not reflected
in the in
induc
duc
ductive
tive
tive bias of a fully-connected layer,
which completely ignores the signal structure.

To leverage these regularities, the tool of choice


is con
convo
convo
volu
volu
lutional
lutional lay
layers
ers
ers, which are also affine, but
process time-series or 2d signals locally, with the

58 155
Y

Y ϕ

X
ϕ
p=2
X
Padding
Y
Y
ϕ

X ϕ

X
s=2
...
d=2
Stride
Dilation
Figure 4.3: Beside its kernel size and number of input
/ output channels, a convolution admits three meta-
parameter: the stride s (left) modulates the step size
when going though the input tensor, the padding p
(top right) specifies how many zeros entries are added
around the input tensor before processing it, and the
dilation d (bottom right) parameterizes the index count
between coefficients of the filter.

59 155
same operator everywhere.

A 1d
1d con
convo
vo
volu
volu
lution
tion is mainly defined by three
meta-parameters: its kerkernel
kernel
nel size
size K, its number
of input channels D, its number of output chan-
nels D′ , and by the trainable parameters w of an

affine mapping ϕ(·;w) : RD×K → RD ×1 .

It can process any tensor X of size D×T with


T ≥ K, and applies ϕ(·;w) to every sub-tensor
D×K of X, storing the results in a tensor Y of
size D′ ×(T −K +1), as pictured in Figure 4.1
(left).

A 2d
2d con
convo
convo
volu
volu
lution
tion is similar but has a K ×L ker-
nel and takes as input a D×H ×W tensor, see
Figure 4.2 (left).

Both operators have for trainable parameters


those of ϕ that can be envisioned as D′ fil
filters
ters
ters
of size D×K or D×K ×L respectively, and a
bias
bias vec
vector
vector
tor of dimension D′ .

They also admit three additional meta-


parameters, illustrated on Figure 4.3:

• The padding
padding specifies how many zero coeffi-
cients should be added around the input tensor
before processing it, particularly to maintain the
tensor size when the kernel size is greater than

60 155
one. Its default value is 0.

• The stride specifies the step used when going


through the input, allowing one to reduce the
output size geometrically by using large steps.
Its default value is 1.

• The di
dila
la
lation
lation specifies the index count between
the filter coefficients of the local affine opera-
tor. Its default value is 1, and greater values
correspond to inserting zeros between the coef-
ficients, which increases the filter / kernel size
while keeping the number of trainable parame-
ters unchanged.

Except for the number of channels, a convolu-


tion’s output is usually strictly smaller than its
input by roughly the size of the kernel, or even
by a scaling factor if the stride is greater than
one.

Given an activation computed by a convolutional


layer, or the vector of values for all the channels
at a certain location, the portion of the input
signal that it depends on is called its rerecep
recep
ceptive
ceptive
tive
field
field (see Figure 4.4). One of the H ×W sub-
tensors corresponding to a single channel of a
D×H ×W activation tensor is referred to as an
ac
acti
acti
tiva
tiva
vation
vation map
map.
map

61 155
Figure 4.4: Given an activation in a series of convolu-
tion layers, here in red, its re
recep
recep
ceptive
ceptive field
field is the area in
the input signal, in blue, that modulates its value. Each
intermediate convolutional layer increases the width
and height of that area by roughly those of the kernel.

Convolutions are used to recombine informa-


tion, generally to reduce the spatial size of the
representation, trading it for a greater number
of channels, which translates into a richer local
representation. They can implement differential
operators such as edge-detectors, or template
matching mechanisms. A succession of such lay-
ers can also be envisioned as a compositional and
hierarchical representation [Zeiler and Fergus,
2014], or as a diffusion process in which infor-
mation can be transported by half the kernel size
when passing through a layer.

62 155
A converse operation is the trans
transposed
transposed
posed con
convo
convo
volu
volu
lu-
tion
tion that also consists of a localized affine op-
erator, defined by similar meta and trainable
parameters as the convolution, but which ap-
plies, for instance, in the 1d case, an affine map-

ping ψ(·;w) : RD×1 → RD ×K , to every D×1
sub-tensor of the input, and sums the shifted
D′ ×K resulting tensors to compute its output.
Such an operator increases the size of the signal
and can be understood intuitively as a synthe-
sis process (see Figure 4.1, right and Figure 4.2,
right).

A series of convolutional layers is the usual archi-


tecture to map a large-dimension signal, such as
an image or a sound sample, to a low-dimension
tensor. That can be, for instance, to get class
scores for classification or a compressed repre-
sentation. Transposed convolution layers are
used the opposite way to build a large-dimension
signal from a compressed representation, either
to assess that the compressed representation con-
tains enough information to build back the signal
or for synthesis, as it is easier to learn a density
model over a low-dimension representation. We
will come back to this in § 5.2.

63 155
4.3 Activation functions
If a network were combining only linear com-
ponents, it would itself be a linear operator,
so it is essential to have non
non-lin
non lin
linear
ear
ear op
oper
oper
era
eraations
tions.
tions
They are implemented in particular with ac acti
acti
tiva
tiva
va-
va
tion
tion func
functions
functions
tions,
tions which are layers that transforms
each component of the input tensor individually
through a mapping, resulting in a tensor of the
same shape.

There are many different activation functions,


but the most used is the Rec
Recti
Recti
tified
tified Lin
Linear
ear
ear Unit
Unit
(ReLU
ReLU
ReLU,
ReLU [Glorot et al., 2011]), which sets nega-
tive values to zero and keeps positive values un-
changed (see Figure 4.5, top right):
(
0 if x < 0,
relu(x) =
x otherwise.

Given that the core training strategy of deep-


learning relies on the gradient, it may seem prob-
lematic to have a mapping that is not differen-
tiable at zero and constant on half the real line.
However, the main property gradient descent
requires is that the gradient is informative on
average. Parameter initialization and data nor-
malization make half of the activations positive

64 155
Tanh ReLU

Leaky ReLU GELU

Figure 4.5: Activation functions.

when the training starts, ensuring that this is the


case.

Before the generalization of ReLU, the standard


activation function was Tanh
Tanh (see Figure 4.5, top
left) which saturates exponentially fast on both
the negative and the positive sides, aggravating
the vanishing gradient.

Other popular activation functions follow the


same idea of keeping positive values unchanged
and squashing the negative values. Leaky
Leaky ReLU
ReLU
[Maas et al., 2013] applies a small positive multi-
65 155
plying factor to the negative values (see Figure
4.5, bottom left):
(
ax if x < 0,
leakyrelu(x) =
x otherwise.

And GELU
GELU [Hendrycks and Gimpel, 2016] is de-
fined with the cumulative distribution function
of the Gaussian distribution, that is

gelu(x) = xP (Z ≤ x),

where Z ∼ 𝒩 (0,1). It roughly behaves like a


smooth ReLU (see Figure 4.5, bottom right).

The choice of an activation function, in partic-


ular among the variants of ReLU, is generally
driven by empirical performance.

66 155
4.4 Pooling
A classical strategy to reduce the signal size is to
use a pool
pooling
pooling
ing operation that combines multiple
activations into one that ideally summarizes the
information. The most standard operation of this
class is the max
max pool
pooling
ing layer, which, similarly
to convolution, can operate in 1d and 2d, and is
defined by a ker
kernel
kernel
nel size
size.

This layer computes the maximum activation


per channel, over non-overlapping sub-tensors
of spatial size equal to the kernel size. These val-
ues are stored in a result tensor with the same
number of channels as the input, and whose spa-
tial size is divided by the kernel size. As with
the convolution, this operator has three meta-
parameters: padding
padding, stride
stride, and di
dila
la
lation
lation
tion,
tion with
the stride being equal to the kernel size by de-
fault.

The max operation can be intuitively interpreted


as a logical disjunction, or, when it follows a
series of con
convo
vo
volu
volu
lutional
tional layer that compute lo-
cal scores for the presence of parts, as a way
of encoding that at least one instance of a part
is present. It loses precise location, making it
invariant to local deformations.

A standard alternative is the av


aver
aver
erage
erage
age pool
pooling
pooling
ing
67 155
Y

max

max

...

max

1d max pooling

Figure 4.6: A 1d max pooling takes as input a D×


T tensor X, computes the max over non-overlapping
1×L sub-tensors and stores the values in a resulting
D×(T /L) tensor Y .

68 155
layer that computes the average instead of the
maximum over the sub-tensors. This is a linear
operation, whereas max pooling is not.

69 155
4.5 Dropout
Some layers have been designed to explicitly
facilitate training or improve the quality of the
learned representations.

One of the main contributions of that sort was


dropout
dropout [Srivastava et al., 2014]. Such a layer
has no trainable parameters, but one meta-
parameter, p, and takes as input a tensor of arbi-
trary shape.

It is usually switched off during testing, in which


case its output is equal to its input. When it is
active, it has a probability p to set to zero each
activation of the input tensor independently, and
it re-scales all the activations by a factor of 1−p
1

to maintain the expected value unchanged (see


Figure 4.7).

The motivation behind dropout is to favor


meaningful individual activation and discourage
group representation. Since the probability that
a group of k activations remains intact through
a dropout layer is (1−p)k , joint representations
become unreliable, which makes the training
procedure avoid them. It can also be seen as
a noise injection that makes the training more
robust.

70 155
Y Y

0
1 1 1 1 1 1
0 1 1 1 0
1 0
1 1
1 1
0 1 1
0 1 1 1 1 1 1 1 1 1
× 1 1 1
0 1 1 1 1 1
0 1 1 1 1
0 × 1−p
1 1 1 1 1 1
0 1 1 1
0 1 1
0 1
1
0 1 1 1
0 1 1 1 1 1 1 1
0 1

X X

Train Test
Figure 4.7: Dropout can process a tensor of arbitrary
shape. During training (left), it sets activations at ran-
dom to zero with probability p and applies a multiply-
ing factor to keep the expected values unchanged. Dur-
ing test (right), it keeps all the activations unchanged.

When dealing with images and 2d tensors, the


short-term correlation of the signals and the re-
sulting redundancy negates the effect of dropout,
since activations set to zero can be inferred from
their neighbors. Hence, dropout for 2d tensors
sets entire channels to zero instead of individual
activations.

Although dropout is generally used to improve


training and is inactive during inference, it can
be used in certain setups as a randomization
strategy, for instance, to estimate empirically

71 155
confidence scores [Gal and Ghahramani, 2015].

72 155
4.6 Normalizing layers
An important class of operators to facilitate the
training of deep architectures are the nor
normal
normal
maliz
maliz
iz-
iz
ing
ing lay
layers
layers
ers, which force the empirical mean and
variance of groups of activations.

The main layer in that family is batch


batch nor
normal
normal
mal-
mal
iza
ization
ization
tion [Ioffe and Szegedy, 2015] which is the
only standard layer to process batches instead
of individual samples. It is parameterized by a
meta-parameter D and two series of trainable
scalar parameters β1 ,...,βD and γ1 ,...,γD .

Given a batch of B samples x1 ,...,xB of dimen-


sion D, it first computes for each of the D com-
ponents an empirical mean m̂d and variance v̂d
across the batch:
B
1X
m̂d = xb,d
B
b=1
B
1 X
v̂d = (xb,d − m̂d )2 ,
B
b=1

from which it computes for every component


xb,d a normalized value zb,d , with empirical
mean 0 and variance 1, and from it the final
result value yb,d with mean βd and standard de-

73 155
D
H,W

x⊙γ +β x⊙γ +β

√ √
(x− m̂)/ v̂+ϵ (x− m̂)/ v̂+ϵ

batchnorm layernorm

Figure 4.8: Batch normalization normalizes across the


sample index dimension B and all spatial dimensions
if any, so B,H,W for a B ×D×H ×W batch tensor,
and scales/shifts according to D, which is implemented
as a component-wise product by γ and a sum with β
of the corresponding sub-tensors (left). Layer normal-
ization normalizes across D and spatial dimensions,
and scales/shifts according to the same (right).

74 155
viation γd
xb,d − m̂d
zb,d = √
v̂d +ϵ
yb,d = γd zb,d +βd .

Because this normalization is defined across a


batch, it is done only during training. During
testing, the layer transforms individual samples
according to the m̂d s and v̂d s estimated with a
moving average over the full training set, which
boils down to a fix affine transformation per com-
ponent.

The motivation behind batch normalization was


to avoid that a change in scaling in an early layer
of the network during training impacts all the
layers that follow, which then have to adapt their
trainable parameters accordingly. Although the
actual mode of action may be more complicated
than this initial motivation, this layer consider-
ably facilitates the training of deep models.

In the case of 2d tensors, to follow the prin-


ciple of convolutional layers of processing all
locations similarly, the normalization is done
per-channel across all 2d positions, and β and
γ remain vectors of dimension D so that the
scaling/shift does not depend on the 2d posi-
tion. Hence, if the tensor to process is of shape
75 155
B ×D×H ×W , the layer computes (m̂d ,v̂d ), for
d = 1,...,D from the corresponding B ×H ×W
slice, normalizes it accordingly, and finally scales
and shifts its components with the trainable pa-
rameters βd and γd .

So, given a B ×D tensor, batch normalization


normalizes it across B and scales/shifts it ac-
cording to D, which can be implemented as a
component-wise product by γ and a sum with
β. Given a B ×D×H ×W it normalizes across
B,H,W and scales/shifts according to D (see
Figure 4.8, left).

This can be generalized depending on these di-


mensions. For instance, layer nor
normal
normal
maliza
maliza
ization
ization
tion [Ba
et al., 2016], computes moments and normalizes
across all components of individual samples, and
scales and shifts components individually (see
Figure 4.8, right). So, given a B ×D tensor, it
normalizes across D and scales/shifts also ac-
cording to D. Given a B ×D×H ×W tensor, it
normalizes it across D,H,W and scales/shifts
according to the same.

Contrary to batch normalization, since it pro-


cesses samples individually, it behaves the same
during training and testing.

76 155
4.7 Skip connections
Another technique that mitigates the vanishing
gradient and allows the training of deep archi-
tectures are skip con
connec
nec
nections
nections [Long et al., 2014;
Ronneberger et al., 2015]. They are not layers
per se, but an architectural design in which out-
puts of some layers are transported as-is to other
layers further in the model, bypassing process-
ing in-between. This unmodified signal can be
concatenated or added to the input to the layer
the connection branches into (see Figure 4.9). A
particular type of skip connections is the resid
resid-
ual
ual con
connec
connec
nection
nection which combines the signal with
a sum, and usually skips only a few layers (see
Figure 4.9, right).

The most desirable property of this design is to


ensure that, even in the case of gradient-killing
processing at a certain stage, the gradient will
still propagate through the skip connections.
Residual connections, in particular, allow for the
building of deep models with up to several hun-
dred layers, and key models, such as the resid
residual
ual
ual
net
networks
networks
works [He et al., 2015] in computer vision,
see § 5.2, and the Trans
Transform
form
formers
ers
ers [Vaswani et al.,
2017] in natural language processing, see § 5.3,
are entirely composed of blocks of layers with
residual connections.
77 155
...

f8
... ...
f7
f6 +
f6
f5 f4
f5
f4 f3
f4
f3 +
f3
f2 f2
f2
f1 f1
f1 ... ...
...

Figure 4.9: Skip connections, highlighted in red on this


figure, transport the signal unchanged across multiple
layers. Some architectures (center) that downscale and
re-upscale the representation size to operate at multiple
scales, have skip connections to feed outputs from the
early parts of the network to later layers operating at
the same scales [Long et al., 2014; Ronneberger et al.,
2015]. The residual connections (right) are a special
type of skip connections that sum the original signal
to the transformed one, and are usually short-term,
bypassing at max a handful of layers [He et al., 2015].

78 155
Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size be-
fore re-expanding it, by connecting layers with
compatible sizes. In the case of residual con-
nections, they may also facilitate learning by
simplifying the task to finding a differential im-
provement instead of a full update.

79 155
4.8 Attention layers
In many applications, there is a need for an oper-
ation that is able to combine local information at
locations far apart in a tensor. For instance, this
could be distant details for coherent and realistic
im
image
image
age syn
synthe
synthe
thesis
sis
sis, or words at different positions
in a paragraph to make a grammatical or seman-
tic decision in nat
natu
ural lan
language
guage
guage pro
process
process
cessing
cessing
ing.
ing

fully
fully-con
fully con
connected
nected
nected lay
layers
ers cannot process large-
dimension signals, nor signals of variable size,
and con
convo
convo
volu
volu
lutional
lutional lay
layers
ers are not able to prop-
agate information quickly. Strategies that ag-
gregate the results of convolutions, for instance,
by averaging them over large spatial areas, suf-
fer from mixing multiple signals into a limited
number of dimensions.

At
Atten
Atten
tention
tention lay
layers
layers specifically address this prob-
lem by computing an attention score for each
component of the resulting tensor to each com-
ponent of the input tensor, without locality con-
straints, and averaging features across the full
tensor accordingly [Vaswani et al., 2017].

Even though they are substantially more com-


plicated than other layers, they have become a
standard element in many recent models. They
are, in particular, the key building block of Trans
Trans-
80 155
Q Y
K A V A

Computes Aq,1 ,...,Aq,N KV Computes Yq


Figure 4.10: The attention operator can be inter-
preted as matching every query Qq with all the
keys K1 ,...,KN KV to get normalized attention scores
Aq,1 ,...,Aq,N KV (left, and Equation 4.1), and then av-
eraging the values V1 ,...,VN KV with these scores to
compute the resulting Yq (right, and Equation 4.2).

form
formers
formers
ers, the dominant architecture for Large
Large
Lan
Language
Language
guage Mod
Models
Models
els. See § 5.3 and § 7.1.

Attention operator
Given

• a tensor Q of queries of size N Q ×DQK ,


• a tensor K of keys
keys of size N KV ×DQK , and
• a tensor V of val
values
ues of size N KV ×DV ,

the at
atten
atten
tention
tention op
oper
er
eraator computes a tensor

Y = att(K,Q,V )

of dimension N Q ×DV . To do so, it first computes


for every query index q and every key index k an
81 155
attention score Aq,k as the sof
soft
softtargmax
argmax of the dot
products between the query Qq and the keys:
 
exp √ 1 QK Q⊤ q K k
Aq,k = P D , (4.1)
exp √ 1 Q⊤ Kl
l QK
D q

where the scaling factor √ 1 QK keeps the range


D
of values roughly unchanged even for large DQK .

Then a retrieved value is computed for each


query by averaging the values according to the
attention scores:
X
Yq = Aq,k Vk . (4.2)
k

So if a query Qn matches one key Km far more


than all the others, the corresponding attention
score An,m will be close to one, and the retrieved
value Yn will be the value Vm associated to that
key. But, if it matches several keys equally, then
Yn will be the average of the associated values.

This can be implemented as


QK ⊤
 
att(Q,K,V ) = softargmax √ V.
DQK
| {z }
A

This operator is usually extended in two ways,


as depicted in Figure 4.11. First, the attention
82 155
Y

·
A

dropout

1/Σk
Masked
softargmax M ⊙
exp

Q K V

Figure 4.11: The attention operator Y = att(Q,K,V )


computes first an attention matrix A as the per-query
softargmax of QK ⊤ , which may be masked by a con-
stant matrix M before the normalization. This atten-
tion matrix goes through a dropout layer before being
multiplied by V to get the resulting Y . This operator
can be made causal by taking M full of 1s below the
diagonal and zero above.

83 155
matrix can be masked by multiplying it before
the softargmax normalization by a Boolean ma-
trix M . This allows, for instance, to make the
operator causal by taking M full of 1s below the
diagonal and zero above, preventing Yq from de-
pending on keys and values of indices k greater
than q. Second, the attention matrix is processed
by a dropout
dropout layer (see § 4.5) before being multi-
plied by V , providing the usual benefits during
training.

Multi-head Attention Layer


This parameterless attention operator is the key
element in the Multi
Multi-Head
Head
Head At
Atten
Atten
tention
tention
tion layer de-
picted in Figure 4.12. This layer has for meta-
parameters a number H of heads, and the shapes
of three series of H trainable weight matrices

• W Q of size H ×D×DQK ,
• W K of size H ×D×DQK , and
• W V of size H ×D×DV ,

to compute respectively the queries, the keys,


and the values from the input, and a final weight
matrix W O of size HDV ×D to aggregate the
per-head results.

It takes as input three sequences

84 155
Y

×W O

(Y1 | ··· | YH )

attatt
attatt
att

Q K V
×W
×W1 2 Q×W
Q
Q×W
1 2K K×W 1 2V V
×W ×W K×W×W
×W3 Q ×W3 K ×W3 4V V
×W4
H ×W 4
H ×W H

×H

XQ XK XV
Figure 4.12: The Multi-head Attention layer applies
for each of its h = 1,...,H heads a parametrized lin-
ear transformation to individual elements of the input
sequences X Q ,X K ,X V to get sequences Q,K,V that
are processed by the attention operator to compute Yh .
These H sequences are concatenated along features,
and individual elements are passed through one last
linear operator to get the final result sequence Y .

85 155
• X Q of size N Q ×D,
• X K of size N KV ×D, and
• X V of size N KV ×D,

from which it computes, for h = 1,...,H



Yh = att X Q WhQ ,X K WhK ,X V WhV .

These sequences Y1 ,...,YH are concatenated


along the feature dimension and each individual
element of the resulting sequence is multiplied
by W O to get the final result

Y = (Y1 | ··· | YH )W O .

As we will see in § 5.3 and in Figure 5.6, this


layer is used to build two model sub-structures:
self
self-at
self at
atten
atten
tention
tention blocks
blocks, in which the three input
sequences X , X K , and X V are the same, and
Q

cross
cross-at
cross at
atten
atten
tention
tention blocks
blocks, where X K and X V are
the same.

It is noteworthy that the attention operator, and


consequently the multi-head attention layer, is
invariant to a permutation of the keys and values,
and equivariant to a permutation of the queries,
as it would permute the resulting tensor simi-
larly.

86 155
4.9 Token embedding
In many situations, we need to convert discrete
tokens into vectors. This can be done with an em
em-
em
bed
bedding
bedding layer
layer,
layer which consists of a lookup table
that directly maps integers to vectors.

Such a layer is defined by two meta-parameters:


the number N of possible token values, and the
dimension D of the output vectors, and one train-
able N ×D weight matrix M .

Given as input an integer tensor X of dimen-


sion D1 ×···×DK and values in {0,...,N −1}
such a layer returns a real-valued tensor Y of
dimension D1 ×···×DK ×D with

∀d1 ,...,dK ,
Y [d1 ,...,dK ] = M [X[d1 ,...,dK ]].

87 155
4.10 Positional encoding
While the processing of a fully
fully-con
fully con
connected
connected
nected layer
layer
is specific to both the positions of the features
in the input tensor and to the position of the
resulting activation in the output tensor, con
convo
convo
vo-
vo
lu
lutional
lutional
tional lay
layers
layers and multi-head attention layers
are oblivious to the absolute position in the ten-
sor. This is key to their strong invariance and
inductive bias, which is beneficial for dealing
with a stationary signal.

However, this can be an issue in certain situ-


ations where proper processing has to access
the absolute positioning. This is the case, for
instance, for image synthesis, where the statis-
tics of a scene are not totally stationary, or in
natural language processing, where the relative
positions of words strongly modulate the mean-
ing of a sentence.

The standard way of coping with this problem


is to add or concatenate a poposi
posi
sitional
tional
tional en
encod
encod
coding
coding
ing,
ing
which is a feature vector that depends on the lo-
cation, to the feature representation at every po-
sition. This positional encoding can be learned as
other layer parameters, or defined analytically.

For instance, in the original Trans


Transformer
former
former model,
for a series of vectors of dimension D, Vaswani
88 155
et al. [2017] add an encoding of the sequence
index as a series of sines and cosines at various
frequencies:

pos-enc[t,d] =
  
 sin d/Dt
if d ∈ 2N
T 
t
 cos (d−1)/D
T
otherwise,

with T = 104 .

89 155
Chapter 5

Architectures

The field of deep learning has developed over


the years for each application domain multiple
deep architectures that exhibit good trade-offs
with respect to multiple criteria of interest: e.g.
ease of training, accuracy of prediction, memory
footprint, computational cost, scalability.

90 155
5.1 Multi-Layer Perceptrons
The simplest deep architecture is the Multi Multi-
Multi
Layer
Layer Per
Percep
Percep
ceptron
ceptron (MLP
MLP
MLP),
MLP which takes the form
of a succession of fully
fully-con
con
connected
nected
nected lay
layers
layers
ers sepa-
rated by ac
acti
acti
tiva
tiva
vation
vation func
functions
tions
tions. See an example
in Figure 5.1. For historical reasons, in such a
model, the number of hidhidden
den
den lay
layers
layers
ers refers to the
number of linear layers, excluding the last one.

A key theoretical result is the uni


univer
ver
versal
versal
sal ap
approx
approx
prox-
prox
iima
mation
mation
tion the
theo
orem
rem [Cybenko, 1989] which states
that, if the activation function σ is not polyno-

Y
2
fully-conn

relu
10
Hidden fully-conn
layers
relu
25
fully-conn
50
X
Figure 5.1: This multi-layer perceptron takes as input
a one dimension tensor of size 50, is composed of three
fully-connected layers with outputs of dimensions re-
spectively 25, 10, and 2, the two first followed by ReLU
layers.

91 155
mial, any continuous function f can be approxi-
mated arbitrarily well uniformly on a compact
by a model of the form l2 ◦σ◦l1 where l1 and l2
are affine. Such a model is a MLP with a single
hidden layer, and this result implies that it can
approximate anything of practical value. How-
ever, this approximation holds if the dimension
of the first linear layer’s output can be arbitrarily
large.

In spite of their simplicity, MLPs remain an im-


portant tool when the dimension of the signal
to be processed is not too large.

92 155
5.2 Convolutional networks
The standard architecture for proprocess
process
cessing
cessing
ing im
images
images
ages
is a con
convo
convo
volu
volu
lutional
tional net
network
network
work,
work or con convnet
convnet
vnet,
vnet that
combines multiple conconvo
vo
volu
volu
lutional
lutional
tional lay
layers
layers
ers,
ers either
to reduce the signal size before it can be pro-
cessed by fully
fully-con
fully con
connected
nected lay
layers
layers
ers,
ers or to output a
2d signal also of large size.

LeNet-like
The original LeNet
LeNet model for image classifica-
tion [LeCun et al., 1998] combines a series of
2d con
convo
convo
volu
volu
lutional
lutional lay
layers
ers and max pooling layers
that play the role of feature extractor, with a
series of fully
fully-con
fully con
connected
nected lay
layers
layers
ers which act like a
MLP and perform the classification per se. See
Figure 5.2 for an example.

This architecture was the blueprint for many


models that share its structure and are simply
larger, such as AlexNet [Krizhevsky et al., 2012]
or the VGG family [Simonyan and Zisserman,
2014].

Residual networks
Standard convolutional neural networks that fol-
low the architecture of the LeNet family are not
easily extended to deep architectures and suffer
93 155
P̂ (Y )

10
fully-conn
Classifier
relu
200
fully-conn
256
reshape

relu
64×2×2
maxpool k=2
64×4×4
Feature conv-2d k=5
extractor
relu
32×8×8
maxpool k=3
32×24×24
conv-2d k=5
1×28×28
X
Figure 5.2: Example of a small LeNetLeNet-like
LeNet network for
classifying 28×28 grayscale images of handwritten
digits [LeCun et al., 1998]. Its first half is convolutional,
and alternates convolutional layers per se and max
pooling layers, reducing the signal dimension for 28×
28 scalars to 256. Its second half processes this 256
dimension feature vector through a one hidden layer
perceptron to compute 10 logit scores corresponding to
the ten possible digits.

94 155
Y
C ×H ×W
relu
+
batchnorm
C ×H ×W
conv-2d k=1

relu
batchnorm
conv-2d k=3 p=1

relu
batchnorm
C
2 ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.3: A residual block.

from the vanishing gradient problem. The resid resid-


ual
ual net
networks
networks
works,
works or ResNets
ResNets, proposed by He et al.
[2015] explicitly address the issue of the van-
ishing gradient with residresidual
residual
ual con
connec
connec
nections
nections
tions (see
§ 4.7), that allow hundreds of layers. They have
become standard architectures for computer vi-
sion applications, and exist in multiple versions
depending on the number of layers. We are go-
ing to look in detail at the architecture of the
ResNet
ResNet-50
ResNet 50 50 for classification.

95 155
Y
4C
S ×H W
S × S
relu
+
batchnorm batchnorm
4C
S ×H W
S × S
conv-2d k=1 s=S conv-2d k=1

relu
batchnorm
C
S ×H W
S × S
conv-2d k=3 s=S p=1

relu
batchnorm
C
S ×H ×W
conv-2d k=1

C ×H ×W
X
Figure 5.4: A downscaling residual block. It admits a
meta-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor size.

As other ResNets, it is composed of a series of


resid
residual
residual
ual blocks
blocks,
blocks each combining several conconvo
convo
vo-
vo
lu
lutional
lutional
tional lay
layers
layers
ers, batch norm layers, and ReLU lay-
ers, wrapped in a residual connection. Such a
block is pictured in Figure 5.3.

A key requirement for high performance with


real images is to propagate a signal with a large
number of channels, to allow for a rich repre-

96 155
P̂ (Y )

1000
fully-conn
2048
reshape
2048×1×1
avgpool k=7

resblock
×2
2048×7×7
dresblock
S=2

resblock
×5
1024×14×14
dresblock
S=2

resblock
×3
512×28×28
dresblock
S=2

resblock
×2
256×56×56
dresblock
S=1
64×56×56
maxpool k=3 s=2 p=1
relu
batchnorm
64×112×112
conv-2d k=7 s=2 p=3

3×224×224
X
Figure 5.5: Structure of the ResNet-50 [He et al., 2015].
97 155
sentation. However, the parameter count of a
convolutional layer, and its computational cost,
are quadratic with the number of channels. This
residual block mitigates this problem by first re-
ducing the number of channels with a 1×1 con-
volution, then operating spatially with a 3×3
convolution on this reduced number of chan-
nels, and then upscaling the number of channels,
again with a 1×1 convolution.

The network reduces the dimensionality of the


signal to finally compute the logits for the clas-
sification. This is done thanks to an architec-
ture composed of several sections, each starting
with a down
downscal
downscal
scaling
ing resid
residual
residual
ual block
block that halves
the height and width of the signal, and doubles
the number of channels, followed by a series
of residual blocks. Such a downscaling resid-
ual block has a structure similar to a standard
residual block, except that it requires a residual
connection that changes the tensor shape. This
is achieved with a 1×1 convolution with a stride
of two (see Figure 5.4).

The overall structure of the ResNet-50 is pre-


sented in Figure 5.5. It starts with a 7×7 convo-
lutional layer that converts the three-channel in-
put image to a 64-channel image of half the size,
followed by four sections of residual blocks. Sur-

98 155
prisingly, in the first section, there is no down-
scaling, only an increase of the number of chan-
nels by a factor of 4. The output of the last resid-
ual block is 2048×7×7, which is converted to a
vector of dimension 2048 by an average pooling
of kernel size 7×7, and then processed through
a fully-connected layer to get the final logits,
here for 1000 classes.

99 155
5.3 Attention models
As stated in § 4.8, many applications, in partic-
ular from natural language processing, greatly
benefit from models that include attention mech-
anisms. The architecture of choice for such tasks,
which has been instrumental in recent advances
in deep learning, is the Trans
Transformer
Transformer
former proposed
by Vaswani et al. [2017].

Transformer
The original Transformer, pictured in Figure 5.7,
was designed for sequence-to-sequence trans-
lation. It combines an encoder that processes
the input sequence to get a refined representa-
tion, and an autoregressive decoder that gener-
ates each token of the result sequence, given the
encoder’s representation of the input sequence
and the output tokens generated so far. As the
residual convolutional networks of § 5.2, both
the encoder and the decoder of the Transformer
are sequences of compounded blocks built with
residual connections.

The self
self-at
self at
atten
atten
tention
tention block
block, pictured on the left of
Figure 5.6, combines a MultiMulti-Head
Multi Head
Head At
Atten
Atten
tention
tention
tion
layer, see § 4.8, that recombines information
globally, allowing any position to collect infor-

100 155
Y Y

+ +
dropout dropout
fully-conn fully-conn
MLP gelu gelu
fully-conn fully-conn
layernorm layernorm

+ +
mha mha
Q K V Q K V

layernorm layernorm

X QKV XQ X KV
Figure 5.6: Self
Self-at
at
atten
ten
tention
tention block
block (left) and cross
cross-at
cross at
atten
atten
ten-
tion block
block (right). These specific structures proposed by
Radford et al. [2018] differ slightly from the original
architecture of Vaswani et al. [2017], in particular by
having the layer normalization first in the residual
blocks.

mation from any other positions, with a one-


hidden-layer MLP
MLP that updates representations
at every position separately.

The cross
cross-at
at
atten
atten
tention
tention block
block, pictured on the right
of Figure 5.6, is similar except that it takes as

101 155
P̂ (Y1 ),..., P̂ (YS | Ys<S )

S ×V
fully-conn
S ×D
cross-att
Q KV

Decoder causal
self-att ×N
pos-enc +
S ×D
embed

S
0,Y1 ,...,YS−1

Z1 ,...,ZT

T ×D
self-att
×N
Encoder
pos-enc +
T ×D
embed

T
X1 ,...,XT

Figure 5.7: Original encoder-decoder Trans


Transformer
Transformer
former
model for sequence-to-sequence translation [Vaswani
et al., 2017].

102 155
input two sequences, one to compute the queries
and one to compute the keys and values.

The encoder of the Transformer (see Figure


5.7, bottom), recodes the input sequence of dis-
crete tokens X1 ,...XT with an em
embed
embed
bedding
bedding
ding layer
layer,
layer
see § 4.9, and adds a po
posi
si
sitional
tional
tional en
encod
encod
coding
coding
ing,
ing see
§ 4.10, before processing it with several self-
attention blocks to generate a refined represen-
tation Z1 ,...,ZT .

The decoder (see Figure 5.7, top), takes as in-


put the sequence Y1 ,...,YS−1 of result tokens
produced so far, similarly recodes them through
an embedding layer, adds a positional encoding,
and processes it through alternating causal
causal self-
attention blocks and cross-attention blocks to
produce the logits predicting the next tokens.
These cross-attention blocks compute their keys
and values from the encoder’s result represen-
tation Z1 ,...,ZT , which allows the resulting se-
quence to be a function of the original sequence
X1 ,...,XT .

As we saw in § 3.2, being causal


causal means that for
a given s the logits for P̂ (Ys | Yt<s ) it computes
depend only on the tokens Yt ,t < s in the in-
put sequence (see Figure 3.1). This ensures that,
given a full input sequence, the output at every

103 155
P̂ (X1 ),..., P̂ (XT | Xt<T )

T ×V
fully-conn
T ×D
causal
self-att ×N
pos-enc +
T ×D
embed

T
0,X1 ,...,XT −1

Figure 5.8: GPT model [Radford et al., 2018].

position is the output that would have been ob-


tained if the input had only been available until
just before that position.

Generative Pre-trained Transformer


The Generative Pre-trained Transformer (GPT GPT
GPT)
[Radford et al., 2018, 2019], pictured in Figure 5.8
is a pure autoregressive model that consists of a
succession of causal self-attention blocks, hence
a causal version of the original Transformer en-
coder. This class of models scales extremely well,
up to hundreds of billions of trainable parame-
ters [Brown et al., 2020].

104 155
P̂ (Y )

C
fully-conn
gelu
MLP
readout fully-conn
gelu
fully-conn

D
Z0 ,Z1 ,...,ZM

(M +1)×D
self-att
×N
pos-enc +

(M +1)×D
E0 ,E1 ,...,EM

Image E0 ×W E
encoder M ×3P 2
X1 ,...,XM

Figure 5.9: Vision Transformer model [Dosovitskiy


et al., 2020].

105 155
Vision Transformer
Transformers have been put to use for image
classification with the Vi
Vision
Vision Trans
Transformer
former
former (ViT
ViT
ViT)
model [Dosovitskiy et al., 2020], see Figure 5.9.

It splits the three-channel input image into M


patches of resolution P ×P , which are then flat-
tened to create a sequence of vectors X1 ,...,XM
of shape M ×3P 2 . This sequence is multiplied
by a trainable matrix W E of shape 3P 2 ×D to
map it to a M ×D sequence, to which is con-
catenated one trainable vector E0 . The resulting
(M +1)×D sequence E0 ,...,EM is then pro-
cessed through multiple self-attention blocks.
See § 5.3 and Figure 5.6.

The first element Z0 in the resultant sequence,


which corresponds to E0 and is not associated
which any part of the image, is finally processed
by a two-hidden-layer MLP to get the final C
logits. Such a token, added for a readout of a
class prediction, was introduced by Devlin et al.
[2018] in the BERT model and is referred to as a
CLS
CLS to
token
token
ken.
ken

106 155
Part III

Applications

107 155
Chapter 6

Prediction

A first category of applications, such as face


recognition, sentiment analysis, object detection,
or speech recognition, requires predicting an un-
known value from an available signal.

108 155
6.1 Image denoising
A direct application of deep models to image
processing is to recover from degradation by
utilizing the redundancy in the statistical struc-
ture of images. The petals of a sunflower on a
grayscale picture can be colored with high confi-
dence, and the texture of a geometric shape such
as a table on a low-light grainy picture can be
corrected by averaging it over a large area likely
to be uniform.

A de
denois
denois
noising
noising au
autoen
toen
toencoder
toencoder is a model that takes
as input a degraded signal X̃ and computes an
estimate of the original one X.

Such a model is trained by collecting a large num-


ber of clean samples paired with their degraded
inputs. The latter can be captured in degraded
conditions, such as low-light or inadequate fo-
cus, or generated algorithmically, for instance,
by converting the clean sample to grayscale, re-
ducing its size, or compressing it aggressively
with a lossy compression method.

The standard training procedure for denoising


autoencoders uses the MSE loss, in which case
the model aims at computing E(X | X̃). This
quantity may be problematic when X is not com-
pletely determined by X̃, in which case some
109 155
parts of the generated signal may be an unreal-
istic, blurry average.

110 155
6.2 Image classification
Im
Image
Image
age clas
classi
si
sifi
sifi
fica
fica
cation
tion is the simplest strategy for
extracting semantics from an image and consists
of predicting a class from a finite, predefined
number of classes, given an input image.

The standard models for this task are convolu-


tional networks, such as ResNets, see § 5.2, and
attention-based models such as ViT, see § 5.3.
Those models generate a vector of logits with as
many dimensions as there are classes.

The training procedure simply minimizes the


cross-entropy loss, see § 3.1. Usually, perfor-
mance can be improved with data data aug
augmen
augmen
menta
menta
ta-
ta
tion
tion,
tion which consists of modifying the training
samples with hand-designed random transfor-
mations that do not change the semantic content
of the image, such as cropping, scaling, mirror-
ing, or color changes.

111 155
6.3 Object detection
A more complex task for image understanding is
ob
object
object de
detec
detec
tection
tion
tion, in which the objective is, given
an input image, to predict the classes and posi-
tions of objects of interest.

An object position is formalized as the four coor-


dinates (x1 ,y1 ,x2 ,y2 ) of a rectangular bounding
box, and the ground truth associated with each
training image is a list of such bounding boxes,
each labeled with the class of the object in it.

The standard approach to solve this task, for in-


stance, by the Sin
Single
Single Shot De
Detec
Detec
tector
tector
tor (SSD
SSD
SSD) [Liu
et al., 2015]), is to use a convolutional neural
network that produces a sequence of image
representations Zs of size Ds ×Hs ×Ws , s =
1,...,S, with decreasing spatial resolution Hs ×
Ws down to 1×1 for s = S (see Figure 6.1). Each
of those tensors covers the input image in full,
so the h,w indices correspond to a partitioning
of the image lattice into regular squares that
gets coarser when s increases. As seen in § 4.2,
and illustrated in Figure 4.4, due to the succes-
sion of con
convo
vo
volu
volu
lutional
tional lay
layers
ers
ers, a feature vector
(Zs [0,h,w],...,Zs [Ds −1,h,w]) is a descriptor
of an area of the image, called its re
recep
recep
ceptive
ceptive
tive field
field,
that is larger than this square but centered on

112 155
X

Z1
Z2
ZS−1
ZS
...

...

Figure 6.1: A convolutional object detector processes the


input image to generate a sequence of representations
of decreasing resolutions. It computes for every h,w, at
every scale s, a pre-defined number of bounding boxes
whose centers are in the image area corresponding to
that cell, and whose size are such that they fit in its
receptive field. Each prediction takes the form of the
estimates (x̂1 , x̂2 , ŷ1 , ŷ2 ), represented by the red boxes
above, and a vector of C +1 logits for the C classes of
interest, and an additional “no object” class.

113 155
Figure 6.2: Examples of object detection with the Single-
Shot Detector [Liu et al., 2015].

114 155
it. This results in a non-ambiguous matching of
any bounding box (x1 ,x2 ,y1 ,y2 ) to a s,h,w, de-
termined respectively by max(x2 −x1 ,y2 −y1 ),
y1 +y2
2 , and 2 .
x1 +x2

Detection is achieved by adding S convolutional


layers, each processing a Zs and computing for
every tensor indices h,w the coordinates of a
bounding box, and the associated logits. If there
are C object classes, there are C +1 logits, the
additional one standing for “no object.” Hence,
each additional convolution layer has 4+C +1
output channels. The SSD algorithm in particu-
lar generates several bounding boxes per s,h,w,
each dedicated to a hard-coded range of aspect
ratios.

Training sets for object detection are costly to


create, since the labeling with bounding boxes
requires a slow human intervention. To mitigate
this issue, the standard approach is to start with
a convolutional model that has been prepre-trained
pre trained
trained
on a large classification data set such as VGG-16
for the original SSD, and to replace its final fully-
connected layers with additional convolutional
ones. Surprisingly, models trained for classifica-
tion only have learned feature representations
that can be repurposed for object detection, even
though that task involves the regression of geo-

115 155
metric quantities.

During training, every ground truth bounding


box is associated with its s,h,w, and induces a
loss term composed of a cross-entropy loss for
the logits, and a regression loss such as MSE
for the bounding box coordinates. Every other
s,h,w free of bounding-box match induces a
cross-entropy only penalty to predict the class
“no object”.

116 155
6.4 Semantic segmentation
The finest-grain prediction task for image under-
standing is se
seman
seman
mantic
mantic seg
segmen
men
mentata
tation
tation
tion,
tion which con-
sists of predicting, for every pixel, the class of the
object to which it belongs. This can be achieved
with a standard convolutional neural network
that outputs a convolutional map with as many
channels as classes, carrying the estimated logits
for every pixel.

While a standard residual network, for instance,


can generate a dense output of the same reso-
lution as its input, as for object detection, this
task requires operating at multiple scales. This
is necessary so that any object, or sufficiently
informative sub-part, regardless of its size, is
captured somewhere in the model by the feature
representation at a single tensor position. Hence,
standard architectures for that task downscale
the image with a series of con
convo
convo
volu
volu
lutional
tional
tional lay
layers
layers
ers
to increase the receptive field of the activations,
and re-upscale it with a series of trans
transposed
transposed
posed con
con-
con
vo
volu
volu
lutional
lutional
tional lay
layers
layers
ers, or other upscaling methods
such as bilinear interpolation, to make the pre-
diction at high resolution.

However, a strict downscaling-upscaling archi-


tecture does not allow for operating at a fine

117 155
Figure 6.3: Semantic segmentation results with the
Pyramid Scene Parsing Network [Zhao et al., 2016].

grain when making the final prediction, since all


the signal has been transmitted through a low-
resolution representation at some point. Models
that apply such downscaling-upscaling serially
mitigate these issues with skip con
connec
connec
nections
nections
tions from
layers at a certain resolution, before downscal-
ing, to layers at the same resolution, after upscal-
ing [Long et al., 2014; Ronneberger et al., 2015].
Models that do it in parallel, after a convolutional

118 155
backbone, concatenate the resulting multi-scale
representation after upscaling, before making
the final per-pixel prediction [Zhao et al., 2016].

Training is achieved with a standard cross-


entropy summed over all the pixels. As for object
detection, training can start from a net
network
network
work pre
pre-
pre
trained
trained on a large-scale image classification data
set to compensate for the limited availability of
segmentation ground truth.

119 155
6.5 Speech recognition
Speech
Speech recog
recogni
recogninition
nition consists of converting a
sound sample into a sequence of words. There
have been plenty of approaches to this problem
historically, but a conceptually simple and recent
one proposed by Radford et al. [2022] consists of
casting it as a sequence-to-sequence translation
and then solving it with a standard attention-
based Trans
Transformer
former
former, as described in § 5.3.

Their model first converts the sound signal into a


spectrogram, which is a one-dimensional series
T ×D, that encodes at every time step a vector of
energies in D frequency bands. The associated
text is encoded with the BPE
BPE to
tok
tokkenizer
enizer,
enizer see § 3.2.

The spectrogram is processed through a few


1d con
convo
convo
volu
volu
lutional
lutional lay
layers
ers
ers, and the resulting rep-
resentation is fed into the encoder of the Trans-
former. The decoder directly generates a discrete
sequence of tokens, that correspond to one of the
possible tasks considered during training. Multi-
ple objectives are considered for training: tran-
scription of English or non-English text, transla-
tion from any language to English, or detection
of non-speech sequences, such as background
music or ambient noise.

This approach allows leveraging extremely large


120 155
data sets that combine multiple types of sound
sources with diverse ground truth.

It is noteworthy that even though the ultimate


goal of this approach is to produce a transla-
tion as deterministic as possible given the input
signal, it is formally the sampling of a text dis-
tribution conditioned on a sound sample, hence
a synthesis process. The decoder is in fact ex-
tremely similar to the generative model of § 7.1.

121 155
6.6 Text-image representations
A powerful approach to image understanding
consists of learning consistent image and text
representations.

The Contrastive Language-Image Pre-training


(CLIP
CLIP
CLIP)
CLIP proposed by Radford et al. [2021] com-
bines an image encoder f , which is a ViT
ViT, and a
text encoder g, which is a GPT
GPT. See § 5.3 for both.
To repurpose a GPT as a text encoder, instead
of a standard autoregressive model, they add to
the input sequence an “end of sentence” token,
and use the representation of this token in the
last layer as the embedding. Both embeddings
have the same dimension, which, depending on
the configuration, is between 512 and 1024.

Those two models are trained from scratch using


a data set of 400 million image-text pairs (ik ,tk )
collected from the internet. The training proce-
dure follows the standard mini-batch stochastic
gradient descent approach but relies on a con con-
con
trastive
trastive loss
loss. The embeddings are computed for
every image and every text of the N pairs in
the mini-batch, and a cosine similarity measure
is computed not only between text and image
embeddings from each pair, but also across pairs,

122 155
resulting in an N ×N matrix of similarity score

lm,n = f (im )⊤ g(tn ), m = 1,...,N,n = 1,...,N.

The model is trained with cross entropy so that,


∀n the values l1,n ,...,lN,n interpreted as logit
scores predict n, and similarly for ln,1 ,...,ln,N .
This means that ∀n,m, s.t. n ̸= m the similarity
ln,n is unambiguously greater than both ln,m and
lm,n .

When it has been trained, this model can be used


to do zero
zero-shot
shot pre
predic
predic
diction
tion
tion, that is, classifying a
signal in the absence of training examples by
defining a series of candidate classes with text
descriptions, and computing the similarity of the
embedding of an image with the embedding of
each of those descriptions (see Figure 6.4).

Additionally, since the textual descriptions are of-


ten detailed, such a model has to capture a richer
representation of images and pick up cues over-
looked by classifier networks. This translates to
excellent performance on challenging datasets
such as ImageNet Adversarial [Hendrycks et al.,
2019] which was specifically designed to degrade
or erase cues on which standard predictors rely.

123 155
Figure 6.4: The CLIP text-image embedding [Radford
et al., 2021] allows to do zero-shot prediction by pre-
dicting what class description embedding is the most
consistent with the image embedding.

124 155
Chapter 7

Synthesis

A second category of applications distinct from


prediction is synthesis. It consists of fitting a
density model to training samples and providing
means to sample from this model.

125 155
7.1 Text generation
The standard approach to text
text syn
synthe
synthe
thesis
thesis is to use
an attention-based, auautore
tore
toregres
gres
gressive
gressive
sive model
model.
model The
most successful in this domain is the GPT [Rad-
ford et al., 2018], which we described in § 5.3.

The encoding into tokens and the decoding is


done with the BPE
BPE to
tok
kkenizer
enizer, see § 3.2.

When it has been trained on very large datasets,


a Large LanLanguage
Language Model (LLM LLM
LLM)
LLM exhibits ex-
tremely powerful properties. Besides the syntac-
tic and grammatical structure of the language, it
has to integrate very diverse knowledge, e.g. to
predict the word following “The capital of Japan
is”, “if water is heated to 100 Celsius degrees it
turns into”, or “because her puppy was sick, Jane
was”.

This results in particular in the ability to solve


zero
zero-shot
shot pre
predic
dic
diction
tion
tion, where no training example
is available and the objective is defined in nat-
ural language, e.g. “In the following sentences,
indicate which ones are aggressive.” More sur-
prisingly, when such a model is put in a statistical
context by a “prompt” carefully crafted, it can
exhibit abilities for question answering, problem
solving, and chain-of-thought that appear eerily
close to high-level reasoning [Chowdhery et al.,
126 155
2022; Bubeck et al., 2023].

Due to these remarkable capabilities, these mod-


els are sometimes referred to as foun
founda
founda
dation
dation
tion mod
mod-
mod
els
els [Bommasani et al., 2021].

127 155
7.2 Image generation
Multiple deep methods have been developed to
model and sample from a high-dimensional den-
sity. A powerful approach for im image
image
age syn
synthe
synthe
thesis
thesis
relies on inverting a dif
diffu
diffu
fusion
sion
sion pro
process
process
cess.
cess

The principle consists of defining analytically


a process that gradually degrades any sample,
and consequently transforms the complex and
unknown density of the data into a simple and
well-known density such as a normal, and train-
ing a deep architecture to invert this degradation
process [Ho et al., 2020].

In practice, given a fixed T , the diffusion process


defines a probabilities over series of T +1 im-
ages as follows: samples x0 uniformly in the data
set, and then go on sampling xt+1 ∼ p(xt+1 | xt )
where the conditional distribution p is defined
analytically, and such that it gradually erases
the structure that was in x0 . The setup should
be such that the distribution p(xT ) of xT has a
simple, known form, so in particular does not de-
pend on the complicated data distribution p(x0 ),
and can be sampled.

For instance, Ho et al. [2020] normalize the data


to have a mean of 0 and a variance of 1, and their
diffusion process consists of adding a bit of white
128 155
xT

x0

Figure 7.1: Image synthesis with denoising diffusion


[Ho et al., 2020]. Each sample starts as a white noise
xT (top), and is gradually de-noised by sampling iter-
atively xt−1 | xt ∼ 𝒩 (xt +f (xt ,t;w),σt ).

129 155
noise and re-normalizing the variance to 1. This
process exponentially reduces the importance of
x0 , and xt ’s density can rapidly be approximated
with a normal.

The denoiser f is a deep architecture that


should model and allow sampling from,
f (xt−1 ,xt ,t;w) ≃ p(xt−1 | xt ). It can be shown,
thanks to a varivariaational bound
bound,
bound that if this
one-step reverse process is accurate enough,
sampling xT ∼ p(xT ) and denoising T steps
with f results in a x0 that follows p(x0 ).

Training f can be achieved by generating a large


(n) (n)
number of sequences x0 ,...,xT , picking a tn
in each, and maximizing
(n) (n)
X
logf (xtn −1 ,xtn ,tn ;w).
n

Given their diffusion process, Ho et al. [2020]


have a denoising of the form

xt−1 | xt ∼ 𝒩 (xt +f (xt ,t;w);σt ), (7.1)

where σt is defined analytically.

In practice, such a model initially hallucinates


structures by pure luck in the random noise, and
then gradually build more elements that emerge
130 155
from the noise by reinforcing the most likely
continuation of the image obtained thus far.

This approach can be extended to text-


conditioned synthesis, to generate images
that match a description. For instance, Nichol
et al. [2021] add to the mean of the denoising
distribution of Equation 7.1 a bias that goes in
the direction of increasing the CLIP matching
score (see § 6.6) between the produced image
and the conditioning text description.

131 155
The missing bits

For the sake of concision, this volume skips many


important topics, in particular:

Recurrent Neural Networks


Before attention models showed greater perfor-
mance, Re
Recur
Recur
current
current
rent Neu
Neural
Neural Net
Networks
Networks
works (RNN
RNN
RNN)
RNN were
the standard approach for dealing with temporal
sequences such as text or sound samples. These
architectures possess an internal hid
hidden
den state
state
that gets updated every time a component of
the sequence is processed. Their main compo-
nents are layers such as LSTM [Hochreiter and
Schmidhuber, 1997] or GRU [Cho et al., 2014].

Training a recurrent architecture amounts to


unfolding it in time, which results in a long
composition of operators. This has historically
prompted the design of key techniques now used
for deep architectures such as rec
recti
recti
tifiers
tifiers and gat-
ing, a form of skip con
connec
nec
nections
nections
tions which are mod-
132 155
ulated dynamically.

Autoencoder
An au
autoen
autoen
toencoder
toencoder is a model that maps an input
signal, possibly of high dimension, to a low-
dimension latent representation, and then maps
it back to the original signal, ensuring that infor-
mation has been preserved. We saw it in § 6.1
for denoising, but it can also be used to auto-
matically discover a meaningful low-dimension
parameterization of the data manifold.

The Vari
Variaaational Au
Autoen
toen
toencoder
toencoder (VAE
VAE
VAE)
VAE proposed by
Kingma and Welling [2013] is a generative model
with a similar structure. It imposes, through the
loss, a pre-defined distribution to the latent rep-
resentation, so that, after training, it allows for
the generation of new samples by sampling the
latent representation according to this imposed
distribution and then mapping back through the
decoder.

Generative Adversarial Networks


Another approach to density modeling is the
Gen
Gener
er
era
eraative Ad
Adver
ver
versar
versar
sarial
ial Net
Networks
Networks
works (GAN
GAN
GAN)
GAN intro-
duced by Goodfellow et al. [2014]. This method
combines a gen
gener
gener
eraator
tor, which takes a random in-

133 155
put following a fixed distribution as input and
produces a structured signal such as an image,
and a dis
discrim
discrim
crimiina
nator
nator
tor, which takes as input a sam-
ple and predicts whether it comes from the train-
ing set or if it was generated by the generator.

Training optimizes the discriminator to mini-


mize a standard cross-entropy loss, and the gen-
erator to maximize the discriminator’s loss. It
can be shown that at equilibrium the gener-
ator produces samples indistinguishable from
real data. In practice, when the gradient flows
through the discriminator to the generator, it
informs the latter about the cues that the dis-
criminator uses that should be addressed.

Reinforcement Learning
Many problems require a model to estimate
an accumulated long-term reward given action
choices and an observable state, and what ac-
tions to choose to maximize that reward. Re Rein
Rein
in-
force
forcement
ment
ment Learn
Learning
Learning (RL
RL
RL) is the standard frame-
work to formalize such problems, and strategy
games or robotic control, for instance, can be
formulated within it. Deep models, particularly
convolutional neural networks, have demon-
strated excellent performance for this class of
tasks [Mnih et al., 2015].

134 155
Fine-tuning
As we saw in § 6.3 for object detection, or in § 6.4
for semantic segmentation, fine
fine-tun
tun
tuning
tuning
ing deep ar-
chitectures is an efficient strategy to deal with
small training sets. Furthermore, due to the dra-
matic increase in the size of architectures, partic-
ularly that of Large Lan
Language
Language
guage Mod
Models
Models
els, training a
single model can cost several millions of dollars,
and fine-tuning is a crucial, and often the only
way, to achieve high performance on a specific
task.

Graph Neural Networks


Many applications require processing signals
which are not organized regularly on a grid. For
instance, molecules, proteins, 3D meshes, or ge-
ographic locations are more naturally structured
as graphs. Standard convolutional networks or
even attention models are poorly adapted to pro-
cess such data, and the tool of choice for such a
task is Graph Neu
Neural
Neural Net
Networks
Networks
works (GNN
GNN
GNN,
GNN [Scarselli
et al., 2009]).

These models are composed of layers that com-


pute activations at each vertex by combining
linearly the activations located at its immediate
neighboring vertices. This operation is very sim-

135 155
ilar to a standard convolution, except that the
data structure does not reflect any geometrical
information associated with the feature vectors
they carry.

Self-supervised training
As stated in § 7.1, even though they are trained
only to predict the next word, Large
Large Lan
Language
guage
guage
Mod
Models
Models
els trained on large unlabeled data sets such
as GPT
GPT (see § 5.3) are able to solve various tasks
such as identifying the grammatical role of a
word, answering questions, or even translating
from one language to another [Radford et al.,
2019].

Such models constitute one category of a larger


class of methods that fall under the name of self
self-
self
su
super
super
pervised
pervised
vised learn
learning
ing
ing, and try to take advantage
of unlabeled data sets [Balestriero et al., 2023].
The key principle of these methods is to define a
task that does not require labels but necessitates
feature representations which are useful for the
real task of interest, for which a small labeled
data set exists. In computer vision, for instance,
a standard approach consists of optimizing im-
age features so that they are invariant to data
transformations that do not change the semantic
content of the image, while being statistically

136 155
uncorrelated [Zbontar et al., 2021].

137 155
Afterword

Recent developments in Artificial Intelligence


have been incredibly exciting, and it is difficult
to comment on them without being overly dra-
matic. There are few doubts that these technolo-
gies will cause fundamental changes in how we
work, how we interact with knowledge and in-
formation, and that they will force us to rethink
concepts as fundamental as intelligence, under-
standing, and sentience.

In spite of its weaknesses, particularly its sheer


brutality and its computational cost, deep learn-
ing is likely to remain an important component
of AI systems for the foreseeable future and, as
such, a key element of this new era.

138 155
Bibliography

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer


Normalization. CoRR, abs/1607.06450, 2016.
[pdf]. 76

R. Balestriero, M. Ibrahim, V. Sobal, et al. A


Cookbook of Self-Supervised Learning. CoRR,
abs/2304.12210, 2023. [pdf]. 136

A. Baydin, B. Pearlmutter, A. Radul, and


J. Siskind. Automatic differentiation in
machine learning: a survey. CoRR,
abs/1502.05767, 2015. [pdf]. 40

M. Belkin, D. Hsu, S. Ma, and S. Mandal. Rec-


onciling modern machine learning and the
bias-variance trade-off. CoRR, abs/1812.11118,
2018. [pdf]. 45

R. Bommasani, D. Hudson, E. Adeli, et al. On


the Opportunities and Risks of Foundation
Models. CoRR, abs/2108.07258, 2021. [pdf].
127
139 155
T. Brown, B. Mann, N. Ryder, et al. Lan-
guage Models are Few-Shot Learners. CoRR,
abs/2005.14165, 2020. [pdf]. 48, 104

S. Bubeck, V. Chandrasekaran, R. Eldan, et al.


Sparks of Artificial General Intelligence:
Early experiments with GPT-4. CoRR,
abs/2303.12712, 2023. [pdf]. 127

T. Chen, B. Xu, C. Zhang, and C. Guestrin. Train-


ing Deep Nets with Sublinear Memory Cost.
CoRR, abs/1604.06174, 2016. [pdf]. 41

K. Cho, B. van Merrienboer, Ç. Gülçehre,


et al. Learning Phrase Representations using
RNN Encoder-Decoder for Statistical Machine
Translation. CoRR, abs/1406.1078, 2014. [pdf].
132

A. Chowdhery, S. Narang, J. Devlin, et al. PaLM:


Scaling Language Modeling with Pathways.
CoRR, abs/2204.02311, 2022. [pdf]. 48, 126

G. Cybenko. Approximation by superpositions


of a sigmoidal function. Mathematics of Con-
trol, Signals, and Systems, 2(4):303–314, De-
cember 1989. [pdf]. 91

J. Devlin, M. Chang, K. Lee, and K. Toutanova.


BERT: Pre-training of Deep Bidirectional

140 155
Transformers for Language Understanding.
CoRR, abs/1810.04805, 2018. [pdf]. 48, 106

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al.


An Image is Worth 16x16 Words: Transform-
ers for Image Recognition at Scale. CoRR,
abs/2010.11929, 2020. [pdf]. 105, 106

K. Fukushima. Neocognitron: A self-organizing


neural network model for a mechanism of
pattern recognition unaffected by shift in po-
sition. Biological Cybernetics, 36(4):193–202,
April 1980. [pdf]. 2

Y. Gal and Z. Ghahramani. Dropout as


a Bayesian Approximation: Representing
Model Uncertainty in Deep Learning. CoRR,
abs/1506.02142, 2015. [pdf]. 72

X. Glorot and Y. Bengio. Understanding the dif-


ficulty of training deep feedforward neural
networks. In International Conference on Arti-
ficial Intelligence and Statistics (AISTATS), 2010.
[pdf]. 42, 56

X. Glorot, A. Bordes, and Y. Bengio. Deep Sparse


Rectifier Neural Networks. In International
Conference on Artificial Intelligence and Statis-
tics (AISTATS), 2011. [pdf]. 64

141 155
A. Gomez, M. Ren, R. Urtasun, and R. Grosse.
The Reversible Residual Network: Backprop-
agation Without Storing Activations. CoRR,
abs/1707.04585, 2017. [pdf]. 41
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza,
et al. Generative Adversarial Networks. CoRR,
abs/1406.2661, 2014. [pdf]. 133
K. He, X. Zhang, S. Ren, and J. Sun. Deep Resid-
ual Learning for Image Recognition. CoRR,
abs/1512.03385, 2015. [pdf]. 46, 77, 78, 95, 97
D. Hendrycks and K. Gimpel. Gaussian Error
Linear Units (GELUs). CoRR, abs/1606.08415,
2016. [pdf]. 66
D. Hendrycks, K. Zhao, S. Basart, et al. Natural
Adversarial Examples. CoRR, abs/1907.07174,
2019. [pdf]. 123
J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion
Probabilistic Models. CoRR, abs/2006.11239,
2020. [pdf]. 128, 129, 130
S. Hochreiter and J. Schmidhuber. Long Short-
Term Memory. Neural Computation, 9(8):1735–
1780, 1997. [pdf]. 132
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reduc-
ing Internal Covariate Shift. In International
142 155
Conference on Machine Learning (ICML), 2015.
[pdf]. 73

J. Kaplan, S. McCandlish, T. Henighan, et al. Scal-


ing Laws for Neural Language Models. CoRR,
abs/2001.08361, 2020. [pdf]. 46, 47

D. Kingma and J. Ba. Adam: A Method for


Stochastic Optimization. CoRR, abs/1412.6980,
2014. [pdf]. 36

D. P. Kingma and M. Welling. Auto-Encoding


Variational Bayes. CoRR, abs/1312.6114, 2013.
[pdf]. 133

A. Krizhevsky, I. Sutskever, and G. Hinton. Ima-


geNet Classification with Deep Convolutional
Neural Networks. In Neural Information Pro-
cessing Systems (NIPS), 2012. [pdf]. 8, 93

Y. LeCun, B. Boser, J. S. Denker, et al. Back-


propagation applied to handwritten zip code
recognition. Neural Computation, 1(4):541–
551, 1989. [pdf]. 8

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.


Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11):
2278–2324, 1998. [pdf]. 93, 94

143 155
W. Liu, D. Anguelov, D. Erhan, et al. SSD: Single
Shot MultiBox Detector. CoRR, abs/1512.02325,
2015. [pdf]. 112, 114

J. Long, E. Shelhamer, and T. Darrell. Fully Con-


volutional Networks for Semantic Segmenta-
tion. CoRR, abs/1411.4038, 2014. [pdf]. 77, 78,
118

A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rec-


tifier nonlinearities improve neural network
acoustic models. In proceedings of the ICML
Workshop on Deep Learning for Audio, Speech
and Language Processing, 2013. [pdf]. 65

V. Mnih, K. Kavukcuoglu, D. Silver, et al. Human-


level control through deep reinforcement
learning. Nature, 518(7540):529–533, February
2015. [pdf]. 134

A. Nichol, P. Dhariwal, A. Ramesh, et al. GLIDE:


Towards Photorealistic Image Generation and
Editing with Text-Guided Diffusion Models.
CoRR, abs/2112.10741, 2021. [pdf]. 131

A. Radford, J. Kim, C. Hallacy, et al. Learn-


ing Transferable Visual Models From Natural
Language Supervision. CoRR, abs/2103.00020,
2021. [pdf]. 122, 124

144 155
A. Radford, J. Kim, T. Xu, et al. Robust Speech
Recognition via Large-Scale Weak Supervi-
sion. CoRR, abs/2212.04356, 2022. [pdf]. 120

A. Radford, K. Narasimhan, T. Salimans, and


I. Sutskever. Improving Language Understand-
ing by Generative Pre-Training, 2018. [pdf].
101, 104, 126

A. Radford, J. Wu, R. Child, et al. Language


Models are Unsupervised Multitask Learners,
2019. [pdf]. 104, 136

O. Ronneberger, P. Fischer, and T. Brox. U-Net:


Convolutional Networks for Biomedical Im-
age Segmentation. In Medical Image Comput-
ing and Computer-Assisted Intervention, 2015.
[pdf]. 77, 78, 118

F. Scarselli, M. Gori, A. C. Tsoi, et al. The Graph


Neural Network Model. IEEE Transactions
on Neural Networks (TNN), 20(1):61–80, 2009.
[pdf]. 135

R. Sennrich, B. Haddow, and A. Birch. Neural


Machine Translation of Rare Words with Sub-
word Units. CoRR, abs/1508.07909, 2015. [pdf].
32

145 155
J. Sevilla, L. Heim, A. Ho, et al. Compute Trends
Across Three Eras of Machine Learning. CoRR,
abs/2202.05924, 2022. [pdf]. 9, 46, 48

J. Sevilla, P. Villalobos, J. F. Cerón, et al. Param-


eter, Compute and Data Trends in Machine
Learning, May 2023. [web]. 49

K. Simonyan and A. Zisserman. Very Deep Con-


volutional Networks for Large-Scale Image
Recognition. CoRR, abs/1409.1556, 2014. [pdf].
93

N. Srivastava, G. Hinton, A. Krizhevsky, et al.


Dropout: A Simple Way to Prevent Neural
Networks from Overfitting. Journal of Ma-
chine Learning Research (JMLR), 15:1929–1958,
2014. [pdf]. 70

A. Vaswani, N. Shazeer, N. Parmar, et al. Atten-


tion Is All You Need. CoRR, abs/1706.03762,
2017. [pdf]. 77, 80, 88, 100, 101, 102

J. Zbontar, L. Jing, I. Misra, et al. Barlow Twins:


Self-Supervised Learning via Redundancy Re-
duction. CoRR, abs/2103.03230, 2021. [pdf].
137

M. D. Zeiler and R. Fergus. Visualizing and Un-


derstanding Convolutional Networks. In Eu-

146 155
ropean Conference on Computer Vision (ECCV),
2014. [pdf]. 62

H. Zhao, J. Shi, X. Qi, et al. Pyramid Scene


Parsing Network. CoRR, abs/1612.01105, 2016.
[pdf]. 118, 119

147 155
Index

1d convolution, 60
2d convolution, 60

activation, 23, 39
activation function, 64, 91
activation map, 61
Adam, 36
artificial neural network, 8, 11
attention layer, 80
attention operator, 81
autoencoder, 133
Autograd, 40
autoregressive model, 31, 126
average pooling, 67

backpropagation, 40
backward pass, 40
basis function regression, 14
batch, 21, 36
batch normalization, 73
bias vector, 55, 60
BPE, 32, 120, 126
148 155
Byte Pair Encoding, 32

cache memory, 21
capacity, 16
causal, 32, 83
causal model, 31, 84, 103
channel, 23
checkpointing, 41
classification, 18
CLIP, 122
CLS token, 106
computational cost, 41
contrastive loss, 28, 122
convnet, 93
convolutional layer, 58, 67, 80, 88, 93, 96, 112,
117, 120
convolutional network, 93
cross-attention block, 86, 101
cross-entropy, 27

data augmentation, 111


deep learning, 8, 11
denoising autoencoder, 109
density modeling, 18
depth, 39
diffusion process, 128
dilation, 61, 67
discriminator, 134
downscaling residual block, 98
149 155
dropout, 70, 84

embedding layer, 87, 103


epoch, 43

filter, 60
fine tuning, 135
flops, 22
forward pass, 39
foundation models, 127
FP32, 22
framework, 23
fully-connected layer, 55, 80, 88, 91, 93

GAN, 133
GELU, 66
Generative Adversarial Networks, 133
generator, 133
GNN, 135
GPT, 104, 122, 126, 136
GPU, 8, 20
gradient descent, 33, 35, 38
gradient step, 33
Graph Neural Network, 135
Graphical Processing Unit, 8, 20
ground truth, 18

hidden layer, 91
hidden state, 132

150 155
image classification, 111
image processing, 93
image synthesis, 80, 128
inductive bias, 17, 44, 58

kernel size, 60, 67


key, 81

Large Language Model, 81, 126, 135


Large Language Models, 136
layer, 53
layer normalization, 76
layers, 39
Leaky ReLU, 65
learning rate, 33
learning rate schedule, 45
LeNet, 93
lenet, 94
linear layer, 55
LLM, 126
local minimum, 33
logit, 27
loss, 12

max pooling, 67
mean squared error, 14, 27
memory requirement, 41
memory speed, 21
meta-parameter, 13, 43
metric learning, 28
151 155
MLP, 91, 101
model, 12
Multi-Head Attention, 84, 100
multi-layer perceptron, 91

natural language processing, 80


NLP, 80
non-linearity, 64
normalizing layer, 73

object detection, 112


overfitting, 17, 43

padding, 60, 67
parameter, 12
parametric model, 12
peak performance, 22
pooling, 67
positional encoding, 88, 103
posterior probability, 27
pre-trained model, 115, 119

query, 81

random initialization, 56
receptive field, 61, 62, 112
rectified linear unit, 64, 132
recurrent neural network, 132
regression, 18
reinforcement learning, 134
152 155
ReLU, 64
residual block, 96
residual connection, 77, 95
residual network, 77, 95
ResNet, 77, 95
ResNet-50, 95
reversible layer, 41
RL, 134
RNN, 132

scaling laws, 46
self-attention block, 86, 100, 101
self-supervised learning, 136
semantic segmentation, 117
SGD, 36
Single Shot Detector, 112
skip connection, 77, 118, 132
softargmax, 27, 82
softmax, 27
speech recognition, 120
SSD, 112
stochastic gradient descent, 36, 46
stride, 61, 67
supervised learning, 19

tanh, 65
tensor, 23
tensor cores, 21
Tensor Processing Units, 21
153 155
test set, 43
text synthesis, 126
tokenizer, 32, 120, 126
tokens, 30
TPU, 21
trainable parameter, 12, 23, 46
training, 12
training set, 12, 26, 43
Transformer, 77, 81, 88, 100, 102, 120
transposed convolution, 63, 117

under-fitting, 16
universal approximation theorem, 91
unsupervised learning, 19

VAE, 133
validation set, 43
value, 81
vanishing gradient, 42, 52
variational autoencoder, 133
variational bound, 130
Vision Transformer, 106
ViT, 106, 122
vocabulary, 30

weight, 13
weight decay, 29
weight matrix, 55

zero-shot prediction, 123, 126


154 155
This book is licensed under the Creative Com-
mons BY-NC-SA 4.0 International License.

preprint-2023.05.21

155 155

You might also like