The Little Book of Deep Learning
The Little Book of Deep Learning
The Little Book of Deep Learning
of
Deep Learning
François Fleuret
François Fleuret is a professor of computer sci-
ence at the University of Geneva, Switzerland.
List of figures 7
Foreword 8
I Foundations 10
1 Machine Learning 11
1.1 Learning from data . . . . . . . 12
1.2 Basis function regression . . . . 14
1.3 Under and overfitting . . . . . . 16
1.4 Categories of models . . . . . . 18
2 Efficient computation 20
2.1 GPUs, TPUs, and batches . . . . 21
2.2 Tensors . . . . . . . . . . . . . . 23
3 Training 26
3.1 Losses . . . . . . . . . . . . . . 27
3.2 Autoregressive models . . . . . 30
3.3 Gradient descent . . . . . . . . 33
3.4 Backpropagation . . . . . . . . 38
3.5 Training protocols . . . . . . . 43
3 155
3.6 The benefits of scale . . . . . . 46
II Deep models 51
4 Model components 52
4.1 The notion of layer . . . . . . . 53
4.2 Linear layers . . . . . . . . . . . 55
4.3 Activation functions . . . . . . 64
4.4 Pooling . . . . . . . . . . . . . . 67
4.5 Dropout . . . . . . . . . . . . . 70
4.6 Normalizing layers . . . . . . . 73
4.7 Skip connections . . . . . . . . 77
4.8 Attention layers . . . . . . . . . 80
4.9 Token embedding . . . . . . . . 87
4.10 Positional encoding . . . . . . . 88
5 Architectures 90
5.1 Multi-Layer Perceptrons . . . . 91
5.2 Convolutional networks . . . . 93
5.3 Attention models . . . . . . . . 100
Afterword 138
Bibliography 139
Index 148
5 155
List of Figures
4.1 1d convolution . . . . . . . . . . . . 57
4.2 2d convolution . . . . . . . . . . . . 58
4.3 Stride, padding, and dilation . . . . 59
4.4 Receptive field . . . . . . . . . . . . 62
4.5 Activation functions . . . . . . . . . 65
4.6 Max pooling . . . . . . . . . . . . . 68
4.7 Dropout . . . . . . . . . . . . . . . . 71
4.8 Batch normalization . . . . . . . . . 74
4.9 Skip connections . . . . . . . . . . . 78
4.10 Interpretation of the attention operator 81
4.11 Attention operator . . . . . . . . . . 83
6 155
4.12 Multi-Head Attention layer . . . . . 85
7 155
Foreword
If you did not get this book from its official URL
https://fleuret.org/public/lbdl.pdf
François Fleuret,
May 21, 2023
9 155
Part I
Foundations
10 155
Chapter 1
Machine Learning
Deep
Deep learn
learning
learning
ing belongs historically to the larger
field of statistical machine learning, as it funda-
mentally concerns methods able to learn repre-
sentations from data. The techniques involved
come originally from ar arti
ti
tifi
fi
ficial
cial
cial neu
neural
ral
ral net
networks
networks
works,
works
and the “deep” qualifier highlights that models
are long compositions of mappings, now known
to achieve greater performance.
11 155
1.1 Learning from data
The simplest use case for a model trained from
data is when a signal x is accessible, for instance,
the picture of a license plate, from which one
wants to predict a quantity y, such as the string
of characters written on the plate.
13 155
1.2 Basis function regression
We can illustrate the training of a model in a sim-
ple case where xn and yn are two real numbers,
the loss is the mean
mean squared ererror
error
ror
N
1X
ℒ (w) = (yn −f (xn ;w))2 , (1.1)
N
n=1
14 155
the loss ℒ (w) is quadratic with respect to the
wk s, and finding w∗ that minimizes it boils down
to solving a linear system. See Figure 1.1 for an
example with Gaussian kernels as fk .
15 155
1.3 Under and overfitting
A key element is the interplay between the ca ca-
ca
pac
pacity
pacity
ity of the model, that is its flexibility and
ability to fit diverse data, and the amount and
quality of the training data. When the capacity
is insufficient, the model cannot fit the data and
the error during training is high. This is referred
to as un
under
under
der-fit
der fit
fitting
fitting
ting.
16 155
This is over
overfit
overfit
fitting
fitting
ting.
17 155
1.4 Categories of models
We can organize the use of machine learning
models into three broad categories:
• Re
Regres
Regres
gression
gression consists of predicting a
continuous-valued vector y ∈ RK , for instance,
a geometrical position of an object, given an
input signal X. This is a multi-dimensional
generalization of the setup we saw in § 1.2. The
training set is composed of pairs of an input
signal and a ground
ground truth value.
• Clas
Classi
Classi
sifi
sifi
fica
fica
cation
tion aims at predicting a value from
a finite set {1,...,C}, for instance, the label Y of
an image X. As for regression, the training set
is composed of pairs of input signal, and ground
ground
truth
truth quantity, here a label from that set. The
standard way of tackling this is to predict one
score per potential class, such that the correct
class has the maximum score.
• Den
Density
Density
sity mod
model
model
eling
ing has as its objective to model
the probability density function of the data µX
itself, for instance, images. In that case, the train-
ing set is composed of values xn without associ-
ated quantities to predict, and the trained model
should allow either the evaluation of the prob-
ability density function, or sampling from the
distribution, or both.
18 155
Both regression and classification are generally
referred to as su
super
per
pervised
pervised learn
learning
ing
ing since the value
to be predicted, which is required as a target dur-
ing training, has to be provided, for instance, by
human experts. On the contrary, density mod-
eling is usually seen as un unsu
unsu
super
per
pervised
pervised learn
learning
learning
ing
since it is sufficient to take existing data, with-
out the need for producing an associated ground-
truth.
19 155
Chapter 2
Efficient computation
20 155
2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally de-
signed for real-time image synthesis, which re-
quires highly parallel architectures that happen
to be fitting for deep models. As their usage
for AI has increased, GPUs have been equipped
with dedicated sub-components referred to as
ten
tensor
sor cores
cores,
cores and deep-learning specialized chips
such as Google’s Ten
Tensor
Tensor Pro
Process
Process
cessing
cessing
ing Units (TPUs
TPUs
TPUs)
have been produced.
21 155
to the cache memory near the actual computing
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing
it for every sample. In practice, a GPU processes
a batch that fits in memory almost as quickly as
a single sample.
(FLOPs
FLOPs
FLOPs) per second, and its memory typically
ranges from 8 to 80 gigabytes. The standard
FP32 encoding of float numbers is on 32 bits, but
empirical results show that using encoding on
16 bits, or even less for some operands, does not
degrade performance.
22 155
2.2 Tensors
GPUs and deep
deep learn
learning
ing frame
frameworks
frameworks
works such as Py-
Torch or JAX manipulate the quantities to be
processed by organizing them as ten
tensors
tensors
sors,
sors which
are series of scalars arranged along several dis-
crete axes. They are elements of RN1 ×···×ND
that generalize the notion of vector and matrix.
24 155
their designs.
25 155
Chapter 3
Training
26 155
3.1 Losses
The example of the mean
mean squared ererror
error
ror of Equa-
tion 1.1 is a standard loss for predicting a con-
tinuous value.
expf (x;w)y
P̂ (Y = y | X = x) = P .
z expf (x;w)z
27 155
For density modeling, the standard loss is the
likelihood of the data. If f (x;w) is to be inter-
preted as a normalized log-probability or density,
the loss is the opposite of the sum of its value
over training samples.
29 155
3.2 Autoregressive models
Many spectacular applications in computer vi-
sion and natural language processing have been
tackled by modeling the distribution of a high-
dimension discrete vector with the chain rule:
cst x1 x2 ... xT −1 xT
32 155
3.3 Gradient descent
Except in specific cases like the linear regression
we saw in § 1.2, the optimal parameters w∗ do
not have a closed-form expression. In the general
case, the tool of choice to minimize a function
is gra
gradi
gradi
dient
ent
ent de
descent
scent
scent.
scent It consists of initializing the
parameters with a random w0 , and then improv-
ing this estimate by iterating gra
gradi
gradi
dient
ent
ent steps
steps,
steps each
consisting of computing the gradient of the loss
with respect to the parameters, and subtracting
a fraction of it:
33 155
w
ℒ (w)
34 155
As with many algorithms, intuition tends to
break down in very high dimensions, and al-
though it may seem that this procedure would
be easily trapped in a local minimum, in reality,
due to the number of parameters, the design of
the models, and the stochasticity of the data, its
efficiency is far greater than one might expect.
where
𝓁n (w) = L(f (xn ;w),yn )
for some L, and the gradient is then
N
1X
∇ℒ |w (w) = ∇𝓁n |w (w). (3.2)
N
n=1
36 155
ent training speeds in different parts of a model.
37 155
3.4 Backpropagation
Using gra
gradi
di
dient
ent
ent de
descent
descent
scent requires a tech-
nical means to compute ∇𝓁|w (w) where
𝓁= L(f (x;w);y). Given that f and L are both
compositions of standard tensor operations, as
for any mathematical expression, the chain rule
allows us to get an expression of it.
fd (·;wd )
x(d−1) x(d)
×Jfd |x
∇𝓁 |x(d−1) ∇𝓁 |x(d)
×Jfd |w
∇𝓁 |wd
38 155
Forward and backward passes
Consider the simple case of a composition of
mappings
f = f1 ◦f2 ◦···◦fD .
40 155
Resource usage
Regarding the comcompupu
puta
ta
tational
tional cost
cost,
cost as we will
see, the bulk of the computation goes into linear
operations that require one matrix product for
the forward pass and two for the products by
the Jacobians for the backward pass, making the
latter roughly twice as costly as the former.
The mem
memory
memory
ory re
require
require
quirement
ment
ment during inference is
roughly equal to that of the most demanding
individual layer. For training, however, the back-
ward pass requires keeping the activations com-
puted during the forward pass to compute the
Jacobians, which results in a memory usage that
grows proportionally to the model’s depth. Tech-
niques exist to trade the memory usage for com-
putation by either relying on re reversible
reversible
versible lay
layers
layers
ers
[Gomez et al., 2017], or using check
checkpoint
checkpoint
pointing
pointing
ing,
ing
which consists of storing activations for some
layers only and recomputing the others on the fly
with partial forward passes during the backward
pass [Chen et al., 2016].
Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-
wards through an operator, it may be rescaled
41 155
by a multiplicative factor, and consequently de-
crease or increase exponentially when it tra-
verses many layer. When it decreases exponen-
tially, this is called the van
vanish
vanish
ishing
ing
ing gra
gradi
gradi
dient
ent
ent,
ent and
it may make the training impossible, or, in its
milder form, cause different parts of the model
to be updated at different speeds, degrading their
co-adaptation [Glorot and Bengio, 2010].
42 155
3.5 Training protocols
Training a deep network requires defining a pro-
tocol to make the most of computation and data,
and ensure that performance will be good on
new data.
Loss
Validation
Train
Number of epochs
44 155
on the training set [Belkin et al., 2018].
45 155
3.6 The benefits of scale
There is an accumulation of empirical results
showing that performance, for instance, esti-
mated through the loss on test data, improves
with the amount of data according to remarkable
scal
scaling
scaling laws
laws,
laws as long as the model size increases
correspondingly [Kaplan et al., 2020], see Figure
3.5.
46 155
Test loss
Compute (peta-FLOP/s-day)
Test loss
Number of parameters
48 155
1TWh
PaLM-540B
1024
GPT-3 LaMDA
AlphaZero Whisper
Training cost (FLOP)
ViT-H/14
1MWh
AlphaGo CLIP-ViT-L/14
GPT-2
1021
BERT-Large
Transformer
ResNet-152 GPT
1KWh
VGG16
2015 2020
Year
Figure 3.6: Training costs in number of FLOP of some
landmark models [Sevilla et al., 2023]. The colors in-
dicate the domains of application: Computer Vision
(blue), Natural Language Processing (red), or other
(black). The dashed lines correspond to the energy con-
sumption using A100s SXM in 16 bits precision.
49 155
ficial intelligence rely on very large language
models, which we will see in § 5.3 and § 7.1,
trained on extremely large text datasets, see Ta-
ble 3.1.
50 155
Part II
Deep models
51 155
Chapter 4
Model components
52 155
4.1 The notion of layer
We call lay
layers
layers standard complex compounded
tensor operations that have been designed and
empirically identified as being generic and effi-
cient. They often incorporate trainable param-
eters and correspond to a convenient level of
granularity for designing and describing large
deep models. The term is inherited from sim-
ple multi-layer neural networks, even though
modern models may take the form of a complex
graph of such modules, incorporating multiple
parallel pathways.
Y
4×4
g n=4
f
×K
32×32
X
54 155
4.2 Linear layers
Lin
Linear
Linear
ear lay
layers
layers are the most important modules
in terms of computation and number of parame-
ters. They benefit from decades of research and
engineering in algorithmic and chip design for
matrix operations.
fully-connected layers
The most basic one is the fully
fully-con
fully con
connected
connected
nected layer
layer,
layer
parameterized by w = (W,b), where W is a
D′ ×D weight ma matrix
trix
trix, and b is a bias
bias vec
vector
vector
tor of di-
mension D′ . It implements a matrix/vector prod-
uct generalized to arbitrary tensor shapes. Given
an input X of dimension D1 ×···×DK ×D, it
computes an output Y of dimension D1 ×···×
DK ×D′ with
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = W X[d1 ,...,dK ]+b.
Convolutional layers
A linear layer can take as input an arbitrarily-
shaped tensor by reshaping it into a vector, as
long as it has the correct number of coefficients.
However, such a layer is poorly adapted to deal-
ing with large tensors, since the number of pa-
rameters and number of operations are propor-
tional to the product of the input and output
dimensions. For instance, to process an RGB
image of size 256×256 as input and compute a
result of the same size, it would require approxi-
mately 4×1010 parameters and multiplications.
56 155
Y Y
ϕ ψ
X X
Y Y
ϕ ψ
X X
... ...
Y Y
ϕ ψ
X X
1d transposed
1d convolution
convolution
Figure 4.1: A 1d convolution (left) takes as input
a D×T tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D×K, and stores
the resulting D′ ×1 tensors into Y . A 1d transposed
convolution (right) takes as input a D×T tensor, ap-
plies the same affine mapping ψ(·;w) to every sub-
tensor of shape D×1, and sums the shifted resulting
D′ ×K tensors. Both can process inputs of different
size.
57 155
ϕ ψ
Y X
X Y
2d transposed
2d convolution
convolution
Figure 4.2: A 2d convolution (left) takes as input a
D×H ×W tensor X, applies the same affine map-
ping ϕ(·;w) to every sub-tensor of shape D×K ×L,
and stores the resulting D′ ×1×1 tensors into Y . A
2d transposed convolution (right) takes as input a
D×H ×W tensor, applies the same affine mapping
ψ(·;w) to every D×1×1 sub-tensor, and sums the
shifted resulting D′ ×K ×L tensors into Y .
58 155
Y
Y ϕ
X
ϕ
p=2
X
Padding
Y
Y
ϕ
X ϕ
X
s=2
...
d=2
Stride
Dilation
Figure 4.3: Beside its kernel size and number of input
/ output channels, a convolution admits three meta-
parameter: the stride s (left) modulates the step size
when going though the input tensor, the padding p
(top right) specifies how many zeros entries are added
around the input tensor before processing it, and the
dilation d (bottom right) parameterizes the index count
between coefficients of the filter.
59 155
same operator everywhere.
A 1d
1d con
convo
vo
volu
volu
lution
tion is mainly defined by three
meta-parameters: its kerkernel
kernel
nel size
size K, its number
of input channels D, its number of output chan-
nels D′ , and by the trainable parameters w of an
′
affine mapping ϕ(·;w) : RD×K → RD ×1 .
A 2d
2d con
convo
convo
volu
volu
lution
tion is similar but has a K ×L ker-
nel and takes as input a D×H ×W tensor, see
Figure 4.2 (left).
• The padding
padding specifies how many zero coeffi-
cients should be added around the input tensor
before processing it, particularly to maintain the
tensor size when the kernel size is greater than
60 155
one. Its default value is 0.
• The di
dila
la
lation
lation specifies the index count between
the filter coefficients of the local affine opera-
tor. Its default value is 1, and greater values
correspond to inserting zeros between the coef-
ficients, which increases the filter / kernel size
while keeping the number of trainable parame-
ters unchanged.
61 155
Figure 4.4: Given an activation in a series of convolu-
tion layers, here in red, its re
recep
recep
ceptive
ceptive field
field is the area in
the input signal, in blue, that modulates its value. Each
intermediate convolutional layer increases the width
and height of that area by roughly those of the kernel.
62 155
A converse operation is the trans
transposed
transposed
posed con
convo
convo
volu
volu
lu-
tion
tion that also consists of a localized affine op-
erator, defined by similar meta and trainable
parameters as the convolution, but which ap-
plies, for instance, in the 1d case, an affine map-
′
ping ψ(·;w) : RD×1 → RD ×K , to every D×1
sub-tensor of the input, and sums the shifted
D′ ×K resulting tensors to compute its output.
Such an operator increases the size of the signal
and can be understood intuitively as a synthe-
sis process (see Figure 4.1, right and Figure 4.2,
right).
63 155
4.3 Activation functions
If a network were combining only linear com-
ponents, it would itself be a linear operator,
so it is essential to have non
non-lin
non lin
linear
ear
ear op
oper
oper
era
eraations
tions.
tions
They are implemented in particular with ac acti
acti
tiva
tiva
va-
va
tion
tion func
functions
functions
tions,
tions which are layers that transforms
each component of the input tensor individually
through a mapping, resulting in a tensor of the
same shape.
64 155
Tanh ReLU
And GELU
GELU [Hendrycks and Gimpel, 2016] is de-
fined with the cumulative distribution function
of the Gaussian distribution, that is
gelu(x) = xP (Z ≤ x),
66 155
4.4 Pooling
A classical strategy to reduce the signal size is to
use a pool
pooling
pooling
ing operation that combines multiple
activations into one that ideally summarizes the
information. The most standard operation of this
class is the max
max pool
pooling
ing layer, which, similarly
to convolution, can operate in 1d and 2d, and is
defined by a ker
kernel
kernel
nel size
size.
max
max
...
max
1d max pooling
68 155
layer that computes the average instead of the
maximum over the sub-tensors. This is a linear
operation, whereas max pooling is not.
69 155
4.5 Dropout
Some layers have been designed to explicitly
facilitate training or improve the quality of the
learned representations.
70 155
Y Y
0
1 1 1 1 1 1
0 1 1 1 0
1 0
1 1
1 1
0 1 1
0 1 1 1 1 1 1 1 1 1
× 1 1 1
0 1 1 1 1 1
0 1 1 1 1
0 × 1−p
1 1 1 1 1 1
0 1 1 1
0 1 1
0 1
1
0 1 1 1
0 1 1 1 1 1 1 1
0 1
X X
Train Test
Figure 4.7: Dropout can process a tensor of arbitrary
shape. During training (left), it sets activations at ran-
dom to zero with probability p and applies a multiply-
ing factor to keep the expected values unchanged. Dur-
ing test (right), it keeps all the activations unchanged.
71 155
confidence scores [Gal and Ghahramani, 2015].
72 155
4.6 Normalizing layers
An important class of operators to facilitate the
training of deep architectures are the nor
normal
normal
maliz
maliz
iz-
iz
ing
ing lay
layers
layers
ers, which force the empirical mean and
variance of groups of activations.
73 155
D
H,W
x⊙γ +β x⊙γ +β
√ √
(x− m̂)/ v̂+ϵ (x− m̂)/ v̂+ϵ
batchnorm layernorm
74 155
viation γd
xb,d − m̂d
zb,d = √
v̂d +ϵ
yb,d = γd zb,d +βd .
76 155
4.7 Skip connections
Another technique that mitigates the vanishing
gradient and allows the training of deep archi-
tectures are skip con
connec
nec
nections
nections [Long et al., 2014;
Ronneberger et al., 2015]. They are not layers
per se, but an architectural design in which out-
puts of some layers are transported as-is to other
layers further in the model, bypassing process-
ing in-between. This unmodified signal can be
concatenated or added to the input to the layer
the connection branches into (see Figure 4.9). A
particular type of skip connections is the resid
resid-
ual
ual con
connec
connec
nection
nection which combines the signal with
a sum, and usually skips only a few layers (see
Figure 4.9, right).
f8
... ...
f7
f6 +
f6
f5 f4
f5
f4 f3
f4
f3 +
f3
f2 f2
f2
f1 f1
f1 ... ...
...
78 155
Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size be-
fore re-expanding it, by connecting layers with
compatible sizes. In the case of residual con-
nections, they may also facilitate learning by
simplifying the task to finding a differential im-
provement instead of a full update.
79 155
4.8 Attention layers
In many applications, there is a need for an oper-
ation that is able to combine local information at
locations far apart in a tensor. For instance, this
could be distant details for coherent and realistic
im
image
image
age syn
synthe
synthe
thesis
sis
sis, or words at different positions
in a paragraph to make a grammatical or seman-
tic decision in nat
natu
ural lan
language
guage
guage pro
process
process
cessing
cessing
ing.
ing
fully
fully-con
fully con
connected
nected
nected lay
layers
ers cannot process large-
dimension signals, nor signals of variable size,
and con
convo
convo
volu
volu
lutional
lutional lay
layers
ers are not able to prop-
agate information quickly. Strategies that ag-
gregate the results of convolutions, for instance,
by averaging them over large spatial areas, suf-
fer from mixing multiple signals into a limited
number of dimensions.
At
Atten
Atten
tention
tention lay
layers
layers specifically address this prob-
lem by computing an attention score for each
component of the resulting tensor to each com-
ponent of the input tensor, without locality con-
straints, and averaging features across the full
tensor accordingly [Vaswani et al., 2017].
form
formers
formers
ers, the dominant architecture for Large
Large
Lan
Language
Language
guage Mod
Models
Models
els. See § 5.3 and § 7.1.
Attention operator
Given
the at
atten
atten
tention
tention op
oper
er
eraator computes a tensor
Y = att(K,Q,V )
·
A
dropout
1/Σk
Masked
softargmax M ⊙
exp
Q K V
83 155
matrix can be masked by multiplying it before
the softargmax normalization by a Boolean ma-
trix M . This allows, for instance, to make the
operator causal by taking M full of 1s below the
diagonal and zero above, preventing Yq from de-
pending on keys and values of indices k greater
than q. Second, the attention matrix is processed
by a dropout
dropout layer (see § 4.5) before being multi-
plied by V , providing the usual benefits during
training.
• W Q of size H ×D×DQK ,
• W K of size H ×D×DQK , and
• W V of size H ×D×DV ,
84 155
Y
×W O
(Y1 | ··· | YH )
attatt
attatt
att
Q K V
×W
×W1 2 Q×W
Q
Q×W
1 2K K×W 1 2V V
×W ×W K×W×W
×W3 Q ×W3 K ×W3 4V V
×W4
H ×W 4
H ×W H
×H
XQ XK XV
Figure 4.12: The Multi-head Attention layer applies
for each of its h = 1,...,H heads a parametrized lin-
ear transformation to individual elements of the input
sequences X Q ,X K ,X V to get sequences Q,K,V that
are processed by the attention operator to compute Yh .
These H sequences are concatenated along features,
and individual elements are passed through one last
linear operator to get the final result sequence Y .
85 155
• X Q of size N Q ×D,
• X K of size N KV ×D, and
• X V of size N KV ×D,
Y = (Y1 | ··· | YH )W O .
cross
cross-at
cross at
atten
atten
tention
tention blocks
blocks, where X K and X V are
the same.
86 155
4.9 Token embedding
In many situations, we need to convert discrete
tokens into vectors. This can be done with an em
em-
em
bed
bedding
bedding layer
layer,
layer which consists of a lookup table
that directly maps integers to vectors.
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = M [X[d1 ,...,dK ]].
87 155
4.10 Positional encoding
While the processing of a fully
fully-con
fully con
connected
connected
nected layer
layer
is specific to both the positions of the features
in the input tensor and to the position of the
resulting activation in the output tensor, con
convo
convo
vo-
vo
lu
lutional
lutional
tional lay
layers
layers and multi-head attention layers
are oblivious to the absolute position in the ten-
sor. This is key to their strong invariance and
inductive bias, which is beneficial for dealing
with a stationary signal.
pos-enc[t,d] =
sin d/Dt
if d ∈ 2N
T
t
cos (d−1)/D
T
otherwise,
with T = 104 .
89 155
Chapter 5
Architectures
90 155
5.1 Multi-Layer Perceptrons
The simplest deep architecture is the Multi Multi-
Multi
Layer
Layer Per
Percep
Percep
ceptron
ceptron (MLP
MLP
MLP),
MLP which takes the form
of a succession of fully
fully-con
con
connected
nected
nected lay
layers
layers
ers sepa-
rated by ac
acti
acti
tiva
tiva
vation
vation func
functions
tions
tions. See an example
in Figure 5.1. For historical reasons, in such a
model, the number of hidhidden
den
den lay
layers
layers
ers refers to the
number of linear layers, excluding the last one.
Y
2
fully-conn
relu
10
Hidden fully-conn
layers
relu
25
fully-conn
50
X
Figure 5.1: This multi-layer perceptron takes as input
a one dimension tensor of size 50, is composed of three
fully-connected layers with outputs of dimensions re-
spectively 25, 10, and 2, the two first followed by ReLU
layers.
91 155
mial, any continuous function f can be approxi-
mated arbitrarily well uniformly on a compact
by a model of the form l2 ◦σ◦l1 where l1 and l2
are affine. Such a model is a MLP with a single
hidden layer, and this result implies that it can
approximate anything of practical value. How-
ever, this approximation holds if the dimension
of the first linear layer’s output can be arbitrarily
large.
92 155
5.2 Convolutional networks
The standard architecture for proprocess
process
cessing
cessing
ing im
images
images
ages
is a con
convo
convo
volu
volu
lutional
tional net
network
network
work,
work or con convnet
convnet
vnet,
vnet that
combines multiple conconvo
vo
volu
volu
lutional
lutional
tional lay
layers
layers
ers,
ers either
to reduce the signal size before it can be pro-
cessed by fully
fully-con
fully con
connected
nected lay
layers
layers
ers,
ers or to output a
2d signal also of large size.
LeNet-like
The original LeNet
LeNet model for image classifica-
tion [LeCun et al., 1998] combines a series of
2d con
convo
convo
volu
volu
lutional
lutional lay
layers
ers and max pooling layers
that play the role of feature extractor, with a
series of fully
fully-con
fully con
connected
nected lay
layers
layers
ers which act like a
MLP and perform the classification per se. See
Figure 5.2 for an example.
Residual networks
Standard convolutional neural networks that fol-
low the architecture of the LeNet family are not
easily extended to deep architectures and suffer
93 155
P̂ (Y )
10
fully-conn
Classifier
relu
200
fully-conn
256
reshape
relu
64×2×2
maxpool k=2
64×4×4
Feature conv-2d k=5
extractor
relu
32×8×8
maxpool k=3
32×24×24
conv-2d k=5
1×28×28
X
Figure 5.2: Example of a small LeNetLeNet-like
LeNet network for
classifying 28×28 grayscale images of handwritten
digits [LeCun et al., 1998]. Its first half is convolutional,
and alternates convolutional layers per se and max
pooling layers, reducing the signal dimension for 28×
28 scalars to 256. Its second half processes this 256
dimension feature vector through a one hidden layer
perceptron to compute 10 logit scores corresponding to
the ten possible digits.
94 155
Y
C ×H ×W
relu
+
batchnorm
C ×H ×W
conv-2d k=1
relu
batchnorm
conv-2d k=3 p=1
relu
batchnorm
C
2 ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.3: A residual block.
95 155
Y
4C
S ×H W
S × S
relu
+
batchnorm batchnorm
4C
S ×H W
S × S
conv-2d k=1 s=S conv-2d k=1
relu
batchnorm
C
S ×H W
S × S
conv-2d k=3 s=S p=1
relu
batchnorm
C
S ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.4: A downscaling residual block. It admits a
meta-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor size.
96 155
P̂ (Y )
1000
fully-conn
2048
reshape
2048×1×1
avgpool k=7
resblock
×2
2048×7×7
dresblock
S=2
resblock
×5
1024×14×14
dresblock
S=2
resblock
×3
512×28×28
dresblock
S=2
resblock
×2
256×56×56
dresblock
S=1
64×56×56
maxpool k=3 s=2 p=1
relu
batchnorm
64×112×112
conv-2d k=7 s=2 p=3
3×224×224
X
Figure 5.5: Structure of the ResNet-50 [He et al., 2015].
97 155
sentation. However, the parameter count of a
convolutional layer, and its computational cost,
are quadratic with the number of channels. This
residual block mitigates this problem by first re-
ducing the number of channels with a 1×1 con-
volution, then operating spatially with a 3×3
convolution on this reduced number of chan-
nels, and then upscaling the number of channels,
again with a 1×1 convolution.
98 155
prisingly, in the first section, there is no down-
scaling, only an increase of the number of chan-
nels by a factor of 4. The output of the last resid-
ual block is 2048×7×7, which is converted to a
vector of dimension 2048 by an average pooling
of kernel size 7×7, and then processed through
a fully-connected layer to get the final logits,
here for 1000 classes.
99 155
5.3 Attention models
As stated in § 4.8, many applications, in partic-
ular from natural language processing, greatly
benefit from models that include attention mech-
anisms. The architecture of choice for such tasks,
which has been instrumental in recent advances
in deep learning, is the Trans
Transformer
Transformer
former proposed
by Vaswani et al. [2017].
Transformer
The original Transformer, pictured in Figure 5.7,
was designed for sequence-to-sequence trans-
lation. It combines an encoder that processes
the input sequence to get a refined representa-
tion, and an autoregressive decoder that gener-
ates each token of the result sequence, given the
encoder’s representation of the input sequence
and the output tokens generated so far. As the
residual convolutional networks of § 5.2, both
the encoder and the decoder of the Transformer
are sequences of compounded blocks built with
residual connections.
The self
self-at
self at
atten
atten
tention
tention block
block, pictured on the left of
Figure 5.6, combines a MultiMulti-Head
Multi Head
Head At
Atten
Atten
tention
tention
tion
layer, see § 4.8, that recombines information
globally, allowing any position to collect infor-
100 155
Y Y
+ +
dropout dropout
fully-conn fully-conn
MLP gelu gelu
fully-conn fully-conn
layernorm layernorm
+ +
mha mha
Q K V Q K V
layernorm layernorm
X QKV XQ X KV
Figure 5.6: Self
Self-at
at
atten
ten
tention
tention block
block (left) and cross
cross-at
cross at
atten
atten
ten-
tion block
block (right). These specific structures proposed by
Radford et al. [2018] differ slightly from the original
architecture of Vaswani et al. [2017], in particular by
having the layer normalization first in the residual
blocks.
The cross
cross-at
at
atten
atten
tention
tention block
block, pictured on the right
of Figure 5.6, is similar except that it takes as
101 155
P̂ (Y1 ),..., P̂ (YS | Ys<S )
S ×V
fully-conn
S ×D
cross-att
Q KV
Decoder causal
self-att ×N
pos-enc +
S ×D
embed
S
0,Y1 ,...,YS−1
Z1 ,...,ZT
T ×D
self-att
×N
Encoder
pos-enc +
T ×D
embed
T
X1 ,...,XT
102 155
input two sequences, one to compute the queries
and one to compute the keys and values.
103 155
P̂ (X1 ),..., P̂ (XT | Xt<T )
T ×V
fully-conn
T ×D
causal
self-att ×N
pos-enc +
T ×D
embed
T
0,X1 ,...,XT −1
104 155
P̂ (Y )
C
fully-conn
gelu
MLP
readout fully-conn
gelu
fully-conn
D
Z0 ,Z1 ,...,ZM
(M +1)×D
self-att
×N
pos-enc +
(M +1)×D
E0 ,E1 ,...,EM
Image E0 ×W E
encoder M ×3P 2
X1 ,...,XM
105 155
Vision Transformer
Transformers have been put to use for image
classification with the Vi
Vision
Vision Trans
Transformer
former
former (ViT
ViT
ViT)
model [Dosovitskiy et al., 2020], see Figure 5.9.
106 155
Part III
Applications
107 155
Chapter 6
Prediction
108 155
6.1 Image denoising
A direct application of deep models to image
processing is to recover from degradation by
utilizing the redundancy in the statistical struc-
ture of images. The petals of a sunflower on a
grayscale picture can be colored with high confi-
dence, and the texture of a geometric shape such
as a table on a low-light grainy picture can be
corrected by averaging it over a large area likely
to be uniform.
A de
denois
denois
noising
noising au
autoen
toen
toencoder
toencoder is a model that takes
as input a degraded signal X̃ and computes an
estimate of the original one X.
110 155
6.2 Image classification
Im
Image
Image
age clas
classi
si
sifi
sifi
fica
fica
cation
tion is the simplest strategy for
extracting semantics from an image and consists
of predicting a class from a finite, predefined
number of classes, given an input image.
111 155
6.3 Object detection
A more complex task for image understanding is
ob
object
object de
detec
detec
tection
tion
tion, in which the objective is, given
an input image, to predict the classes and posi-
tions of objects of interest.
112 155
X
Z1
Z2
ZS−1
ZS
...
...
113 155
Figure 6.2: Examples of object detection with the Single-
Shot Detector [Liu et al., 2015].
114 155
it. This results in a non-ambiguous matching of
any bounding box (x1 ,x2 ,y1 ,y2 ) to a s,h,w, de-
termined respectively by max(x2 −x1 ,y2 −y1 ),
y1 +y2
2 , and 2 .
x1 +x2
115 155
metric quantities.
116 155
6.4 Semantic segmentation
The finest-grain prediction task for image under-
standing is se
seman
seman
mantic
mantic seg
segmen
men
mentata
tation
tation
tion,
tion which con-
sists of predicting, for every pixel, the class of the
object to which it belongs. This can be achieved
with a standard convolutional neural network
that outputs a convolutional map with as many
channels as classes, carrying the estimated logits
for every pixel.
117 155
Figure 6.3: Semantic segmentation results with the
Pyramid Scene Parsing Network [Zhao et al., 2016].
118 155
backbone, concatenate the resulting multi-scale
representation after upscaling, before making
the final per-pixel prediction [Zhao et al., 2016].
119 155
6.5 Speech recognition
Speech
Speech recog
recogni
recogninition
nition consists of converting a
sound sample into a sequence of words. There
have been plenty of approaches to this problem
historically, but a conceptually simple and recent
one proposed by Radford et al. [2022] consists of
casting it as a sequence-to-sequence translation
and then solving it with a standard attention-
based Trans
Transformer
former
former, as described in § 5.3.
121 155
6.6 Text-image representations
A powerful approach to image understanding
consists of learning consistent image and text
representations.
122 155
resulting in an N ×N matrix of similarity score
123 155
Figure 6.4: The CLIP text-image embedding [Radford
et al., 2021] allows to do zero-shot prediction by pre-
dicting what class description embedding is the most
consistent with the image embedding.
124 155
Chapter 7
Synthesis
125 155
7.1 Text generation
The standard approach to text
text syn
synthe
synthe
thesis
thesis is to use
an attention-based, auautore
tore
toregres
gres
gressive
gressive
sive model
model.
model The
most successful in this domain is the GPT [Rad-
ford et al., 2018], which we described in § 5.3.
127 155
7.2 Image generation
Multiple deep methods have been developed to
model and sample from a high-dimensional den-
sity. A powerful approach for im image
image
age syn
synthe
synthe
thesis
thesis
relies on inverting a dif
diffu
diffu
fusion
sion
sion pro
process
process
cess.
cess
x0
129 155
noise and re-normalizing the variance to 1. This
process exponentially reduces the importance of
x0 , and xt ’s density can rapidly be approximated
with a normal.
131 155
The missing bits
Autoencoder
An au
autoen
autoen
toencoder
toencoder is a model that maps an input
signal, possibly of high dimension, to a low-
dimension latent representation, and then maps
it back to the original signal, ensuring that infor-
mation has been preserved. We saw it in § 6.1
for denoising, but it can also be used to auto-
matically discover a meaningful low-dimension
parameterization of the data manifold.
The Vari
Variaaational Au
Autoen
toen
toencoder
toencoder (VAE
VAE
VAE)
VAE proposed by
Kingma and Welling [2013] is a generative model
with a similar structure. It imposes, through the
loss, a pre-defined distribution to the latent rep-
resentation, so that, after training, it allows for
the generation of new samples by sampling the
latent representation according to this imposed
distribution and then mapping back through the
decoder.
133 155
put following a fixed distribution as input and
produces a structured signal such as an image,
and a dis
discrim
discrim
crimiina
nator
nator
tor, which takes as input a sam-
ple and predicts whether it comes from the train-
ing set or if it was generated by the generator.
Reinforcement Learning
Many problems require a model to estimate
an accumulated long-term reward given action
choices and an observable state, and what ac-
tions to choose to maximize that reward. Re Rein
Rein
in-
force
forcement
ment
ment Learn
Learning
Learning (RL
RL
RL) is the standard frame-
work to formalize such problems, and strategy
games or robotic control, for instance, can be
formulated within it. Deep models, particularly
convolutional neural networks, have demon-
strated excellent performance for this class of
tasks [Mnih et al., 2015].
134 155
Fine-tuning
As we saw in § 6.3 for object detection, or in § 6.4
for semantic segmentation, fine
fine-tun
tun
tuning
tuning
ing deep ar-
chitectures is an efficient strategy to deal with
small training sets. Furthermore, due to the dra-
matic increase in the size of architectures, partic-
ularly that of Large Lan
Language
Language
guage Mod
Models
Models
els, training a
single model can cost several millions of dollars,
and fine-tuning is a crucial, and often the only
way, to achieve high performance on a specific
task.
135 155
ilar to a standard convolution, except that the
data structure does not reflect any geometrical
information associated with the feature vectors
they carry.
Self-supervised training
As stated in § 7.1, even though they are trained
only to predict the next word, Large
Large Lan
Language
guage
guage
Mod
Models
Models
els trained on large unlabeled data sets such
as GPT
GPT (see § 5.3) are able to solve various tasks
such as identifying the grammatical role of a
word, answering questions, or even translating
from one language to another [Radford et al.,
2019].
136 155
uncorrelated [Zbontar et al., 2021].
137 155
Afterword
138 155
Bibliography
140 155
Transformers for Language Understanding.
CoRR, abs/1810.04805, 2018. [pdf]. 48, 106
141 155
A. Gomez, M. Ren, R. Urtasun, and R. Grosse.
The Reversible Residual Network: Backprop-
agation Without Storing Activations. CoRR,
abs/1707.04585, 2017. [pdf]. 41
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza,
et al. Generative Adversarial Networks. CoRR,
abs/1406.2661, 2014. [pdf]. 133
K. He, X. Zhang, S. Ren, and J. Sun. Deep Resid-
ual Learning for Image Recognition. CoRR,
abs/1512.03385, 2015. [pdf]. 46, 77, 78, 95, 97
D. Hendrycks and K. Gimpel. Gaussian Error
Linear Units (GELUs). CoRR, abs/1606.08415,
2016. [pdf]. 66
D. Hendrycks, K. Zhao, S. Basart, et al. Natural
Adversarial Examples. CoRR, abs/1907.07174,
2019. [pdf]. 123
J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion
Probabilistic Models. CoRR, abs/2006.11239,
2020. [pdf]. 128, 129, 130
S. Hochreiter and J. Schmidhuber. Long Short-
Term Memory. Neural Computation, 9(8):1735–
1780, 1997. [pdf]. 132
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reduc-
ing Internal Covariate Shift. In International
142 155
Conference on Machine Learning (ICML), 2015.
[pdf]. 73
143 155
W. Liu, D. Anguelov, D. Erhan, et al. SSD: Single
Shot MultiBox Detector. CoRR, abs/1512.02325,
2015. [pdf]. 112, 114
144 155
A. Radford, J. Kim, T. Xu, et al. Robust Speech
Recognition via Large-Scale Weak Supervi-
sion. CoRR, abs/2212.04356, 2022. [pdf]. 120
145 155
J. Sevilla, L. Heim, A. Ho, et al. Compute Trends
Across Three Eras of Machine Learning. CoRR,
abs/2202.05924, 2022. [pdf]. 9, 46, 48
146 155
ropean Conference on Computer Vision (ECCV),
2014. [pdf]. 62
147 155
Index
1d convolution, 60
2d convolution, 60
activation, 23, 39
activation function, 64, 91
activation map, 61
Adam, 36
artificial neural network, 8, 11
attention layer, 80
attention operator, 81
autoencoder, 133
Autograd, 40
autoregressive model, 31, 126
average pooling, 67
backpropagation, 40
backward pass, 40
basis function regression, 14
batch, 21, 36
batch normalization, 73
bias vector, 55, 60
BPE, 32, 120, 126
148 155
Byte Pair Encoding, 32
cache memory, 21
capacity, 16
causal, 32, 83
causal model, 31, 84, 103
channel, 23
checkpointing, 41
classification, 18
CLIP, 122
CLS token, 106
computational cost, 41
contrastive loss, 28, 122
convnet, 93
convolutional layer, 58, 67, 80, 88, 93, 96, 112,
117, 120
convolutional network, 93
cross-attention block, 86, 101
cross-entropy, 27
filter, 60
fine tuning, 135
flops, 22
forward pass, 39
foundation models, 127
FP32, 22
framework, 23
fully-connected layer, 55, 80, 88, 91, 93
GAN, 133
GELU, 66
Generative Adversarial Networks, 133
generator, 133
GNN, 135
GPT, 104, 122, 126, 136
GPU, 8, 20
gradient descent, 33, 35, 38
gradient step, 33
Graph Neural Network, 135
Graphical Processing Unit, 8, 20
ground truth, 18
hidden layer, 91
hidden state, 132
150 155
image classification, 111
image processing, 93
image synthesis, 80, 128
inductive bias, 17, 44, 58
max pooling, 67
mean squared error, 14, 27
memory requirement, 41
memory speed, 21
meta-parameter, 13, 43
metric learning, 28
151 155
MLP, 91, 101
model, 12
Multi-Head Attention, 84, 100
multi-layer perceptron, 91
padding, 60, 67
parameter, 12
parametric model, 12
peak performance, 22
pooling, 67
positional encoding, 88, 103
posterior probability, 27
pre-trained model, 115, 119
query, 81
random initialization, 56
receptive field, 61, 62, 112
rectified linear unit, 64, 132
recurrent neural network, 132
regression, 18
reinforcement learning, 134
152 155
ReLU, 64
residual block, 96
residual connection, 77, 95
residual network, 77, 95
ResNet, 77, 95
ResNet-50, 95
reversible layer, 41
RL, 134
RNN, 132
scaling laws, 46
self-attention block, 86, 100, 101
self-supervised learning, 136
semantic segmentation, 117
SGD, 36
Single Shot Detector, 112
skip connection, 77, 118, 132
softargmax, 27, 82
softmax, 27
speech recognition, 120
SSD, 112
stochastic gradient descent, 36, 46
stride, 61, 67
supervised learning, 19
tanh, 65
tensor, 23
tensor cores, 21
Tensor Processing Units, 21
153 155
test set, 43
text synthesis, 126
tokenizer, 32, 120, 126
tokens, 30
TPU, 21
trainable parameter, 12, 23, 46
training, 12
training set, 12, 26, 43
Transformer, 77, 81, 88, 100, 102, 120
transposed convolution, 63, 117
under-fitting, 16
universal approximation theorem, 91
unsupervised learning, 19
VAE, 133
validation set, 43
value, 81
vanishing gradient, 42, 52
variational autoencoder, 133
variational bound, 130
Vision Transformer, 106
ViT, 106, 122
vocabulary, 30
weight, 13
weight decay, 29
weight matrix, 55
preprint-2023.05.21
155 155