6 390 Lecture Notes Spring24
6 390 Lecture Notes Spring24
6 390 Lecture Notes Spring24
1 Introduction 6
1.1 Problem class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Sequence learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.5 Other settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Model type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Non-parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Model class and parameter fitting . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Regression 13
2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Regression as an optimization problem . . . . . . . . . . . . . . . . . . . . . 14
2.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 A gloriously simple linear regression algorithm . . . . . . . . . . . . . . . . . 16
2.5 Analytical solution: ordinary least squares . . . . . . . . . . . . . . . . . . . 16
2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.1 Regularization and linear regression . . . . . . . . . . . . . . . . . . . 18
2.6.2 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Evaluating learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.1 Evaluating hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.2 Evaluating learning algorithms . . . . . . . . . . . . . . . . . . . . . . 21
2.7.2.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.2.2 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.2.3 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . 22
1
MIT 6.390 Spring 2024 2
3 Gradient Descent 23
3.1 Gradient descent in one dimension . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Multiple dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Application to regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Classification 29
4.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Linear classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Linear classifiers: definition . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Linear logistic classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Linear logistic classifiers: definition . . . . . . . . . . . . . . . . . . . 33
4.3.2 Learning linear logistic classifiers . . . . . . . . . . . . . . . . . . . . . 35
4.4 Gradient descent for logistic regression . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Convexity of the NLL Loss Function . . . . . . . . . . . . . . . . . . . 38
4.5 Handling multiple classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Prediction accuracy and validation . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Feature representation 41
5.1 Gaining intuition about feature transformations . . . . . . . . . . . . . . . . 41
5.2 Systematic feature construction . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Polynomial basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.2 Radial basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Hand-constructing features for real domains . . . . . . . . . . . . . . . . . . 46
5.3.1 Discrete features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.3 Numeric values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Neural Networks 49
6.1 Basic element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2.1 Single layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.2 Many layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Choices of activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Loss functions and activation functions . . . . . . . . . . . . . . . . . . . . . 54
6.5 Error back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.5.1 First, suppose everything is one-dimensional . . . . . . . . . . . . . . 54
6.5.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.5.3 Derivations for the general case . . . . . . . . . . . . . . . . . . . . . . 57
6.5.4 Reflecting on backpropagation . . . . . . . . . . . . . . . . . . . . . . 57
6.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.7 Optimizing neural network parameters . . . . . . . . . . . . . . . . . . . . . 59
6.7.1 Batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.7.2 Adaptive step-size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.8 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.8.1 Methods related to ridge regression . . . . . . . . . . . . . . . . . . . 61
6.8.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.8.3 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8 Transformers 70
8.1 Vector embeddings and tokens . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.2 Query, key, value, and attention . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2.1 Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.1 Learned embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3.2 Variations and training . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9 Non-parametric methods 78
9.1 Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.2 Tree Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.2.1.1 Building a tree . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2.1.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2.5 Tree variants and tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . 86
11 Reinforcement learning 95
11.1 Reinforcement learning algorithms overview . . . . . . . . . . . . . . . . . . 96
11.2 Model-free methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
11.2.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
11.2.2 Function approximation: Deep Q learning . . . . . . . . . . . . . . . 99
11.2.3 Fitted Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.2.4 Policy gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.3 Model-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.4 Bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Index 138
Introduction
The main focus of machine learning (ML) is making decisions or predictions based on data.
There are a number of other fields with significant overlap in technique, but difference in
focus: in economics and psychology, the goal is to discover underlying causal processes This description
and in statistics it is to find a model that fits a data set well. In those fields, the end product paraphrased from a
post on 9/4/12 at
is a model. In machine learning, we often fit models, but as a means to the end of making
andrewgelman.com
good predictions or decisions.
As ML methods have improved in their capability and scope, ML has become arguably
the best way–measured in terms of speed, human engineering time, and robustness–to
approach many applications. Great examples are face detection, speech recognition, and
many kinds of language-processing tasks. Almost any application that involves under-
standing data or signals that come from the real world can be nicely addressed using ma-
chine learning.
One crucial aspect of machine learning approaches to solving problems is that human and often undervalued
engineering plays an important role. A human still has to frame the problem: acquire and
organize data, design a space of possible solutions, select a learning algorithm and its pa-
rameters, apply the algorithm to the data, validate the resulting solution to decide whether
it’s good enough to use, try to understand the impact on the people who will be affected
by its deployment, etc. These steps are of great importance.
The conceptual basis of learning from data is the problem of induction: Why do we think
that previously seen data will help us predict the future? This is a serious long standing
philosophical problem. We will operationalize it by making assumptions, such as that all
training data are so-called i.i.d.(independent and identically distributed), and that queries This means that the el-
will be drawn from the same distribution as the training data, or that the answer comes ements in the set are
related in the sense that
from a set of possible answers known in advance.
they all come from the
In general, we need to solve these two problems: same underlying prob-
ability distribution, but
• estimation: When we have data that are noisy reflections of some underlying quan- not in any other ways.
tity of interest, we have to aggregate the data and make estimates or predictions
about the quantity. How do we deal with the fact that, for example, the same treat-
ment may end up with different results on different trials? How can we predict how
well an estimate may compare to future results?
• generalization: How can we predict results of a situation or experiment that we have
never encountered before in our data set?
6
MIT 6.390 Spring 2024 7
We can describe problems and their solutions using six characteristics, three of which
characterize the problem and three of which characterize the solution:
1. Problem class: What is the nature of the training data and what kinds of queries will
be made at testing time?
2. Assumptions: What do we know about the source of the data or the form of the
solution?
3. Evaluation criteria: What is the goal of the prediction or estimation system? How
will the answers to individual queries be evaluated? How will the overall perfor-
mance of the system be measured?
4. Model type: Will an intermediate model of the world be made? What aspects of the
data will be modeled in different variables/parameters? How will the model be used
to make predictions?
5. Model class: What particular class of models will be used? What criterion will we
use to pick a particular model from the model class?
6. Algorithm: What computational process will be used to fit the model to the data
and/or to make predictions?
Without making some assumptions about the nature of the process generating the data, we
cannot perform generalization. In the following sections, we elaborate on these ideas. Don’t feel you have
to memorize all these
kinds of learning, etc.
We just want you to
1.1 Problem class have a very high-level
view of (part of) the
There are many different problem classes in machine learning. They vary according to what breadth of the field.
kind of data is provided and what kind of conclusions are to be drawn from it. Five stan-
dard problem classes are described below, to establish some notation and terminology.
In this course, we will focus on classification and regression (two examples of super-
vised learning), and we will touch on reinforcement learning, sequence learning, and clus-
tering.
1.1.1.1 Regression
For a regression problem, the training data Dn is in the form of a set of n pairs:
where x(i) represents an input, most typically a d-dimensional vector of real and/or dis-
crete values, and y(i) is the output to be predicted, in this case a real-number. The y values Many textbooks use xi
are sometimes called target values. and ti instead of x(i)
The goal in a regression problem is ultimately, given a new input value x(n+1) , to predict and y(i) . We find that
notation somewhat dif-
the value of y(n+1) . Regression problems are a kind of supervised learning, because the ficult to manage when
desired output y(i) is specified for each of the training examples x(i) . x(i) is itself a vector and
we need to talk about
its elements. The no-
Last Updated: 05/06/24 18:25:27 tation we are using is
standard in some other
parts of the ML litera-
ture.
MIT 6.390 Spring 2024 8
1.1.1.2 Classification
A classification problem is like regression, except that the values that y(i) can take do not
have an order. The classification problem is binary or two-class if y(i) (also known as the
class) is drawn from a set of two possible values; otherwise, it is called multi-class.
1.1.2.1 Clustering
Given samples x(1) , . . . , x(n) ∈ Rd , the goal is to find a partitioning (or “clustering”) of
the samples that groups together similar samples. There are many different objectives,
depending on the definition of the similarity between samples and exactly what criterion
is to be used (e.g., minimize the average distance between elements inside a cluster and
maximize the average distance between elements across clusters). Other methods perform
a “soft” clustering, in which samples may be assigned 0.9 membership in one cluster and
0.1 in another. Clustering is sometimes used as a step in the so-called density estimation
(described below), and sometimes to find useful structure or influential features in data.
• The agent observes the current state st . Note it’s standard prac-
tice in reinforcement
• It selects an action at . learning to use s and
a instead of x and y
• It receives a reward, rt , which typically depends on st and possibly at . to denote the machine
learning model’s in-
• The environment transitions probabilistically to a new state, st+1 , with a distribution put and output. The
that depends only on st and at . subscript t denotes the
timestep, and captures
the sequential nature of
• The agent observes the current state, st+1 . the problem.
• ...
The goal is to find a policy π, mapping s to a, (that is, states to actions) such that some
long-term sum or average of rewards r is maximized.
This setting is very different from either supervised learning or unsupervised learning,
because the agent’s action choices affect both its reward and its ability to observe the envi-
ronment. It requires careful consideration of the long-term effects of actions, as well as all
of the other issues that pertain to supervised learning.
1.2 Assumptions
The kinds of assumptions that we can make about the data source or the solution include:
• The data are generated by a Markov chain (i.e. outputs only depend only on the
current state, with no additional memory).
• The “true” model that is generating the data can be perfectly described by one of
some particular set of hypotheses.
The effect of an assumption is often to reduce the “size” or “expressiveness” of the space of
possible hypotheses and therefore reduce the amount of data required to reliably identify
an appropriate hypothesis.
• 0-1 Loss applies to predictions drawn from finite domains. If the actual values are
drawn from a contin-
0 if g = a uous distribution, the
L(g, a) = probability they would
1 otherwise ever be equal to some
predicted g is 0 (except
for some weird cases).
• Squared loss
L(g, a) = (g − a)2
• Absolute loss
L(g, a) = |g − a|
• Asymmetric loss Consider a situation in which you are trying to predict whether
someone is having a heart attack. It might be much worse to predict “no” when the
answer is really “yes”, than the other way around.
1 if g = 1 and a = 0
L(g, a) = 10 if g = 0 and a = 1
0 otherwise
Any given prediction rule will usually be evaluated based on multiple predictions and
the loss of each one. At this level, we might be interested in:
• Minimizing expected loss over all the predictions (also known as risk)
• Minimizing or bounding regret: how much worse this predictor performs than the
best one drawn from some class
• Characterizing asymptotic behavior: how well the predictor will perform in the limit
of infinite training data
• Finding algorithms that are probably approximately correct: they probably generate
a hypothesis that is right most of the time.
There is a theory of rational agency that argues that you should always select the action
that minimizes the expected loss. This strategy will, for example, make you the most money
in the long run, in a gambling setting. As mentioned above, expected loss is also sometimes Of course, there are
called risk in ML literature, but that term means other things in economics or other parts other models for ac-
tion selection and it’s
of decision theory, so be careful...it’s risky to use it. We will, most of the time, concentrate
clear that people do not
on this criterion. always (or maybe even
often) select actions that
follow this rule.
1.4 Model type
Recall that the goal of a ML system is typically to estimate or generalize, based on data
provided. Below, we examine the role of model-making in machine learning.
1.6 Algorithm
Once we have described a class of models and a way of scoring a model given data, we
have an algorithmic problem: what sequence of computational instructions should we run
in order to find a good model from our class? For example, determining the parameter
vector which minimizes the training error might be done using a familiar least-squares
minimization algorithm, when the model h is a function being fit to some data x.
Sometimes we can use software that was designed, generically, to perform optimiza-
tion. In many other cases, we use algorithms that are specialized for ML problems, or for
particular hypotheses classes. Some algorithms are not easily seen as trying to optimize a
particular criterion. In fact, a historically important method for finding linear classifiers,
the perceptron algorithm, has this character.
Regression
Regression is an important machine-learning problem that provides a good starting point “Regression,” in com-
for diving deeply into the field. mon parlance, means
moving backwards. But
this is forward progress!
2.1 Problem formulation
A hypothesis h is employed as a model for solving the regression problem, in that it maps
inputs x to outputs y,
x→ h →y ,
where x ∈ Rd (i.e., a length d column vector of real numbers), and y ∈ R (i.e., a real
number). Real life rarely gives us vectors of real numbers; the x we really want to take as
input is usually something like a song, image, or person. In that case, we’ll have to define
a function φ(x), whose range is Rd , where φ represents features of x, like a person’s height
or the amount of bass in a song, and then let the h : φ(x) → R. In much of the following,
we’ll omit explicit mention of φ and assume that the x(i) are in Rd , but you should always
have in mind that some additional process was almost surely required to go from the actual
input examples to their feature representation, and we’ll talk a lot more about features later
in the course.
Regression is a supervised learning problem, in which we are given a training dataset of
the form
Dn = x(1) , y(1) , . . . , x(n) , y(n) ,
which gives examples of input values x(i) and the output values y(i) that should be
associated with them. Because y values are real-valued, our hypotheses will have the form
h : Rd → R .
This is a good framework when we want to predict a numerical quantity, like height, stock
value, etc., rather than to divide the inputs into discrete categories.
What makes a hypothesis useful? That it works well on new data; that is, that it makes
good predictions on examples it hasn’t seen. But we don’t know exactly what data this My favorite analogy
hypothesis might be tested on when we use it in the real world. So, we have to assume a is to problem sets. We
evaluate a student’s
connection between the training data and testing data – typically, the assumption is that
ability to generalize by
putting questions on the
exam that were not on
13 the homework (training
set).
MIT 6.390 Spring 2024 14
they (the training and testing data) are drawn independently from the same probability
distribution.
To make this discussion more concrete, we have to provide a loss function, to say how
unhappy we are when we guess an output g given an input x for which the desired output
was a.
Given a training set Dn and a hypothesis h with parameters Θ, we can define the train-
ing error of h to be the average loss on the training data:
1X
n
En (h; Θ) = L(h(x(i) ; Θ), y(i) ) , (2.1)
n
i=1
The training error of h gives us some idea of how well it characterizes the relationship
between x and y values in our data, but it isn’t the quantity that we most care about. What
we most care about is test error:
1 X
′
n+n
E(h) = ′ L(h(x(i) ), y(i) )
n
i=n+1
on n ′ new examples that were not used in the process of finding the hypothesis. It might be worthwhile
For now, we will try to find a hypothesis with small training error (later, with some to stare at the two er-
rors and think about
added criteria) and try to make some design choices so that it generalizes well to new data,
what’s the difference.
meaning that it also has a small test error. For example, notice
how Θ is no long a
variable in the testing
2.2 Regression as an optimization problem error? this is because
in evaluating the test-
ing error, the param-
Given data, a loss function, and a hypothesis class, we need a method for finding a good eters will have been
hypothesis in the class. One of the most general ways to approach this problem is by "picked"/"fixed" already.
framing the machine learning problem as an optimization problem. One reason for taking
this approach is that there is a rich area of math and algorithms studying and developing
efficient methods for solving optimization problems, and lots of very good software imple-
mentations of these methods. So, if we can turn our problem into one of these problems,
then there will be a lot of work already done for us!
We begin by writing down an objective function J(Θ), where Θ stands for all the param-
eters in our model (i.e., all possible choices over parameters). We often write J(Θ; D) to
make clear the dependence on the data D. Don’t be too perturbed
The objective function describes how we feel about possible hypotheses Θ: we will by the semicolon where
you expected to see a
generally look for values for parameters Θ that minimize the objective function:
comma! It’s a math way
of saying that we are
Θ∗ = arg min J(Θ) . mostly interested in this
Θ
as a function of the ar-
A very common form for a machine-learning objective is guments before the “;”,
but we should remem-
ber that there’s a depen-
1 X
n
dence on the stuff after
J(Θ) = L(h(x(i) ; Θ), y(i) ) + λ
|{z} R(Θ). (2.2) it, as well.
n | {z }
i=1 loss non-negative constant
You can think about
The loss tells us how unhappy we are about the prediction h(x(i) ; Θ) that Θ makes for Θ∗ here as “the theta
that minimizes J”:
(x(i) , y(i) ). Minimizing this loss makes the prediction better. The regularizer is an addi- arg minx f(x) means the
tional term that encourages the prediction to remain general, and the constant λ adjusts value of x for which
the balance between reproducing seen examples, and being able to generalize to unseen f(x) is the smallest.
examples. We will return to discuss this balance, and more about the idea of regulariza- Sometimes we write
arg minx∈X f(x) when
tion, in Section 2.6. we want to explicitly
specify the set X of val-
Last Updated: 05/06/24 18:25:27 ues of x over which we
want to minimize.
MIT 6.390 Spring 2024 15
h(x; θ, θ0 ) = θT x + θ0 , (2.3)
with model parameters Θ = (θ, θ0 ). In one dimension (d = 1) this has the same familiar
slope-intercept form as y = mx+b; in higher dimensions, this model describes the so-called
hyperplanes.
We define a loss function to describe how to evaluate the quality of the predictions our
hypothesis is making, when compared to the “target” y values in the data set. The choice
of loss function is part of modeling your domain. In the absence of additional information
about a regression problem, we typically use squared loss:
L(g, a) = (g − a)2 .
where g = h(x) is our "guess" from the hypothesis, and a is the "actual" observation (in
other words, here a is being used equivalently as y). With this choice of squared loss, the
average loss as generally defined in 2.1 will become the so-called mean squared error (MSE),
which we’ll study closely very soon.
The squared loss penalizes guesses that are too high the same amount as it penal-
izes guesses that are too low, and has a good mathematical justification in the case that
your data are generated from an underlying linear hypothesis with the so-called Gaussian-
distributed noise added to the y values. But there are applications in which other losses We won’t get into the
would be better, and much of the framework we discuss can be applied to different loss details of Gaussian dis-
tribution in our class;
functions, although this one has a form that also makes it particularly computationally
but it’s one of the most
convenient. important distributions
Our objective in linear regression will be to find a hyperplane that goes as close as and well-worth study-
possible, on average, to all of our training data. ing closely at some
point. One obvious fact
Applying the general optimization framework to the linear regression hypothesis class about Gaussian is that
of Eq. 2.3 with squared loss and no regularization, our objective is to find values for Θ = it’s symmetric; this is in
(θ, θ0 ) that minimize the MSE: fact one of the reasons
squared loss works well
1 X T (i) under Gaussian settings,
n 2
J(θ, θ0 ) = θ x + θ0 − y(i) , (2.4) as the loss is also sym-
n metric.
i=1
For one-dimensional data (d = 1), this becomes the familiar problem of fitting a line
to data. For d > 1, this hypothesis may be visualized as a d-dimensional hyperplane
embedded in a (d + 1)-dimensional space (that consists of the input dimension and the y
dimension). For example, in the left plot below, we can see data points with labels y and
input dimensions x1 and x2 . In the right plot below, we see the result of fitting these points
with a two-dimensional plane that resides in three dimensions. We interpret the plane as
representing a function that provides a y value for any input (x1 , x2 ).
R ANDOM -R EGRESSION(D, k)
(i)
1 For i in 1 . . . k: Randomly generate hypothesis θ(i) , θ0
(i)
2 Let i = arg mini J(θ(i) , θ0 ; D)
(i)
3 Return θ(i) , θ0
This seems kind of silly, but it’s a learning algorithm, and it’s not completely useless.
Study Question: If your data set has n data points, and the dimension of the x val-
ues is d, what is the size of an individual θ(i) ?
Study Question: How do you think increasing the number of guesses k will change
the training error of the resulting hypothesis?
1 X T (i)
n 2
J(θ) = θ x − y(i) . (2.6)
n
i=1
Study Question: Stop and prove to yourself that adding that extra feature with
value 1 to every input vector and getting rid of the θ0 parameter is equivalent to our
original model.
We approach this just like a minimization problem from calculus homework: take the
derivative of J with respect to θ, set it to zero, and solve for θ. There are additional steps
required, to check that the resulting θ is a minimum (rather than a maximum or an inflec-
tion point) but we won’t work through that here. It is possible to approach this problem
by:
That works just fine. To get practice for applying techniques like this to more complex
problems, we will work through a more compact (and cool!) matrix view. Along the way,
it will be helpful to collect all of the derivatives in one vector. In particular, the gradient of
J with respect to θ is following column vector of length d:
∂J/∂θ1
∇θ J = ... .
∂J/∂θd
Study Question: Work through the next steps and check your answer against ours
below.
We can think of our training data in terms of matrices X and Y, where each column of X
is an example, and each “column” of Y is the corresponding target output value:
(1) (n)
x1 . . . x1
. .. .. (1)
X= .. . . Y= y
. . . y(n) .
(1) (n)
xd . . . xd
and using facts about matrix/vector calculus, we get You should be able to
verify this by doing a
2 T simple (say, 2 × 2) exam-
∇θ J = X̃ (X̃θ − Ỹ) . (2.7)
n |{z} | {z } ple by hand.
d×n n×1
See Appendix A for a nice way to think about finding this derivative.
Setting ∇θ J to 0 and solving, we get:
2 T
X̃ (X̃θ − Ỹ) = 0
n
X̃T X̃θ − X̃T Ỹ = 0
X̃T X̃θ = X̃T Ỹ
θ = (X̃T X̃)−1 X̃T Ỹ
So, given our data, we can directly compute the linear regression that minimizes mean
squared error. That’s pretty awesome!
2.6 Regularization
The objective function of Eq. 2.2 balances (training-data) memorization, induced by the loss
term, with generalization, induced by the regularization term. Here, we address the need for
regularization specifically for linear regression, and show how this can be realized using
one popular regularization technique called ridge regression.
data, because of underlying common causes; for example, across a population, the height
of people may depend on both age and amount of food intake in the same way. This is
especially the case when there are many feature dimensions used in the regression. Mathe-
matically, this leads to X̃T X̃ close to singularity, such that (X̃T X̃)−1 is undefined or has huge
values, resulting in unstable models (see the middle panel of figure and note the range of
the y values—the slope is huge!):
when we have some idea in advance that Θ ought to be near some value Θprior . Learn about Bayesian
Here, the notion of distance is quantified by squaring the l2 norm of the parameter methods in machine
learning to see the the-
vector: for any d-dimensional vector v ∈ Rd , the l2 norm of v is defined as,
ory behind this and cool
v results!
uX
u d
∥v∥ = t |vi |2 .
i=1
R(Θ) = ∥Θ∥2 .
When this is done in the example depicted above, the regression model becomes stable,
producing the result shown in the right-hand panel in the figure. Now the slope is much
more sensible.
1 X T (i)
n 2
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λ∥θ∥2
n
i=1
Larger λ values pressure θ values to be near zero. Note that we don’t penalize θ0 ; intu-
itively, θ0 is what “floats” the regression surface to the right level for the data you have,
and so you shouldn’t make it harder to fit a data set where the y values tend to be around
one million than one where they tend to be around one. The other parameters control the
orientation of the regression surface, and we prefer it to have a not-too-crazy orientation.
There is an analytical expression for the θ, θ0 values that minimize Jridge , but it’s a little
bit more complicated to derive than the solution for OLS because θ0 needs special treatment.
If we decide not to treat θ0 specially (so we add a 1 feature to our input vectors as discussed
above), then we get:
2
∇θ Jridge = X̃T (X̃θ − Ỹ) + 2λθ .
n
Setting to 0 and solving, we get: Remember that I stands
for the identity matrix,
2 T a square matrix that has
X̃ (X̃θ − Ỹ) + 2λθ = 0 1’s along the diagonal
n
and 0’s everywhere else.
1 T 1
X̃ X̃θ − X̃T Ỹ + λθ = 0
n n
1 T 1
X̃ X̃θ + λθ = X̃T Ỹ
n n
X̃T X̃θ + nλθ = X̃T Ỹ
(X̃T X̃ + nλI)θ = X̃T Ỹ
θ = (X̃T X̃ + nλI)−1 X̃T Ỹ
1 X h
n+n ′ i2
E(h) = ′ h(x(i) ) − y(i)
n
i=n+1
A learning algorithm is a procedure that takes a data set Dn as input and returns an
hypothesis h from a hypothesis class H; it looks like
Keep in mind that h has parameters. The learning algorithm itself may have its own pa-
rameters, and such parameters are often called hyperparameters. The analytical solutions
presented above for linear regression, e.g., Eq. 2.8, may be thought of as learning algo-
rithms, where λ is a hyperparameter that governs how the learning algorithm works and
can strongly affect its performance.
How should we evaluate the performance of a learning algorithm? This can be tricky.
There are many potential sources of variability in the possible result of computing test error
on a learned hypothesis h:
• Which particular training examples occurred in Dn
• Which particular testing examples occurred in Dn ′
• Randomization inside the learning algorithm itself
2.7.2.1 Validation
Generally, to evaluate how well a learning algorithm works, given an unlimited data source,
we would like to execute the following process multiple times:
• Evaluate resulting h on a validation set that does not overlap the training set (but is
still a subset of our same big data source)
Running the algorithm multiple times controls for possible poor choices of training set
or unfortunate randomization inside the algorithm itself.
C ROSS -VALIDATE(D, k)
1 divide D into k chunks D1 , D2 , . . . Dk (of roughly equal size)
2 for i = 1 to k
3 train hi on D \ Di (withholding chunk Di as the validation set)
4 compute “test” error Ei (hi ) on withheld data Di
P
5 return k1 ki=1 Ei (hi )
It’s very important to understand that (cross-)validation neither delivers nor evaluates a
single particular hypothesis h. It evaluates the learning algorithm that produces hypotheses.
Gradient Descent
In the previous chapter, we showed how to describe an interesting objective function for
machine learning, but we need a way to find the optimal Θ∗ = arg minΘ J(Θ), particularly
when the objective function is not amenable to analytical optimization. For example, this
can be the case when J(Θ) involves a more complex loss function, or more general forms
of regularization. It can also be the case when there is simply too much data for it to be
computationally feasible to analytically invert the required matrices.
There is an enormous and fascinating literature on the mathematical and algorithmic
foundations of optimization, but for this class, we will consider one of the simplest meth- Which you should con-
ods, called gradient descent. sider studying some
day!
Intuitively, in one or two dimensions, we can easily think of J(Θ) as defining a surface
over Θ; that same idea extends to higher dimensions. Now, our objective is to find the
Θ value at the lowest point on that surface. One way to think about gradient descent is
that you start at some arbitrary point on the surface, look to see in which direction the
“hill” goes down most steeply, take a small step in that direction, determine the direction
of steepest descent from where you are, take another small step, etc.
Below, we explicitly give gradient descent algorithms for one and multidimensional
objective functions (Sections 3.1 and 3.2). We then illustrate the application of gradient
descent to a loss function which is not merely mean squared loss (Section 3.3). And we
present an important method known as stochastic gradient descent (Section 3.4), which is
especially useful when datasets are too large for descent in a single batch, and has some
important behaviors of its own.
23
MIT 6.390 Spring 2024 24
notice though, is that even when η is constant, the actual magnitude of the change to Θ
may not be constant, as that change depends on the magnitude of the gradient itself too.
Note that this algorithm terminates when the change in the function f is sufficiently small.
There are many other reasonable ways to decide to terminate, including:
• Stop after a fixed number of iterations T , i.e., when t = T .
• Stop when the change in the value of the parameter Θ is sufficiently small, i.e., when
Θ(t) − Θ(t−1) < ϵ.
• Stop when the derivative f ′ at the latest value of Θ is sufficiently small, i.e., when
f ′ (Θ(t) ) < ϵ.
Theorem 3.1.1. Choose any small distance ϵ̃ > 0. If we assume that f has a minimum, is suffi-
ciently “smooth” and convex, and if the step size η is sufficiently small, gradient descent will reach A function is convex
a point within ϵ̃ of a global optimum point Θ. if the line segment be-
tween any two points
However, we must be careful when choosing the step size to prevent slow convergence, on the graph of the
non-converging oscillation around the minimum, or divergence. function lies above or
on the graph.
The following plot illustrates a convex function f(x) = (x−2)2 , starting gradient descent
at xinit = 4.0 with a step-size of 1/2. It is very well-behaved!
f(x)
4
x
−1 1 2 3 4 5 6
Study Question: What happens in this example with very small η? With very big η?
If f is non-convex, where gradient descent converges to depends on xinit . First, let’s
establish some definitions. Suppose we have analytically defined derivatives for f. Then
we say that f has a local minimum point or local optimum point at x if f ′ (x) = 0 and f ′′ (x) > 0,
and we say that f(x) is a local minimum value of f. More generally, x is a local minimum
point of f if f(x) is at least as low as f(x ′ ) for all points x ′ in some small area around x. We
say that f has a global minimum point at x if f(x) is at least as low as f(x ′ ) for every other
input x ′ . And then we call f(x) a global minimum value. A global minimum point is also a
local minimum point, but a local minimum point does not have to be a global minimum
point.
If f is non-convex (and sufficiently smooth), one expects that gradient descent (run long
enough with small enough step size) will get very close to a point at which the gradient is
zero, though we cannot guarantee that it will converge to a global minimum point.
There are two notable exceptions to this common sense expectation: First, gradient
descent can get stagnated while approaching a point x which is not a local minimum or
maximum, but satisfies f ′ (x) = 0. For example, for f(x) = x3 , starting gradient descent
from the initial guess xinit = 1, while using step size η < 1/3 will lead to x(k) converging to
zero as k → ∞. Second, there are functions (even convex ones) with no minimum points,
like f(x) = exp(−x), for which gradient descent with a positive step size converges to +∞.
The plot below shows two different xinit , and how gradient descent started from each
point heads toward two different local optimum points.
10
f(x)
x
−2 −1 1 2 3 4
2 T
∇θ J = X̃ (X̃θ − Ỹ) , (3.2)
n |{z} | {z }
d×n n×1
to obtain an analytical solution to the linear regression problem. Gradient descent could
also be applied to numerically compute a solution, using the update rule
2 X h (t−1) iT (i)
n
θ(t) = θ(t−1) − η θ x − y(i) x(i) . (3.3)
n
i=1
1 X T (i)
n 2
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λ∥θ∥2 .
n
i=1
Recall that in ordinary least squares, we finessed handling θ0 by adding an extra di-
mension of all 1’s. In ridge regression, we really do need to separate the parameter vector
θ from the offset θ0 , and so, from the perspective of our general-purpose gradient descent
method, our whole parameter set Θ is defined to be Θ = (θ, θ0 ). We will go ahead and find
the gradients separately for each one: Some passing familiar-
ity with matrix deriva-
2 X T (i)
n tives is helpful here. A
∇θ Jridge (θ, θ0 ) = θ x + θ0 − y(i) x(i) + 2λθ foolproof way of com-
n puting them is to com-
i=1
pute partial derivative
2 X T (i)
n
∂Jridge (θ, θ0 )
of J with respect to each
= θ x + θ0 − y(i) .
∂θ0 n component θi of θ. See
i=1
Appendix A on matrix
derivatives!
Note that ∇θ Jridge will be of shape d × 1 and ∂Jridge /∂θ0 will be a scalar since we have
separated θ0 from θ here.
Study Question: Convince yourself that the dimensions of all these quantities are
correct, under the assumption that θ is d × 1. How does d relate to m as discussed
for Θ in the previous section?
Study Question: Compute ∇θ Jridge (θT x+θ0 , y) by finding the vector of partial deriva-
tives (∂Jridge (θT x + θ0 , y)/∂θ1 , . . . , ∂Jridge (θT x + θ0 , y)/∂θd ).
Study Question: Use these last two results to verify our derivation above.
Putting everything together, our gradient descent algorithm for ridge regression be-
comes
(t) (t−1)
8 until Jridge (θ(t) , θ0 ) − Jridge (θ(t−1) , θ0 ) <ϵ
(t) (t)
9 return θ , θ0
Study Question: Is it okay that the 2’s from the gradient definitions don’t appear in
the algorithm?
X
n
f(Θ) = fi (Θ) ,
i=1
where n is the number of data points used in the objective (and this may be different from
the number of points available in the whole data set). Here is pseudocode for applying
SGD to such an objective f; it assumes we know the form of ∇Θ fi for all i in 1 . . . n:
Note that now instead of a fixed value of η, η is indexed by the iteration of the algo-
rithm, t. Choosing a good stopping criterion can be a little trickier for SGD than traditional
gradient descent. Here we’ve just chosen to stop after a fixed number of iterations T .
For SGD to converge to a local optimum point as t increases, the step size has to decrease
as a function of time. The next result shows one step size sequence that works.
then SGD converges with probability one to the optimal Θ. We have left out some
P gnarly conditions in this
Why these two conditions? The intuition is that the first condition, on η(t), is needed theorem. Also, you can
to allow for the possibility of an unbounded potential range of exploration, while the sec- learn more about the
P subtle difference be-
ond condition, on η(t)2 , ensures that the step sizes get smaller and smaller as t increases. tween “with probabil-
One “legal” way of setting the step size is to make η(t) = 1/t but people often use rules ity one” and “always”
that decrease more slowly, and so don’t strictly satisfy the criteria for convergence. by taking an advanced
probability course.
Study Question: If you start a long way from the optimum, would making η(t) de-
crease more slowly tend to make you move more quickly or more slowly to the opti-
mum?
There are multiple intuitions for why SGD might be a better choice algorithmically than
regular GD (which is sometimes called batch GD (BGD)):
• BGD typically requires computing some quantity over every data point in a data set.
SGD may perform well after visiting only some of the data. This behavior can be
useful for very large data sets – in runtime and memory savings.
• If your f is actually non-convex, but has many shallow local optimum points that
might trap BGD, then taking samples from the gradient at some point Θ might “bounce”
you around the landscape and away from the local optimum points.
• Sometimes, optimizing f really well is not what we want to do, because it might
overfit the training set; so, in fact, although SGD might not get lower training error
than BGD, it might result in lower test error.
Classification
4.1 Classification
Classification is a machine learning problem seeking to map from inputs Rd to outputs in
an unordered set. Examples of classification output sets could be {apples, oranges, pears} in contrast to a continu-
if we’re trying to figure out what type of fruit we have, or {heartattack, noheartattack} ous real-valued output,
as we saw for linear re-
if we’re working in an emergency room and trying to give the best medical care to a new gression
patient. We focus on an essential simple case, binary classification, where we aim to find
a mapping from Rd to two outputs. While we should think of the outputs as not having
an order, it’s often convenient to encode them as {−1, +1}. As before, let the letter h (for
hypothesis) represent a classifier, so the classification process looks like:
x→ h →y .
We will assume that each x(i) is a d × 1 column vector. The intended meaning of this data is
that, when given an input x(i) , the learned hypothesis should generate output y(i) .
What makes a classifier useful? As in regression, we want it to work well on new data,
making good predictions on examples it hasn’t seen. But we don’t know exactly what
data this classifier might be tested on when we use it in the real world. So, we have to
assume a connection between the training data and testing data; typically, they are drawn
independently from the same probability distribution.
In classification, we will often use 0-1 loss for evaluation (as discussed in Section 1.3).
For that choice, we can write the training error and the testing error. In particular, given a
training set Dn and a classifier h, we define the training error of h to be
1 X 1 h(x(i) ) ̸= y(i)
n
En (h) = .
n 0 otherwise
i=1
29
MIT 6.390 Spring 2024 30
For now, we will try to find a classifier with small training error (later, with some added
criteria) and hope it generalizes well to new data, and has a small test error
1 X
′
n+n
1 h(x(i) ) ̸= y(i)
E(h) = ′
n 0 otherwise
i=n+1
on n ′ new examples that were not used in the process of finding the classifier.
We begin by introducing the hypothesis class of linear classifiers (Section 4.2) and then
define an optimization framework to learn linear logistic classifiers (Section 4.3).
1
Example: Let h be the linear classifier defined by θ = , θ0 = 1.
−1
The diagram below shows the θ vector (in green) and the separator it defines:
x2 θT x + θ0 = 0
θ θ2
x1
θ1
What is θ0 ? We can solve for it by plugging a point on the line into the equation for
the line. It is often convenient
to choose a point on one of the axes, e.g., in this case,
0
x = [0, 1]T , for which θT + θ0 = 0, giving θ0 = 1.
1
In this example, the separator divides Rd , the space our x(i) points live in, into two half-
spaces. The one that is on the same side as the normal vector is the positive half-space, and
we classify all points in that space as positive. The half-space on the other side is negative
and all points in it are classified as negative.
Note that we will call a separator a linear separator of a data set if all of the data with
one label falls on one side of the separator and all of the data with the other label falls on
the other side of the separator. For instance, the separator in the next example is a linear
separator for the illustrated data. If there exists a linear separator on a dataset, we call this
dataset linearly separable.
−1
Example: Let h be the linear classifier defined by θ = , θ0 = 3.
1.5
3
The diagram below shows several points classified by h. In particular, let x(1) =
2
4
and x(2) = .
−1
(1)
3
h(x ; θ, θ0 ) = sign −1 1.5 + 3 = sign(3) = +1
2
4
h(x(2) ; θ, θ0 ) = sign −1 1.5
+ 3 = sign(−2.5) = −1
−1
Thus, x(1) and x(2) are given positive and negative classifications, respectively.
x(1)
θT x + θ0 = 0
x(2)
Study Question: What is the green vector normal to the separator? Specify it as a
column vector.
Study Question: What change would you have to make to θ, θ0 if you wanted to
have the separating hyperplane in the same place, but to classify all the points la-
beled ’+’ in the diagram as negative and all the points labeled ’-’ in the diagram as
positive?
However, even for simple linear classifiers, it is very difficult to find values for θ, θ0 that
minimize simple 0-1 training error
1X
n
J(θ, θ0 ) = L01 (sign(θT x(i) + θ0 ), y(i) ) .
n
i=1
This problem is NP-hard, which probably implies that solving the most difficult instances The “probably” here is
of this problem would require computation time exponential in the number of training ex- not because we’re too
lazy to look it up, but
amples, n.
actually because of a
What makes this a difficult optimization problem is its lack of “smoothness”: fundamental unsolved
problem in computer-
• There can be two hypotheses, (θ, θ0 ) and (θ ′ , θ0′ ), where one is closer in parameter science theory, known
space to the optimal parameter values (θ∗ , θ∗0 ), but they make the same number of as “P vs. NP.”
misclassifications so they have the same J value.
• All predictions are categorical: the classifier can’t express a degree of certainty about
whether a particular input x should have an associated value y.
For these reasons, if we are considering a hypothesis θ, θ0 that makes five incorrect predic-
tions, it is difficult to see how we might change θ, θ0 so that it will perform better, which
makes it difficult to design an algorithm that searches in a sensible way through the space
of hypotheses for a good one. For these reasons, we investigate another hypothesis class:
linear logistic classifiers, providing their definition, then an approach for learning such clas-
sifiers using optimization.
σ(z)
1
0.5
z
−4 −3 −2 −1 1 2 3 4
Study Question: Convince yourself the output of σ is always in the interval (0, 1).
Why can’t it equal 0 or equal 1? For what value of z does σ(z) = 0.5?
What does an LLC look like? Let’s consider the simple case where d = 1, so our input
points simply lie along the x axis. Classifiers in this case have dimension 0, meaning that
they are points. The plot below shows LLCs for three different parameter settings: σ(10x +
1), σ(−2x + 1), and σ(2x − 3).
σ(θT x + θ0 )
1
0.5
x
−4 −3 −2 −1 1 2 3 4
Study Question: Which plot is which? What governs the steepness of the curve?
What governs the x value where the output is equal to 0.5?
But wait! Remember that the definition of a classifier is that it’s a mapping from Rd →
{−1, +1} or to some other discrete set. So, then, it seems like an LLC is actually not a
classifier!
Given an LLC, with an output value in (0, 1), what should we do if we are forced to
make a prediction in {+1, −1}? A default answer is to predict +1 if σ(θT x + θ0 ) > 0.5 and
−1 otherwise. The value 0.5 is sometimes called a prediction threshold.
In fact, for different problem settings, we might prefer to pick a different prediction
threshold. The field of decision theory considers how to make this choice. For example, if
the consequences of predicting +1 when the answer should be −1 are much worse than
the consequences of predicting −1 when the answer should be +1, then we might set the
prediction threshold to be greater than 0.5.
Study Question: Using a prediction threshold of 0.5, for what values of x do each of
the LLCs shown in the figure above predict +1?
When d = 2, then our inputs x lie in a two-dimensional space with axes x1 and x2 , and
the output of the LLC is a surface, as shown below, for θ = (1, 1), θ0 = 2.
Study Question: Convince yourself that the set of points for which σ(θT x + θ0 ) = 0.5,
that is, the “boundary” between positive and negative predictions with prediction
threshold 0.5, is a line in (x1 , x2 ) space. What particular line is it for the case in the
figure above? How would the plot change for θ = (1, 1), but now with θ0 = −2? For
θ = (−1, −1), θ0 = 2?
Study Question: Be sure you can see why these two expressions are the same.
The big product above is kind of hard to deal with in practice, though. So what can we
do? Because the log function is monotonic, the θ, θ0 that maximize the quantity above will
be the same as the θ, θ0 that maximize its log, which is the following:
X
n
y(i) log g(i) + (1 − y(i) ) log(1 − g(i) ) .
i=1
Finally, we can turn the maximization problem above into a minimization problem by tak-
ing the negative of the above expression, and write in terms of minimizing a loss
X
n
Lnll (g(i) , y(i) )
i=1
This loss function is also sometimes referred to as the log loss or cross entropy. You can use any base
What is the objective function for linear logistic classification? We can finally put all for the logarithm and
it won’t make any real
these pieces together and develop an objective function for optimizing regularized neg-
difference. If we ask
ative log-likelihood for a linear logistic classifier. In fact, this process is usually called you for numbers, use
“logistic regression,” so we’ll call our objective Jlr , and define it as log base e.
1X
n
!
That’s a lot of fancy
Jlr (θ, θ0 ; D) = Lnll (σ(θ x + θ0 ), y ) + λ ∥θ∥2 .
T (i) (i)
(4.1) words!
n
i=1
Study Question: Consider the case of linearly separable data. What will the θ values
that optimize this objective be like if λ = 0? What will they be like if λ is very big?
Try to work out an example in one dimension with two data points.
What role does regularization play for classifiers? This objective function has the same
structure as the one we used for regression, Eq. 2.2, where the first term (in parentheses)
is the average loss, and the second term is for regularization. Regularization is needed
for building classifiers that can generalize well (just as was the case for regression). The
parameter λ governs the trade-off between the two terms as illustrated in the following
example.
Suppose we wish to obtain a linear logistic classifier for this one-dimensional dataset:
1.2
1.0
0.8
0.6
y
0.4
0.2
0.0
0.2
8 6 4 2 0 2 4 6 8
x
Clearly, this can be fit very nicely by a hypothesis h(x) = σ(θx), but what is the best value
for θ? Evidently, when there is no regularization (λ = 0), the objective function Jlr (θ) will
approach zero for large values of θ, as shown in the plot on the left, below. However, would
the best hypothesis really have an infinite (or very large) value for θ? Such a hypothesis
would suggest that the data indicate strong certainty that a sharp transition between y = 0
and y = 1 occurs exactly at x = 0, despite the actual data having a wide gap around x = 0.
20 λ=0 20 λ = 0.2
15 15
Jlr (θ)
Jlr (θ)
10 10
5 5
0 0
4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4
θ θ
Study Question: Be sure this makes sense. When the θ values are very large, what
does the sigmoid curve look like? Why do we say that it has a strong certainty in
that case?
In absence of other beliefs about the solution, we might prefer that our linear logistic
classifier not be overly certain about its predictions, and so we might prefer a smaller θ
over a large θ. By not being overconfident, we might expect a somewhat smaller θ to per-
form better on future examples drawn from this same distribution. This preference can be To refresh on some vo-
realized using a nonzero value of the regularization trade-off parameter, as illustrated in cabulary, we say that
in this example, a very
the plot on the right, above, with λ = 0.2.
large θ would be overfit
Another nice way of thinking about regularization is that we would like to prevent to the training data.
our hypothesis from being too dependent on the particular training data that we were
given: we would like for it to be the case that if the training data were changed slightly, the
hypothesis would not change by much.
∂Jlr
Note that ∇θ Jlr will be of shape d × 1 and ∂θ0 will be a scalar since we have separated θ0
from θ here.
Study Question: Convince yourself that the dimensions of all these quantities are
correct, under the assumption that θ is d × 1. How does d relate to m as discussed
for Θ in the previous section?
Study Question: Use these last two results to verify our derivation above.
Putting everything together, our gradient descent algorithm for logistic regression be-
comes:
(t) (t−1)
8 until Jlr (θ(t) , θ0 ) − Jlr (θ(t−1) , θ0 ) <ϵ
(t) (t)
9 return θ , θ0
the derivative of the function f1 (z) is a monotonically increasing function and therefore f1
is a convex function.
Second, we can see that since,
d d
f2 (z) = [− log(exp(−z)/(1 + exp(−z)))] ,
dz dz
d
= [log(1 + exp(−z)) + z] ,
dz
= σ(z),
the derivative of the function f2 (z) is also monotonically increasing and therefore f2 is a
convex function.
that
z = θT x + θ0 .
Next, we have to extend our use of the sigmoid function to the multi-dimensional softmax Let’s check dimensions!
function, that takes a whole vector z ∈ RK and generates θT is K×d and x is d×1,
and θ0 is K × 1, so z is
P K × 1 and we’re good!
exp(z1 )/ i exp(zi )
g = softmax(z) = ..
.
.
P
exp(zK )/ i exp(zi )
which can be interpreted as a probability distribution over K items. To make the final
prediction of the class label, we can then look at g, find the most likely probability over
these K entries in g, (i.e. find the largest entry in g,) and return the corresponding index as
the “one-hot” element of 1 in our prediction.
Study Question: Convince yourself that the vector of g values will be non-negative
and sum to 1.
h(x; θ, θ0 ) = softmax(θT x + θ0 ) .
Now, we retain the goal of maximizing the probability that our hypothesis assigns to
the correct output yk for each input x. We can write this probability, letting g stand for our
Q
“guess”, h(x), for a single example (x, y) as K yk
k=1 gk .
Study Question: How many elements that are not equal to 1 will there be in this
product?
The negative log of the probability that we are making a correct guess is, then, for one-
hot vector y and probability distribution vector g,
X
K
Lnllm (g, y) = − yk · log(gk ) .
k=1
We’ll call this NLLM for negative log likelihood multiclass. It is also worth noting that the
NLLM loss function is also convex; however, we will omit the proof.
Study Question: Be sure you see that is Lnllm is minimized when the guess assigns
high probability to the true class.
1X
n
A(h; D) = 1 − L01 (g(i) , y(i) ) ,
n
i=1
where g(i) is the final guess for one class or the other that we make from h(x(i) ), e.g., after
thresholding. It’s noteworthy here that we use a different loss function for optimization
than for evaluation. This is a compromise we make for computational ease and efficiency.
Feature representation
Linear regression and classification are powerful tools, but in the real world, data often
exhibit non-linear behavior that cannot immediately be captured by the linear models which
we have built so far. For example, suppose the true behavior of a system (with d = 2) looks
like this wavelet: This plot is of the so-
called jinc function
J1 (ρ)/ρ for ρ2 = x21 + x22
Such behavior is actually ubiquitous in physical systems, e.g., in the vibrations of the sur-
face of a drum, or scattering of light through an aperture. However, no single hyperplane
would be a very good fit to such peaked responses!
A richer class of hypotheses can be obtained by performing a non-linear feature trans-
formation ϕ(x) before doing the regression. That is, θT x + θ0 is a linear function of x, but
θT ϕ(x) + θ0 is a non-linear function of x, if ϕ is a non-linear function of x.
There are many different ways to construct ϕ. Some are relatively systematic and do-
main independent. Others are directly related to the semantics (meaning) of the original
features, and we construct them deliberately with our application (goal) in mind.
41
MIT 6.390 Spring 2024 42
These points are not linearly separable, but consider the transformation ϕ(x) = [x, x2 ]T . What’s a linear separa-
Putting the data in ϕ space, we see that it is now separable. There are lots of possible tor for data in 1D? A
point!
separators; we have just shown one of them here.
x2
separator
A linear separator in ϕ space is a nonlinear separator in the original space! Let’s see
how this plays out in our simple example. Consider the separator x2 − 1 = 0, which labels
the half-plane x2 − 1 > 0 as positive. What separator does it correspond to in the original
1-D space? We have to ask the question: which x values have the property that x2 − 1 = 0.
The answer is +1 and −1, so those two points constitute our separator, back in the original
space. And we can use the same reasoning to find the region of 1D space that is labeled
positive by this separator.
x
-1 1
0
This transformation can be used in combination with linear regression or logistic regres-
sion (or any other regression or classification model). When we’re using a linear regression
or classification model, the key insight is that a linear regressor or separator in the trans-
formed space is a non-linear regressor or separator in the original space.
For example, the wavelet pictured at the start of this chapter can be fit much better than
with just a hyperplane, using linear regression with polynomials up to order k = 8: Specifically, this exam-
ple uses [1, x1 , x2 , x21 +
x22 , (x21 + x22 )2 , (x21 + x22 )4 ]T
The raw data (with n = 1000 random samples) is plotted on the left, and the regression
result (curved surface) is on the right.
Now let’s look at a classification example and see how polynomial feature transforma-
tion may help us.
One well-known example is the “exclusive or” (XOR) data set, the drosophila of machine- D. Melanogaster is a
learning data sets: species of fruit fly, used
as a simple system in
which to study genetics,
since 1910.
Clearly, this data set is not linearly separable. So, what if we try to solve the XOR classi-
fication problem using a polynomial basis as the feature transformation? We can just take
our two-dimensional data and transform it into a higher-dimensional data set, by applying
ϕ. Now, we have a classification problem as usual.
Let’s try it for k = 2 on our XOR problem. The feature transformation is
and is plotted below, with the gray shaded region classified as negative and the white
region classified as positive:
0
x2
3
3 2 1 0 1 2 3
x1
0
x2
3
3 2 1 0 1 2 3
x1
Here is a harder data set. Logistic regression with gradient descent failed to separate
it with a second, third, or fourth-order basis feature representation, but succeeded with a
fifth-order basis. Shown below are some results after ∼ 1000 gradient descent iterations
(from random starting points) for bases of order 2 (upper left), 3 (upper right), 4 (lower
left), and 5 (lower right).
5 5
4 4
3 3
2 2
x2
x2
1 1
0 0
1 1
2 2
2 1 0 1 2 3 4 5 2 1 0 1 2 3 4 5
x1 x1
5 5
4 4
3 3
2 2
x2
x2
1 1
0 0
1 1
2 2
2 1 0 1 2 3 4 5 2 1 0 1 2 3 4 5
x1 x1
Study Question: Percy Eptron has a domain with four numeric input features,
(x1 , . . . , x4 ). He decides to use a representation of the form
where a⌢ b means the vector a concatenated with the vector b. What is the dimen-
sion of Percy’s representation? Under what assumptions about the original features is
this a reasonable choice?
Let’s start with the basic case, in which X = Rd . Then we can define
2
fp (x) = e−β∥p−x∥ .
This function is maximized when p = x and decreases exponentially as x becomes more
distant from p.
The parameter β governs how quickly the feature value decays as we move away from
the center point p. For large values of β, the fp values are nearly 0 almost everywhere
except right near p; for small values of β, the features have a high value over a larger part
of the space.
Study Question: What is fp (p)?
Now, given a dataset D containing n points, we can make a feature transformation ϕ
that maps points in our original space, Rd , into points in a new space, Rn . It is defined as
follows:
ϕ(x) = [fx(1) (x), fx(2) (x), . . . , fx(n) (x)]T .
So, we represent a new datapoint x in terms of how far it is from each of the datapoints in
our training set.
This idea can be generalized in several ways and is the fundamental concept underlying
kernel methods, that you should read about some time. This idea of describing objects in
terms of their similarity to a set of reference objects is very powerful and can be applied to
cases where X is not a simple vector space, but where the inputs are graphs or strings or
other types of objects, as long as there is a distance metric defined on it.
• Factored code: If your discrete values can sensibly be decomposed into two parts
(say the “maker” and “model” of a car), then it’s best to treat those as two separate
features, and choose an appropriate encoding of each one from this list.
• Binary code: It might be tempting for the computer scientists among us to use some
binary code, which would let us represent k values using a vector of length log k.
This is a bad idea! Decoding a binary code takes a lot of work, and by encoding your
inputs this way, you’d be forcing your system to learn the decoding algorithm.
As an example, imagine that we want to encode blood types, that are drawn from the
set {A+, A−, B+, B−, AB+, AB−, O+, O−}. There is no obvious linear numeric scaling or
even ordering to this set. But there is a reasonable factoring, into two features: {A, B, AB, O}
and {+, −}. And, in fact, we can further reasonably factor the first group into {A, notA},
{B, notB}. So, here are two plausible encodings of the whole set: It is sensible (according
to Wikipedia!) to treat
• Use a 6-D vector, with two components of the vector each encoding the correspond- O as having neither fea-
ing factor using a one-hot encoding. ture A nor feature B.
• Use a 3-D vector, with one dimension for each factor, encoding its presence as 1.0
and absence as −1.0 (this is sometimes better than 0.0). In this case, AB+ would be
[1.0, 1.0, 1.0]T and O− would be [−1.0, −1.0, −1.0]T .
5.3.2 Text
The problem of taking a text (such as a tweet or a product review, or even this document!)
and encoding it as an input for a machine-learning algorithm is interesting and compli-
cated. Much later in the class, we’ll study sequential input models, where, rather than
having to encode a text as a fixed-length feature vector, we feed it into a hypothesis word
by word (or even character by character!).
There are some simple encodings that work well for basic applications. One of them is
the bag of words (BOW) model. The idea is to let d be the number of words in our vocabulary
(either computed from the training set or some other body of text or dictionary). We will
then make a binary vector (with values 1.0 and 0.0) of length d, where element j has value
1.0 if word j occurs in the document, and 0.0 otherwise.
feature with much larger values than another, it will take the learning algorithm a lot of
work to find parameters that can put them on an equal basis. We could also perform a
x−x
more involved scaling/transformation ϕ(x) = , where x is the average of the x(i) ,
σ
and σ is the standard deviation of the x(i) . The resulting feature values will have mean 0
and standard deviation 1. This transformation is sometimes called standardizing a variable
. Such standard variables
Then, of course, we might apply a higher-order polynomial-basis transformation to one are often known as “z-
scores,” for example, in
or more groups of numeric features.
the social sciences.
Study Question: Consider using a polynomial basis of order k as a feature trans-
formation ϕ on our data. Would increasing k tend to increase or decrease structural
error? What about estimation error?
Neural Networks
You’ve probably been hearing a lot about “neural networks.” Now that we have several
useful machine-learning concepts (hypothesis classes, classification, regression, gradient
descent, regularization, etc.) we are well equipped to understand neural networks in detail.
This is, in some sense, the “third wave” of neural nets. The basic idea is founded on
the 1943 model of neurons of McCulloch and Pitts and the learning ideas of Hebb. There
was a great deal of excitement, but not a lot of practical success: there were good train-
ing methods (e.g., perceptron) for linear functions, and interesting examples of non-linear
functions, but no good way to train non-linear functions from data. Interest died out for a
while, but was re-kindled in the 1980s when several people came up with a way to train As with many good
neural networks with “back-propagation,” which is a particular style of implementing gra- ideas in science, the
basic idea for how to
dient descent, that we will study here. By the mid-90s, the enthusiasm waned again, be-
train non-linear neural
cause although we could train non-linear networks, the training tended to be slow and was networks with gradi-
plagued by a problem of getting stuck in local optima. Support vector machines (SVMs) ent descent was inde-
that use regularization of high-dimensional hypotheses by seeking to maximize the mar- pendently developed
by more than one re-
gin, and kernel methods that are an efficient and beautiful way of using feature transfor- searcher.
mations to non-linearly transform data into a higher-dimensional space, provided reliable
learning methods with guaranteed convergence and no local optima.
However, during the SVM enthusiasm, several groups kept working on neural net-
works, and their work, in combination with an increase in available data and computation,
has made them rise again. They have become much more reliable and capable, and are
now the method of choice in many applications. There are many, many variations of neu- The number increases
ral networks, which we can’t even begin to survey. We will study the core “feed-forward” daily, as may be seen on
arxiv.org.
networks with “back-propagation” training, and then, in later chapters, address some of
the major advances beyond this core.
We can view neural networks from several different perspectives:
View 3: A method for building applications that make predictions based on huge amounts
of data in very complex domains.
49
MIT 6.390 Spring 2024 50
We will mostly take view 1, with the understanding that the techniques we develop will
enable the applications in view 3. View 2 was a major motivation for the early development
of neural networks, but the techniques we will study do not seem to actually account for Some prominent re-
the biological learning processes in brains. searchers are, in fact,
working hard to find
analogues of these
methods in the brain.
6.1 Basic element
The basic element of a neural network is a “neuron,” pictured schematically below. We will
also sometimes refer to a neuron as a “unit” or “node.”
pre-activation
x1 w1
output
.. P z
. f(·) a
wm
xm w0
activation function
input
It is a non-linear function of an input vector x ∈ Rm to a single output value a ∈ R. It is Sorry for changing our
parameterized by a vector of weights (w1 , . . . , wm ) ∈ Rm and an offset or threshold w0 ∈ R. notation here. We were
using d as the dimen-
In order for the neuron to be non-linear, we also specify an activation function f : R → R,
sion of the input, but
which can be the identity (f(x) = x, in that case the neuron is a linear function of x), but can we are trying to be con-
also be any other function, though we will only be able to work with it if it is differentiable. sistent here with many
The function represented by the neuron is expressed as: other accounts of neural
networks. It is impossi-
ble to be consistent with
Xm
all of them though—
a = f(z) = f xj wj + w0 = f(wT x + w0 ) . there are many differ-
j=1 ent ways of telling this
story.
Before thinking about a whole network, we can consider how to train a single unit.
Given a loss function L(guess, actual) and a dataset {(x(1) , y(1) ), . . . , (x(n) , y(n) )}, we can do This
should remind you of
(stochastic) gradient descent, adjusting the weights w, w0 to minimize
our θ and θ0 for linear
X models.
J(w, w0 ) = L NN(x(i) ; w, w0 ), y(i) ,
i
where NN is the output of our single-unit neural net for a given input.
We have already studied two special cases of the neuron: linear logistic classifiers
(LLCs) with NLL loss and regressors with quadratic loss! The activation function for the
LLC is f(x) = σ(x) and for linear regression it is simply f(x) = x.
Study Question: Just for a single neuron, imagine for some reason, that we decide
to use activation function f(z) = ez and loss function L(guess, actual) = (guess −
actual)2 . Derive a gradient descent update for w and w0 .
6.2 Networks
Now, we’ll put multiple neurons together into a network. A neural network in general
takes in an input x ∈ Rm and generates an output a ∈ Rn . It is constructed out of multiple
neurons; the inputs of each neuron might be elements of x and/or outputs of other neurons.
The outputs are generated by n output units.
In this chapter, we will only consider feed-forward networks. In a feed-forward network,
you can think of the network as defining a function-call graph that is acyclic: that is, the
input to a neuron can never depend on that neuron’s output. Data flows one way, from the
inputs to the outputs, and the function computed by the network is just a composition of
the functions computed by the individual neurons.
Although the graph structure of a feed-forward neural network can really be anything
(as long as it satisfies the feed-forward constraint), for simplicity in software and analysis,
we usually organize them into layers. A layer is a group of neurons that are essentially “in
parallel”: their inputs are outputs of neurons in the previous layer, and their outputs are
the input to the neurons in the next layer. We’ll start by describing a single layer, and then
go on to the case of multiple layers.
P
f a1
x1 P
f a2
x2
P
f a3
..
.
xm .. .. ..
. . .
W, W0 P
f an
Since each unit has a vector of weights and a single offset, we can think of the weights of
the whole layer as a matrix, W, and the collection of all the offsets as a vector W0 . If we
have m inputs, n units, and n outputs, then
• W is an m × n matrix,
• W0 is an n × 1 column vector,
X = A0 W 1 Z1 A1 W2 Z2 A2 AL−1 W L ZL AL
f1 f2 ··· fL
W01 W02 W0L
AL = W total X ,
which is a linear function of X! Having all those layers did not change the representational
capacity of the network: the non-linearity of the activation function is crucial.
Study Question: Convince yourself that any function representable by any number
of linear layers (where f is the identity function) can be represented by a single layer.
Now that we are convinced we need a non-linear activation, let’s examine a few com-
mon choices. These are shown mathematically below, followed by plots of these functions.
Step function:
0 if z < 0
step(z) =
1 otherwise
Sigmoid function: Also known as a logistic function. This can sometimes be interpreted
as probability, because for any value of z the output is in (0, 1):
1
σ(z) =
1 + e−z
1.5 1.5
step(z) ReLU(z)
1 1
0.5 0.5
z z
−2 −1 1 2 −2 −1 1 2
−0.5 −0.5
σ(z) tanh(z)
1 1
0.5 0.5
z z
−4 −2 2 4 −4 −2 2 4
−0.5 −0.5
−1 −1
The original idea for neural networks involved using the step function as an activation,
but because the derivative of the step function is zero everywhere except at the discontinu-
ity (and there it is undefined), gradient-descent methods won’t be useful in finding a good
setting of the weights, and so we won’t consider them further. They have been replaced, in
a sense, by the sigmoid, ReLU, and tanh activation functions.
Study Question: Consider sigmoid, ReLU, and tanh activations. Which one is most
like a step function? Is there an additional parameter you could add to a sigmoid
that would make it be more like a step function?
Study Question: What is the derivative of the ReLU function? Are there some val-
ues of the input for which the derivative vanishes?
ReLUs are especially common in internal (“hidden”) layers, sigmoid activations are
common for the output for binary classification, and softmax activations are common for
the output for multi-class classification (see Section 4.3.2 for an explanation).
In the equation above, we’re using the lowercase letters al , zl , wl , al−1 , wl0 to emphasize
that all of these quantities are scalars just for the moment. We’ll look at the more general
matrix case below.
To use SGD, then, we want to compute ∂L(NN(x; W), y)/∂wl and ∂L(NN(x; W), y)/∂wl0 Check your understand-
ing: why do we need
exactly these quantities
Last Updated: 05/06/24 18:25:27 for SGD?
MIT 6.390 Spring 2024 55
for each layer l and each data point (x, y). Below we’ll write “loss” as an abbreviation for
L(NN(x; W), y). Then our first quantity of interest is ∂loss/∂wl . The chain rule gives us
the following. First, let’s look at the case l = L:
Study Question: Check the derivations above yourself. You should use the chain
rule and also solve for the individual derivatives that arise in the chain rule.
Study Question: Check that the the final layer (l = L) case is a special case of the
general layer l case above.
Study Question: Derive ∂L(NN(x; W), y)/∂wl0 for yourself, for both the final layer
(l = L) and general l.
Study Question: Does the L = 1 case remind you of anything from earlier in this
course?
Study Question: Write out the full SGD algorithm for this neural network.
It’s pretty typical to run the chain rule from left to right like we did above. But, for
where we’re going next, it will be useful to notice that it’s completely equivalent to write it
in the other direction. So we can rewrite our result from above as follows:
∂loss ∂loss
= al−1 · (6.1)
∂wl ∂zl
∂loss ∂al ∂zl+1 ∂aL−1 ∂zL ∂aL ∂loss
= · · · · · · · (6.2)
∂zl ∂zl ∂al ∂zL−1 ∂aL−1 ∂zL ∂aL
∂al ∂aL−1 ∂aL ∂loss
= · wl+1 · · · L−1 · wL · L · . (6.3)
∂zl ∂z ∂z ∂aL
OK, let’s start with the results! Again, below we’ll be using “loss” as an abbreviation
for L(NN(x; W), y). Then,
T
∂loss l−1 ∂loss
l
= A
| {z } (6.4)
|∂W
{z } ∂Zl
ml ×1 | {z }
ml ×nl 1×nl
∂loss ∂A ∂Zl+1
l
∂AL−1 ∂ZL ∂AL ∂loss
= · · · · · · · · (6.5)
∂Zl ∂Zl ∂Al ∂ZL−1 ∂AL−1 ∂ZL ∂AL
l L−1 L
∂A ∂A ∂A ∂loss
= l
· W l+1 · · · · L−1
· WL · · . (6.6)
∂Z ∂Z ∂ZL ∂AL
First, compare each equation to its one-dimensional counterpart, and make sure you
see the similarities. That is, compare the general weight derivatives in Eq. 6.4 to the one-
dimensional case in Eq. 6.1. Compare the intermediate derivative of loss with respect to the
pre-activations Zl in Eq. 6.5 to the one-dimensional case in Eq. 6.2. And finally compare the
version where we’ve substituted in some of the derivatives in Eq. 6.6 to Eq. 6.3. Hopefully
you see how the forms are very analogous. But in the matrix case, we now have to be
careful about the matrix dimensions. We’ll check these matrix dimensions below.
Let’s start by talking through each of the terms in the matrix version of these equations.
Recall that loss is a scalar, and W l is a matrix of size ml × nl . You can read about the
conventions in the course for derivatives starting in this chapter in Appendix A. By these
conventions (not the only possible conventions!), we have that ∂loss/∂W l will be a matrix
of size ml × nl whose (i, j) entry is the scalar ∂loss/∂Wi,j
l
. In some sense, we’re just doing
a bunch of traditional scalar derivatives, and the matrix notation lets us write them all
simultaneously and succinctly. In particular, for SGD, we need to find the derivative of the
loss with respect to every scalar component of the weights because these are our model’s
parameters and therefore are the things we want to update in SGD.
The next quantity we see in Eq. 6.4 is Al−1 , which we recall has size ml × 1 (or equiva-
lently nl−1 × 1 since it represents the outputs of the l − 1 layer). Finally, we see ∂loss/∂Zl .
Again, loss is a scalar, and Zl is a nl × 1 vector. So by the conventions in Appendix A, we
have that ∂loss/∂Zl has size nl × 1. The transpose then has size 1 × nl . Now you should
be able to check that the dimensions all make sense in Eq. 6.4; in particular, you can check
that inner dimensions agree in the matrix multiplication and that, after the multiplication,
we should be left with something that has the dimensions on the lefthand side.
Now let’s look at Eq. 6.6. We’re computing ∂loss/∂Zl so that we can use it in Eq. 6.4.
The weights are familiar. The one part that remains is terms of the form ∂Al /∂Zl . Checking
out Appendix A, we see that this term should be a matrix of size nl × nl since Al and Zl
both have size nl × 1. The (i, j) entry of this matrix is ∂Alj /∂Zli . This scalar derivative is
something that you can compute when you know your activation function. If you’re not
using a softmax activation function, Alj typically is a function only of Zlj , which means that
∂Alj /∂Zli should equal 0 whenever i ̸= j, and that ∂Alj /∂Zlj = (fl ) ′ (Zlj ).
Study Question: Compute the dimensions of every term in Eqs. 6.5 and 6.6 using
Appendix A. After you’ve done that, check that all the matrix multiplications work;
that is, check that the inner dimensions agree and that the lefthand side and right-
hand side of these equations have the same dimensions.
Study Question: If I use the identity activation function, what is ∂Alj /∂Zlj for any j?
What is the full matrix ∂Al /∂Zl ?
To figure this out, let’s remember that Zl = (W l )⊤ Al−1 + W0l . We can write one element
P l
of the Zl vector, then, as Zlb = m l
a=1 Wa,b Aa
l−1
+ (W0l )b . It follows that ∂Zlk /∂Wi,j
l
will be
zero except when k = j (check you agree!). So we can rewrite Eq. 6.7 as
Finally, then, we match entries of the matrices on both sides of the equation above to re-
cover Eq. 6.4.
Study Question: Check that Eq. 6.8 and Eq. 6.4 say the same thing.
Study Question: Convince yourself that ∂Zl /∂Al−1 = W l by comparing the entries
of the matrices on both sides on the equality sign.
Study Question: Apply the same reasoning to find the gradients of loss with respect
to W0l .
well as do the necessary weight updates for gradient descent. Each module has to provide
the following “methods.” We are already using letters a, x, y, z with particular meanings,
so here we will use u as the vector input to the module and v as the vector output:
• forward: u → v
• weight grad: u, ∂L/∂v → ∂L/∂W only needed for modules that have weights W
In homework we will ask you to implement these modules for neural network components,
and then use them to construct a network and train it as described in the next section.
6.6 Training
Here we go! Here’s how to do stochastic gradient descent training on a feed-forward neural
network. After this pseudo-code, we motivate the choice of initialization in lines 2 and 3.
The actual computation of the gradient values (e.g., ∂loss/∂AL ) is not directly defined in
this code, because we want to make the structure of the computation clear.
Study Question: What is ∂Zl /∂W l ?
4 for t = 1 to T
5 i = random sample from {1, . . . , n}
6 A0 = x(i)
7 // forward pass to compute the output AL
8 for l = 1 to L
T
9 Zl = W l Al−1 + W0l
10 A = f (Zl )
l l
Initializing W is important; if you do it badly there is a good chance the neural net-
work training won’t work well. First, it is important to initialize the weights to random
values. We want different parts of the network to tend to “address” different aspects of
the problem; if they all start at the same weights, the symmetry will often keep the values
from moving in useful directions. Second, many of our activation functions have (near)
zero slope when the pre-activation z values have large magnitude, so we generally want to
keep the initial weights small so we will be in a situation where the gradients are non-zero,
so that gradient descent will have some useful signal about which way to go.
One good general-purpose strategy is to choose each weight at random from a Gaussian
(normal) distribution with mean 0 and standard deviation (1/m) where m is the number
of inputs to the unit.
Study Question: If the input x to this unit is a vector of 1’s, what would the ex-
pected pre-activation z value be with these initial weights?
6.7.1 Batches
Assume that we have an objective of the form
X
n
J(W) = L(h(x(i) ; W), y(i) ) ,
i=1
where h is the function computed by a neural network, and W stands for all the weight
matrices and vectors in the network.
Recall that, when we perform batch (or the vanilla) gradient descent, we use the update
rule
Wt = Wt−1 − η∇W J(Wt−1 ) ,
which is equivalent to
X
n
Wt = Wt−1 − η ∇W L(h(x(i) ; Wt−1 ), y(i) ) .
i=1
So, we sum up the gradient of loss at each training point, with respect to W, and then take
a step in the negative direction of the gradient.
In stochastic gradient descent, we repeatedly pick a point (x(i) , y(i) ) at random from the
data set, and execute a weight update on that point alone:
As long as we pick points uniformly at random from the data set, and decrease η at an
appropriate rate, we are guaranteed, with high probability, to converge to at least a local
optimum.
These two methods have offsetting virtues. The batch method takes steps in the exact
gradient direction but requires a lot of computation before even a single step can be taken,
especially if the data set is large. The stochastic method begins moving right away, and can
sometimes make very good progress before looking at even a substantial fraction of the
whole data set, but if there is a lot of variability in the data, it might require a very small η
to effectively average over the individual steps moving in “competing” directions.
An effective strategy is to “average” between batch and stochastic gradient descent by
using mini-batches. For a mini-batch of size K, we select K distinct data points uniformly
at random from the data set and do the update based just on their contributions to the
gradient
XK
Wt = Wt−1 − η ∇W L(h(x(i) ; Wt−1 ), y(i) ) .
i=1
See note on the ceiling1 function, for the case when n/K is not an integer.
So, we can consider having an independent step-size parameter for each weight, and up-
dating it based on a local view of how the gradient updates have been going. Some com- This section is very
mon strategies for this include momentum (“averaging” recent gradient updates), Adadelta strongly influenced
by Sebastian Ruder’s
(take larger steps in parts of the space where J(W) is nearly flat), and Adam (which com-
excellent blog posts on
bines these two previous ideas). Details of these approaches are described in Appendix B.0.1. the topic: ruder.io/
optimizing-gradient-descent
6.8 Regularization
So far, we have only considered optimizing loss on the training data as our objective for
neural network training. But, as we have discussed before, there is a risk of overfitting if
we do this. The pragmatic fact is that, in current deep neural networks, which tend to be
very large and to be trained with a large amount of data, overfitting is not a huge problem.
This runs counter to our current theoretical understanding and the study of this question
is a hot area of research. Nonetheless, there are several strategies for regularizing a neural
network, and they can sometimes be important.
This rule has the form of first “decaying” Wt−1 by a factor of (1 − 2λη) and then taking a
gradient step.
Finally, the same effect can be achieved by perturbing the x(i) values of the training data
by adding a small amount of zero-mean normally distributed noise before each gradient
computation. It makes intuitive sense that it would be more difficult for the network to
overfit to particular training data if they are changed slightly on each training step.
6.8.2 Dropout
Dropout is a regularization method that was designed to work with deep neural networks.
The idea behind it is, rather than perturbing the data every time we train, we’ll perturb the
network! We’ll do this by randomly, on each training step, selecting a set of units in each
layer and prohibiting them from participating. Thus, all of the units will have to take a
kind of “collective” responsibility for getting the answer right, and will not be able to rely
on any small subset of the weights to do all the necessary computation. This tends also to
make the network more robust to data perturbations.
During the training phase, for each training example, for each unit, randomly with
probability p temporarily set aℓj = 0. There will be no contribution to the output and no
gradient update for the associated unit.
Study Question: Be sure you understand why, when using SGD, setting an activa-
tion value to 0 will cause that unit’s weights not to be updated on that iteration.
When we are done training and want to use the network to make predictions, we mul-
tiply all weights by p to achieve the same average activation levels.
Implementing dropout is easy! In the forward pass during training, we let
aℓ = f(zℓ ) ∗ dℓ
where ∗ denotes component-wise product and dℓ is a vector of 0’s and 1’s drawn randomly
with probability p. The backwards pass depends on aℓ , so we do not need to make any
further changes to the algorithm.
It is common to set p to 0.5, but this is something one might experiment with to get
good results on your problem and data.
values is changing over time as the first layer’s weights change. Learning when the input
distribution is changing is extra difficult: you have to change your weights to improve your
predictions, but also just to compensate for a change in your inputs (imagine, for instance,
that the magnitude of the inputs to your layer is increasing over time—then your weights
will have to decrease, just to keep your predictions the same).
So, when training with mini-batches, the idea is to standardize the input values for each
mini-batch, just in the way that we did it in Section 5.3.3 of Chapter 5, subtracting off the
mean and dividing by the standard deviation of each input dimension. This means that the
scale of the inputs to each layer remains the same, no matter how the weights in previous
layers change. However, this somewhat complicates matters, because the computation of
the weight updates will need to take into account that we are performing this transforma-
tion. In the modular view, batch normalization can be seen as a module that is applied to
zl , interposed after the product with W l and before input to fl . We follow here the sug-
Although batch-norm was originally justified based on the problem of covariate shift, gestion from the origi-
nal paper of applying
it’s not clear that that is actually why it seems to improve performance. Batch normaliza-
batch normalization
tion can also end up having a regularizing effect for similar reasons that adding noise and before the activation
dropout do: each mini-batch of data ends up being mildly perturbed, which prevents the function. Since then it
network from exploiting very particular values of the data points. For those interested, the has been shown that, in
some cases, applying it
equations for batch normalization, including a derivation of the forward pass and back- after works a bit better.
ward pass, are described in Appendix B.0.2. But there aren’t any def-
inite findings on which
works better and when.
So far, we have studied what are called fully connected neural networks, in which all of the
units at one layer are connected to all of the units in the next layer. This is a good arrange-
ment when we don’t know anything about what kind of mapping from inputs to outputs
we will be asking the network to learn to approximate. But if we do know something about
our problem, it is better to build it into the structure of our neural network. Doing so can
save computation time and significantly diminish the amount of training data required to
arrive at a solution that generalizes robustly.
One very important application domain of neural networks, where the methods have
achieved an enormous amount of success in recent years, is signal processing. Signals
might be spatial (in two-dimensional camera images or three-dimensional depth or CAT
scans) or temporal (speech or music). If we know that we are addressing a signal-processing
problem, we can take advantage of invariant properties of that problem. In this chapter, we
will focus on two-dimensional spatial problems (images) but use one-dimensional ones as
a simple example. In a later chapter, we will address temporal problems.
Imagine that you are given the problem of designing and training a neural network that
takes an image as input, and outputs a classification, which is positive if the image contains
a cat and negative if it does not. An image is described as a two-dimensional array of pixels, A pixel is a “picture ele-
each of which may be represented by three integer values, encoding intensity levels in red, ment.”
green, and blue color channels.
There are two important pieces of prior structural knowledge we can bring to bear on
this problem:
• Spatial locality: The set of pixels we will have to take into consideration to find a cat
will be near one another in the image. So, for example, we
won’t have to consider
• Translation invariance: The pattern of pixels that characterizes a cat is the same no some combination of
matter where in the image the cat occurs. pixels in the four cor-
ners of the image, in
We will design neural network structures that take advantage of these properties. order to see if they en-
code cat-ness.
63
MIT 6.390 Spring 2024 64
7.1 Filters
We begin by discussing image filters. An image filter is a function that takes in a local spatial Unfortunately in AI/M-
neighborhood of pixel values and detects the presence of some pattern in that data. L/CS/Math, the word
“filter” gets used in
Let’s consider a very simple case to start, in which we have a 1-dimensional binary
many ways: in addition
“image” and a filter F of size two. The filter is a vector of two numbers, which we will to the one we describe
move along the image, taking the dot product between the filter values and the image here, it can describe a
values at each step, and aggregating the outputs to produce a new image. temporal process (in
fact, our moving aver-
Let X be the original image, of size d; then pixel i of the the output image is specified ages are a kind of filter)
by and even a somewhat
Yi = F · (Xi−1 , Xi ) . esoteric algebraic struc-
ture.
To ensure that the output image is also of dimension d, we will generally “pad” the input
image with 0 values if we need to access pixels that are beyond the bounds of the input
image. This process of applying the filter to the image to create a new image is called
“convolution.” And filters are also
If you are already familiar with what a convolution is, you might notice that this def- sometimes called con-
volutional kernels.
inition corresponds to what is often called a correlation and not to a convolution. In-
deed, correlation and convolution refer to different operations in signal processing. How-
ever, in the neural networks literature, most libraries implement the correlation (as de-
scribed in this chapter) but call it convolution. The distinction is not significant; in prin-
ciple, if convolution is required to solve the problem, the network could learn the nec-
essary weights. For a discussion of the difference between convolution and correlation
and the conventions used in the literature you can read Section 9.1 in this excellent book:
https://www.deeplearningbook.org.
Here is a concrete example. Let the filter F1 = (−1, +1). Then given the image in the first
line below, we can convolve it with filter F1 to obtain the second image. You can think of
this filter as a detector for “left edges” in the original image—to see this, look at the places
where there is a 1 in the output image, and see what pattern exists at that position in the
input image. Another interesting filter is F2 = (−1, +1, −1). The third image (the last line
below) shows the result of convolving the first image with F2 , where we see that the output
pixel i corresponds to when the center of F2 is aligned at input pixel i.
Study Question: Convince yourself that filter F2 can be understood as a detector for
isolated positive pixels in the binary image.
Image: 0 0 1 1 1 0 1 0 0 0
F1 : -1 +1
F2 -1 +1 -1
Two-dimensional versions of filters like these are thought to be found in the visual
cortex of all mammalian brains. Similar patterns arise from statistical analysis of natural
images. Computer vision people used to spend a lot of time hand-designing filter banks. A
filter bank is a set of sets of filters, arranged as shown in the diagram below.
Image
All of the filters in the first group are applied to the original image; if there are k such
filters, then the result is k new images, which are called channels. Now imagine stacking
all these new images up so that we have a cube of data, indexed by the original row and
column indices of the image, as well as by the channel. The next set of filters in the filter
bank will generally be three-dimensional: each one will be applied to a sub-range of the row
and column indices of the image and to all of the channels.
These 3D chunks of data are called tensors. The algebra of tensors is fun, and a lot like There are now many
matrix algebra, but we won’t go into it in any detail. useful neural-network
software packages, such
Here is a more complex example of two-dimensional filtering. We have two 3 × 3 filters
as TensorFlow and Py-
in the first layer, f1 and f2 . You can think of each one as “looking” for three pixels in a Torch that make opera-
row, f1 vertically and f2 horizontally. Assuming our input image is n × n, then the result tions on tensors easy.
of filtering with these two filters is an n × n × 2 tensor. Now we apply a tensor filter
(hard to draw!) that “looks for” a combination of two horizontal and two vertical bars
(now represented by individual pixels in the two channels), resulting in a single final n × n
image. When we have a color
image as input, we treat
it as having three chan-
nels, and hence as an
f2 n × n × 3 tensor.
tensor
filter
f1
We are going to design neural networks that have this structure. Each “bank” of the
filter bank will correspond to a neural-network layer. The numbers in the individual filters
will be the “weights” (plus a single additive bias or offset value for each filter) of the net-
work, that we will train using gradient descent. What makes this interesting and powerful
(and somewhat confusing at first) is that the same weights are used many many times in
the computation of each layer. This weight sharing means that we can express a transforma-
tion on a large image with relatively few parameters; it also means we’ll have to take care
in figuring out exactly how to train it!
We will define a filter layer l formally with: For simplicity, we are
assuming that all im-
• number of filters ml ; ages and filters are
square (having the same
• size of one filter is kl × kl × ml−1 plus 1 bias value (for this one filter); number of rows and
columns). That is in no
• stride sl is the spacing at which we apply the filter to the image; in all of our examples way necessary, but is
so far, we have used a stride of 1, but if we were to “skip” and apply the filter only at usually fine and def-
initely simplifies our
odd-numbered indices of the image, then it would have a stride of two (and produce notation.
a resulting image of half the size);
• padding: pl is how many extra pixels – typically with value 0 – we add around the
edges of the input. For an input of size nl−1 × nl−1 × ml−1 , our new effective input
size with padding becomes (nl−1 + 2 · pl ) × (nl−1 + 2 · pl ) × ml−1 .
Study Question: If we used a fully-connected layer with the same size inputs and
outputs, how many weights would it have?
1 Recall that ⌈·⌉ is the ceiling function; it returns the smallest integer greater than or equal to its input. E.g.,
As a result of applying a max pooling layer, we don’t keep track of the precise location of a
pattern. This helps our filters to learn to recognize patterns independent of their location.
Consider a max pooling layer where both the strides and k are set to be 2. This would
map a 64 × 64 × 3 image to a 32 × 32 × 3 image. Note that max pooling layers do not have
additional bias or offset values.
Study Question: Maximilian Poole thinks it would be a good idea to add two max
pooling layers of size k, one right after the other, to their network. What single layer
would be equivalent?
One potential concern about max-pooling layers is that they actually don’t completely
preserve translation invariance. If you do max-pooling with a stride other than 1 (or just
pool over the whole image size), then shifting the pattern you are hoping to detect within
the image by a small amount can change the output of the max-pooling layer substan-
tially, just because there are discontinuities induced by the way the max-pooling window
matches up with its input image. Here is an interesting paper that illustrates this phe- https://arxiv.org/
nomenon clearly and suggests that one should first do max-pooling with a stride of 1, then pdf/1904.11486.pdf
do “downsampling” by averaging over a window of outputs.
The “depth” dimension in the layers shown as cuboids corresponds to the number of chan-
nels in the output tensor. (Figure source: https://www.mathworks.com/solutions/deep-
learning/convolutional-neural-network.html)
At the end of each filter layer, we typically apply a ReLU activation function. There
may be multiple filter plus ReLU layers. Then we have a max pooling layer. Then we have
some more filter + ReLU layers. Then we have max pooling again. Once the output is
down to a relatively small size, there is typically a last fully-connected layer, leading into
an activation function such as softmax that produces the final output. The exact design of
these structures is an art—there is not currently any clear theoretical (or even systematic
empirical) understanding of how these various design choices affect overall performance
of the network.
The critical point for us is that this is all just a big neural network, which takes an input
and computes an output. The mapping is a differentiable function of the weights, which Well, techinically the
means we can adjust the weights to decrease the loss by performing gradient descent, and derivative does not exist
at every point, both be-
we can compute the relevant gradients using back-propagation!
cause of the ReLU and
the max pooling oper-
ations, but we ignore
that fact.
Last Updated: 05/06/24 18:25:27
MIT 6.390 Spring 2024 68
conv ReLU fc
W2
Z2 = A2
W1
For simplicity assume k is odd, let the input image X = A0 , and assume we are using
squared loss. Then we can describe the forward pass as follows:
T
Z1i = W 1 A0[i−⌊k/2⌋:i+⌊k/2⌋]
A1 = ReLU(Z1 )
T
A2 = Z2 = W 2 A1
Lsquare (A2 , y) = (A2 − y)2
Study Question: Assuming a stride of 1, for a filter of size k, how much padding
do we need to add to the top and bottom of the image? We see one zero at the top
and bottom in the figure just above; what filter size is implicitly being shown in the
figure? (Recall the padding is for the sake of getting an output the same size as the
input.)
y = max(a1 , a2 ) ,
where a1 and a2 are each computed by some network. Consider doing back-propagation
through the maximum. First consider the case where a1 > a2 . Then the error value at y
is propagated back entirely to the network computing the value a1 . The weights in the
network computing a1 will ultimately be adjusted, and the network computing a2 will be
untouched.
Study Question: What is ∇(x,y) max(x, y) ?
Transformers
Transformers are a very recent family of architectures that have revolutionized fields like
natural language processing (NLP), image processing, and multi-modal generative AI.
Transformers were originally introduced in the field of NLP in 2017, as an approach
to process and understand human language. Human language is inherently sequential in
nature (e.g., characters form words, words form sentences, and sentences form paragraphs
and documents). Prior to the advent of the transformers architecture, recurrent neural net-
works (RNNs) briefly dominated the field for their ability to process sequential information
(RNNs are described in Appendix C for reference). However, RNNs, like many other ar-
chitectures, processed sequential information in an iterative/sequential fashion, whereby
each item of a sequence was individually processed one after another. Transformers offer
many advantages over RNNs, including their ability to process all items in a sequence in a
parallel fashion (as do CNNs).
Like CNNs, transformers factorize the signal processing problem into stages that in-
volve independent and identically processed chunks. However, they also include layers
that mix information across the chunks, called attention layers, so that the full pipeline can
model dependencies between the chunks.
In this chapter, we describe transformers from the bottom up. We start with the idea
of embeddings and tokens (Section 8.1). We then describe the attention mechanism (Sec-
tion 8.2). And finally we then assemble all these ideas together to arrive at the full trans-
former architecture in Section 8.3.
70
MIT 6.390 Spring 2024 71
each vector being slightly different based on its context in the sentence and story at large.
To measure how similar any two word embeddings are (in terms of their numeric val-
ues) it is common to use cosine similarity as the metric:
uT v
= cos < u, v > , (8.1)
|u| |v|
where |u| and |v| are the lengths of the vectors, and < u, v > is the angle between u and v.
The cosine similarity is +1 when u = v, zero when the two vectors are perpendicular to
each other, and −1 when the two vectors are diametrically opposed to each other. Thus,
higher values correspond to vectors that are numerically more similar to each other.
While word embeddings – and various approaches to create them – have existed for
decades, the first approach that produced astonishingly effective word embeddings was
word2vec in 2012. This revolutionary approach was the first highly-successful approach of
applying deep learning to NLP, and it enabled all subsequent progress in the field, includ-
ing Transformers. The details of word2vec are beyond the scope of this course, but we note
two facts: (1) it created a single word embedding for each distinct word in the training cor-
pus (not on a per-occurrence basis); (2) it produced word embeddings that were so useful,
many relationships between the vectors corresponded with real-world semantic related-
ness. For example, when using Euclidean distance as a distance metric between two vectors,
word2vec produced word embeddings with properties such as (where vword is the vector
for word):
This corresponds with the real-world property that Paris is to France what Rome is to
Italy. This incredible finding existed not only for geographic words but all sorts of real-
world concepts in the vocabulary. Nevertheless, to some extent, the exact values in each
embedding is arbitrary, and what matters most is the holistic relation between all embed-
dings, along with how performant/useful they are for the exact task that we care about.
For example, an embedding may be considered good if it accurately captures the con-
ditional probability for a given word to appear next in a sequence of words. You probably
have a good idea of what words might typically fill in the blank at the end of this sentence:
Or a model could be built that tries to correctly predict words in the middle of sentences:
The model can be built by minimizing a loss function that penalizes incorrect word guesses,
and rewards correct ones. This is done by training a model on a very large corpus of written
material, such as all of Wikipedia, or even all the accessible digitized written materials
produced by humans.
While we will not dive into the full details of tokenization, the high-level idea is straight-
forward: the individual inputs of data that are represented and processed by a model are
referred to as tokens. And, instead of processing each word as a whole, words are typically
split into smaller, meaningful pieces (akin to syllables). Thus, when we refer to tokens,
know that we’re referring to each individual input, and that in practice, nowadays, they
tend to be sub-words (e.g., the word ‘talked’ may be split into two tokens, ‘talk’ and ‘ed’).
This vector-based lookup mechanism has come to be known as “attention” in the sense
that p(k|q) is a conditional probability distribution that says how much attention should
be given to the key kj for a given query q.
In other words, the conditional probability distribution p(k|q) gives the “attention weights,”
and the weighted average value
X
p(kj |q) vj (8.3)
j
√
softmax q⊤ ⊤
q⊤
1 k1 q 1 k2 · · ·
1 knk /√dk
softmax q⊤ ⊤
2 k1 q 2 k2 · · · q⊤
2 k nk / d k
A= .. (8.4)
. √
softmax qnq k1 qnq k2 · · · q⊤
⊤ ⊤
n q k n k / dk
done to reduce the magnitude of the dot product, which would otherwise grow undesir-
ably large with increasing dk , making it difficult for (overall) training.
Let αij be the entry in ith row and jth column in the attention matrix A. Then αij helps
answer the question "which tokens x(j) help the most with predicting the corresponding
output token y(i) ?" The attention output is given by a weighted sum over the values:
X
n
y(i) = αij vj
j=1
Comparing this self-attention matrix with the attention matrix described in Equation
8.2, we notice the only difference lies in the dimensions: since in self-attention, the query,
key, and value all come from the same input, we have nq = nk = nv , and we often denote
all three with a unified n.
The self-attention output is then given by a weighted sum over the values:
X
n
y(i) = αij vj
j=1
This diagram below shows (only) the middle input token generating a query that is then
combined with the keys computed with all tokens to generate the attention weights via a
softmax. The output of the softmax is then combined with values computed from all to-
kens, to generate the attention output corresponding to the middle input token. Repeating
this for each input token then generates the output.
Study Question: We have five colored tokens in the diagram above (gray, blue, or-
gange, green, red). Could you read off the diagram the correspondence between the
color and input, query, query, value, output?
Note that the size of the output is the same as the size of the input. Also, observe that
there is no apparent notion of ordering of the input words in the depicted structure. Posi-
tional information can be added by encoding a number for token (giving say, the token’s
position relative to the start of the sequence) into the vector embedding of each token.
And note that a given query need not pay attention to all other tokens in the input; in this
example, the token used for the query is not used for a key or value.
More generally, a mask may be applied to limit which tokens are used in the attention
computation. For example, one common mask limits the attention computation to tokens
that occur previously in time to the one being used for the query. This prevents the atten-
tion mechanism from “looking ahead” in scenarios where the transformer is being used to
generate one token at a time.
Each self-attention stage is trained to have key, value, and query embeddings that lead
it to pay specific attention to some particular feature of the input. We generally want to
pay attention to many different kinds of features in the input; for example, in translation
one feature might be be the verbs, and another might be objects or subjects. A transformer
utilizes multiple instances of self-attention, each known as an “attention head,” to allow
combinations of attention paid to many different features.
8.3 Transformers
A transformer is the composition of a number of transformer blocks, each of which has
multiple attention heads. At a very high-level, the goal of a transformer block is to output
a really rich, useful representation for each input token, all for the sake of being high-
performant for whatever task the model is trained to learn.
Rather than depicting the transformer graphically, it is worth returning to the beauty of
the underlying equations1 .
Q = XWq
K = XWk
V = XWv
These Q, K, V triple can then be used to produce one (self)attention-layer output. One such
layer is called one "attention head".
One can have more than one "attention head", such that: the queries, keys, and values
are embedded via encoding matrices:
and Wh,q , Wh,k , Wh,v ∈ Rd×dk where dk is the size of the key/query embedding space, and
h ∈ {1, · · · , H} is an index over “attention heads.” for each attention-head
We then perform a weighted sum over all the outputs for each head, h, we learn one set of
Wh,q , Wh,k , Wh,v .
(i)
X
H X
n
(h) (h)
u′ = T
Wh,c αij Vj , (8.9)
h=1 j=1
where Wh,c ∈ Rdk ×d , u ′ (i) ∈ Rd×1 , the indices i ∈ {1, · · · , n} and j ∈ {1, · · · , n} are an
integer index over tokens. (h)
Vj is the dk × 1 value
This is then standardized and combined with x(i) using a LayerNorm function (defined embedding vector that
below) to become corresponds to the input
token xj for attention
(i) head h.
u(i) = LayerNorm x(i) + u ′ ; γ1 , β1 (8.10)
with parameters γ1 , β1 ∈ Rd .
1 The presentation here follows the notes by John Thickstun.
To get the final output, we follow the “intermediate output then layer norm" recipe
again. In particular, we first get the transformer block output z ′ (i) given by
(i)
z ′ = W2T ReLU W1T u(i) (8.11)
with weights W1 ∈ Rd×m and W2 ∈ Rm×d . This is then standardized and combined with
u(i) to give the final output z(i) :
(i)
z(i) = LayerNorm u(i) + z ′ ; γ2 , β2 , (8.12)
with parameters γ2 , β2 ∈ Rd . These vectors are then assembled (e.g., through parallel
computation) to produce z ∈ Rn×d .
The LayerNorm function transforms a d-dimensional input z with parameters γ, β ∈ Rd
into
z − µz
LayerNorm(z; γ, β) = γ + β, (8.13)
σz
where µz is the mean and σz the standard deviation of z:
1X
d
µz = zi (8.14)
d
i=1
v
u1 X
u d
σz = t (zi − µz )2 . (8.15)
d
i=1
model was trained to predict the masked words. BERT was also trained on sequences of
sentences, where the model was trained to predict whether two sentences are likely to be
contextually close together or not. The pre-training stage is generally very expensive.
The second “fine-tuning” stage trains the model for a specific task, such as classification
or question answering. This training stage can be relatively inexpensive, but it generally
requires labeled data.
Non-parametric methods
Neural networks have adaptable complexity, in the sense that we can try different struc-
tural models and use cross validation to find one that works well on our data. Beyond
neural networks, we may further broaden the class of models that we can fit to our data,
for example as illustrated by the techniques introduced in this chapter.
Here, we turn to models that automatically adapt their complexity to the training data.
The name non-parametric methods is misleading: it is really a class of methods that does not
have a fixed parameterization in advance. Rather, the complexity of the parameterization
can grow as we acquire more data.
Some non-parametric models, such as nearest-neighbor, rely directly on the data to
make predictions and do not compute a model that summarizes the data. Other non-
parametric methods, such as decision trees, can be seen as dynamically constructing some- These are sometimes
thing that ends up looking like a more traditional parametric model, but where the actual called classification trees;
the decision analysis lit-
training data affects exactly what the form of the model will be.
erature uses “decision
The non-parametric methods we consider here tend to have the form of a composition tree” for a structure that
of simple models: lays out possible fu-
ture events that consist
• Nearest neighbor models: (Section 9.1) where we don’t process data at training time, of choices interspersed
but do all the work when making predictions, by looking for the closest training with chance nodes.
example(s) to a given new data point.
• Tree models: (Section 9.2) where we partition the input space and use different sim-
ple predictions on different regions of the space; the hypothesis space can become
arbitrarily large allowing finer and finer partitions of the input space.
• Ensemble models: (Section 9.2.3) in which we train several different classifiers on the
whole space and average the answers; this decreases the estimation error. In particu-
lar, we will look at bootstrap aggregation, or bagging of trees.
• Boosting is a way to construct a model composed of a sequence of component models
(e.g., a model consisting of a sequence of trees, each subsequent tree seeking to correct
errors in the previous trees) that decreases both estimation and structural error. We
won’t consider this in detail in this class.
Why are we studying these methods, in the heyday of complicated models such as
neural networks?
78
MIT 6.390 Spring 2024 79
d(x, x) = 0
d(x, x ′ ) = d(x ′ , x)
d(x, x ′′ ) ⩽ d(x, x ′ ) + d(x ′ , x ′′ )
that is, the predicted output associated with the training point that is closest to the query
point x. Tie breaking is typically done at random.
This same algorithm works for regression and classification!
The nearest neighbor prediction function can be described by dividing the space up
into regions whose closest point is each individual training point as shown below : Decision boundary re-
gions can also be de-
scribed by Voronoi di-
agrams. In a Voronoi
diagram, each of the
data points would have
its own “cell” or region
in the space that is clos-
est to the data point
in question. In the di-
agram provided here,
cells have been merged
if the predicted value
is the same in adjacent
cells.
fit locally linear regression models to the k nearest points, possibly giving less weight to
those that are farther away. In large data-sets, it is important to use good data structures
(e.g., ball trees) to perform the nearest-neighbor look-ups efficiently (without looking at all
the data points each time).
• The class of possible ways to split the space at each node; these are typically linear
splits, either aligned with the axes of the space, or sometimes using more general
classifiers.
• The class of predictors within the partitions; these are often simply constants, but
may be more general classification or regression models.
• The way in which we control the complexity of the hypothesis: it would be within
the capacity of these methods to have a separate partition element for each individual
training example.
• The algorithm for making the partitions and fitting the models.
One advantage of tree models is that they are easily interpretable by humans. This is
important in application domains, such as medicine, where there are human experts who
often ultimately make critical decisions and who need to feel confident in their under-
standing of recommendations made by an algorithm. Below is an example decision tree,
illustrating how one might be able to understand the decisions made by the tree.
Example: Here is a sample tree (reproduced from Breiman, Friedman, Olshen, Stone
(1984)):
no yes
no yes
Is sinus tachycardia
low risk present?
no yes
These methods are most appropriate for domains where the input space is not very
high-dimensional and where the individual input features have some substantially useful
information individually or in small groups. Trees would not be good for image input,
but might be good in cases with, for example, a set of meaningful measurements of the
condition of a patient in the hospital, as in the example above.
We’ll concentrate on the CART/ID3 (“classification and regression trees” and “iterative
dichotomizer 3”, respectively) family of algorithms, which were invented independently
in the statistics and the artificial intelligence communities. They work by greedily con-
structing a partition, where the splits are axis aligned and by fitting a constant model in the
leaves. The interesting questions are how to select the splits and how to control complexity.
The regression and classification versions are very similar.
As a concrete example, consider the following images:
The left image depicts a set of labeled data points in a two-dimensional feature space. The
right shows a partition into regions by a decision tree, in this case having no classification
errors in the final partitions.
9.2.1 Regression
The predictor is made up of
• a partition function, π, mapping elements of the input space into exactly one of M
regions, R1 , . . . , RM , and
If we already knew a division of the space into regions, we would set Om , the constant
output for region Rm , tobe the average of the training output values in that region. For
a training data set D = x(i) , y(i) , i = 1, . . . n, we let I be an indicator set of all of the
elements within D, so that I = {1, . . . , n} for our whole data set. We can define Im as the
subset of data set samples that are in region Rm , so that Im = {i | x(i) ∈ Rm }. Then
Om = averagei∈Im y(i) .
We can define the error in a region as Em . For example, Em as the sum of squared error
would be expressed as X
Em = (y(i) − Om )2 . (9.2)
i∈Im
X
M
λM + Em , (9.3)
m=1
for some regularization constant λ. It is enough to search over all partitions of the train-
ing data (not all partitions of the input space!) to optimize this, but the problem is NP-
complete.
Study Question: Be sure you understand why it’s enough to consider all partitions
of the training data, if this is your objective.
• I+
j,s indicates the set of examples (subset of I) whose feature value in dimension j is
greater than or equal to split point s;
• I−
j,s indicates the set of examples (subset of I) whose feature value in dimension j is
less than s;
+
• ŷ+
j,s is the average y value of the data points indicated by set Ij,s ; and
−
• ŷ−
j,s is the average y value of the data points indicated by set Ij,s .
Here is the pseudocode. In what follows, k is the largest leaf size that we will allow in
the tree, and is a hyperparameter of the algorithm.
B UILD T REE(I, k)
1 if |I| ⩽ k
2 Set ŷ = averagei∈I y(i)
3 return L EAF(value = ŷ)
4 else
5 for each split dimension j and split value s
(i)
6 j,s = {i ∈ I | xj ⩾ s}
Set I+
(i)
7 j,s = {i ∈ I | xj < s}
Set I−
+
8 Set ŷj,s = averagei∈I+ y(i)
j,s
9 Set ŷ−
j,s = averagei∈I− y(i)
P j,s P
10 Set Ej,s = i∈I+ j,s
(y(i) − ŷ+ 2
j,s ) + i∈I−
j,s
(y(i) − ŷ−
j,s )
2
∗ ∗
11 Set (j , s ) = arg minj,s Ej,s
12 return N ODE(j∗ , s∗ , B UILD T REE(I− +
j∗ ,s∗ , k), B UILD T REE (Ij∗ ,s∗ , k))
In practice, we typically start by calling B UILD T REE with the first input equal to our
whole data set (that is, with I = {1, . . . , n}). But then that call of B UILD T REE can recursively
lead to many other calls of B UILD T REE.
Let’s think about how long each call of B UILD T REE takes to run. We have to consider
all possible splits. So we consider a split in each of the d dimensions. In each dimension,
we only need to consider splits betwee two data points (any other split will give the same
error on the training data). So, in total, we consider O(dn) splits in each call to B UILD T REE.
Study Question: Concretely, what would be a good set of split-points to consider for
dimension j of a data set indicated by I?
9.2.1.2 Pruning
It might be tempting to regularize by using a somewhat large value of k, or by stopping
when splitting a node does not significantly decrease the error. One problem with short-
sighted stopping criteria is that they might not see the value of a split that will require one
more split before it seems useful.
Study Question: Apply the decision-tree algorithm to the XOR problem in two di-
mensions. What is the training-set error of all possible hypotheses based on a single
split?
So, we will tend to build a tree that is too large, and then prune it back.
We define cost complexity of a tree T , where m ranges over its leaves, as
|T |
X
Cα (T ) = Em (T ) + α|T | , (9.4)
m=1
and |T | is the number of leaves. For a fixed α, we can find a T that (approximately) mini-
mizes Cα (T ) by “weakest-link” pruning:
• Create a sequence of trees by successively removing the bottom-level split that mini-
mizes the increase in overall error, until the root is reached.
• Return the T in the sequence that minimizes the cost complexity.
We can choose an appropriate α using cross validation.
9.2.2 Classification
The strategy for building and pruning classification trees is very similar to the strategy for
regression trees.
Given a region Rm corresponding to a leaf of the tree, we would pick the output class
y to be the value that exists most frequently (the majority value) in the data points whose x
values are in that region, i.e., data points indicated by Im :
Om = majorityi∈Im y(i) .
Let’s now define the error in a region as the number of data points that do not have the
value Om :
Em = {i | i ∈ Im and y(i) ̸= Om } .
We define the empirical probability of an item from class k occurring in region m as:
{i | i ∈ Im and y(i) = k}
P̂m,k = P̂(Im , k) = ,
Nm
where Nm is the number of training points in region m; that is, Nm = |Im |. For later use,
we’ll also define the empirical probabilities of split values, P̂m,j,s , as the fraction of points
with dimension j in split s occurring in region m (one branch of the tree), and 1 − P̂m,j,s as
the complement (the fraction of points in the other branch).
Splitting criteria In our greedy algorithm, we need a way to decide which split to make
next. There are many criteria that express some measure of the “impurity” in child nodes.
Some measures include:
• Misclassification error:
Em
Qm (T ) = = 1 − P̂m,Om (9.5)
Nm
• Gini index: X
Qm (T ) = P̂m,k (1 − P̂m,k ) (9.6)
k
• Entropy: X
Qm (T ) = H(Im ) = − P̂m,k log2 P̂m,k (9.7)
k
= (1 − P̂m,j,s )H(I− +
j,s ) + P̂m,j,s H(Ij,s )
|I−
j,s | |I+
j,s |
= · H(I−
j,s ) + · H(I+
j,s ) . (9.8)
Nm Nm
Choosing the split that minimizes the entropy of the children is equivalent to maximizing
the information gain of the test xj = s, defined by
|I−
j,s | |I |
+
!
j,s
INFO G AIN(xj = s, Im ) = H(Im ) − · H(I−
j,s ) + · H(I+
j,s ) (9.9)
Nm Nm
In the two-class case (with labels 0 and 1), all of the splitting criteria mentioned above
have the values
0.0 when P̂m,0 = 0.0
.
0.0 when P̂m,0 = 1.0
The respective impurity curves are shown below, where p = P̂m,0 ; the vertical axis plots
Qm (T ) for each of the three criteria.
There used to be endless haggling about which impurity function one should use. It seems
to be traditional to use entropy to select which node to split while growing the tree, and
misclassification error in the pruning criterion.
9.2.3 Bagging
One important limitation or drawback in conventional trees is that they can have high
estimation error: small changes in the data can result in very big changes in the resulting
tree.
Bootstrap aggregation is a technique for reducing the estimation error of a non-linear
predictor, or one that is adaptive to the data. The key idea applied to trees, is to build
multiple trees with different subsets of the data, and then create an ensemble model that
combines the results from multiple trees to make a prediction.
• Construct B new data sets of size n. Each data set is constructed by sampling n data
points with replacement from D. A single data set is called bootstrap sample of D.
• Train a predictor f̂b (x) on each bootstrap sample.
• Regression case: bagged predictor is
1X b
B
f̂bag (x) = f̂ (x) . (9.10)
B
b=1
• Classification case: Let K be the number of classes. We find a majority bagged predictor
as follows. We let f̂b (x) be a “one-hot” vector with a single 1 and K − 1 zeros, and
define the predicted output ŷ for predictor fb as ŷb (x) = arg maxk f̂b (x)k . Then
1X b
B
f̂bag (x) = f̂ (x), (9.11)
B
b=1
which is a vector containing the proportion of classifiers that predicted each class k
for input x. Then the overall predicted output is
ŷbag (x) = arg max f̂bag (x)k . (9.12)
k
There are theoretical arguments showing that bagging does, in fact, reduce estimation
error. However, when we bag a model, any simple intrepetability is lost.
R ANDOM F OREST(D; B, m, n)
1 for b = 1, . . . , B
2 Draw a bootstrap sample Db of size n from D
3 Grow a tree Tb on data Db by recursively:
4 Select m variables at random from the d variables
5 Pick the best variable and split point among the m variables
6 Split the node
7 return tree Tb
Given the ensemble of trees, vote to make a prediction on a new x.
So far, most of the learning problems we have looked at have been supervised, that is, for
each training input x(i) , we are told which value y(i) should be the output. From a tradi-
tional machine-learning viewpoint, there’re two other major groups of learning problems:
one is the unsupervised learning problems, in which we are given data and no expected
outputs, and we will look at later in Chapter 12.1 and Chapter 12.2.
The other major type is the so-called Reinforcement learning (RL) problems. Reinforce-
ment learning differs significantly from supervised learning problems, and we will delve
into the details later in in Chapter 11. However, it’s worth pointing out one major dif-
ference at a very high level: in supervised learning, our goal is to learn a one-time static
mapping to make predictions, whereas in RL, the setup requires us to sequentially take
actions to maximize cumulative rewards.
This setup change necessitates additional mathematical and algorithmic tools for us to
understand RL. Markov decision process (MDP) is precisely such a classical and fundamental
tool.
87
MIT 6.390 Spring 2024 88
• If you perform the “wash” operation on any object, whether it’s dirty, clean,
or painted, it will end up “clean” with probability 0.9 and “dirty” otherwise.
Paint:
• If you perform the “paint” operation on a clean object, it will become nicely
“painted” with probability 0.8. With probability 0.1, the paint misses but the
object stays clean, and also with probability 0.1, the machine dumps rusty
dust all over the object and it becomes “dirty”.
• If you perform the “paint” operation on a “dirty” part, it stays “dirty” with
probability 1.0.
Eject:
• If you perform an “eject” operation on any part, the part comes out of the
machine and this fun game is over. The part remains "ejected" regardless of
any further action.
These descriptions specify the transition model T , and the transition function for
each action can be depicted as a state machine diagram. For example, here is the
diagram for “wash”:
0.1 0.9
0.9
dirty clean
0.1
painted ejected
You get reward +10 for ejecting a painted object, reward 0 for ejecting a non-painted
object, reward 0 for any action on an "ejected" object, and reward -3 otherwise. The
MDP description would be completed by also specifying a discount factor.
A policy is a function π : S → A that specifies what action to take in each state. The
policy is what we will want to learn; it is akin to the strategy that a player employs to
win a given game. Below, we take just the initial steps towards this eventual goal. We
describe how to evaluate how good a policy is, first in the finite horizon case (Section 10.1.1)
when the total number of transition steps is finite. Then we consider the infinite horizon
case (Section 10.1.2), when you don’t know when the game will be over.
The sum over s ′ is an expectation: it considers all possible next states s ′ , and computes
an average of their (h − 1)-horizon values, weighted by the probability that the transition
function from state s with the action chosen by the policy π(s) assigns to arriving in state
s ′ , and discounted by γ.
P
Study Question: What is s ′ T (s, a, s ′ ) for any particular s and a?
Study Question: Convince yourself that Eqs. 10.1 and 10.3 are special cases of
Eq. 10.4.
Then we can say that a policy π1 is better than policy π2 for horizon h, i.e., π1 >h π2 ,
if and only if for all s ∈ S, Vπh1 (s) ⩾ Vπh2 (s) and there exists at least one s ∈ S such that
Vπh1 (s) > Vπh2 (s).
Even though it seems intuitive that the second policy should be better, we can’t justify that
by saying ∞ < ∞.
One standard approach to deal with this problem is to consider the discounted infinite
horizon. We will generalize from the finite-horizon case by adding a discount factor.
In the finite-horizon case, we valued a policy based on an expected finite-horizon value:
X
"h−1 #
E γ Rt | π, s0 ,
t
(10.5)
t=0
A very important point is that R(s, a) is always deterministic (in this class) for any
given s and a. Here Rt represents the set of all possible R(st , a) at time step t; this Rt
is a random variable because the state we’re in at step t is itself a random variable,
due to prior stochastic state transitions up to but not including at step t and prior
(deterministic) actions dictated by policy π.
Now, for the infinite-horizon case, we select a discount factor 0 < γ < 1, and evaluate a
policy based on its expected infinite horizon discounted value:
"∞
X
#
γ Rt | π, s0 = E R0 + γR1 + γ2 R2 + . . . | π, s0 .
t
E (10.6)
t=0
Note that the t indices here are not the number of steps to go, but actually the number
of steps forward from the starting state (there is no sensible notion of “steps to go” in the
infinite horizon case).
Eqs. 10.5 and 10.6 are a conceptual stepping stone. Our main objective is to get to
Eq. 10.8, which can also be viewed as including γ in Eq. 10.4, with the appropriate
definition of the infinite-horizon value.
There are two good intuitive motivations for discounting. One is related to economic
theory and the present value of money: you’d generally rather have some money today
than that same amount of money next week (because you could use it now or invest it).
The other is to think of the whole process terminating, with probability 1 − γ on each step
of the interaction. This value is the expected amount of reward the agent would gain under At every step, your ex-
this terminating model. pected future lifetime,
given that you have
Study Question: Verify this fact: if, on every day you wake up, there is a probability survived until now, is
of 1 − γ that today will be your last day, then your expected lifetime is 1/(1 − γ) days. 1/(1 − γ).
Let us now evaluate a policy in terms of the expected discounted infinite-horizon value
that the agent will get in the MDP if it executes that policy. We define the infinite-horizon
value of a state s under policy π as
Because the expectation of a linear combination of random variables is the linear combina-
tion of the expectations, we have
Given an MDP, our goal is typically to find a policy that is optimal in the sense that it
gets as much total reward as possible, in expectation over the stochastic transitions that
the domain makes. We build on what we have learned about evaluating the goodness of
a policy (Sections 10.1.1 and 10.1.2), and find optimal policies for the finite horizon case
(Section 10.2.1), then the infinite horizon case (Section 10.2.2).
• starting in state s,
• continuing for h − 1 more steps executing an optimal policy for the appropriate hori-
zon on each step.
Similar to our definition of V h for evaluating a policy, we define the Qh function recur-
sively according to the horizon. The only difference is that, on each step with horizon h,
rather than selecting an action specified by a given policy, we select the value of a that will
Q0 (s, a) = 0 (10.9)
1
Q (s, a) = R(s, a) + 0 (10.10)
X
2 ′ 1 ′ ′
Q (s, a) = R(s, a) + γ T (s, a, s ) max
′
Q (s , a ) (10.11)
a
s′
..
.
X
Qh (s, a) = R(s, a) + γ T (s, a, s ′ ) max
′
Qh−1 (s ′ , a ′ ) (10.12)
a
s′
where (s ′ , a ′ ) denotes the next time-step state/action pair. We can solve for the values of
Qh with a simple recursive algorithm called finite-horizon value iteration that just computes
Qh starting from horizon 0 and working backward to the desired horizon H. Given Qh , an
optimal π∗h can be found as follows:
which gives the immediate best action(s) to take when there are h steps left; then π∗h−1 (s)
gives the best action(s) when there are (h − 1) steps left, and so on. In the case where there
are multiple best actions, we typically can break ties randomly.
Additionally, it is worth noting that in order for such an optimal policy to be computed,
we assume that the reward function R(s, a) is bounded on the set of all possible (state,
action) pairs. Furthermore, we will assume that the set of all possible actions is finite.
Study Question: The optimal value function is unique, but the optimal policy is not.
Think of a situation in which there is more than one optimal policy.
• To compute Q4 (si , aj ) for any one (si , aj ), we would need to compute Q3 (s, a)
for all (s, a) pairs.
• To compute Q3 (si , aj ) for any one (si , aj ), we’d need to compute Q2 (s, a) for
all (s, a) pairs.
• To compute Q2 (si , aj ) for any one (si , aj ), we’d need to compute Q1 (s, a) for
all (s, a) pairs.
So, if we have n states and m actions, this is O((mn)3 ) work — that seems like
way too much, especially as the horizon increases! But observe that we really only
have mnh values that need to be computed: Qh (s, a) for all h, s, a. If we start with
h = 1, compute and store those values, then using and reusing the Qh−1 (s, a) values
to compute the Qh (s, a) values, we can do all this computation in time O(mn2 h),
which is much better!
This is also a set of equations, one for each (s, a) pair. This time, though, they are not
linear (due to the max operation), and so they are not easy to solve. But there is a theorem
Theory There are a lot of nice theoretical results about infinite-horizon value iteration.
For some given (not necessarily optimal) Q function, define πQ (s) = arg maxa Q(s, a).
• The algorithm can be executed asynchronously, in parallel: as long as all (s, a) pairs
are updated infinitely often in an infinite run, it still converges to the optimal value. This is very important
for reinforcement learn-
ing.
Reinforcement learning
So far, all the learning problems we have looked at have been supervised, that is, for each
training input x(i) , we are told which value y(i) should be the output. Reinforcement learning
differs from previous learning problems in several important ways:
• The learner interacts explicitly with an environment, rather than implicitly (as in su-
pervised learning) through an available training data set of (x(i) , y(i) ) pairs drawn
from the environment.
• The learner has some choice over what new information it seeks to gain from the
environment.
• The learner updates models incrementally as additional information about the envi- Online learning is a vari-
ronment becomes available. ant of supervised learn-
ing in which new data
pairs become avail-
In a reinforcement learning problem, the interaction with the environment takes a par-
able over time and the
ticular form: model is updated, e.g.,
by retraining over the
Learner entire larger data set, or
by weight update using
state reward action just the new data.
Environment
• Learner observes input state s(i)
• Learner generates output action a(i)
• Learner observes reward r(i)
• Learner observes input state s(i+1)
• Learner generates output action a(i+1)
• Learner observes reward r(i+1)
• ...
Similar to MDPs, the learner is supposed to find a policy, mapping a state s to action a, that
maximizes expected reward over time.
95
MIT 6.390 Spring 2024 96
• Ignoring the r(i) values that it gets while learning, but considering how many inter-
actions with the environment are required for it to learn a policy π : S → A that is
nearly optimal.
Most of the focus is on the first criterion (which is called “sample efficiency”), because the
second one is very difficult. The first criterion is reasonable when the learning can take
place somewhere safe (imagine a robot learning, inside the robot factory, where it can’t
hurt itself too badly) or in a simulated environment.
Approaches to reinforcement learning differ significantly according to what kind of
hypothesis or model is being learned. Roughly speaking, RL methods can be categorized
into model-free methods and model-based methods. The main distinction is that model-
based methods explicitly learn the transition and reward models to assist the end-goal
of learning a policy; model-free methods do not. We will start our discussion with the
model-free methods, and introduce two of the arguably most popular types of algorithms,
Q-learning (Section 11.2.1) and policy gradient (Section 11.2.4). We then describe model-
based methods (Section 11.3). Finally, we briefly consider “bandit” problems (Section 11.4),
which differ from our MDP learning context by having probabilistic rewards.
11.2.1 Q-learning
Q-learning is a frequently used class of RL algorithms that concentrates on learning (es-
timating) the state-action value function, i.e., the Q function. Specifically, recall the MDP
value-iteration update: The thing that most stu-
X dents seem to get con-
Q(s, a) = R(s, a) + γ T (s, a, s ′ ) max Q(s ′ , a ′ ) (11.1) fused about is when we
a ′ do value iteration and
s′
when we do Q learning.
Value iteration assumes
The Q-learning algorithm below adapts this value-iteration idea to the RL scenario, where you know T and R and
we do not know the transition function T or reward function R, and instead rely on samples just need to compute Q.
to perform the updates. In Q learning, we don’t
know or even directly
estimate T and R: we
estimate Q directly from
experience!
Q-L EARNING(S, A, γ, ϵ, α, s0 )
1 for s ∈ S, a ∈ A :
2 Qold (s, a) = 0
3 s = s0
4 while True:
5 a = select_action(s, Qold (s, a))
6 r, s ′ = execute(a)
7 Qnew (s, a) = (1 − α)Qold (s, a) + α(r + γ maxa ′ Qold (s ′ , a ′ ))
8 s = s′
9 if maxs,a |Qold (s, a) − Qnew (s, a)| < ϵ :
10 return Qnew
11 Qold = Qnew
With the pseudo-code provided for Q-learning, there are a few key things to note. First,
we must determine which state to initialize the learning from. In the context of a game,
this initial state may be well defined. In the context of a robot navigating an environment,
one may consider sampling the initial state at random. In any case, the initial state is neces-
sary to determine the trajectory the agent will experience as it navigates the environment.
Second, different contexts will influence how we want to choose when to stop iterating
through the while loop. Again, in some games there may be a clear terminating state based
on the rules of how it is played. On the other hand, a robot may be allowed to explore an
environment ad infinitum. In such a case, one may consider either setting a fixed number
of transitions to take or we may want to stop iterating once the values in the Q-table are
not changing. Finally, a single trajectory through the environment may not be sufficient
to adequately explore all state-action pairs. In these instances, it becomes necessary to
run through a number of iterations of the Q-learning algorithm, potentially with different
choices of initial state s0 . Of course, we would then want to modify Q-learning such that This notion of running
the Q table is not reset with each call. a number of instances
of Q-learning is often
Now, let’s dig in to what is happening in Q-learning. Here, α ∈ (0, 1] represents the
referred to as experienc-
“learning rate,” which needs to decay for convergence purposes, but in practice is often ing multiple episodes.
set to a constant. It’s also worth mentioning that Q-learning assumes a discrete state and
action space where states and actions take on discrete values like 1, 2, 3, . . . etc. In contrast,
a continuous state space would allow the state to take values from, say, a continuous range
of numbers; for example, the state could be any real number in the interval [1, 3]. Similarly,
a continuous action space would allow the action to be drawn from, e.g., a continuous
range of numbers. There are now many extensions developed based on Q-learning that
can handle continuous state and action spaces (we’ll look at one soon), and therefore the
algorithm above is also sometimes referred to more specifically as tabular Q-learning.
In the Q-learning update rule
Q[s, a] ← (1 − α)Q[s, a] + α(r + γ max
′
Q[s ′ , a ′ ]) (11.2)
a
the term r + γ maxa ′ Q[s ′ , a ′ ] is often referred to as the (one-step look-ahead) target. The
update can be viewed as a combination of two different iterative processes that we have
already seen: the combination of an old estimate with the target using a running average
with a learning rate α, and the dynamic-programming update of a Q value from value
iteration.
Eq. 11.2 can also be equivalently rewritten as
′ ′
Q[s, a] ← Q[s, a] + α (r + γ max ′
Q[s , a ]) − Q[s, a] , (11.3)
a
which allows us to interpret Q-learning in yet another way: we make an update (or correc-
tion) based on the temporal difference between the target and the current estimated value
Q[s, a].
The Q-learning algorithm above includes a procedure called select_action, that, given the
current state s and current Q function, has to decide which action to take. If the Q value is
estimated very accurately and the agent is behaving in the world, then generally we would
want to choose the apparently optimal action arg maxa∈A Q(s, a). But, during learning, the
Q value estimates won’t be very good and exploration is important. However, exploring
completely at random is also usually not the best strategy while learning, because it is
good to focus your attention on the parts of the state space that are likely to be visited
when executing a good policy (not a bad or random one).
A typical action-selection strategy that attempts to address this exploration versus ex-
ploitation dilemma is the so-called ϵ-greedy strategy:
where the ϵ probability of choosing a random action helps the agent to explore and try out
actions that might not seem so desirable at the moment.
Q-learning has the surprising property that it is guaranteed to converge to the actual
optimal Q function under fairly weak conditions! Any exploration strategy is okay as
long as it tries every action infinitely often on an infinite run (so that it doesn’t converge
prematurely to a bad action choice).
Q-learning can be very inefficient. Imagine a robot that has a choice between moving to
the left and getting a reward of 1, then returning to its initial state, or moving to the right
and walking down a 10-step hallway in order to get a reward of 1000, then returning to its
initial state.
+1 +1000
-1 robot 1 2 3 4 5 6 7 8 9 10
The first time the robot moves to the right and goes down the hallway, it will update
the Q value just for state 9 on the hallway and action “right” to have a high value, but it
won’t yet understand that moving to the right in the earlier steps was a good choice. The
next time it moves down the hallway it updates the value of the state before the last one,
and so on. After 10 trips down the hallway, it now can see that it is better to move to the
right than to the left.
More concretely, consider the vector of Q values Q(i = 0, . . . , 9; right), representing the
Q values for moving right at each of the positions i = 0, . . . , 9. Position index 0 is the
starting position of the robot as pictured above.
Then, for α = 1 and γ = 0.9, Eq. 11.3 becomes
Q(0) (i = 0, . . . , 9; right) = 0
0 0 0 0 0 0 0 0 0 . (11.5)
Since the only nonzero reward from moving right is R(9, right) = 1000, after our robot We are violating our
makes it down the hallway once, our new Q vector is usual notational con-
ventions here, and writ-
ing Q(i) to mean the
Q(1) (i = 0, . . . , 9; right) = 0 0 0 0 0 0 0 0 0 1000 .
(11.6)
Q value function that
results after the robot
runs all the way to
Last Updated: 05/06/24 18:25:27 the end of the hallway,
when executing the pol-
icy that always moves
to the right.
MIT 6.390 Spring 2024 99
After making its way down the hallway again, Q(8, right) = 0 + 0.9 · Q(9, right) = 900
updates:
Similarly,
and the robot finally sees the value of moving right from position 0. Here, we can see the
exploration/exploita-
Study Question: Determine the Q value functions that will result from updates due tion dilemma in action:
to the robot always executing the “move left” policy. from the perspective of
s0 = 0, it will seem that
getting the immediate
reward of 1 is a better
11.2.2 Function approximation: Deep Q learning strategy without explor-
ing the long hallway.
In our Q-learning algorithm above, we essentially keep track of each Q value in a table,
indexed by s and a. What do we do if S and/or A are large (or continuous)?
We can use a function approximator like a neural network to store Q values. For exam-
ple, we could design a neural network that takes in inputs s and a, and outputs Q(s, a).
We can treat this as a regression problem, optimizing this loss: This is the so-called
squared Bellman error;
2 as the name suggests,
′ ′
Q(s, a) − (r + γ max
′
Q(s , a )) , (11.12) it’s closely related to the
a Bellman equation we
saw in MDPs in Chap-
where Q(s, a) is now the output of the neural network. ter 10. Roughly speak-
There are several different architectural choices for using a neural network to approxi- ing, this error measures
mate Q values: how much the Bellman
equality is violated.
• One network for each action a, that takes s as input and produces Q(s, a) as output;
• One single network that takes s as input and produces a vector Q(s, ·), consisting of
the Q values for each action; or
• One single network that takes s, a concatenated into a vector (if a is discrete, we
would probably use a one-hot encoding, unless it had some useful internal structure)
and produces Q(s, a) as output.
For continuous action
The first two choices are only suitable for discrete (and not too big) action sets. The last spaces, it is popular
choice can be applied for continuous actions, but then it is difficult to find arg maxa∈A Q(s, a). to use a class of meth-
ods called actor-critic
There are not many theoretical guarantees about Q-learning with function approxima- methods, which com-
tion and, indeed, it can sometimes be fairly unstable (learning to perform well for a while, bine policy and value-
and then getting suddenly worse, for example). But neural network Q-learning has also function learning. We
won’t get into them in
had some significant successes. detail here, though.
One form of instability that we do know how to guard against is catastrophic forgetting.
In standard supervised learning, we expect that the training x values were drawn inde-
pendently from some distribution. But when a learning agent, such as a robot, is moving And, in fact, we rou-
through an environment, the sequence of states it encounters will be temporally correlated. tinely shuffle their order
For example, the robot might spend 12 hours in a dark environment and then 12 in a light in the data file, anyway.
one. This can mean that while it is in the dark, the neural-network weight-updates will
make the Q function “forget” the value function for when it’s light.
One way to handle this is to use experience replay, where we save our (s, a, s ′ , r) expe-
riences in a replay buffer. Whenever we take a step in the world, we add the (s, a, s ′ , r) to
the replay buffer and use it to do a Q-learning update. Then we also randomly select some
number of tuples from the replay buffer, and do Q-learning updates based on them, as well.
In general it may help to keep a sliding window of just the 1000 most recent experiences in
the replay buffer. (A larger buffer will be necessary for situations when the optimal policy
might visit a large part of the state space, but we like to keep the buffer size small for mem-
ory reasons and also so that we don’t focus on parts of the state space that are irrelevant for
the optimal policy.) The idea is that it will help us propagate reward values through our
state space more efficiently if we do these updates. We can see it as doing something like
value iteration, but using samples of experience rather than a known model.
Here, we alternate between using the policy induced by the current Q function to gather
a batch of data Dnew , adding it to our overall data set D, and then using supervised neural-
network training to learn a representation of the Q value function on the whole data set.
This method does not mix the dynamic-programming phase (computing new Q values
based on old ones) with the function approximation phase (supervised training of the neu-
ral network) and avoids catastrophic forgetting. The regression training in line 9 typically
uses squared error as a loss function and would be trained until the fit is good (possibly
measured on held-out data).
• When θ has higher dimensions (e.g., it represents the set of parameters in a com-
plicated neural network), there are more clever algorithms, e.g., one called REIN-
FORCE, but they can often be difficult to get to work reliably.
Policy search is a good choice when the policy has a simple known form, but the MDP
would be much more complicated to estimate.
11.3 Model-based RL
The conceptually simplest approach to RL is to model R and T from the data we have gotten
so far, and then use those models, together with an algorithm for solving MDPs (such as
value iteration) to find a policy that is near-optimal given the current models.
Assume that we have had some set of interactions with the environment, which can be
characterized as a set of tuples of the form (s(t) , a(t) , s(t+1) , r(t) ).
Because the transition function T (s, a, s ′ ) specifies probabilities, multiple observations
of (s, a, s ′ ) may be needed to model the transition function. One approach to this task of
building a model T̂ (s, a, s ′ ) for the true T (s, a, s ′ ) is to estimate it using a simple counting
strategy,
#(s, a, s ′ ) + 1
T̂ (s, a, s ′ ) = . (11.13)
#(s, a) + |S|
Here, #(s, a, s ′ ) represents the number of times in our data set we have the situation where
s(t) = s, a(t) = a, s(t+1) = s ′ and #(s, a) represents the number of times in our data set we
have the situation where s(t) = s, a(t) = a.
P
Study Question: Prove to yourself that #(s, a) = s ′ #(s, a, s ′ ).
Adding 1 and |S| to the numerator and denominator, respectively, are a form of smooth-
ing called the Laplace correction. It ensures that we never estimate that a probability is 0,
and keeps us from dividing by 0. As the amount of data we gather increases, the influence Conceptually, this is
of this correction fades away. also similar to having
“initialized” our esti-
In contrast, the reward function R(s, a) (as we have specified it in this text) is a deter-
mate for the transition
ministic function, such that knowing the reward r for a given (s, a) is sufficient to fully function with uniform
determine the function at that point. In other words, our model R̂ can simply be a record random probabilities,
of observed rewards, such that R̂(s, a) = r = R(s, a). before having made any
observations.
Given empirical models T̂ and R̂ for the transition and reward functions, we can now
solve the MDP (S, A, T̂ , R̂) to find an optimal policy using value iteration, or use a search
algorithm to find an action to take for a particular state.
This approach is effective for problems with small state and action spaces, where it is
not too hard to get enough experience to model T and R well; but it is difficult to generalize
this method to handle continuous (or very large discrete) state spaces, and is a topic of
current research.
• A set of actions A;
The most typical bandit problem has R = {0, 1} and |A| = k. This is called a k-armed
bandit problem, where the decision is which “arm” (action a) to select, and the reward is Why? Because in En-
either getting a payoff (1) or not (0). There is a lot of mathematical literature on optimal glish slang, “one-armed
bandit” is a name for
strategies for k-armed bandit problems under various assumptions. The important ques-
a slot machine (an old-
tion is usually one of exploration versus exploitation. Imagine that you have tried each action style gambling machine
10 times, and now you have estimates R̂p (a, r) for the probabilities Rp (a, r) for reward r where you put a coin
given action a. Which arm should you pick next? You could into a slot and then pull
its arm to see if you get
exploit your knowledge, and for future trials choose the arm with the highest value of a payoff) because it has
one arm and takes your
expected reward; or money! What we have
here is a similar sort
explore further, by trying some or all actions more times, hoping to get better estimates of machine, but with k
of the Rp (a, r) values. arms.
The theory ultimately tells us that, the longer our horizon h (or, similarly, closer to 1 our
discount factor), the more time we should spend exploring, so that we don’t converge
prematurely on a bad choice of action.
Study Question: Why is it that “bad” luck during exploration is more dangerous
than “good” luck? Imagine that there is an action that generates reward value 1 with
probability 0.9, but the first three times you try it, it generates value 0. How might
that cause difficulty? Why is this more dangerous than the situation when an action
that generates reward value 1 with probability 0.1 actually generates reward 1 on the
first three tries?
Bandit problems are reinforcement learning problems (and are very different from batch
supervised learning) in that: There is a setting of su-
pervised learning, called
• The agent gets to influence what data it obtains (selecting a gives it another sample active learning, where in-
from R(a, r)), and stead of being given a
training set, the learner
• The agent is penalized for mistakes it makes while it is learning (if it is trying to gets to select a value of
P x and the environment
maximize the expected reward r r · Pr(Rp (a, r) = r) it gets while behaving). gives back a label y;
the problem of picking
In a contextual bandit problem, you have multiple possible states, drawn from some set good x values to query
S, and a separate bandit problem associated with each one. is interesting, but the
Bandit problems are an essential subset of reinforcement learning. It’s important to be problem of deriving a
hypothesis from (x, y)
aware of the issues, but we will not study solutions to them in this class. pairs is the same as the
supervised problem we
have been studying.
Unsupervised Learning
12.1 Clustering
Oftentimes a dataset can be partitioned into different categories. A doctor may notice that
their patients come in cohorts and different cohorts respond to different treatments. A biol-
ogist may gain insight by identifying that bats and whales, despite outward appearances,
have some underlying similarity, and both should be considered members of the same cat-
egory, i.e., “mammal”. The problem of automatically identifying meaningful groupings in
datasets is called clustering. Once these groupings are found, they can be leveraged toward
interpreting the data and making optimal decisions for each group.
103
MIT 6.390 Spring 2024 104
Figure 12.1: A dataset we would like to cluster. How many clusters do you think there are?
There seem to be about five clumps of datapoints and those clumps are what we would
like to call clusters. If we assign all datapoints in each clump to a cluster corresponding to
that clump, then we might desire that nearby datapoints are assigned to the same cluster,
while far apart datapoints are assigned to different clusters.
In designing clustering algorithms, three critical things we need to decide are:
• How do we measure distance between datapoints? What counts as “nearby” and “far
apart”?
We will see how to begin making these decisions as we work through a concrete clus-
tering algorithm in the next section.
objective can be quantified with the following objective function (which we also call the
“k-means objective” or “k-means loss”):
X
k X
n
2
1(y(i) = j) x(i) − µ(j) , (12.1)
j=1 i=1
P P
where µ(j) = N1j n i=1 1(y
(i)
= j)x(i) and Nj = n i=1 1(y
(i)
= j), so that µ(j) is the mean of
all datapoints in cluster j, and using 1(·) to denote the indicator function (which takes on
value of 1 if its argument is true and 0 otherwise). The inner sum (over data points) of the
loss is the variance of datapoints within cluster j. We sum up the variance of all k clusters
to get our overall loss.
Figure 12.2: The first three steps of running the k-means algorithm on this data. Datapoints
are colored according to the cluster to which they are assigned. Cluster means are the larger
X’s with black outlines.
Each time we reassign the data to the nearest cluster mean, the k-means loss decreases
(the datapoints end up closer to their assigned cluster mean), or stays the same. And each
time we recompute the cluster means the loss also decreases (the means end up closer to
their assigned datapoints) or stays the same. Overall then, the clustering gets better and
better, according to our objective – until it stops improving. After four iterations of cluster
assignment + update means in our example, the k-means algorithm stops improving. Its
final solution is shown in Fig. 12.3. It seems to arrive at something reasonable!
Study Question: Why do we have the “break” statement on line 9? Could the clus-
tering improve if we ran it for more iterations after this point? Has it converged?
The for-loop over the n datapoints assigns each datapoint to the nearest cluster center.
The for-loop over the k clusters updates the cluster center to be the mean of all datapoints
currently assigned to that cluster. As suggested above, it can be shown that this algorithm
reduces the loss in Eq. 12.1 on each iteration, until it converges to a local minimum of the
loss.
It’s like classification except the algorithm picked what the classes are rather than being
given examples of what the classes are.
Figure 12.4: With the initialization of the means to the left, the yellow and red means end
up splitting what perhaps should be one cluster in half.
Figure 12.4 is an example of a different initialization on our toy data, which results in a
worse converged clustering:
A variety of methods have been developed to pick good initializations (for example,
check out the k-means++ algorithm). One simple option is to run the standard k-means
algorithm multiple times, with different random initial conditions, and then pick from
these the clustering that achieves the lowest k-means loss.
12.1.6 Importance of k
A very important choice in cluster algorithms is the number of clusters we are looking
for. Some advanced algorithms can automatically infer a suitable number of clusters, but
most of the time, like with k-means, we will have to pick k – it’s a hyperparameter of the
algorithm.
Figure 12.5: Example of k-means run on our toy data, with two different values of k. Setting
k=4, on the left, results in one cluster being merged, compared to setting k=5, on the right.
Which clustering do you think is better? How could you decide?
Figure 12.5 shows an example of the effect. Which result looks more correct? It can be
hard to say! Using higher k we get more clusters, and with more clusters we can achieve
lower within-cluster variance – the k-means objective will never increase, and will typically
strictly decrease as we increase k. Eventually, we can increase k to equal the total number
of datapoints, so that each datapoint is assigned to its own cluster. Then the k-means
objective is zero, but the clustering reveals nothing. Clearly, then, we cannot use the k-
means objective itself to choose the best value for k. In subsection 12.1.7.1, we will discuss
some ways of evaluating the success of clustering beyond its ability to minimize the k-
means objective, and it’s with these sorts of methods that we might decide on a proper
value of k.
Figure 12.6: Averaged across the whole population, risk of heart disease positively corre-
lates with hours of exercise. However, if we cluster the data, we can observe that there are
four subgroups of the population which correspond to different age groups, and within
each subgroup the correlation is negative. We can make better predictions, and better cap-
ture the presumed true effect, if we cluster this data and then model the trend in each
cluster separately.
Figure 12.7: Compression of digits dataset into two dimensions. The input x(i) , an image of a
handwritten digit, is shown at the new low-dimensional representation (a1 , a2 ).
coder and the encoder could involve multiple layers, as opposed to the single layer shown
here. Learning seeks parameters W 1 , W01 and W 2 , W02 such that the reconstructed instances,
h(g(x(i) ; W 1 , W01 ); W 2 , W02 ), are close to the original input x(i) .
Figure 12.8: Autoencoder structure, showing the encoder (left half, light green), and the decoder
(right half, light blue), encoding inputs x to the representation a, and decoding the representation to
produce x̃, the reconstruction. In this specific example, the representation (a1 , a2 , a3 ) only has three
dimensions.
What are some conventions for derivatives of matrices and vectors? It will always work
to explicitly write all indices and treat everything as scalars, but we introduce here some
shortcuts that are often faster to use and helpful for understanding.
There are at least two consistent but different systems for describing shapes and rules
for doing matrix derivatives. In the end, they all are correct, but it is important to be
consistent.
We will use what is often called the ‘Hessian’ or denominator layout, in which we say
that for
x of size n × 1 and y of size m × 1, ∂y/∂x is a matrix of size n × m with the (i, j) entry
∂yj /∂xi . This denominator layout convention has been adopted by the field of machine
learning to ensure that the shape of the gradient is the same as the shape of the shape
of the respective derivative. This is somewhat controversial at large, but alas, we shall
continue with denominator layout.
The discussion below closely follows the Wikipedia on matrix derivatives.
∂y/∂x1
∂y/∂x2
∂y/∂x = .
..
.
∂y/∂xn
112
MIT 6.390 Spring 2024 113
∂y/∂x = ∂y1 /∂x ∂y2 /∂x · · · ∂ym /∂x .
···
∂y1 /∂x1 ∂y2 /∂x1 ∂ym /∂x1
∂y1 /∂x2 ∂y2 /∂x2 ··· ∂ym /∂x2
∂y/∂x = .
.. .. .. ..
. . . .
∂y1 /∂xn ∂y2 /∂xn ··· ∂ym /∂xn
You may notice that in this list, we have not included matrix-by-matrix, matrix-by-
vector, or vector-by-matrix derivatives. This is because, generally, they cannot be expressed
nicely in matrix form and require higher order objects (e.g., tensors) to represent their
derivatives. These cases are beyond the scope of this course.
Additionally, notice that for all cases, you can explicitly compute each element of the
derivative object using (scalar) partial derivatives. You may find it useful to work through
some of these by hand as you are reviewing matrix derivatives.
∂a
= 0, (A.1)
∂x
is an n × m matrix of 0s. This is similar to the scalar case of differentiating a constant. Next,
we can consider the case of differentiating a vector with respect to itself:
∂x
=I (A.2)
∂x
This is the n × n identity matrix, with 1’s along the diagonal and 0’s elsewhere. It makes
sense, because ∂xj /xi is 1 for i = j and 0 otherwise. This identity is also similar to the scalar
case.
···
∂(Ax)1 /∂x1 ∂(Ax)2 /∂x1 ∂(Ax)m /∂x1
∂Ax ∂(Ax)1 /∂x2 ∂(Ax)2 /∂x2 ··· ∂(Ax)m /∂x2
= (A.3)
.. .. .. ..
∂x
. . . .
∂(Ax)1 /∂xn ∂(Ax)2 /∂xn ··· ∂(Ax)m /∂xn
Note that any element of the column vector Ax can be written as, for j = 1, . . . , m:
X
n
(Ax)j = Aj,k xk .
k=1
∂Ax
Thus, computing the (i, j) entry of ∂x requires computing the partial derivative ∂(Ax)j /∂xi :
X
n
!
∂(Ax)j /∂xi = ∂ Aj,k xk /∂xi = Aj,i
k=1
∂Ax
Therefore, the (i, j) entry of ∂x is the (j, i) entry of A:
∂Ax
= AT (A.4)
∂x
Similarly, for objects x, A of the same shape, on can obtain,
∂xT A
=A (A.5)
∂x
∂(u + v) ∂u ∂v
= + (A.6)
∂x ∂x ∂x
Suppose that a is a scalar constant and u is an m × 1 vector that is a function of x. Then,
∂au ∂u
=a (A.7)
∂x ∂x
One can extend the previous identity to vector- and matrix-valued constants. Suppose
that a is a vector with shape m × 1 and v is a scalar which depends on x. Then,
∂va ∂v T
= a (A.8)
∂x ∂x
First, checking dimensions, ∂v/∂x is n × 1 and a is m × 1 so aT is 1 × m and our answer
is n × m as it should be. Now, checking a value, element (i, j) of the answer is ∂vaj /∂xi =
(∂v/∂xi )aj which corresponds to element (i, j) of (∂v/∂x)aT .
Similarly, suppose that A is a matrix which does not depend on x and u is a column
vector which does depend on x. Then,
∂Au ∂u T
= A (A.9)
∂x ∂x
∂vu ∂u ∂v T
=v + u (A.10)
∂x ∂x ∂x
One can see this relationship by expanding the derivative as follows:
···
∂(vu1 )/∂x1 ∂(vu2 )/∂x1 ∂(vum )/∂x1
∂vu ∂(vu1 )/∂x2
∂(vu2 )/∂x2 ··· ∂(vum )/∂x2
= .
.. .. .. ..
∂x . . . .
∂(vu1 )/∂xn ∂(vu2 )/∂xn ··· ∂(vum )/∂xn
Then, one can use the product rule for scalar-valued functions,
∂g(u) ∂u ∂g(u)
= (A.11)
∂x ∂x ∂u
Following “the shapes of things,” ∂u/∂x is n × d and ∂g(u)/∂u is d × m, where element
(i, j) is ∂g(u)j /∂ui . The same chain rule applies for further compositions of functions:
∂uT v ∂u ∂v
= v+ u (A.13)
∂x ∂x ∂x
∂uT ∂u T
= (A.14)
∂x ∂x
1
J(θ) = X̃ab θb − Ỹa X̃ac θc − Ỹa . (A.17)
n
Note that we no longer need the transpose on the first term, because all that transpose
accomplished was to take a dot product between the vector given by the left term, and the
vector given by the right term. With implicit summation, this is accomplished by the two
terms sharing the repeated index a.
Taking the derivative of J with respect to the dth element of θ thus gives, using the chain
rule for (ordinary scalar) multiplication:
dJ 1
= X̃ab δbd X̃ac θc − Ỹa + X̃ab θb − Ỹa X̃ac δcd (A.18)
dθd n
1
= X̃ad X̃ac θc − Ỹa + X̃ab θb − Ỹa X̃ad (A.19)
n
2
= X̃ad X̃ab θb − Ỹa , (A.20)
n
where the second line follows from the first, with the definition that δbd = 1 only when
b = d (and similarly for δcd ). And the third line follows from the second by recognizing
that the two terms in the second line are identical. Now note that in this implicit sum-
mation notation, the a, b element of the matrix product of A and B is (AB)ac = Aab Bbc .
That is, ordinary matrix multiplication sums over indices which are adjacent to each other,
because a row of A times a column of B becomes a scalar number. So the term in the above
equation with X̃ad X̃ab is not a matrix product of X̃ with X̃. However, taking the transpose
X̃T switches row and column indices, so X̃ad = X̃Tda . And X̃Tda X̃ab is a matrix product of
X̃T with X̃! Thus, we have that
dJ 2 T
= X̃ X̃ab θb − Ỹa (A.21)
dθd n da
2 T
= X̃ X̃θ − Ỹ d , (A.22)
n
which is the desired result.
So, you can see that inputs ct closer to the end of the sequence T have more effect on CT
than early inputs.
If, instead, we set γt = (t − 1)/t, then we get the actual average.
Study Question: Prove to yourself that the previous assertion holds.
B.0.1.2 Momentum
Now, we can use methods that are a bit like running averages to describe strategies for
computing η. The simplest method is momentum, in which we try to “average” recent
gradient updates, so that if they have been bouncing back and forth in some direction, we
take out that component of the motion. For momentum, we have
V0 = 0
Vt = γVt−1 + η∇W J(Wt−1 )
Wt = Wt−1 − Vt
118
MIT 6.390 Spring 2024 119
This doesn’t quite look like an adaptive step size. But what we can see is that, if we let
η = η ′ (1 − γ), then the rule looks exactly like doing an update with step size η ′ on a
moving average of the gradients with parameter γ:
M0 = 0
Mt = γMt−1 + (1 − γ)∇W J(Wt−1 )
Wt = Wt−1 − η ′ Mt
The red arrows show the update after each successive step of mini-batch gradient
descent with momentum. The blue points show the direction of the gradient with
respect to the mini-batch at each step. Momentum smooths the path taken towards
the local minimum and leads to faster convergence.
Study Question: If you set γ = 0.1, would momentum have more of an effect or less
of an effect than if you set it to 0.9?
B.0.1.3 Adadelta
Another useful idea is this: we would like to take larger steps in parts of the space where
J(W) is nearly flat (because there’s no risk of taking too big a step due to the gradient
being large) and smaller steps when it is steep. We’ll apply this idea to each weight in-
dependently, and end up with a method called adadelta, which is a variant on adagrad (for
adaptive gradient). Even though our weights are indexed by layer, input unit and output
unit, for simplicity here, just let Wj be any weight in the network (we will do the same
thing for all of them).
gt,j = ∇W J(Wt−1 )j
Gt,j = γGt−1,j + (1 − γ)g2t,j
η
Wt,j = Wt−1,j − p gt,j
Gt,j + ϵ
The sequence Gt,j is a moving average of the square of the jth component of the gradient.
We square it in order to be insensitive to the sign—we want to know whether the magni-
tude is p big or small. Then, we perform a gradient update to weight j, but divide the step
size by Gt,j + ϵ, which is larger when the surface is steeper in direction j at point Wt−1 in
weight space; this means that the step size will be smaller when it’s steep and larger when
it’s flat.
B.0.1.4 Adam
Adam has become the default method of managing step sizes in neural networks. It com- Although, interestingly,
bines the ideas of momentum and adadelta. We start by writing moving averages of the it may actually violate
the convergence
gradient and squared gradient, which reflect estimates of the mean and variance of the
conditions of SGD:
gradient for weight j: arxiv.org/abs/1705.08292
gt,j = ∇W J(Wt−1 )j
mt,j = B1 mt−1,j + (1 − B1 )gt,j
vt,j = B2 vt−1,j + (1 − B2 )g2t,j .
Note that Bt1 is B1 raised to the power t, and likewise for Bt2 . To justify these corrections,
note that if we were to expand mt,j in terms of m0,j and g0,j , g1,j , . . . , gt,j the coefficients
would sum to 1. However, the coefficient behind m0,j is Bt1 and since m0,j = 0, the sum of
coefficients of non-zero terms is 1 − Bt1 , hence the correction. The same justification holds
for vt,j .
Now, our update for weight j has a step size that takes the steepness into account, as in
adadelta, but also tends to move in the same direction, as in momentum. The authors of
this method propose setting B1 = 0.9, B2 = 0.999, ϵ = 10−8 . Although we now have even
more parameters, Adam is not highly sensitive to their values (small changes do not have
a huge effect on the result).
Study Question: Define m̂j directly as a moving average of gt,j . What is the decay
(γ parameter)?
Even though we now have a step-size for each weight, and we have to update vari-
ous quantities on each iteration of gradient descent, it’s relatively easy to implement by
ℓ
maintaining a matrix for each quantity (mℓt , vℓt , gℓt , g2t ) in each layer of the network.
Our first step will be to compute the batchwise mean and standard deviation. Let µl be
the nl × 1 vector where
1X l
K
µli = Zij ,
K
j=1
The basic normalized version of our data would be a matrix, element (i, j) of which is
l Zlij − µli
Zij = ,
σli + ϵ
where ϵ is a very small constant to guard against division by zero. However, if we let these
be our Zbl values, we really are forcing something too strong on our data—our goal was to
normalize across the data batch, but not necessarily force the output values to have exactly
mean 0 and standard deviation 1. So, we will give the layer the “opportunity” to shift and
scale the outputs by adding new weights to the layer. These weights are Gl and Bl , each of
which is an nl × 1 vector. Using the weights, we define the final output to be
bl = Gl Zl + Bl .
Z ij i ij i
• Compute ∂L/∂Gl and ∂L/∂Bl for gradient updates of the weights in this layer.
Schematically For simplicity we will
∂L ∂L ∂Zb drop the reference to
= . the layer l in the rest of
∂B ∂Z ∂B
b the derivation
It’s hard to think about these derivatives in matrix terms, so we’ll see how it works for the
components. Bi contributes to Z bij for all data points j in the batch. So
∂L X ∂L ∂Zbij
=
∂Bi ∂Zij ∂Bi
b
j
X ∂L
= ,
∂Z
j
bij
Similarly, Gi contributes to Z
bij for all data points j in the batch. So
∂L X ∂L ∂Zbij
=
∂Gi bij ∂Gi
∂Z j
X ∂L
= Z .
bij ij
∂Z
j
∂L ∂L ∂Z
b
= .
∂Z b ∂Z
∂Z
And because dependencies only exist across the batch, but not across the unit outputs,
∂L X ∂L ∂Z
K bik
= .
∂Zij bik ∂Zij
∂Z
k=1
∂Z
bik bik ∂Zik
∂Z
=
∂Zij ∂Zik ∂Zij
∂Zik
= Gi .
∂Zij
∂µi 1
= ,
∂Zij K
∂σi Zij − µi
= .
∂Zij Kσi
∂L X ∂L
K
1
(Zik − µi )(Zij − µi )
= Gi δjk K − 1 − .
∂Zij ∂Zik
b
k=1
Kσ i σ2i
So far, we have limited our attention to domains in which each output y is assumed to
have been generated as a function of an associated input x, and our hypotheses have been
“pure” functions, in which the output depends only on the input (and the parameters we
have learned that govern the function’s behavior). In the next few chapters, we are going
to consider cases in which our models need to go beyond functions. In particular, behavior
as a function of time will be an important concept:
• In recurrent neural networks, the hypothesis that we learn is not a function of a single
input, but of the whole sequence of inputs that the predictor has received.
In this chapter, we introduce state machines. We start with deterministic state machines,
and then consider recurrent neural network (RNN) architectures to model their behavior.
Later, in Chapter 10, we will study Markov decision processes (MDPs) that extend to consider
probabilistic (rather than deterministic) transitions in our state machines. RNNs and MDPs
will enable description and modeling of temporally sequential patterns of behavior that are
important in many domains.
123
MIT 6.390 Spring 2024 124
The basic operation of the state machine is to start with state s0 , then iteratively compute In some cases, we will
for t ⩾ 1: pick a starting state
from a set or distribu-
tion.
st = fs (st−1 , xt ) (C.1)
yt = fo (st ) (C.2)
The diagram below illustrates this process. Note that the “feedback” connection of
st back into fs has to be buffered or delayed by one time step—-otherwise what it
is computing would not generally be well defined.
xt st yt
fs fo
−
st−1
We sometimes say that the machine transduces sequence x into sequence y. The output at
time t can have dependence on inputs from steps 1 to t. There are a huge num-
One common form is finite state machines, in which S, X, and Y are all finite sets. They are ber of major and minor
variations on the idea of
often described using state transition diagrams such as the one below, in which nodes stand
a state machine. We’ll
for states and arcs indicate transitions. Nodes are labeled by which output they generate just work with one spe-
and arcs are labeled by which input causes the transition. cific one in this section
and another one in the
next, but don’t worry if
One can verify that the state machine below reads binary strings and determines the you see other variations
parity of the number of zeros in the given string. Check for yourself that all input out in the world!
binary strings end in state S1 if and only if they contain an even number of zeros.
All computers can be
described, at the digital
level, as finite state ma-
chines. Big, but finite!
Another common structure that is simple but powerful and used in signal processing
and control is linear time-invariant (LTI) systems. In this case, all the quantities are real-
valued vectors: S = Rm , X = Rl and Y = Rn . The functions fs and fo are linear functions of
their inputs. The transition function is described by the state matrix A and the input matrix
B; the output function is defined by the output matrix C, each with compatible dimensions.
In discrete time, they can be defined by a linear difference equation, like
and can be implemented using state to store relevant previous input and output informa-
tion. We will study recurrent neural networks which are a lot like a non-linear version of an
LTI system.
A recurrent neural network is a state machine with neural networks constituting functions
fs and fo :
xt : ℓ × 1 (C.7)
st : m × 1 (C.8)
yt : v × 1 . (C.9)
W sx : m × ℓ (C.10)
W ss : m × m (C.11)
W0ss :m×1 (C.12)
o
W :v×m (C.13)
W0o :v×1 (C.14)
The per-element loss function Lelt will depend on the type of yt and what information it is So it could be NLL,
encoding, in the same way as for a supervised network. squared loss, etc.
Then, letting W = (W sx , W ss , W o , W0ss , W0o ), our overall goal is to minimize the objec-
tive
1X
q
J(W) = Lseq RNN(x(i) ; W), y(i) , (C.16)
q
i=1
x = (⟨start⟩, c1 , c2 , . . . , ck ) (C.18)
y = (c1 , c2 , . . . , ⟨end⟩) (C.19)
What we want you to take away from this section is that, by “unrolling” a recurrent
network out to model a particular sequence, we can treat the whole thing as a feed-
forward network with a lot of parameter sharing. Thus, we can tune the parameters
using stochastic gradient descent, and learn to model sequential mappings. The
concepts here are very important. While the details are important to get right if you
need to implement something, we present the mathematical details below primarily
to convey or explain the larger concepts.
Calculus reminder: total derivative Most of us are not very careful about the differ-
ence between the partial derivative and the total derivative. We are going to use a nice
example from the Wikipedia article on partial derivatives to illustrate the difference.
The volume of a circular cone depends on its height and radius:
πr2 h
V(r, h) = . (C.20)
3
The partial derivatives of volume with respect to height and radius are
∂V 2πrh ∂V πr2
= and = . (C.21)
∂r 3 ∂h 3
They measure the change in V assuming everything is held constant except the
single variable we are changing. Now assume that we want to preserve the cone’s
proportions in the sense that the ratio of radius to height stays constant. Then we
can’t really change one without changing the other. In this case, we really have to
think about the total derivative. If we’re interested in the total derivative with respect
to r, we sum the “paths” along which r might influence V:
dV ∂V ∂V dh
= + (C.22)
dr ∂r ∂h dr
2πrh πr2 dh
= + (C.23)
3 3 dr
Or if we’re interested in the total derivative with respect to h, we consider how h
might influence V, either directly or via r:
dV ∂V ∂V dr
= + (C.24)
dh ∂h ∂r dh
πr2 2πrh dr
= + (C.25)
3 3 dh
Just to be completely concrete, let’s think of a right circular cone with a fixed angle
α = tan r/h, so that if we change r or h then α remains constant. So we have
r = h tan−1 α; let constant c = tan−1 α, so now r = ch. Thus, we finally have
dV 2πrh πr2 1
= + (C.26)
dr 3 3 c
2
dV πr 2πrh
= + c. (C.27)
dh 3 3
(1) Sample a training pair of sequences (x, y); let their length be n.
(2) “Unroll" the RNN to be length n (picture for n = 3 below), and initialize s0 :
Now, we can see our problem as one of performing what is almost an ordinary back-
propagation training procedure in a feed-forward neural network, but with the dif-
ference that the weight matrices are shared among the layers. In many ways, this is
similar to what ends up happening in a convolutional network, except in the conv-
net, the weights are re-used spatially, and here, they are re-used temporally.
(4) Do backward pass to compute the gradients. For both W ss and W sx we need to find
Letting Lu = Lelt (gu , yu ) and using the total derivative, which is a sum over all the
ways in which W affects Lu , we have
X
n X
n
∂st ∂Lu
=
∂W ∂st
u=1 t=1
Re-organizing, we have
X
n
∂st X ∂Lu
n
=
∂W ∂st
t=1 u=1
Xn
∂st X ∂Lu
n
=
∂W u=t ∂st
t=1
Xn Xn
∂st ∂Lt ∂Lu
.
= + (C.32)
∂W ∂st ∂st
t=1 u=t+1
| {z }
δst
where δst is the dependence of the future loss (incurred after step t) on the state St . That is, δst is how
much we can blame
We can compute this backwards, with t going from n down to 1. The trickiest part is state st for all the fu-
figuring out how early states contribute to later losses. We define the future loss after ture element losses.
step t to be
X
n
Ft = Lelt (gu , yu ) , (C.33)
u=t+1
so
∂Ft
δst = . (C.34)
∂st
At the last stage, Fn = 0 so δsn = 0.
Now, working backwards,
∂ X
n
δst−1 = Lelt (gu , yu ) (C.35)
∂st−1 u=t
∂st ∂ X
n
= Lelt (gu , yu ) (C.36)
∂st−1 ∂st u=t
X
" n
#
∂st ∂
= Lelt (gt , yt ) + Lelt (gu , yu ) (C.37)
∂st−1 ∂st
u=t+1
∂st ∂Lelt (gt , yt )
= + δst (C.38)
∂st−1 ∂st
Now, we can use the chain rule again to find the dependence of the element loss at
time t on the state at that same time,
and the dependence of the state at time t on the state at the previous time,
Note that ∂st /∂z1t is formally an m × m diagonal matrix, with the values along
the diagonal being fs′ (z1t,i ), 1 ⩽ i ⩽ m. But since this is a diagonal matrix,
one could represent it as an m × 1 vector fs′ (z1t ). In that case the product
of the matrix W ss T by the vector fs′ (z1t ), denoted W ss T ∗ fs′ (z1t ), should be
interpreted as follows: take the first column of the matrix W ss T and multiply
each of its elements by the first element of the vector ∂st /∂z1t , then take the
second column of the matrix W ss T and multiply each of its elements by the
second element of the vector ∂st /∂z1t , and so on and so forth ...
We’re almost there! Now, we can describe the actual weight updates. Using Eq. C.32
and recalling the definition of δst = ∂Ft /∂st , as we iterate backwards, we can accu-
mulate the terms in Eq. C.32 to get the gradient for the whole loss.
dLseq Xn
dLelt (gt , yt ) Xn
∂z1t ∂st ∂Ft−1
= = (C.43)
dW sx dW sx ∂W sx ∂z1t ∂st
t=1 t=1
We can handle W o separately; it’s easier because it does not affect future losses in the
way that the other weight matrices do:
Assuming we have ∂L t
∂z2t
= (gt − yt ), (which ends up being true for squared loss,
softmax-NLL, etc.), then
dLseq X
n
= (gt − yt ) sTt . (C.45)
|dW
o
{z } | {z } |{z}
t=1 v×1 1×m
v×m
Whew!
Study Question: Derive the updates for the offsets W0ss and W0o .
Consider a case where only the output at the end of the sequence is incorrect, but it depends
critically, via the weights, on the input at time 1. In this case, we will multiply the loss at
step n by
∂s2 ∂s3 ∂sn
··· . (C.47)
∂s1 ∂s2 ∂sn−1
In general, this quantity will either grow or shrink exponentially with the length of the
sequence, and make it very difficult to train.
Study Question: The last time we talked about exploding and vanishing gradients, it
was to justify per-weight adaptive step sizes. Why is that not a solution to the prob-
lem this time?
An important insight that really made recurrent networks work well on long sequences
is the idea of gating.
where ∗ is component-wise multiplication. We can see, here, that the output of the gating
network is deciding, for each dimension of the state, how much it should be updated now.
This mechanism makes it much easier for the network to learn to, for example, “store”
some information in some dimension of the state, and then not change it during future
state updates, or change it only under certain conditions on the input or other aspects of
the state.
Study Question: Why is it important that the activation function for g be a sigmoid?
In which we try to describe the outlines of the “lifecycle” of supervised learning, including
hyperparameter tuning and evaluation of the final product.
If the data used for evaluation were not used during learning of the hypothesis then this is
a reasonable estimate of how well the hypothesis will make additional predictions on new
data from the same source.
133
MIT 6.390 Spring 2024 134
Then,
1X
K
V(A, L, D) = E(A(Dtrain
k ), L, Dk ) .
val
K
k=1
h∗ = A∗ (D) ,
D.1.7 Hyperparameters
Often, rather than comparing an arbitrary collection of learning algorithms, we think of
our learning algorithm as having some parameters that affect the way it maps data to a
hypothesis. These are not parameters of the hypothesis itself, but rather parameters of the
algorithm. We call these hyperparameters. A classic example would be to use a hyperparam-
eter λ to govern the weight of a regularization term on an objective to be optimized:
J(h; D) = E(h, L, D) + λR(h) .
Then we could think of our algorithm as A(D; λ). Picking a good value of λ is the same as
comparing different supervised learning algorithms, which is accomplished by validating
them and picking the best one!
For a particular training data set and parameter λ, it finds the best hypothesis on this data,
specified with parameters Θ = (θ, θ0 ), written Θ∗ (λ, D).
Picking the best value of the hyperparameter is choosing among learning algorithms.
We could, most simply, optimize using a single training / validation split, so D = Dtrain ∪
Dval , and
λ∗ = arg min V(Aλ , L, Dval )
λ
It would be much better to select the best λ using multiple runs or cross-validation; that
would just be a different choices of the validation procedure V in the top line.
Note that we don’t use regularization here because we just want to measure how good
the output of the algorithm is at predicting values of new points, and so that’s what we
measure. We use the regularizer during training when we don’t want to focus only on
optimizing predictions on the training data.
Finally! To make a predictor to ship out into the world, we would use all the data we
have, D, to train, using the best hyperparameters we know, and return
Θ∗ = A(D; λ∗ )
= Θ∗ (λ∗ , D)
1 X
= arg min (θT x + θ0 − y)2 + λ∗ ∥θ∥2
θ,θ0 |D|
(x,y)∈D
Finally, a customer might evaluate this hypothesis on their data, which we have never
seen during training or validation, as
• Y = {+1, 0}
1 X
A(D; λ) = Θ∗ (λ, D) = arg min Lnll (σ(θT x + θ0 ), y) + λ∥θ∥2 .
θ,θ0 |D|
(x,y)∈D
For a particular training data set and parameter λ, it finds the best hypothesis on this data,
specified with parameters Θ = (θ, θ0 ), written Θ∗ (λ, D) according to the proxy loss Lnll .
Picking the best value of the hyperparameter is choosing among learning algorithms
based on their actual predictions. We could, most simply, optimize using a single training
/ validation split, so D = Dtrain ∪ Dval , and we use the real 01 loss:
It would be much better to select the best λ using multiple runs or cross-validation; that
would just be a different choices of the validation procedure V in the top line.
Finally! To make a predictor to ship out into the world, we would use all the data we
have, D, to train, using the best hyperparameters we know, and return
Θ∗ = A(D; λ∗ )
Study Question: What loss function is being optimized inside this algorithm?
Finally, a customer might evaluate this hypothesis on their data, which we have never
seen during training or validation, as
The customer just wants to buy the right stocks! So we use the real L01 here for validation.
activation function, 52
hyperbolic tangent, 53
ReLU, 53
sigmoid, 53
softmax, 53
step function, 52
active learning, 102
adaptive step size, 60
adadelta, 119
adam, 120
momentum, 118
running average, 118
autoencoder, 103
variational, 111
classification, 29
binary classification, 29
clustering, 103
evaluation, 108
into partitions, 104
k-means algorithm, 105
convolution, 64
convolutional neural network, 64
backpropagation, 68
cross validation, 22
138
MIT 6.390 Spring 2024 139
data
training data, 6
distribution, 11
independent and identically distributed, 6
dynamic programming, 93
empirical probability, 83
ensemble models, 78
error
test error, 11
training error, 11
error back-propagation, 57
estimation, 6
estimation error, 21
expectation, 89
experience replay, 100
exploding (or vanishing) gradients, 60
features, 13
filter, 64
channels, 65
filter size, 66
finite state machine, 124
fitted Q-learning, 100
hand-built features
binary code, 47
factored, 47
numeric, 46
one-hot, 47
text encoding, 47
thermometer, 46
hierarchical clustering, 108
hyperparameter
hyperparameter tuning, 22
hyperplane, 15
hypothesis, 10
linear regression, 15
k-means
algorithm, 106
formulation, 105
in feature space, 108
initialization, 107
kernel methods, 46
Machine Learning, 6
Markov decision process, 87
max pooling, 66
model, 7
learned model, 10
model class, 7, 12
no model, 11
prediction rule, 11
parameter, 11
policy, 89
finite horizon, 89
infinite horizon, 89
discount factor, 90
optimal policy, 91
reinforcement learning, 95
stationary policy, 93
Q-function, 91
real numbers, 13
recurrent neural network, 123
recurrent neural network, 125
gating, 131
regression, 13
linear regression, 15
random regression, 16
ridge regression, 20
regularizer, 14
reinforcement
actor-critic methods, 99
reinforcement learning
exploration vs. exploitation, 102
policy, 95
reinforcement learning, 87, 95, 123
Laplace correction, 101
RL algorithm, 96
reward function, 87, 101
separator, 30
linear separator, 31
sequence-to-sequence mapping, 126
sign function, 30
sliding window, 100
spatial locality, 63
state machine, 123
state transition diagram, 124
stochastic gradient descent, 28
convergence, 28
structural error, 21
supervised learning, 7
tensor, 65
total derivative, 127
transduction, 126
transition function, 124
transition model, 87
translation invariance, 63
tree models
information gain, 84
tree models, 78
building a tree, 82
classification, 83
cost complexity, 83
pruning, 83
regression tree, 81
splitting criteria, 84
entropy, 84
Gini index, 84
misclassification error, 84
value function
calculating value function, 91
value function, 91
weight sharing, 65