6 390 Lecture Notes Spring24

Download as pdf or txt
Download as pdf or txt
You are on page 1of 144

Lecture Notes – MIT 6.

390 Spring 2024

© 2024 MIT. All Rights Reserved


Contents

1 Introduction 6
1.1 Problem class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2 Unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.2.3 Dimensionality reduction . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Sequence learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.5 Other settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Model type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Non-parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Parametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Model class and parameter fitting . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Regression 13
2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Regression as an optimization problem . . . . . . . . . . . . . . . . . . . . . 14
2.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 A gloriously simple linear regression algorithm . . . . . . . . . . . . . . . . . 16
2.5 Analytical solution: ordinary least squares . . . . . . . . . . . . . . . . . . . 16
2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.1 Regularization and linear regression . . . . . . . . . . . . . . . . . . . 18
2.6.2 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Evaluating learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.1 Evaluating hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7.2 Evaluating learning algorithms . . . . . . . . . . . . . . . . . . . . . . 21
2.7.2.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.2.2 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.7.2.3 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . 22

1
MIT 6.390 Spring 2024 2

3 Gradient Descent 23
3.1 Gradient descent in one dimension . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Multiple dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Application to regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Classification 29
4.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Linear classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Linear classifiers: definition . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Linear logistic classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Linear logistic classifiers: definition . . . . . . . . . . . . . . . . . . . 33
4.3.2 Learning linear logistic classifiers . . . . . . . . . . . . . . . . . . . . . 35
4.4 Gradient descent for logistic regression . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Convexity of the NLL Loss Function . . . . . . . . . . . . . . . . . . . 38
4.5 Handling multiple classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Prediction accuracy and validation . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Feature representation 41
5.1 Gaining intuition about feature transformations . . . . . . . . . . . . . . . . 41
5.2 Systematic feature construction . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.1 Polynomial basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.2 Radial basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Hand-constructing features for real domains . . . . . . . . . . . . . . . . . . 46
5.3.1 Discrete features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.2 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.3 Numeric values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Neural Networks 49
6.1 Basic element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.2.1 Single layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.2 Many layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 Choices of activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Loss functions and activation functions . . . . . . . . . . . . . . . . . . . . . 54
6.5 Error back-propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.5.1 First, suppose everything is one-dimensional . . . . . . . . . . . . . . 54
6.5.2 The general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.5.3 Derivations for the general case . . . . . . . . . . . . . . . . . . . . . . 57
6.5.4 Reflecting on backpropagation . . . . . . . . . . . . . . . . . . . . . . 57
6.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.7 Optimizing neural network parameters . . . . . . . . . . . . . . . . . . . . . 59
6.7.1 Batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.7.2 Adaptive step-size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.8 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.8.1 Methods related to ridge regression . . . . . . . . . . . . . . . . . . . 61
6.8.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.8.3 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 3

7 Convolutional Neural Networks 63


7.1 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 Typical architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.4 Backpropagation in a simple CNN . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4.1 Weight update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4.2 Max pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8 Transformers 70
8.1 Vector embeddings and tokens . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.2 Query, key, value, and attention . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2.1 Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.1 Learned embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3.2 Variations and training . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

9 Non-parametric methods 78
9.1 Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
9.2 Tree Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.2.1.1 Building a tree . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9.2.1.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2.5 Tree variants and tradeoffs . . . . . . . . . . . . . . . . . . . . . . . . 86

10 Markov Decision Processes 87


10.1 Definition and value functions . . . . . . . . . . . . . . . . . . . . . . . . . . 87
10.1.1 Finite-horizon value functions . . . . . . . . . . . . . . . . . . . . . . 89
10.1.2 Infinite-horizon value functions . . . . . . . . . . . . . . . . . . . . . 89
10.2 Finding policies for MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
10.2.1 Finding optimal finite-horizon policies . . . . . . . . . . . . . . . . . 91
10.2.2 Finding optimal infinite-horizon policies . . . . . . . . . . . . . . . . 93

11 Reinforcement learning 95
11.1 Reinforcement learning algorithms overview . . . . . . . . . . . . . . . . . . 96
11.2 Model-free methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
11.2.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
11.2.2 Function approximation: Deep Q learning . . . . . . . . . . . . . . . 99
11.2.3 Fitted Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.2.4 Policy gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.3 Model-based RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.4 Bandit problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

12 Unsupervised Learning 103


12.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
12.1.1 Clustering formalisms . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
12.1.2 The k-means formulation . . . . . . . . . . . . . . . . . . . . . . . . . 104
12.1.3 K-means algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
12.1.4 Using gradient descent to minimize k-means objective . . . . . . . . 106
12.1.5 Importance of initialization . . . . . . . . . . . . . . . . . . . . . . . . 106

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 4

12.1.6 Importance of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107


12.1.7 k-means in feature space . . . . . . . . . . . . . . . . . . . . . . . . . . 108
12.1.7.1 How to evaluate clustering algorithms . . . . . . . . . . . . 108
12.2 Autoencoder structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
12.2.1 Autoencoder Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.2.2 Evaluating an autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 111
12.2.3 Linear encoders and decoders . . . . . . . . . . . . . . . . . . . . . . 111
12.2.4 Advanced encoders and decoders . . . . . . . . . . . . . . . . . . . . 111

A Matrix derivative common cases 112


A.1 The shapes of things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.2 Some vector-by-vector identities . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.2.1 Some fundamental cases . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.2.2 Derivatives involving a constant matrix . . . . . . . . . . . . . . . . . 114
A.2.3 Linearity of derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.2.4 Product rule (vector-valued numerator) . . . . . . . . . . . . . . . . . 115
A.2.5 Chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.3 Some other identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.4 Derivation of gradient for linear regression . . . . . . . . . . . . . . . . . . . 116
A.5 Matrix derivatives using Einstein summation . . . . . . . . . . . . . . . . . . 116

B Optimizing Neural Networks 118


B.0.1 Strategies towards adaptive step-size . . . . . . . . . . . . . . . . . . 118
B.0.1.1 Running averages . . . . . . . . . . . . . . . . . . . . . . . . 118
B.0.1.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
B.0.1.3 Adadelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
B.0.1.4 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B.0.2 Batch Normalization Details . . . . . . . . . . . . . . . . . . . . . . . 120

C Recurrent Neural Networks 123


C.1 State machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
C.2 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
C.3 Sequence-to-sequence RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
C.4 RNN as a language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
C.5 Back-propagation through time . . . . . . . . . . . . . . . . . . . . . . . . . . 126
C.6 Vanishing gradients and gating mechanisms . . . . . . . . . . . . . . . . . . 130
C.6.1 Simple gated recurrent networks . . . . . . . . . . . . . . . . . . . . . 131
C.6.2 Long short-term memory . . . . . . . . . . . . . . . . . . . . . . . . . 131

D Supervised learning in a nutshell 133


D.1 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
D.1.1 Minimal problem specification . . . . . . . . . . . . . . . . . . . . . . 133
D.1.2 Evaluating a hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . 133
D.1.3 Evaluating a supervised learning algorithm . . . . . . . . . . . . . . . 134
D.1.3.1 Using a validation set . . . . . . . . . . . . . . . . . . . . . . 134
D.1.3.2 Using multiple training/evaluation runs . . . . . . . . . . . 134
D.1.3.3 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . 134
D.1.4 Comparing supervised learning algorithms . . . . . . . . . . . . . . . 134
D.1.5 Fielding a hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
D.1.6 Learning algorithms as optimizers . . . . . . . . . . . . . . . . . . . . 135
D.1.7 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
D.2 Concrete case: linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 5

D.3 Concrete case: logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . 136

Index 138

Last Updated: 05/06/24 18:25:27


CHAPTER 1

Introduction

The main focus of machine learning (ML) is making decisions or predictions based on data.
There are a number of other fields with significant overlap in technique, but difference in
focus: in economics and psychology, the goal is to discover underlying causal processes This description
and in statistics it is to find a model that fits a data set well. In those fields, the end product paraphrased from a
post on 9/4/12 at
is a model. In machine learning, we often fit models, but as a means to the end of making
andrewgelman.com
good predictions or decisions.
As ML methods have improved in their capability and scope, ML has become arguably
the best way–measured in terms of speed, human engineering time, and robustness–to
approach many applications. Great examples are face detection, speech recognition, and
many kinds of language-processing tasks. Almost any application that involves under-
standing data or signals that come from the real world can be nicely addressed using ma-
chine learning.
One crucial aspect of machine learning approaches to solving problems is that human and often undervalued
engineering plays an important role. A human still has to frame the problem: acquire and
organize data, design a space of possible solutions, select a learning algorithm and its pa-
rameters, apply the algorithm to the data, validate the resulting solution to decide whether
it’s good enough to use, try to understand the impact on the people who will be affected
by its deployment, etc. These steps are of great importance.
The conceptual basis of learning from data is the problem of induction: Why do we think
that previously seen data will help us predict the future? This is a serious long standing
philosophical problem. We will operationalize it by making assumptions, such as that all
training data are so-called i.i.d.(independent and identically distributed), and that queries This means that the el-
will be drawn from the same distribution as the training data, or that the answer comes ements in the set are
related in the sense that
from a set of possible answers known in advance.
they all come from the
In general, we need to solve these two problems: same underlying prob-
ability distribution, but
• estimation: When we have data that are noisy reflections of some underlying quan- not in any other ways.
tity of interest, we have to aggregate the data and make estimates or predictions
about the quantity. How do we deal with the fact that, for example, the same treat-
ment may end up with different results on different trials? How can we predict how
well an estimate may compare to future results?
• generalization: How can we predict results of a situation or experiment that we have
never encountered before in our data set?

6
MIT 6.390 Spring 2024 7

We can describe problems and their solutions using six characteristics, three of which
characterize the problem and three of which characterize the solution:

1. Problem class: What is the nature of the training data and what kinds of queries will
be made at testing time?

2. Assumptions: What do we know about the source of the data or the form of the
solution?

3. Evaluation criteria: What is the goal of the prediction or estimation system? How
will the answers to individual queries be evaluated? How will the overall perfor-
mance of the system be measured?

4. Model type: Will an intermediate model of the world be made? What aspects of the
data will be modeled in different variables/parameters? How will the model be used
to make predictions?

5. Model class: What particular class of models will be used? What criterion will we
use to pick a particular model from the model class?

6. Algorithm: What computational process will be used to fit the model to the data
and/or to make predictions?

Without making some assumptions about the nature of the process generating the data, we
cannot perform generalization. In the following sections, we elaborate on these ideas. Don’t feel you have
to memorize all these
kinds of learning, etc.
We just want you to
1.1 Problem class have a very high-level
view of (part of) the
There are many different problem classes in machine learning. They vary according to what breadth of the field.
kind of data is provided and what kind of conclusions are to be drawn from it. Five stan-
dard problem classes are described below, to establish some notation and terminology.
In this course, we will focus on classification and regression (two examples of super-
vised learning), and we will touch on reinforcement learning, sequence learning, and clus-
tering.

1.1.1 Supervised learning


The idea of supervised learning is that the learning system is given inputs and told which
specific outputs should be associated with them. We divide up supervised learning based
on whether the outputs are drawn from a small finite set (classification) or a large finite
ordered set or continuous set (regression).

1.1.1.1 Regression
For a regression problem, the training data Dn is in the form of a set of n pairs:

Dn = {(x(1) , y(1) ), . . . , (x(n) , y(n) )},

where x(i) represents an input, most typically a d-dimensional vector of real and/or dis-
crete values, and y(i) is the output to be predicted, in this case a real-number. The y values Many textbooks use xi
are sometimes called target values. and ti instead of x(i)
The goal in a regression problem is ultimately, given a new input value x(n+1) , to predict and y(i) . We find that
notation somewhat dif-
the value of y(n+1) . Regression problems are a kind of supervised learning, because the ficult to manage when
desired output y(i) is specified for each of the training examples x(i) . x(i) is itself a vector and
we need to talk about
its elements. The no-
Last Updated: 05/06/24 18:25:27 tation we are using is
standard in some other
parts of the ML litera-
ture.
MIT 6.390 Spring 2024 8

1.1.1.2 Classification
A classification problem is like regression, except that the values that y(i) can take do not
have an order. The classification problem is binary or two-class if y(i) (also known as the
class) is drawn from a set of two possible values; otherwise, it is called multi-class.

1.1.2 Unsupervised learning


Unsupervised learning doesn’t involve learning a function from inputs to outputs based on
a set of input-output pairs. Instead, one is given a data set and generally expected to find
some patterns or structure inherent in it.

1.1.2.1 Clustering
Given samples x(1) , . . . , x(n) ∈ Rd , the goal is to find a partitioning (or “clustering”) of
the samples that groups together similar samples. There are many different objectives,
depending on the definition of the similarity between samples and exactly what criterion
is to be used (e.g., minimize the average distance between elements inside a cluster and
maximize the average distance between elements across clusters). Other methods perform
a “soft” clustering, in which samples may be assigned 0.9 membership in one cluster and
0.1 in another. Clustering is sometimes used as a step in the so-called density estimation
(described below), and sometimes to find useful structure or influential features in data.

1.1.2.2 Density estimation


Given samples x(1) , . . . , x(n) ∈ Rd drawn i.i.d. from some distribution Pr(X) , the goal is to The capital X is a typ-
predict the probability Pr(x(n+1) ) of an element drawn from the same distribution. Density ical practice to empha-
size this is a so-called
estimation sometimes plays a role as a “subroutine” in the overall learning method for
random variable. Small
supervised learning, as well. letters are often used in
probability too; those
are typically reserved
1.1.2.3 Dimensionality reduction to denote the realized
values of random vari-
Given samples x(1) , . . . , x(n) ∈ RD , the problem is to re-represent them as points in a d- ables. It might help to
dimensional space, where d < D. The goal is typically to retain information in the data set concretely think of coin-
that will, e.g., allow elements of one class to be distinguished from another. tosses; there, the toss
Dimensionality reduction is a standard technique that is particularly useful for visualiz- outcome is a random
variable and it may be
ing or understanding high-dimensional data. If the goal is ultimately to perform regression realized as a "head".
or classification on the data after the dimensionality is reduced, it is usually best to artic- This paragraph actu-
ulate an objective for the overall prediction problem rather than to first do dimensionality ally talks about both a
reduction without knowing which dimensions will be important for the prediction task. random variable and a
realization of it, can you
spot that from the nota-
tion and do you feel the
1.1.3 Sequence learning difference?
In sequence learning, the goal is to learn a mapping from input sequences x0 , . . . , xn to output
sequences y1 , . . . , ym . The mapping is typically represented as a state machine, with one
function fs used to compute the next hidden internal state given the input, and another
function fo used to compute the output given the current hidden state.
It is supervised in the sense that we are told what output sequence to generate for which
input sequence, but the internal functions have to be learned by some method other than
direct supervision, because we don’t know what the hidden state sequence is.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 9

1.1.4 Reinforcement learning


In reinforcement learning, the goal is to learn a mapping from input values (typically as-
sumed to be states of an agent or system; for now, think e.g. the velocity of a moving car)
to output values (typically we want control actions; for now, think e.g. if to accelerate or hit
the brake). However, we need to learn the mapping without a direct supervision signal to
specify which output values are best for a particular input; instead, the learning problem
is framed as an agent interacting with an environment, in the following setting:

• The agent observes the current state st . Note it’s standard prac-
tice in reinforcement
• It selects an action at . learning to use s and
a instead of x and y
• It receives a reward, rt , which typically depends on st and possibly at . to denote the machine
learning model’s in-
• The environment transitions probabilistically to a new state, st+1 , with a distribution put and output. The
that depends only on st and at . subscript t denotes the
timestep, and captures
the sequential nature of
• The agent observes the current state, st+1 . the problem.
• ...

The goal is to find a policy π, mapping s to a, (that is, states to actions) such that some
long-term sum or average of rewards r is maximized.
This setting is very different from either supervised learning or unsupervised learning,
because the agent’s action choices affect both its reward and its ability to observe the envi-
ronment. It requires careful consideration of the long-term effects of actions, as well as all
of the other issues that pertain to supervised learning.

1.1.5 Other settings


There are many other problem settings. Here are a few.
In semi-supervised learning, we have a supervised-learning training set, but there may
be an additional set of x(i) values with no known y(i) . These values can still be used
to improve learning performance (if they are drawn from Pr(X) that is the marginal of
Pr(X, Y) that governs the rest of the data set).
In active learning, it is assumed to be expensive to acquire a label y(i) (imagine asking a
human to read an x-ray image), so the learning algorithm can sequentially ask for particular
inputs x(i) to be labeled, and must carefully select queries in order to learn as effectively as
possible while minimizing the cost of labeling.
In transfer learning (also called meta-learning), there are multiple tasks, with data drawn
from different, but related, distributions. The goal is for experience with previous tasks to
apply to learning a current task in a way that requires decreased experience with the new
task.

1.2 Assumptions
The kinds of assumptions that we can make about the data source or the solution include:

• The data are independent and identically distributed (i.i.d.).

• The data are generated by a Markov chain (i.e. outputs only depend only on the
current state, with no additional memory).

• The process generating the data might be adversarial.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 10

• The “true” model that is generating the data can be perfectly described by one of
some particular set of hypotheses.

The effect of an assumption is often to reduce the “size” or “expressiveness” of the space of
possible hypotheses and therefore reduce the amount of data required to reliably identify
an appropriate hypothesis.

1.3 Evaluation criteria


Once we have specified a problem class, we need to say what makes an output or the an-
swer to a query good, given the training data. We specify evaluation criteria at two levels:
how an individual prediction is scored, and how the overall behavior of the prediction or
estimation system is scored.
The quality of predictions from a learned model is often expressed in terms of a loss
function. A loss function L(g, a) tells you how much you will be penalized for making a
guess g when the answer is actually a. There are many possible loss functions. Here are
some frequently used examples:

• 0-1 Loss applies to predictions drawn from finite domains. If the actual values are
 drawn from a contin-
0 if g = a uous distribution, the
L(g, a) = probability they would
1 otherwise ever be equal to some
predicted g is 0 (except
for some weird cases).
• Squared loss
L(g, a) = (g − a)2

• Absolute loss
L(g, a) = |g − a|

• Asymmetric loss Consider a situation in which you are trying to predict whether
someone is having a heart attack. It might be much worse to predict “no” when the
answer is really “yes”, than the other way around.


1 if g = 1 and a = 0
L(g, a) = 10 if g = 0 and a = 1


0 otherwise

Any given prediction rule will usually be evaluated based on multiple predictions and
the loss of each one. At this level, we might be interested in:

• Minimizing expected loss over all the predictions (also known as risk)

• Minimizing maximum loss: the loss of the worst prediction

• Minimizing or bounding regret: how much worse this predictor performs than the
best one drawn from some class

• Characterizing asymptotic behavior: how well the predictor will perform in the limit
of infinite training data

• Finding algorithms that are probably approximately correct: they probably generate
a hypothesis that is right most of the time.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 11

There is a theory of rational agency that argues that you should always select the action
that minimizes the expected loss. This strategy will, for example, make you the most money
in the long run, in a gambling setting. As mentioned above, expected loss is also sometimes Of course, there are
called risk in ML literature, but that term means other things in economics or other parts other models for ac-
tion selection and it’s
of decision theory, so be careful...it’s risky to use it. We will, most of the time, concentrate
clear that people do not
on this criterion. always (or maybe even
often) select actions that
follow this rule.
1.4 Model type
Recall that the goal of a ML system is typically to estimate or generalize, based on data
provided. Below, we examine the role of model-making in machine learning.

1.4.1 Non-parametric models


In some simple cases, in response to queries, we can generate predictions directly from
the training data, without the construction of any intermediate model, or more precisely,
without the learning of any parameters.
For example, in regression or classification, we might generate an answer to a new
query by averaging answers to recent queries, as in the nearest neighbor method.

1.4.2 Parametric models


This two-step process is more typical:
1. “Fit” a model (with some a-prior chosen parameterization) to the training data
2. Use the model directly to make predictions
In the parametric models setting of regression or classification, the model will be some
hypothesis or prediction rule y = h(x; Θ) for some functional form h. The term hypothesis
has its roots in statistical learning and the scientific method, where models or hypotheses
about the world are tested against real data, and refined with more evidence, observations,
or insights. Note that the parameters themselves are only part of the assumptions that
we’re making about the world. The model itself is a hypothesis that will be refined with
more evidence.
The idea is that Θ is a set of one or more parameter values that will be determined by
fitting the model to the training data and then be held fixed during testing.
Given a new x(n+1) , we would then make the prediction h(x(n+1) ; Θ). We write f(a; b) to de-
The fitting process is often articulated as an optimization problem: Find a value of Θ scribe a function that is
that minimizes some criterion involving Θ and the data. An optimal strategy, if we knew usually applied to a sin-
gle argument a, but is a
the actual underlying distribution on our data, Pr(X, Y) would be to predict the value of member of a paramet-
y that minimizes the expected loss, which is also known as the test error. If we don’t have ric family of functions,
that actual underlying distribution, or even an estimate of it, we can take the approach with the particular func-
tion determined by pa-
of minimizing the training error: that is, finding the prediction rule h that minimizes the rameter value b.
average loss on our training data set. So, we would seek Θ that minimizes
This notation describes
1X
n
a so-called "joint distri-
En (h; Θ) = L(h(x(i) ; Θ), y(i) ) ,
n bution"; roughly, as the
i=1 name suggests, it cap-
where the loss function L(g, a) measures how bad it would be to make a guess of g tures how both random
variables X and Y "con-
when the actual value is a. tribute" to the chance of
We will find that minimizing training error alone is often not a good choice: it is possible something happening.
to emphasize fitting the current data too strongly and end up with a hypothesis that does
not generalize well when presented with new x values.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 12

1.5 Model class and parameter fitting


A model class M is a set of possible models, typically parameterized by a vector of param-
eters Θ. What assumptions will we make about the form of the model? When solving a
regression problem using a prediction-rule approach, we might try to find a linear func-
tion h(x; θ, θ0 ) = θT x + θ0 that fits our data well. In this example, the parameter vector
Θ = (θ, θ0 ).
For problem types such as classification, there are huge numbers of model classes that
have been considered...we’ll spend much of this course exploring these model classes, es-
pecially neural networks models. We will almost completely restrict our attention to model
classes with a fixed, finite number of parameters. Models that relax this assumption are
called “non-parametric” models.
How do we select a model class? In some cases, the ML practitioner will have a good
idea of what an appropriate model class is, and will specify it directly. In other cases, we
may consider several model classes and choose the best based on some objective function.
In such situations, we are solving a model selection problem: model-selection is to pick a
model class M from a (usually finite) set of possible model classes, whereas model fitting is
to pick a particular model in that class, specified by (usually continuous) parameters Θ.

1.6 Algorithm
Once we have described a class of models and a way of scoring a model given data, we
have an algorithmic problem: what sequence of computational instructions should we run
in order to find a good model from our class? For example, determining the parameter
vector which minimizes the training error might be done using a familiar least-squares
minimization algorithm, when the model h is a function being fit to some data x.
Sometimes we can use software that was designed, generically, to perform optimiza-
tion. In many other cases, we use algorithms that are specialized for ML problems, or for
particular hypotheses classes. Some algorithms are not easily seen as trying to optimize a
particular criterion. In fact, a historically important method for finding linear classifiers,
the perceptron algorithm, has this character.

Last Updated: 05/06/24 18:25:27


CHAPTER 2

Regression

Regression is an important machine-learning problem that provides a good starting point “Regression,” in com-
for diving deeply into the field. mon parlance, means
moving backwards. But
this is forward progress!
2.1 Problem formulation
A hypothesis h is employed as a model for solving the regression problem, in that it maps
inputs x to outputs y,
x→ h →y ,
where x ∈ Rd (i.e., a length d column vector of real numbers), and y ∈ R (i.e., a real
number). Real life rarely gives us vectors of real numbers; the x we really want to take as
input is usually something like a song, image, or person. In that case, we’ll have to define
a function φ(x), whose range is Rd , where φ represents features of x, like a person’s height
or the amount of bass in a song, and then let the h : φ(x) → R. In much of the following,
we’ll omit explicit mention of φ and assume that the x(i) are in Rd , but you should always
have in mind that some additional process was almost surely required to go from the actual
input examples to their feature representation, and we’ll talk a lot more about features later
in the course.
Regression is a supervised learning problem, in which we are given a training dataset of
the form    
Dn = x(1) , y(1) , . . . , x(n) , y(n) ,

which gives examples of input values x(i) and the output values y(i) that should be
associated with them. Because y values are real-valued, our hypotheses will have the form

h : Rd → R .

This is a good framework when we want to predict a numerical quantity, like height, stock
value, etc., rather than to divide the inputs into discrete categories.
What makes a hypothesis useful? That it works well on new data; that is, that it makes
good predictions on examples it hasn’t seen. But we don’t know exactly what data this My favorite analogy
hypothesis might be tested on when we use it in the real world. So, we have to assume a is to problem sets. We
evaluate a student’s
connection between the training data and testing data – typically, the assumption is that
ability to generalize by
putting questions on the
exam that were not on
13 the homework (training
set).
MIT 6.390 Spring 2024 14

they (the training and testing data) are drawn independently from the same probability
distribution.
To make this discussion more concrete, we have to provide a loss function, to say how
unhappy we are when we guess an output g given an input x for which the desired output
was a.
Given a training set Dn and a hypothesis h with parameters Θ, we can define the train-
ing error of h to be the average loss on the training data:

1X
n
En (h; Θ) = L(h(x(i) ; Θ), y(i) ) , (2.1)
n
i=1

The training error of h gives us some idea of how well it characterizes the relationship
between x and y values in our data, but it isn’t the quantity that we most care about. What
we most care about is test error:

1 X

n+n
E(h) = ′ L(h(x(i) ), y(i) )
n
i=n+1

on n ′ new examples that were not used in the process of finding the hypothesis. It might be worthwhile
For now, we will try to find a hypothesis with small training error (later, with some to stare at the two er-
rors and think about
added criteria) and try to make some design choices so that it generalizes well to new data,
what’s the difference.
meaning that it also has a small test error. For example, notice
how Θ is no long a
variable in the testing
2.2 Regression as an optimization problem error? this is because
in evaluating the test-
ing error, the param-
Given data, a loss function, and a hypothesis class, we need a method for finding a good eters will have been
hypothesis in the class. One of the most general ways to approach this problem is by "picked"/"fixed" already.
framing the machine learning problem as an optimization problem. One reason for taking
this approach is that there is a rich area of math and algorithms studying and developing
efficient methods for solving optimization problems, and lots of very good software imple-
mentations of these methods. So, if we can turn our problem into one of these problems,
then there will be a lot of work already done for us!
We begin by writing down an objective function J(Θ), where Θ stands for all the param-
eters in our model (i.e., all possible choices over parameters). We often write J(Θ; D) to
make clear the dependence on the data D. Don’t be too perturbed
The objective function describes how we feel about possible hypotheses Θ: we will by the semicolon where
you expected to see a
generally look for values for parameters Θ that minimize the objective function:
comma! It’s a math way
of saying that we are
Θ∗ = arg min J(Θ) . mostly interested in this
Θ
as a function of the ar-
A very common form for a machine-learning objective is guments before the “;”,
but we should remem-
  ber that there’s a depen-
1 X
n
dence on the stuff after
J(Θ) =  L(h(x(i) ; Θ), y(i) ) + λ
|{z} R(Θ). (2.2) it, as well.
n | {z }
i=1 loss non-negative constant
You can think about
The loss tells us how unhappy we are about the prediction h(x(i) ; Θ) that Θ makes for Θ∗ here as “the theta
that minimizes J”:
(x(i) , y(i) ). Minimizing this loss makes the prediction better. The regularizer is an addi- arg minx f(x) means the
tional term that encourages the prediction to remain general, and the constant λ adjusts value of x for which
the balance between reproducing seen examples, and being able to generalize to unseen f(x) is the smallest.
examples. We will return to discuss this balance, and more about the idea of regulariza- Sometimes we write
arg minx∈X f(x) when
tion, in Section 2.6. we want to explicitly
specify the set X of val-
Last Updated: 05/06/24 18:25:27 ues of x over which we
want to minimize.
MIT 6.390 Spring 2024 15

2.3 Linear regression


To make this discussion more concrete, we have to provide a hypothesis class and a loss
function.
We will begin by picking a class of hypotheses H that we think might provide a good
set of possible models of the relationship between x and y in our data. We will start with
a very simple class of linear hypotheses for regression. It is both simple to study and very
powerful, and will serve as the basis for many other important techniques (even neural
networks!).
In linear regression, the set H of hypotheses has the form

h(x; θ, θ0 ) = θT x + θ0 , (2.3)

with model parameters Θ = (θ, θ0 ). In one dimension (d = 1) this has the same familiar
slope-intercept form as y = mx+b; in higher dimensions, this model describes the so-called
hyperplanes.
We define a loss function to describe how to evaluate the quality of the predictions our
hypothesis is making, when compared to the “target” y values in the data set. The choice
of loss function is part of modeling your domain. In the absence of additional information
about a regression problem, we typically use squared loss:

L(g, a) = (g − a)2 .

where g = h(x) is our "guess" from the hypothesis, and a is the "actual" observation (in
other words, here a is being used equivalently as y). With this choice of squared loss, the
average loss as generally defined in 2.1 will become the so-called mean squared error (MSE),
which we’ll study closely very soon.
The squared loss penalizes guesses that are too high the same amount as it penal-
izes guesses that are too low, and has a good mathematical justification in the case that
your data are generated from an underlying linear hypothesis with the so-called Gaussian-
distributed noise added to the y values. But there are applications in which other losses We won’t get into the
would be better, and much of the framework we discuss can be applied to different loss details of Gaussian dis-
tribution in our class;
functions, although this one has a form that also makes it particularly computationally
but it’s one of the most
convenient. important distributions
Our objective in linear regression will be to find a hyperplane that goes as close as and well-worth study-
possible, on average, to all of our training data. ing closely at some
point. One obvious fact
Applying the general optimization framework to the linear regression hypothesis class about Gaussian is that
of Eq. 2.3 with squared loss and no regularization, our objective is to find values for Θ = it’s symmetric; this is in
(θ, θ0 ) that minimize the MSE: fact one of the reasons
squared loss works well
1 X  T (i) under Gaussian settings,
n 2
J(θ, θ0 ) = θ x + θ0 − y(i) , (2.4) as the loss is also sym-
n metric.
i=1

resulting in the solution:


θ∗ , θ∗0 = arg min J(θ, θ0 ) . (2.5)
θ,θ0

For one-dimensional data (d = 1), this becomes the familiar problem of fitting a line
to data. For d > 1, this hypothesis may be visualized as a d-dimensional hyperplane
embedded in a (d + 1)-dimensional space (that consists of the input dimension and the y
dimension). For example, in the left plot below, we can see data points with labels y and
input dimensions x1 and x2 . In the right plot below, we see the result of fitting these points
with a two-dimensional plane that resides in three dimensions. We interpret the plane as
representing a function that provides a y value for any input (x1 , x2 ).

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 16

A richer class of hypotheses can be obtained by performing a non-linear feature trans-


formation before doing the regression, as we will later see (in Chapter 5), but it will still
end up that we have to solve a linear regression problem.

2.4 A gloriously simple linear regression algorithm


Okay! Given the objective in Eq. 2.4, how can we find good values of θ and θ0 ? We’ll study
several general-purpose, efficient, interesting algorithms. But before we do that, let’s start
with the simplest one we can think of: guess a whole bunch (k) of different values of θ and θ0 ,
see which one has the smallest error on the training set, and return it.

R ANDOM -R EGRESSION(D, k)
(i)
1 For i in 1 . . . k: Randomly generate hypothesis θ(i) , θ0
(i)
2 Let i = arg mini J(θ(i) , θ0 ; D)
(i)
3 Return θ(i) , θ0

This seems kind of silly, but it’s a learning algorithm, and it’s not completely useless.
Study Question: If your data set has n data points, and the dimension of the x val-
ues is d, what is the size of an individual θ(i) ?

Study Question: How do you think increasing the number of guesses k will change
the training error of the resulting hypothesis?

2.5 Analytical solution: ordinary least squares


One very interesting aspect of the problem of finding a linear hypothesis that minimizes
mean squared error is that we can find a closed-form formula for the answer! This general What does “closed
problem is often called the ordinary least squares (OLS) form” mean? Generally,
that it involves direct
Everything is easier to deal with if we assume that all of the the x(i) have been aug-
evaluation of a mathe-
mented with an extra input dimension (feature) that always has value 1, so that they are in matical expression using
d + 1 dimensions, and rather than having an explicit θ0 , we let it be the last element of our a fixed number of “typ-
θ vector, so that we have, simply, ical” operations (like
arithmetic operations,
y = θT x . trig functions, powers,
etc.). So equation 2.5 is
not in closed form, be-
cause it’s not at all clear
Last Updated: 05/06/24 18:25:27 what operations one
needs to perform to find
the solution.
MIT 6.390 Spring 2024 17

In this case, the objective becomes

1 X  T (i)
n 2
J(θ) = θ x − y(i) . (2.6)
n
i=1

Study Question: Stop and prove to yourself that adding that extra feature with
value 1 to every input vector and getting rid of the θ0 parameter is equivalent to our
original model.
We approach this just like a minimization problem from calculus homework: take the
derivative of J with respect to θ, set it to zero, and solve for θ. There are additional steps
required, to check that the resulting θ is a minimum (rather than a maximum or an inflec-
tion point) but we won’t work through that here. It is possible to approach this problem
by:

• Finding ∂J/∂θk for k in 1, . . . , d, We will use d here for


the total number of fea-
• Constructing a set of k equations of the form ∂J/∂θk = 0, and tures in each x(i) , in-
cluding the added 1.
• Solving the system for values of θk .

That works just fine. To get practice for applying techniques like this to more complex
problems, we will work through a more compact (and cool!) matrix view. Along the way,
it will be helpful to collect all of the derivatives in one vector. In particular, the gradient of
J with respect to θ is following column vector of length d:
 
∂J/∂θ1
∇θ J =  ...  .
 

∂J/∂θd

Study Question: Work through the next steps and check your answer against ours
below.
We can think of our training data in terms of matrices X and Y, where each column of X
is an example, and each “column” of Y is the corresponding target output value:
 
(1) (n)
x1 . . . x1
 . .. ..   (1) 
X=  .. . .  Y= y
 . . . y(n) .
(1) (n)
xd . . . xd

Study Question: What are the dimensions of X and Y?


In most textbooks, they think of an individual example x(i) as a row, rather than a
column. So that we get an answer that will be recognizable to you, we are going to define
a new matrix and vector, X̃ and Ỹ, which are just transposes of our X and Y, and then work
with them:  
(1) (1)  (1) 
x1 . . . xd y
T
 . . .  T . 
X̃ = X =   .. .. .. 
 Ỹ = Y =  ..  .

(n) (n)
x1 ... xd y(n)

Study Question: What are the dimensions of X̃ and Ỹ?

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 18

Now we can write


  2
1 1 X
n Xd
J(θ) = (X̃θ − Ỹ)T (X̃θ − Ỹ) = X̃ij θj  − Ỹi 
n | {z } | {z } n

1×n n×1 i=1 j=1

and using facts about matrix/vector calculus, we get You should be able to
verify this by doing a
2 T simple (say, 2 × 2) exam-
∇θ J = X̃ (X̃θ − Ỹ) . (2.7)
n |{z} | {z } ple by hand.
d×n n×1

See Appendix A for a nice way to think about finding this derivative.
Setting ∇θ J to 0 and solving, we get:

2 T
X̃ (X̃θ − Ỹ) = 0
n
X̃T X̃θ − X̃T Ỹ = 0
X̃T X̃θ = X̃T Ỹ
θ = (X̃T X̃)−1 X̃T Ỹ

And the dimensions work out!


−1 T
θ = X̃T X̃ X̃ Ỹ
| {z } |{z} |{z}
d×d d×n n×1

So, given our data, we can directly compute the linear regression that minimizes mean
squared error. That’s pretty awesome!

2.6 Regularization
The objective function of Eq. 2.2 balances (training-data) memorization, induced by the loss
term, with generalization, induced by the regularization term. Here, we address the need for
regularization specifically for linear regression, and show how this can be realized using
one popular regularization technique called ridge regression.

2.6.1 Regularization and linear regression


If all we cared about was finding a hypothesis with small loss on the training data, we
would have no need for regularization, and could simply omit the second term in the
objective. But remember that our ultimate goal is to perform well on input values that we
haven’t trained on! It may seem that this is an impossible task, but humans and machine-
learning methods do this successfully all the time. What allows generalization to new input
values is a belief that there is an underlying regularity that governs both the training and
testing data. One way to describe an assumption about such a regularity is by choosing
a limited class of possible hypotheses. Another way to do this is to provide smoother
guidance, saying that, within a hypothesis class, we prefer some hypotheses to others. The
regularizer articulates this preference and the constant λ says how much we are willing to
trade off loss on the training data versus preference over hypotheses.
For example, consider what happens when d = 2, and x2 is highly correlated with
x1 , meaning that the data look like a line, as shown in the left panel of the figure below.
Thus, there isn’t a unique best hyperplane . Such correlations happen often in real-life Sometimes there’s tech-
nically a unique best
hyperplane, but just be-
Last Updated: 05/06/24 18:25:27
cause of noise.
MIT 6.390 Spring 2024 19

data, because of underlying common causes; for example, across a population, the height
of people may depend on both age and amount of food intake in the same way. This is
especially the case when there are many feature dimensions used in the regression. Mathe-
matically, this leads to X̃T X̃ close to singularity, such that (X̃T X̃)−1 is undefined or has huge
values, resulting in unstable models (see the middle panel of figure and note the range of
the y values—the slope is huge!):

A common strategy for specifying a regularizer is to use the form


2
R(Θ) = Θ − Θprior

when we have some idea in advance that Θ ought to be near some value Θprior . Learn about Bayesian
Here, the notion of distance is quantified by squaring the l2 norm of the parameter methods in machine
learning to see the the-
vector: for any d-dimensional vector v ∈ Rd , the l2 norm of v is defined as,
ory behind this and cool
v results!
uX
u d
∥v∥ = t |vi |2 .
i=1

In the absence of such knowledge a default is to regularize toward zero:

R(Θ) = ∥Θ∥2 .

When this is done in the example depicted above, the regression model becomes stable,
producing the result shown in the right-hand panel in the figure. Now the slope is much
more sensible.

2.6.2 Ridge regression



There are some kinds of trouble we can get into in regression problems. What if X̃T X̃ is
not invertible?
Study Question: Consider, for example, a situation where the data-set is just the
same point repeated twice: x(1) = x(2) = [1 2]T . What is X̃ in this case? What is
X̃T X̃? What is (X̃T X̃)−1 ?
Another kind of problem is overfitting: we have formulated an objective that is just
about fitting the data as well as possible, but we might also want to regularize to keep the
hypothesis from getting too attached to the data.
We address both the problem of not being able to invert (X̃T X̃)−1 and the problem of
overfitting using a mechanism called ridge regression. We add a regularization term ∥θ∥2 to
the OLS objective, with a non-negative scalar value λ to control the tradeoff between the
training error and the regularization term.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 20

Study Question: Why do we emphasize the non-negativity of the scalar λ? When we


add a regularizer of the form ∥θ∥2 , what is our most “preferred” value of θ, in the
absence of any data?
Here is the ridge regression objective function:

1 X  T (i)
n 2
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λ∥θ∥2
n
i=1

Larger λ values pressure θ values to be near zero. Note that we don’t penalize θ0 ; intu-
itively, θ0 is what “floats” the regression surface to the right level for the data you have,
and so you shouldn’t make it harder to fit a data set where the y values tend to be around
one million than one where they tend to be around one. The other parameters control the
orientation of the regression surface, and we prefer it to have a not-too-crazy orientation.
There is an analytical expression for the θ, θ0 values that minimize Jridge , but it’s a little
bit more complicated to derive than the solution for OLS because θ0 needs special treatment.
If we decide not to treat θ0 specially (so we add a 1 feature to our input vectors as discussed
above), then we get:
2
∇θ Jridge = X̃T (X̃θ − Ỹ) + 2λθ .
n
Setting to 0 and solving, we get: Remember that I stands
for the identity matrix,
2 T a square matrix that has
X̃ (X̃θ − Ỹ) + 2λθ = 0 1’s along the diagonal
n
and 0’s everywhere else.
1 T 1
X̃ X̃θ − X̃T Ỹ + λθ = 0
n n
1 T 1
X̃ X̃θ + λθ = X̃T Ỹ
n n
X̃T X̃θ + nλθ = X̃T Ỹ
(X̃T X̃ + nλI)θ = X̃T Ỹ
θ = (X̃T X̃ + nλI)−1 X̃T Ỹ

Whew! So the solution is: −1 T


θridge = X̃T X̃ + nλI X̃ Ỹ (2.8)
T

and the term X̃ X̃ + nλI becomes invertible when λ > 0. This is called “ridge”
regression because we
Study Question: What is the dimension of I in the equation above? are adding a “ridge”
of nλ values along the
diagonal of the matrix
before inverting it.
2.7 Evaluating learning algorithms
In this section, we will explore how to evaluate supervised machine-learning algorithms.
We will study the special case of applying them to regression problems, but the basic ideas
of validation, hyper-parameter selection, and cross-validation apply much more broadly.
We have seen how linear regression is a well-formed optimization problem, which has
an analytical solution when ridge regularization is applied. But how can one choose the
best amount of regularization, as parameterized by λ? Two key ideas involve the evalua-
tion of the performance of a hypothesis, and a separate evaluation of the algorithm used to
produce hypotheses, as described below.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 21

2.7.1 Evaluating hypotheses


The performance of a given hypothesis h may be evaluated by measuring test error on data
that was not used to train it. Given a training set Dn , a regression hypothesis h, and if we It’s a bit funny to in-
choose squared loss, we can define the OLS training error of h to be the mean square error terpret the analytical
formulas given above
between its predictions and the expected outputs:
for θ as “training,” but
1 Xh
n i2 later when we employ
En (h) = h(x(i) ) − y(i) . more statistical meth-
n ods “training” will be a
i=1
meaningful concept.
Test error captures the performance of h on unseen data, and is the mean square error on
the test set, with a nearly identical expression as that above, differing only in the range of
index i:

1 X h
n+n ′ i2
E(h) = ′ h(x(i) ) − y(i)
n
i=n+1

on n ′ new examples that were not used in the process of constructing h.


In machine learning in general, not just regression, it is useful to distinguish two ways
in which a hypothesis h ∈ H might contribute to test error. Two are:
Structural error: This is error that arises because there is no hypothesis h ∈ H that will
perform well on the data, for example because the data was really generated by a
sine wave but we are trying to fit it with a line.
Estimation error: This is error that arises because we do not have enough data (or the
data are in some way unhelpful) to allow us to choose a good h ∈ H, or because we
didn’t solve the optimization problem well enough to find the best h given the data
that we had.
There are technical defi-
When we increase λ, we tend to increase structural error but decrease estimation error, nitions of these concepts
that are studied in more
and vice versa.
advanced treatments
of machine learning.
2.7.2 Evaluating learning algorithms Structural error is re-
ferred to as bias and
Note that this section is relevant to learning algorithms generally—we are just introducing the topic estimation error is re-
ferred to as variance.
here since we now have an algorithm that can be evaluated!

A learning algorithm is a procedure that takes a data set Dn as input and returns an
hypothesis h from a hypothesis class H; it looks like

Dn −→ learning alg (H) −→ h

Keep in mind that h has parameters. The learning algorithm itself may have its own pa-
rameters, and such parameters are often called hyperparameters. The analytical solutions
presented above for linear regression, e.g., Eq. 2.8, may be thought of as learning algo-
rithms, where λ is a hyperparameter that governs how the learning algorithm works and
can strongly affect its performance.
How should we evaluate the performance of a learning algorithm? This can be tricky.
There are many potential sources of variability in the possible result of computing test error
on a learned hypothesis h:
• Which particular training examples occurred in Dn
• Which particular testing examples occurred in Dn ′
• Randomization inside the learning algorithm itself

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 22

2.7.2.1 Validation
Generally, to evaluate how well a learning algorithm works, given an unlimited data source,
we would like to execute the following process multiple times:

• Train on a new training set (subset of our big data source)

• Evaluate resulting h on a validation set that does not overlap the training set (but is
still a subset of our same big data source)

Running the algorithm multiple times controls for possible poor choices of training set
or unfortunate randomization inside the algorithm itself.

2.7.2.2 Cross validation


One concern is that we might need a lot of data to do this, and in many applications data
is expensive or difficult to acquire. We can re-use data with cross validation (but it’s harder
to do theoretical analysis).

C ROSS -VALIDATE(D, k)
1 divide D into k chunks D1 , D2 , . . . Dk (of roughly equal size)
2 for i = 1 to k
3 train hi on D \ Di (withholding chunk Di as the validation set)
4 compute “test” error Ei (hi ) on withheld data Di
P
5 return k1 ki=1 Ei (hi )

It’s very important to understand that (cross-)validation neither delivers nor evaluates a
single particular hypothesis h. It evaluates the learning algorithm that produces hypotheses.

2.7.2.3 Hyperparameter tuning


The hyper-parameters of a learning algorithm affect how the algorithm works but they are
not part of the resulting hypothesis. So, for example, λ in ridge regression affects which
hypothesis will be returned, but λ itself doesn’t show up in the hypothesis (the hypothesis
is specified using parameters θ and θ0 ).
You can think about each different setting of a hyper-parameter as specifying a different
learning algorithm.
In order to pick a good value of the hyper-parameter, we often end up just trying a lot
of values and seeing which one works best via validation or cross-validation.
Study Question: How could you use cross-validation to decide whether to use ana-
lytic ridge regression or our random-regression algorithm and to pick K for random
regression or λ for ridge regression?

Last Updated: 05/06/24 18:25:27


CHAPTER 3

Gradient Descent

In the previous chapter, we showed how to describe an interesting objective function for
machine learning, but we need a way to find the optimal Θ∗ = arg minΘ J(Θ), particularly
when the objective function is not amenable to analytical optimization. For example, this
can be the case when J(Θ) involves a more complex loss function, or more general forms
of regularization. It can also be the case when there is simply too much data for it to be
computationally feasible to analytically invert the required matrices.
There is an enormous and fascinating literature on the mathematical and algorithmic
foundations of optimization, but for this class, we will consider one of the simplest meth- Which you should con-
ods, called gradient descent. sider studying some
day!
Intuitively, in one or two dimensions, we can easily think of J(Θ) as defining a surface
over Θ; that same idea extends to higher dimensions. Now, our objective is to find the
Θ value at the lowest point on that surface. One way to think about gradient descent is
that you start at some arbitrary point on the surface, look to see in which direction the
“hill” goes down most steeply, take a small step in that direction, determine the direction
of steepest descent from where you are, take another small step, etc.
Below, we explicitly give gradient descent algorithms for one and multidimensional
objective functions (Sections 3.1 and 3.2). We then illustrate the application of gradient
descent to a loss function which is not merely mean squared loss (Section 3.3). And we
present an important method known as stochastic gradient descent (Section 3.4), which is
especially useful when datasets are too large for descent in a single batch, and has some
important behaviors of its own.

3.1 Gradient descent in one dimension


We start by considering gradient descent in one dimension. Assume Θ ∈ R, and that we
know both J(Θ) and its first derivative with respect to Θ, J ′ (Θ). Here is pseudo-code for
gradient descent on an arbitrary function f. Along with f and its gradient ∇Θ f (which,
in the case of a scalar Θ, is the same as its derivative f ′ ), we have to specify some hyper-
parameters. These hyper-parameters include the initial value for parameter Θ, a step-size
hyper-parameter η, and an accuracy hyper-parameter ϵ.
The hyper-parameter η is often called learning rate when gradient descent is applied in
machine learning. For simplicity, η may be taken as a constant, as is the case in the pseudo-
code below; and we’ll see adaptive (non-constant) step-sizes soon. What’s important to

23
MIT 6.390 Spring 2024 24

notice though, is that even when η is constant, the actual magnitude of the change to Θ
may not be constant, as that change depends on the magnitude of the gradient itself too.

1D-G RADIENT-D ESCENT(Θinit , η, f, f ′ , ϵ)


1 Θ(0) = Θinit
2 t=0
3 repeat
4 t = t+1
5 Θ(t) = Θ(t−1) − η f ′ (Θ(t−1) )
6 until |f(Θ(t) ) − f(Θ(t−1) )| < ϵ
7 return Θ(t)

Note that this algorithm terminates when the change in the function f is sufficiently small.
There are many other reasonable ways to decide to terminate, including:
• Stop after a fixed number of iterations T , i.e., when t = T .

• Stop when the change in the value of the parameter Θ is sufficiently small, i.e., when
Θ(t) − Θ(t−1) < ϵ.

• Stop when the derivative f ′ at the latest value of Θ is sufficiently small, i.e., when
f ′ (Θ(t) ) < ϵ.

Study Question: Consider all of the potential stopping criteria for


1D-G RADIENT-D ESCENT, both in the algorithm as it appears and listed sepa-
rately later. Can you think of ways that any two of the criteria relate to each other?

Theorem 3.1.1. Choose any small distance ϵ̃ > 0. If we assume that f has a minimum, is suffi-
ciently “smooth” and convex, and if the step size η is sufficiently small, gradient descent will reach A function is convex
a point within ϵ̃ of a global optimum point Θ. if the line segment be-
tween any two points
However, we must be careful when choosing the step size to prevent slow convergence, on the graph of the
non-converging oscillation around the minimum, or divergence. function lies above or
on the graph.
The following plot illustrates a convex function f(x) = (x−2)2 , starting gradient descent
at xinit = 4.0 with a step-size of 1/2. It is very well-behaved!

f(x)
4

x
−1 1 2 3 4 5 6

Study Question: What happens in this example with very small η? With very big η?
If f is non-convex, where gradient descent converges to depends on xinit . First, let’s
establish some definitions. Suppose we have analytically defined derivatives for f. Then

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 25

we say that f has a local minimum point or local optimum point at x if f ′ (x) = 0 and f ′′ (x) > 0,
and we say that f(x) is a local minimum value of f. More generally, x is a local minimum
point of f if f(x) is at least as low as f(x ′ ) for all points x ′ in some small area around x. We
say that f has a global minimum point at x if f(x) is at least as low as f(x ′ ) for every other
input x ′ . And then we call f(x) a global minimum value. A global minimum point is also a
local minimum point, but a local minimum point does not have to be a global minimum
point.
If f is non-convex (and sufficiently smooth), one expects that gradient descent (run long
enough with small enough step size) will get very close to a point at which the gradient is
zero, though we cannot guarantee that it will converge to a global minimum point.
There are two notable exceptions to this common sense expectation: First, gradient
descent can get stagnated while approaching a point x which is not a local minimum or
maximum, but satisfies f ′ (x) = 0. For example, for f(x) = x3 , starting gradient descent
from the initial guess xinit = 1, while using step size η < 1/3 will lead to x(k) converging to
zero as k → ∞. Second, there are functions (even convex ones) with no minimum points,
like f(x) = exp(−x), for which gradient descent with a positive step size converges to +∞.
The plot below shows two different xinit , and how gradient descent started from each
point heads toward two different local optimum points.
10
f(x)

x
−2 −1 1 2 3 4

3.2 Multiple dimensions


The extension to the case of multi-dimensional Θ is straightforward. Let’s assume Θ ∈ Rm ,
so f : Rm → R. The gradient of f with respect to Θ is
 
∂f/∂Θ1
∇Θ f = 
 .. 
. 
∂f/∂Θm
The algorithm remains the same, except that the update step in line 5 becomes
Θ(t) = Θ(t−1) − η∇Θ f(Θ(t−1) )
and any termination criteria that depended on the dimensionality of Θ would have to
change. The easiest thing is to keep the test in line 6 as f(Θ(t) ) − f(Θ(t−1) ) < ϵ, which
is sensible no matter the dimensionality of Θ.
Study Question: Which termination criteria from the 1D case were defined in a way
that assumes Θ is one dimensional?

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 26

3.3 Application to regression


Recall from the previous chapter that choosing a loss function is the first step in formulating
a machine-learning problem as an optimization problem, and for regression we studied the
mean square loss, which captures loss as (guess − actual)2 . This leads to the ordinary least
squares objective
1 X  T (i)
n 2
J(θ) = θ x − y(i) . (3.1)
n
i=1

We use the gradient of the objective with respect to the parameters,

2 T
∇θ J = X̃ (X̃θ − Ỹ) , (3.2)
n |{z} | {z }
d×n n×1

to obtain an analytical solution to the linear regression problem. Gradient descent could
also be applied to numerically compute a solution, using the update rule

2 X h (t−1) iT (i)
n  
θ(t) = θ(t−1) − η θ x − y(i) x(i) . (3.3)
n
i=1

Beware double super-


scripts! [θ]T is the trans-
pose of the vector θ
3.3.1 Ridge regression
Now, let’s add in the regularization term, to get the ridge-regression objective:

1 X  T (i)
n 2
Jridge (θ, θ0 ) = θ x + θ0 − y(i) + λ∥θ∥2 .
n
i=1

Recall that in ordinary least squares, we finessed handling θ0 by adding an extra di-
mension of all 1’s. In ridge regression, we really do need to separate the parameter vector
θ from the offset θ0 , and so, from the perspective of our general-purpose gradient descent
method, our whole parameter set Θ is defined to be Θ = (θ, θ0 ). We will go ahead and find
the gradients separately for each one: Some passing familiar-
ity with matrix deriva-
2 X  T (i)
n  tives is helpful here. A
∇θ Jridge (θ, θ0 ) = θ x + θ0 − y(i) x(i) + 2λθ foolproof way of com-
n puting them is to com-
i=1
pute partial derivative
2 X  T (i)
n
∂Jridge (θ, θ0 ) 
of J with respect to each
= θ x + θ0 − y(i) .
∂θ0 n component θi of θ. See
i=1
Appendix A on matrix
derivatives!
Note that ∇θ Jridge will be of shape d × 1 and ∂Jridge /∂θ0 will be a scalar since we have
separated θ0 from θ here.
Study Question: Convince yourself that the dimensions of all these quantities are
correct, under the assumption that θ is d × 1. How does d relate to m as discussed
for Θ in the previous section?

Study Question: Compute ∇θ ∥θ∥2 by finding the vector of partial derivatives


(∂ ∥θ∥2 /∂θ1 , . . . , ∂ ∥θ∥2 /∂θd ). What is the shape of ∇θ ∥θ∥2 ?

Study Question: Compute ∇θ Jridge (θT x+θ0 , y) by finding the vector of partial deriva-
tives (∂Jridge (θT x + θ0 , y)/∂θ1 , . . . , ∂Jridge (θT x + θ0 , y)/∂θd ).

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 27

Study Question: Use these last two results to verify our derivation above.
Putting everything together, our gradient descent algorithm for ridge regression be-
comes

RR-G RADIENT-D ESCENT(θinit , θ0init , η, ϵ)


1 θ(0) = θinit
(0)
2 θ0 = θ0init
3 t=0
4 repeat
5 t = t+1  P   
1 n (t−1) T (i) (t−1)
6 θ(t) = θ(t−1) − η i=1 θ x + θ0 − y (i)
x(i)
+ λθ(t−1)
n P  
(t) (t−1) (t−1) T (i) (t−1)
7 θ0 = θ0 − η n1 ni=1 θ x + θ0 − y (i)

(t) (t−1)
8 until Jridge (θ(t) , θ0 ) − Jridge (θ(t−1) , θ0 ) <ϵ
(t) (t)
9 return θ , θ0

Study Question: Is it okay that λ doesn’t appear in line 7?

Study Question: Is it okay that the 2’s from the gradient definitions don’t appear in
the algorithm?

3.4 Stochastic gradient descent


When the form of the gradient is a sum, rather than take one big(ish) step in the direction
of the gradient, we can, instead, randomly select one term of the sum, and take a very The word “stochastic”
small step in that direction. This seems sort of crazy, but remember that all the little steps means probabilistic,
or random; so does
would average out to the same direction as the big step if you were to stay in one place. Of
“aleatoric,” which is a
course, you’re not staying in that place, so you move, in expectation, in the direction of the very cool word. Look
gradient. up aleatoric music,
Most objective functions in machine learning can end up being written as a sum over sometime.
data points, in which case, stochastic gradient descent (SGD) is implemented by picking a
data point randomly out of the data set, computing the gradient as if there were only that
one point in the data set, and taking a small step in the negative direction.
Let’s assume our objective has the form

X
n
f(Θ) = fi (Θ) ,
i=1

where n is the number of data points used in the objective (and this may be different from
the number of points available in the whole data set). Here is pseudocode for applying
SGD to such an objective f; it assumes we know the form of ∇Θ fi for all i in 1 . . . n:

S TOCHASTIC -G RADIENT-D ESCENT(Θinit , η, f, ∇Θ f1 , . . . , ∇Θ fn , T )


1 Θ(0) = Θinit
2 for t = 1 to T
3 randomly select i ∈ {1, 2, . . . , n}
4 Θ(t) = Θ(t−1) − η(t) ∇Θ fi (Θ(t−1) )
5 return Θ(t)

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 28

Note that now instead of a fixed value of η, η is indexed by the iteration of the algo-
rithm, t. Choosing a good stopping criterion can be a little trickier for SGD than traditional
gradient descent. Here we’ve just chosen to stop after a fixed number of iterations T .
For SGD to converge to a local optimum point as t increases, the step size has to decrease
as a function of time. The next result shows one step size sequence that works.

Theorem 3.4.1. If f is convex, and η(t) is a sequence satisfying



X ∞
X
η(t) = ∞ and η(t)2 < ∞ ,
t=1 t=1

then SGD converges with probability one to the optimal Θ. We have left out some
P gnarly conditions in this
Why these two conditions? The intuition is that the first condition, on η(t), is needed theorem. Also, you can
to allow for the possibility of an unbounded potential range of exploration, while the sec- learn more about the
P subtle difference be-
ond condition, on η(t)2 , ensures that the step sizes get smaller and smaller as t increases. tween “with probabil-
One “legal” way of setting the step size is to make η(t) = 1/t but people often use rules ity one” and “always”
that decrease more slowly, and so don’t strictly satisfy the criteria for convergence. by taking an advanced
probability course.
Study Question: If you start a long way from the optimum, would making η(t) de-
crease more slowly tend to make you move more quickly or more slowly to the opti-
mum?
There are multiple intuitions for why SGD might be a better choice algorithmically than
regular GD (which is sometimes called batch GD (BGD)):

• BGD typically requires computing some quantity over every data point in a data set.
SGD may perform well after visiting only some of the data. This behavior can be
useful for very large data sets – in runtime and memory savings.

• If your f is actually non-convex, but has many shallow local optimum points that
might trap BGD, then taking samples from the gradient at some point Θ might “bounce”
you around the landscape and away from the local optimum points.

• Sometimes, optimizing f really well is not what we want to do, because it might
overfit the training set; so, in fact, although SGD might not get lower training error
than BGD, it might result in lower test error.

Last Updated: 05/06/24 18:25:27


CHAPTER 4

Classification

4.1 Classification
Classification is a machine learning problem seeking to map from inputs Rd to outputs in
an unordered set. Examples of classification output sets could be {apples, oranges, pears} in contrast to a continu-
if we’re trying to figure out what type of fruit we have, or {heartattack, noheartattack} ous real-valued output,
as we saw for linear re-
if we’re working in an emergency room and trying to give the best medical care to a new gression
patient. We focus on an essential simple case, binary classification, where we aim to find
a mapping from Rd to two outputs. While we should think of the outputs as not having
an order, it’s often convenient to encode them as {−1, +1}. As before, let the letter h (for
hypothesis) represent a classifier, so the classification process looks like:

x→ h →y .

Like regression, classification is a supervised learning problem, in which we are given a


training data set of the form
   
Dn = x(1) , y(1) , . . . , x(n) , y(n) .

We will assume that each x(i) is a d × 1 column vector. The intended meaning of this data is
that, when given an input x(i) , the learned hypothesis should generate output y(i) .
What makes a classifier useful? As in regression, we want it to work well on new data,
making good predictions on examples it hasn’t seen. But we don’t know exactly what
data this classifier might be tested on when we use it in the real world. So, we have to
assume a connection between the training data and testing data; typically, they are drawn
independently from the same probability distribution.
In classification, we will often use 0-1 loss for evaluation (as discussed in Section 1.3).
For that choice, we can write the training error and the testing error. In particular, given a
training set Dn and a classifier h, we define the training error of h to be

1 X 1 h(x(i) ) ̸= y(i)
n
En (h) = .
n 0 otherwise
i=1

29
MIT 6.390 Spring 2024 30

For now, we will try to find a classifier with small training error (later, with some added
criteria) and hope it generalizes well to new data, and has a small test error

1 X

n+n
1 h(x(i) ) ̸= y(i)
E(h) = ′
n 0 otherwise
i=n+1

on n ′ new examples that were not used in the process of finding the classifier.
We begin by introducing the hypothesis class of linear classifiers (Section 4.2) and then
define an optimization framework to learn linear logistic classifiers (Section 4.3).

4.2 Linear classifiers


We start with the hypothesis class of linear classifiers. They are (relatively) easy to under-
stand, simple in a mathematical sense, powerful on their own, and the basis for many other
more sophisticated methods. Following their definition, we present a simple learning al-
gorithm for classifiers.

4.2.1 Linear classifiers: definition


A linear classifier in d dimensions is defined by a vector of parameters θ ∈ Rd and scalar
θ0 ∈ R. So, the hypothesis class H of linear classifiers in d dimensions is parameterized by
the set of all vectors in Rd+1 . We’ll assume that θ is a d × 1 column vector.
Given particular values for θ and θ0 , the classifier is defined by

T +1 if θT x + θ0 > 0
h(x; θ, θ0 ) = sign(θ x + θ0 ) = .
−1 otherwise

Remember that we can think of θ, θ0 as specifying a d-dimensional hyperplane (compare


the above with Eq. 2.3). But this time, rather than being interested in that hyperplane’s
values at particular points x, we will focus on the separator that it induces. The separator is
the set of x values such that θT x+θ0 = 0. This is also a hyperplane, but in d−1 dimensions!
We can interpret θ as a vector that is perpendicular to the separator. (We will also say that
θ is normal to the separator.)
For example, in two dimensions (d = 2) the separator has dimension 1, which means it
is a line, and the two components of θ = [θ1 , θ2 ]T give the orientation of the separator, as
illustrated in the following example.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 31



1
Example: Let h be the linear classifier defined by θ = , θ0 = 1.
−1

The diagram below shows the θ vector (in green) and the separator it defines:

x2 θT x + θ0 = 0

θ θ2
x1
θ1

What is θ0 ? We can solve for it by plugging a point on the line into the equation for
the line. It is often convenient
  to choose a point on one of the axes, e.g., in this case,
0
x = [0, 1]T , for which θT + θ0 = 0, giving θ0 = 1.
1

In this example, the separator divides Rd , the space our x(i) points live in, into two half-
spaces. The one that is on the same side as the normal vector is the positive half-space, and
we classify all points in that space as positive. The half-space on the other side is negative
and all points in it are classified as negative.
Note that we will call a separator a linear separator of a data set if all of the data with
one label falls on one side of the separator and all of the data with the other label falls on
the other side of the separator. For instance, the separator in the next example is a linear
separator for the illustrated data. If there exists a linear separator on a dataset, we call this
dataset linearly separable.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 32


−1
Example: Let h be the linear classifier defined by θ = , θ0 = 3.
1.5
 
3
The diagram below shows several points classified by h. In particular, let x(1) =
2
 
4
and x(2) = .
−1
   
(1)
  3
h(x ; θ, θ0 ) = sign −1 1.5 + 3 = sign(3) = +1
2
   
 4
h(x(2) ; θ, θ0 ) = sign −1 1.5

+ 3 = sign(−2.5) = −1
−1

Thus, x(1) and x(2) are given positive and negative classifications, respectively.

x(1)
θT x + θ0 = 0

x(2)

Study Question: What is the green vector normal to the separator? Specify it as a
column vector.

Study Question: What change would you have to make to θ, θ0 if you wanted to
have the separating hyperplane in the same place, but to classify all the points la-
beled ’+’ in the diagram as negative and all the points labeled ’-’ in the diagram as
positive?

4.3 Linear logistic classifiers


Given a data set and the hypothesis class of linear classifiers, our goal will be to find the
linear classifier that optimizes an objective function relating its predictions to the training
data. To make this problem computationally reasonable, we will need to take care in how
we formulate the optimization problem to achieve this goal.
For classification, it is natural to make predictions in {+1, −1} and use the 0-1 loss func-
tion, L01 , as introduced in Chapter 1:

0 if g = a
L01 (g, a) = .
1 otherwise

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 33

However, even for simple linear classifiers, it is very difficult to find values for θ, θ0 that
minimize simple 0-1 training error

1X
n
J(θ, θ0 ) = L01 (sign(θT x(i) + θ0 ), y(i) ) .
n
i=1

This problem is NP-hard, which probably implies that solving the most difficult instances The “probably” here is
of this problem would require computation time exponential in the number of training ex- not because we’re too
lazy to look it up, but
amples, n.
actually because of a
What makes this a difficult optimization problem is its lack of “smoothness”: fundamental unsolved
problem in computer-
• There can be two hypotheses, (θ, θ0 ) and (θ ′ , θ0′ ), where one is closer in parameter science theory, known
space to the optimal parameter values (θ∗ , θ∗0 ), but they make the same number of as “P vs. NP.”
misclassifications so they have the same J value.

• All predictions are categorical: the classifier can’t express a degree of certainty about
whether a particular input x should have an associated value y.
For these reasons, if we are considering a hypothesis θ, θ0 that makes five incorrect predic-
tions, it is difficult to see how we might change θ, θ0 so that it will perform better, which
makes it difficult to design an algorithm that searches in a sensible way through the space
of hypotheses for a good one. For these reasons, we investigate another hypothesis class:
linear logistic classifiers, providing their definition, then an approach for learning such clas-
sifiers using optimization.

4.3.1 Linear logistic classifiers: definition


The hypotheses in a linear logistic classifier (LLC) are parameterized by a d-dimensional
vector θ and a scalar θ0 , just as is the case for linear classifiers. However, instead of making
predictions in {+1, −1}, LLC hypotheses generate real-valued outputs in the interval (0, 1).
An LLC has the form
h(x; θ, θ0 ) = σ(θT x + θ0 ) .
This looks familiar! What’s new?
The logistic function, also known as the sigmoid function, is defined as
1
σ(z) = ,
1 + e−z
and is plotted below, as a function of its input z. Its output can be interpreted as a proba-
bility, because for any value of z the output is in (0, 1).

σ(z)
1

0.5

z
−4 −3 −2 −1 1 2 3 4

Study Question: Convince yourself the output of σ is always in the interval (0, 1).
Why can’t it equal 0 or equal 1? For what value of z does σ(z) = 0.5?

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 34

What does an LLC look like? Let’s consider the simple case where d = 1, so our input
points simply lie along the x axis. Classifiers in this case have dimension 0, meaning that
they are points. The plot below shows LLCs for three different parameter settings: σ(10x +
1), σ(−2x + 1), and σ(2x − 3).

σ(θT x + θ0 )
1

0.5

x
−4 −3 −2 −1 1 2 3 4

Study Question: Which plot is which? What governs the steepness of the curve?
What governs the x value where the output is equal to 0.5?
But wait! Remember that the definition of a classifier is that it’s a mapping from Rd →
{−1, +1} or to some other discrete set. So, then, it seems like an LLC is actually not a
classifier!
Given an LLC, with an output value in (0, 1), what should we do if we are forced to
make a prediction in {+1, −1}? A default answer is to predict +1 if σ(θT x + θ0 ) > 0.5 and
−1 otherwise. The value 0.5 is sometimes called a prediction threshold.
In fact, for different problem settings, we might prefer to pick a different prediction
threshold. The field of decision theory considers how to make this choice. For example, if
the consequences of predicting +1 when the answer should be −1 are much worse than
the consequences of predicting −1 when the answer should be +1, then we might set the
prediction threshold to be greater than 0.5.
Study Question: Using a prediction threshold of 0.5, for what values of x do each of
the LLCs shown in the figure above predict +1?
When d = 2, then our inputs x lie in a two-dimensional space with axes x1 and x2 , and
the output of the LLC is a surface, as shown below, for θ = (1, 1), θ0 = 2.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 35

Study Question: Convince yourself that the set of points for which σ(θT x + θ0 ) = 0.5,
that is, the “boundary” between positive and negative predictions with prediction
threshold 0.5, is a line in (x1 , x2 ) space. What particular line is it for the case in the
figure above? How would the plot change for θ = (1, 1), but now with θ0 = −2? For
θ = (−1, −1), θ0 = 2?

4.3.2 Learning linear logistic classifiers


Optimization is a key approach to solving machine learning problems; this also applies
to learning linear logistic classifiers (LLCs) by defining an appropriate loss function for
optimization. A first attempt might be to use the simple 0-1 loss function L01 that gives
a value of 0 for a correct prediction, and a 1 for an incorrect prediction. As noted earlier,
however, this gives rise to an objective function that is very difficult to optimize, and so we
pursue another strategy for defining our objective.
For learning LLCs, we’d have a class of hypotheses whose outputs are in (0, 1), but for
which we have training data with y values in {+1, −1}. How can we define an appropriate
loss function? We start by changing our interpretation of the output to be the probability that
the input should map to output value 1 (we might also say that this is the probability that the
input is in class 1 or that the input is ‘positive.’)
Study Question: If h(x) is the probability that x belongs to class +1, what is the
probability that x belongs to the class -1? Assuming there are only these two classes.
Intuitively, we would like to have low loss if we assign a high probability to the correct
class. We’ll define a loss function, called negative log-likelihood (NLL), that does just this. In
addition, it has the cool property that it extends nicely to the case where we would like to
classify our inputs into more than two classes.
In order to simplify the description, we assume that (or transform our data so that) the
labels in the training data are y ∈ {0, 1}. Remember to be sure
We would like to pick the parameters of our classifier to maximize the probability as- your y values have this
form if you try to learn
signed by the LLC to the correct y values, as specified in the training set. Letting guess
an LLC using NLL!
g(i) = σ(θT x(i) + θ0 ), that probability is
 That crazy huge Π rep-
Yn
g(i) if y(i) = 1 resents taking the prod-
, uct over a bunch of fac-
i=1
1 − g(i) otherwise tors just as huge Σ rep-
resents taking the sum
under the assumption that our predictions are independent. This can be cleverly rewritten, over a bunch of terms.
when y(i) ∈ {0, 1}, as
Yn
y(i) (i)
g(i) (1 − g(i) )1−y .
i=1

Study Question: Be sure you can see why these two expressions are the same.
The big product above is kind of hard to deal with in practice, though. So what can we
do? Because the log function is monotonic, the θ, θ0 that maximize the quantity above will
be the same as the θ, θ0 that maximize its log, which is the following:

X
n  
y(i) log g(i) + (1 − y(i) ) log(1 − g(i) ) .
i=1

Finally, we can turn the maximization problem above into a minimization problem by tak-

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 36

ing the negative of the above expression, and write in terms of minimizing a loss

X
n
Lnll (g(i) , y(i) )
i=1

where Lnll is the negative log-likelihood loss function:

Lnll (guess, actual) = − (actual · log(guess) + (1 − actual) · log(1 − guess)) .

This loss function is also sometimes referred to as the log loss or cross entropy. You can use any base
What is the objective function for linear logistic classification? We can finally put all for the logarithm and
it won’t make any real
these pieces together and develop an objective function for optimizing regularized neg-
difference. If we ask
ative log-likelihood for a linear logistic classifier. In fact, this process is usually called you for numbers, use
“logistic regression,” so we’ll call our objective Jlr , and define it as log base e.

1X
n
!
That’s a lot of fancy
Jlr (θ, θ0 ; D) = Lnll (σ(θ x + θ0 ), y ) + λ ∥θ∥2 .
T (i) (i)
(4.1) words!
n
i=1

Study Question: Consider the case of linearly separable data. What will the θ values
that optimize this objective be like if λ = 0? What will they be like if λ is very big?
Try to work out an example in one dimension with two data points.
What role does regularization play for classifiers? This objective function has the same
structure as the one we used for regression, Eq. 2.2, where the first term (in parentheses)
is the average loss, and the second term is for regularization. Regularization is needed
for building classifiers that can generalize well (just as was the case for regression). The
parameter λ governs the trade-off between the two terms as illustrated in the following
example.
Suppose we wish to obtain a linear logistic classifier for this one-dimensional dataset:
1.2

1.0

0.8

0.6
y

0.4

0.2

0.0

0.2
8 6 4 2 0 2 4 6 8
x
Clearly, this can be fit very nicely by a hypothesis h(x) = σ(θx), but what is the best value
for θ? Evidently, when there is no regularization (λ = 0), the objective function Jlr (θ) will
approach zero for large values of θ, as shown in the plot on the left, below. However, would
the best hypothesis really have an infinite (or very large) value for θ? Such a hypothesis
would suggest that the data indicate strong certainty that a sharp transition between y = 0
and y = 1 occurs exactly at x = 0, despite the actual data having a wide gap around x = 0.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 37

20 λ=0 20 λ = 0.2

15 15
Jlr (θ)

Jlr (θ)
10 10

5 5

0 0
4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4
θ θ
Study Question: Be sure this makes sense. When the θ values are very large, what
does the sigmoid curve look like? Why do we say that it has a strong certainty in
that case?
In absence of other beliefs about the solution, we might prefer that our linear logistic
classifier not be overly certain about its predictions, and so we might prefer a smaller θ
over a large θ. By not being overconfident, we might expect a somewhat smaller θ to per-
form better on future examples drawn from this same distribution. This preference can be To refresh on some vo-
realized using a nonzero value of the regularization trade-off parameter, as illustrated in cabulary, we say that
in this example, a very
the plot on the right, above, with λ = 0.2.
large θ would be overfit
Another nice way of thinking about regularization is that we would like to prevent to the training data.
our hypothesis from being too dependent on the particular training data that we were
given: we would like for it to be the case that if the training data were changed slightly, the
hypothesis would not change by much.

4.4 Gradient descent for logistic regression


Now that we have a hypothesis class (LLC) and a loss function (NLL), we need to take
some data and find parameters! Sadly, there is no lovely analytical solution like the one
we obtained for regression, in Section 2.6.2. Good thing we studied gradient descent! We
can perform gradient descent on the Jlr objective, as we’ll see next. We can also apply
stochastic gradient descent to this problem.
Luckily, Jlr has enough nice properties that gradient descent and stochastic gradient de-
scent should generally “work”. We’ll soon see some more challenging optimization prob-
lems though – in the context of neural networks, in Section 6.7.
First we need derivatives with respect to both θ0 (the scalar component) and θ (the
vector component) of Θ. Explicitly, they are: Some passing familiar-
ity with matrix deriva-
1 X  (i)
n  tives is helpful here. A
∇θ Jlr (θ, θ0 ) = g − y(i) x(i) + 2λθ foolproof way of com-
n puting them is to com-
i=1
pute partial derivative
1 X  (i)
n
∂Jlr (θ, θ0 ) 
of J with respect to each
= g − y(i) .
∂θ0 n component θi of θ.
i=1

∂Jlr
Note that ∇θ Jlr will be of shape d × 1 and ∂θ0 will be a scalar since we have separated θ0
from θ here.
Study Question: Convince yourself that the dimensions of all these quantities are
correct, under the assumption that θ is d × 1. How does d relate to m as discussed
for Θ in the previous section?

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 38

Study Question: Compute ∇θ ∥θ∥2 by finding the vector of partial derivatives


(∂ ∥θ∥2 /∂θ1 , . . . , ∂ ∥θ∥2 /∂θd ). What is the shape of ∇θ ∥θ∥2 ?

Study Question: Compute ∇θ Lnll (σ(θT x + θ0 ), y) by finding the vector of partial


derivatives (∂Lnll (σ(θT x + θ0 ), y)/∂θ1 , . . . , ∂Lnll (σ(θT x + θ0 ), y)/∂θd ).

Study Question: Use these last two results to verify our derivation above.
Putting everything together, our gradient descent algorithm for logistic regression be-
comes:

LR-G RADIENT-D ESCENT(θinit , θ0init , η, ϵ)


1 θ(0) = θinit
(0)
2 θ0 = θ0init
3 t=0
4 repeat
5 t = t+1  P     
1 n (t−1) T (i) (t−1)
6 θ(t) = θ(t−1) − η i=1 σ θ x + θ0 − y (i)
x(i)
+ 2λθ(t−1)
n P    
(t) (t−1) (t−1) T (i) (t−1)
7 θ0 = θ0 − η n1 ni=1 σ θ x + θ0 − y (i)

(t) (t−1)
8 until Jlr (θ(t) , θ0 ) − Jlr (θ(t−1) , θ0 ) <ϵ
(t) (t)
9 return θ , θ0

Logistic regression, implemented using batch or stochastic gradient descent, is a useful


and fundamental machine learning technique. We will also see later that it corresponds to
a one-layer neural network with a sigmoidal activation function, and so is an important
step toward understanding neural networks.

4.4.1 Convexity of the NLL Loss Function


Much like the squared-error loss function that we saw for linear regression, the NLL loss
function for linear logistic regression is a convex function. This means that running gradi-
ent descent with a reasonable set of hyperparameters will converge arbitrarily close to the
minimum of the objective function.
We will use the following facts to demonstrate that the NLL loss function is a convex
function:
• if the derivative of a function of a scalar argument is monotonically increasing, then
it is a convex function,
• the sum of convex functions is also convex,
• a convex function of an affine function is a convex function.
Let z = θT x + θ0 ; z is an affine function of θ and θ0 . It therefore suffices to show that the
functions f1 (z) = − log(σ(z)) and f2 (z) = − log(1 − σ(z)) are convex with respect to z.
First, we can see that since,
d d
f1 (z) = [− log(1/(1 + exp(−z)))] ,
dz dz
d
= [log(1 + exp(−z))] ,
dz
= − exp(−z)/(1 + exp(−z)),
= −1 + σ(z),

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 39

the derivative of the function f1 (z) is a monotonically increasing function and therefore f1
is a convex function.
Second, we can see that since,
d d
f2 (z) = [− log(exp(−z)/(1 + exp(−z)))] ,
dz dz
d
= [log(1 + exp(−z)) + z] ,
dz
= σ(z),

the derivative of the function f2 (z) is also monotonically increasing and therefore f2 is a
convex function.

4.5 Handling multiple classes


So far, we have focused on the binary classification case, with only two possible classes. But
what can we do if we have multiple possible classes (e.g., we want to predict the genre of
a movie)? There are two basic strategies:
• Train multiple binary classifiers using different subsets of our data and combine their
outputs to make a class prediction.

• Directly train a multi-class classifier using a hypothesis class that is a generalization


of logistic regression, using a one-hot output encoding and NLL loss.
The method based on NLL is in wider use, especially in the context of neural networks,
and is explored here. In the following, we will assume that we have a data set D in which
the inputs x(i) ∈ Rd but the outputs y(i) are drawn from a set of K classes {c1 , . . . , cK }. Next,
we extend the idea of NLL directly to multi-class classification with K classes, where the
 T
training label is represented with what is called a one-hot vector y = y1 , . . . , yK , where
yk = 1 if the example is of class k and yk = 0 otherwise. Now, we have a problem of
mapping an input x(i) that is in Rd into a K-dimensional output. Furthermore, we would
like this output to be interpretable as a discrete probability distribution over the possible
classes, which means the elements of the output vector have to be non-negative (greater
than or equal to 0) and sum to 1.
We will do this in two steps. First, we will map our input x(i) into a vector value
z ∈ RK by letting θ be a whole d × K matrix of parameters, and θ0 be a K × 1 vector, so
(i)

that
z = θT x + θ0 .
Next, we have to extend our use of the sigmoid function to the multi-dimensional softmax Let’s check dimensions!
function, that takes a whole vector z ∈ RK and generates θT is K×d and x is d×1,
and θ0 is K × 1, so z is
 P  K × 1 and we’re good!
exp(z1 )/ i exp(zi )
g = softmax(z) =  ..
 .
 
.
P
exp(zK )/ i exp(zi )

which can be interpreted as a probability distribution over K items. To make the final
prediction of the class label, we can then look at g, find the most likely probability over
these K entries in g, (i.e. find the largest entry in g,) and return the corresponding index as
the “one-hot” element of 1 in our prediction.
Study Question: Convince yourself that the vector of g values will be non-negative
and sum to 1.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 40

Putting these steps together, our hypotheses will be

h(x; θ, θ0 ) = softmax(θT x + θ0 ) .

Now, we retain the goal of maximizing the probability that our hypothesis assigns to
the correct output yk for each input x. We can write this probability, letting g stand for our
Q
“guess”, h(x), for a single example (x, y) as K yk
k=1 gk .

Study Question: How many elements that are not equal to 1 will there be in this
product?
The negative log of the probability that we are making a correct guess is, then, for one-
hot vector y and probability distribution vector g,

X
K
Lnllm (g, y) = − yk · log(gk ) .
k=1

We’ll call this NLLM for negative log likelihood multiclass. It is also worth noting that the
NLLM loss function is also convex; however, we will omit the proof.
Study Question: Be sure you see that is Lnllm is minimized when the guess assigns
high probability to the true class.

Study Question: Show that Lnllm for K = 2 is the same as Lnll .

4.6 Prediction accuracy and validation


In order to formulate classification with a smooth objective function that we can optimize
robustly using gradient descent, we changed the output from discrete classes to probability
values and the loss function from 0-1 loss to NLL. However, when time comes to actually
make a prediction we usually have to make a hard choice: buy stock in Acme or not? And,
we get rewarded if we guessed right, independent of how sure or not we were when we
made the guess.
The performance of a classifier is often characterized by its accuracy, which is the per-
centage of a data set that it predicts correctly in the case of 0-1 loss. We can see that accuracy
of hypothesis h on data D is the fraction of the data set that does not incur any loss:

1X
n
A(h; D) = 1 − L01 (g(i) , y(i) ) ,
n
i=1

where g(i) is the final guess for one class or the other that we make from h(x(i) ), e.g., after
thresholding. It’s noteworthy here that we use a different loss function for optimization
than for evaluation. This is a compromise we make for computational ease and efficiency.

Last Updated: 05/06/24 18:25:27


CHAPTER 5

Feature representation

Linear regression and classification are powerful tools, but in the real world, data often
exhibit non-linear behavior that cannot immediately be captured by the linear models which
we have built so far. For example, suppose the true behavior of a system (with d = 2) looks
like this wavelet: This plot is of the so-
called jinc function
J1 (ρ)/ρ for ρ2 = x21 + x22

Such behavior is actually ubiquitous in physical systems, e.g., in the vibrations of the sur-
face of a drum, or scattering of light through an aperture. However, no single hyperplane
would be a very good fit to such peaked responses!
A richer class of hypotheses can be obtained by performing a non-linear feature trans-
formation ϕ(x) before doing the regression. That is, θT x + θ0 is a linear function of x, but
θT ϕ(x) + θ0 is a non-linear function of x, if ϕ is a non-linear function of x.
There are many different ways to construct ϕ. Some are relatively systematic and do-
main independent. Others are directly related to the semantics (meaning) of the original
features, and we construct them deliberately with our application (goal) in mind.

5.1 Gaining intuition about feature transformations


In this section, we explore the effects of non-linear feature transformations on simple clas-
sification problems, to gain intuition.
Let’s look at an example data set that starts in 1-D:

41
MIT 6.390 Spring 2024 42

These points are not linearly separable, but consider the transformation ϕ(x) = [x, x2 ]T . What’s a linear separa-
Putting the data in ϕ space, we see that it is now separable. There are lots of possible tor for data in 1D? A
point!
separators; we have just shown one of them here.

x2

separator

A linear separator in ϕ space is a nonlinear separator in the original space! Let’s see
how this plays out in our simple example. Consider the separator x2 − 1 = 0, which labels
the half-plane x2 − 1 > 0 as positive. What separator does it correspond to in the original
1-D space? We have to ask the question: which x values have the property that x2 − 1 = 0.
The answer is +1 and −1, so those two points constitute our separator, back in the original
space. And we can use the same reasoning to find the region of 1D space that is labeled
positive by this separator.

x
-1 1
0

5.2 Systematic feature construction


Here are two different ways to systematically construct features in a problem independent
way.

5.2.1 Polynomial basis


If the features in your problem are already naturally numerical, one systematic strategy for
constructing a new feature space is to use a polynomial basis. The idea is that, if you are
using the kth-order basis (where k is a positive integer), you include a feature for every
possible product of k different dimensions in your original input.
Here is a table illustrating the kth order polynomial basis for different values of k, call-
ing out the cases when d = 1 and d > 1:

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 43

Order d=1 in general (d > 1)


0 [1] [1]
1 [1, x]T [1, x1 , . . . , xd ]T
2 [1, x, x2 ]T [1, x1 , . . . , xd , x21 , x1 x2 , . . .]T
3 [1, x, x2 , x3 ]T [1, x1 , . . . , x21 , x1 x2 , . . . , x1 x2 x3 , . . .]T
.. .. ..
. . .

This transformation can be used in combination with linear regression or logistic regres-
sion (or any other regression or classification model). When we’re using a linear regression
or classification model, the key insight is that a linear regressor or separator in the trans-
formed space is a non-linear regressor or separator in the original space.
For example, the wavelet pictured at the start of this chapter can be fit much better than
with just a hyperplane, using linear regression with polynomials up to order k = 8: Specifically, this exam-
ple uses [1, x1 , x2 , x21 +
x22 , (x21 + x22 )2 , (x21 + x22 )4 ]T

The raw data (with n = 1000 random samples) is plotted on the left, and the regression
result (curved surface) is on the right.
Now let’s look at a classification example and see how polynomial feature transforma-
tion may help us.
One well-known example is the “exclusive or” (XOR) data set, the drosophila of machine- D. Melanogaster is a
learning data sets: species of fruit fly, used
as a simple system in
which to study genetics,
since 1910.

Clearly, this data set is not linearly separable. So, what if we try to solve the XOR classi-
fication problem using a polynomial basis as the feature transformation? We can just take
our two-dimensional data and transform it into a higher-dimensional data set, by applying
ϕ. Now, we have a classification problem as usual.
Let’s try it for k = 2 on our XOR problem. The feature transformation is

ϕ([x1 , x2 ]T ) = [1, x1 , x2 , x21 , x1 x2 , x22 ]T .

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 44

Study Question: If we train a classifier after performing this feature transformation,


would we lose any expressive power if we let θ0 = 0 (i.e., trained without offset in-
stead of with offset)?
We might run a classification learning algorithm and find a separator with coefficients
θ = [0, 0, 0, 0, 4, 0]T and θ0 = 0. This corresponds to

0 + 0x1 + 0x2 + 0x21 + 4x1 x2 + 0x22 + 0 = 0

and is plotted below, with the gray shaded region classified as negative and the white
region classified as positive:

0
x2

3
3 2 1 0 1 2 3
x1

Study Question: Be sure you understand why this high-dimensional hyperplane is


a separator, and how it corresponds to the figure.
For fun, we show some more plots below. Here is another result for a linear classifier
on XOR generated with logistic regression and gradient descent, using a random initial
starting point and second-order polynomial basis:

0
x2

3
3 2 1 0 1 2 3
x1

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 45

Here is a harder data set. Logistic regression with gradient descent failed to separate
it with a second, third, or fourth-order basis feature representation, but succeeded with a
fifth-order basis. Shown below are some results after ∼ 1000 gradient descent iterations
(from random starting points) for bases of order 2 (upper left), 3 (upper right), 4 (lower
left), and 5 (lower right).

5 5

4 4

3 3

2 2
x2

x2
1 1

0 0

1 1

2 2
2 1 0 1 2 3 4 5 2 1 0 1 2 3 4 5
x1 x1
5 5

4 4

3 3

2 2
x2

x2

1 1

0 0

1 1

2 2
2 1 0 1 2 3 4 5 2 1 0 1 2 3 4 5
x1 x1

Study Question: Percy Eptron has a domain with four numeric input features,
(x1 , . . . , x4 ). He decides to use a representation of the form

ϕ(x) = PolyBasis((x1 , x2 ), 3)⌢ PolyBasis((x3 , x4 ), 3)

where a⌢ b means the vector a concatenated with the vector b. What is the dimen-
sion of Percy’s representation? Under what assumptions about the original features is
this a reasonable choice?

5.2.2 Radial basis functions


Another cool idea is to use the training data itself to construct a feature space. The idea works
as follows. For any particular point p in the input space X, we can construct a feature fp
which takes any element x ∈ X and returns a scalar value that is related to how far x is
from the p we started with.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 46

Let’s start with the basic case, in which X = Rd . Then we can define
2
fp (x) = e−β∥p−x∥ .
This function is maximized when p = x and decreases exponentially as x becomes more
distant from p.
The parameter β governs how quickly the feature value decays as we move away from
the center point p. For large values of β, the fp values are nearly 0 almost everywhere
except right near p; for small values of β, the features have a high value over a larger part
of the space.
Study Question: What is fp (p)?
Now, given a dataset D containing n points, we can make a feature transformation ϕ
that maps points in our original space, Rd , into points in a new space, Rn . It is defined as
follows:
ϕ(x) = [fx(1) (x), fx(2) (x), . . . , fx(n) (x)]T .
So, we represent a new datapoint x in terms of how far it is from each of the datapoints in
our training set.
This idea can be generalized in several ways and is the fundamental concept underlying
kernel methods, that you should read about some time. This idea of describing objects in
terms of their similarity to a set of reference objects is very powerful and can be applied to
cases where X is not a simple vector space, but where the inputs are graphs or strings or
other types of objects, as long as there is a distance metric defined on it.

5.3 Hand-constructing features for real domains


In many machine-learning applications, we are given descriptions of the inputs with many
different types of attributes, including numbers, words, and discrete features. An impor-
tant factor in the success of an ML application is the way that the features are chosen to be
encoded by the human who is framing the learning problem.

5.3.1 Discrete features


Getting a good encoding of discrete features is particularly important. You want to create
“opportunities” for the ML system to find the underlying regularities. Although there
are machine-learning methods that have special mechanisms for handling discrete inputs,
most of the methods we consider in this class will assume the input vectors x are in Rd . So,
we have to figure out some reasonable strategies for turning discrete values into (vectors
of) real numbers.
We’ll start by listing some encoding strategies, and then work through some examples.
Let’s assume we have some feature in our raw data that can take on one of k discrete values.
• Numeric: Assign each of these values a number, say 1.0/k, 2.0/k, . . . , 1.0. We might
want to then do some further processing, as described in Section 5.3.3. This is a
sensible strategy only when the discrete values really do signify some sort of numeric
quantity, so that these numerical values are meaningful.
• Thermometer code: If your discrete values have a natural ordering, from 1, . . . , k, but
not a natural mapping into real numbers, a good strategy is to use a vector of length
k binary variables, where we convert discrete input value 0 < j ⩽ k into a vector in
which the first j values are 1.0 and the rest are 0.0. This does not necessarily imply
anything about the spacing or numerical quantities of the inputs, but does convey
something about ordering.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 47

• Factored code: If your discrete values can sensibly be decomposed into two parts
(say the “maker” and “model” of a car), then it’s best to treat those as two separate
features, and choose an appropriate encoding of each one from this list.

• One-hot code: If there is no obvious numeric, ordering, or factorial structure, then


the best strategy is to use a vector of length k, where we convert discrete input value
0 < j ⩽ k into a vector in which all values are 0.0, except for the jth, which is 1.0.

• Binary code: It might be tempting for the computer scientists among us to use some
binary code, which would let us represent k values using a vector of length log k.
This is a bad idea! Decoding a binary code takes a lot of work, and by encoding your
inputs this way, you’d be forcing your system to learn the decoding algorithm.

As an example, imagine that we want to encode blood types, that are drawn from the
set {A+, A−, B+, B−, AB+, AB−, O+, O−}. There is no obvious linear numeric scaling or
even ordering to this set. But there is a reasonable factoring, into two features: {A, B, AB, O}
and {+, −}. And, in fact, we can further reasonably factor the first group into {A, notA},
{B, notB}. So, here are two plausible encodings of the whole set: It is sensible (according
to Wikipedia!) to treat
• Use a 6-D vector, with two components of the vector each encoding the correspond- O as having neither fea-
ing factor using a one-hot encoding. ture A nor feature B.

• Use a 3-D vector, with one dimension for each factor, encoding its presence as 1.0
and absence as −1.0 (this is sometimes better than 0.0). In this case, AB+ would be
[1.0, 1.0, 1.0]T and O− would be [−1.0, −1.0, −1.0]T .

Study Question: How would you encode A+ in both of these approaches?

5.3.2 Text
The problem of taking a text (such as a tweet or a product review, or even this document!)
and encoding it as an input for a machine-learning algorithm is interesting and compli-
cated. Much later in the class, we’ll study sequential input models, where, rather than
having to encode a text as a fixed-length feature vector, we feed it into a hypothesis word
by word (or even character by character!).
There are some simple encodings that work well for basic applications. One of them is
the bag of words (BOW) model. The idea is to let d be the number of words in our vocabulary
(either computed from the training set or some other body of text or dictionary). We will
then make a binary vector (with values 1.0 and 0.0) of length d, where element j has value
1.0 if word j occurs in the document, and 0.0 otherwise.

5.3.3 Numeric values


If some feature is already encoded as a numeric value (heart rate, stock price, distance, etc.)
then we should generally keep it as a numeric value. An exception might be a situation in
which we know there are natural “breakpoints” in the semantics: for example, encoding
someone’s age in the US, we might make an explicit distinction between under and over
18 (or 21), depending on what kind of thing we are trying to predict. It might make sense
to divide into discrete bins (possibly spacing them closer together for the very young) and
to use a one-hot encoding for some sorts of medical situations in which we don’t expect a
linear (or even monotonic) relationship between age and some physiological features.
If we choose to leave a feature as numeric, it is typically useful to scale it, so that it
tends to be in the range [−1, +1]. Without performing this transformation, if we have one

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 48

feature with much larger values than another, it will take the learning algorithm a lot of
work to find parameters that can put them on an equal basis. We could also perform a
x−x
more involved scaling/transformation ϕ(x) = , where x is the average of the x(i) ,
σ
and σ is the standard deviation of the x(i) . The resulting feature values will have mean 0
and standard deviation 1. This transformation is sometimes called standardizing a variable
. Such standard variables
Then, of course, we might apply a higher-order polynomial-basis transformation to one are often known as “z-
scores,” for example, in
or more groups of numeric features.
the social sciences.
Study Question: Consider using a polynomial basis of order k as a feature trans-
formation ϕ on our data. Would increasing k tend to increase or decrease structural
error? What about estimation error?

Last Updated: 05/06/24 18:25:27


CHAPTER 6

Neural Networks

You’ve probably been hearing a lot about “neural networks.” Now that we have several
useful machine-learning concepts (hypothesis classes, classification, regression, gradient
descent, regularization, etc.) we are well equipped to understand neural networks in detail.
This is, in some sense, the “third wave” of neural nets. The basic idea is founded on
the 1943 model of neurons of McCulloch and Pitts and the learning ideas of Hebb. There
was a great deal of excitement, but not a lot of practical success: there were good train-
ing methods (e.g., perceptron) for linear functions, and interesting examples of non-linear
functions, but no good way to train non-linear functions from data. Interest died out for a
while, but was re-kindled in the 1980s when several people came up with a way to train As with many good
neural networks with “back-propagation,” which is a particular style of implementing gra- ideas in science, the
basic idea for how to
dient descent, that we will study here. By the mid-90s, the enthusiasm waned again, be-
train non-linear neural
cause although we could train non-linear networks, the training tended to be slow and was networks with gradi-
plagued by a problem of getting stuck in local optima. Support vector machines (SVMs) ent descent was inde-
that use regularization of high-dimensional hypotheses by seeking to maximize the mar- pendently developed
by more than one re-
gin, and kernel methods that are an efficient and beautiful way of using feature transfor- searcher.
mations to non-linearly transform data into a higher-dimensional space, provided reliable
learning methods with guaranteed convergence and no local optima.
However, during the SVM enthusiasm, several groups kept working on neural net-
works, and their work, in combination with an increase in available data and computation,
has made them rise again. They have become much more reliable and capable, and are
now the method of choice in many applications. There are many, many variations of neu- The number increases
ral networks, which we can’t even begin to survey. We will study the core “feed-forward” daily, as may be seen on
arxiv.org.
networks with “back-propagation” training, and then, in later chapters, address some of
the major advances beyond this core.
We can view neural networks from several different perspectives:

View 1: An application of stochastic gradient descent for classification and regression


with a potentially very rich hypothesis class.

View 2: A brain-inspired network of neuron-like computing elements that learn dis-


tributed representations.

View 3: A method for building applications that make predictions based on huge amounts
of data in very complex domains.

49
MIT 6.390 Spring 2024 50

We will mostly take view 1, with the understanding that the techniques we develop will
enable the applications in view 3. View 2 was a major motivation for the early development
of neural networks, but the techniques we will study do not seem to actually account for Some prominent re-
the biological learning processes in brains. searchers are, in fact,
working hard to find
analogues of these
methods in the brain.
6.1 Basic element
The basic element of a neural network is a “neuron,” pictured schematically below. We will
also sometimes refer to a neuron as a “unit” or “node.”
pre-activation
x1 w1
output

.. P z
. f(·) a
wm
xm w0
activation function

input

It is a non-linear function of an input vector x ∈ Rm to a single output value a ∈ R. It is Sorry for changing our
parameterized by a vector of weights (w1 , . . . , wm ) ∈ Rm and an offset or threshold w0 ∈ R. notation here. We were
using d as the dimen-
In order for the neuron to be non-linear, we also specify an activation function f : R → R,
sion of the input, but
which can be the identity (f(x) = x, in that case the neuron is a linear function of x), but can we are trying to be con-
also be any other function, though we will only be able to work with it if it is differentiable. sistent here with many
The function represented by the neuron is expressed as: other accounts of neural
networks. It is impossi-
   ble to be consistent with
Xm
all of them though—
a = f(z) = f  xj wj  + w0  = f(wT x + w0 ) . there are many differ-
j=1 ent ways of telling this
story.
Before thinking about a whole network, we can consider how to train a single unit.
Given a loss function L(guess, actual) and a dataset {(x(1) , y(1) ), . . . , (x(n) , y(n) )}, we can do This
should remind you of
(stochastic) gradient descent, adjusting the weights w, w0 to minimize
our θ and θ0 for linear
X   models.
J(w, w0 ) = L NN(x(i) ; w, w0 ), y(i) ,
i

where NN is the output of our single-unit neural net for a given input.
We have already studied two special cases of the neuron: linear logistic classifiers
(LLCs) with NLL loss and regressors with quadratic loss! The activation function for the
LLC is f(x) = σ(x) and for linear regression it is simply f(x) = x.
Study Question: Just for a single neuron, imagine for some reason, that we decide
to use activation function f(z) = ez and loss function L(guess, actual) = (guess −
actual)2 . Derive a gradient descent update for w and w0 .

6.2 Networks
Now, we’ll put multiple neurons together into a network. A neural network in general
takes in an input x ∈ Rm and generates an output a ∈ Rn . It is constructed out of multiple

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 51

neurons; the inputs of each neuron might be elements of x and/or outputs of other neurons.
The outputs are generated by n output units.
In this chapter, we will only consider feed-forward networks. In a feed-forward network,
you can think of the network as defining a function-call graph that is acyclic: that is, the
input to a neuron can never depend on that neuron’s output. Data flows one way, from the
inputs to the outputs, and the function computed by the network is just a composition of
the functions computed by the individual neurons.
Although the graph structure of a feed-forward neural network can really be anything
(as long as it satisfies the feed-forward constraint), for simplicity in software and analysis,
we usually organize them into layers. A layer is a group of neurons that are essentially “in
parallel”: their inputs are outputs of neurons in the previous layer, and their outputs are
the input to the neurons in the next layer. We’ll start by describing a single layer, and then
go on to the case of multiple layers.

6.2.1 Single layer


A layer is a set of units that, as we have just described, are not connected to each other. The
layer is called fully connected if, as in the diagram below, all of the inputs (i.e., x1 , x2 , . . . xm
in this case) are connected to every unit in the layer. A layer has input x ∈ Rm and output
(also known as activation) a ∈ Rn .

P
f a1

x1 P
f a2

x2
P
f a3
..
.

xm .. .. ..
. . .

W, W0 P
f an

Since each unit has a vector of weights and a single offset, we can think of the weights of
the whole layer as a matrix, W, and the collection of all the offsets as a vector W0 . If we
have m inputs, n units, and n outputs, then

• W is an m × n matrix,

• W0 is an n × 1 column vector,

• X, the input, is an m × 1 column vector,

• Z = W T X + W0 , the pre-activation, is an n × 1 column vector,

• A, the activation, is an n × 1 column vector,

and the output vector is


A = f(Z) = f(W T X + W0 ) .
The activation function f is applied element-wise to the pre-activation values Z.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 52

6.2.2 Many layers


A single neural network generally combines multiple layers, most typically by feeding the
outputs of one layer into the inputs of another layer.
We have to start by establishing some nomenclature. We will use l to name a layer, and
let ml be the number of inputs to the layer and nl be the number of outputs from the layer.
Then, W l and W0l are of shape ml × nl and nl × 1, respectively. Note that the input to layer
l is the output from layer l − 1, so we have ml = nl−1 , and as a result Al−1 is of shape
ml × 1, or equivalently nl−1 × 1. Let fl be the activation function of layer l. Then, the It is technically possi-
pre-activation outputs are the nl × 1 vector ble to have different
activation functions
T within the same layer,
Zl = W l Al−1 + W0l but, again, for conve-
nience in specification
and the activation outputs are simply the nl × 1 vector and implementation,
we generally have the
Al = fl (Zl ) . same activation function
within a layer.
Here’s a diagram of a many-layered network, with two blocks for each layer, one rep-
resenting the linear part of the operation and one representing the non-linear activation
function. We will use this structural decomposition to organize our algorithmic thinking
and implementation.

X = A0 W 1 Z1 A1 W2 Z2 A2 AL−1 W L ZL AL
f1 f2 ··· fL
W01 W02 W0L

layer 1 layer 2 layer L

6.3 Choices of activation function


There are many possible choices for the activation function. We will start by thinking about
whether it’s really necessary to have an f at all.
What happens if we let f be the identity? Then, in a network with L layers (we’ll leave
out W0 for simplicity, but keeping it wouldn’t change the form of this argument),
T T T T
AL = W L AL−1 = W L W L−1 · · · W 1 X .

So, multiplying out the weight matrices, we find that

AL = W total X ,

which is a linear function of X! Having all those layers did not change the representational
capacity of the network: the non-linearity of the activation function is crucial.
Study Question: Convince yourself that any function representable by any number
of linear layers (where f is the identity function) can be represented by a single layer.
Now that we are convinced we need a non-linear activation, let’s examine a few com-
mon choices. These are shown mathematically below, followed by plots of these functions.

Step function: 
0 if z < 0
step(z) =
1 otherwise

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 53

Rectified linear unit (ReLU):



0 if z < 0
ReLU(z) = = max(0, z)
z otherwise

Sigmoid function: Also known as a logistic function. This can sometimes be interpreted
as probability, because for any value of z the output is in (0, 1):
1
σ(z) =
1 + e−z

Hyperbolic tangent: Always in the range (−1, 1):


ez − e−z
tanh(z) =
ez + e−z

Softmax function: Takes a whole vector Z ∈ Rn and generates as output a vector A ∈


P
(0, 1)n with the property that n i=1 Ai = 1, which means we can interpret it as a
probability distribution over n items:
 P 
exp(z1 )/ i exp(zi )
softmax(z) = 
 .. 
.
P

exp(zn )/ i exp(zi )

1.5 1.5
step(z) ReLU(z)
1 1

0.5 0.5
z z
−2 −1 1 2 −2 −1 1 2
−0.5 −0.5

σ(z) tanh(z)
1 1

0.5 0.5

z z
−4 −2 2 4 −4 −2 2 4

−0.5 −0.5

−1 −1

The original idea for neural networks involved using the step function as an activation,
but because the derivative of the step function is zero everywhere except at the discontinu-
ity (and there it is undefined), gradient-descent methods won’t be useful in finding a good
setting of the weights, and so we won’t consider them further. They have been replaced, in
a sense, by the sigmoid, ReLU, and tanh activation functions.
Study Question: Consider sigmoid, ReLU, and tanh activations. Which one is most
like a step function? Is there an additional parameter you could add to a sigmoid
that would make it be more like a step function?

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 54

Study Question: What is the derivative of the ReLU function? Are there some val-
ues of the input for which the derivative vanishes?
ReLUs are especially common in internal (“hidden”) layers, sigmoid activations are
common for the output for binary classification, and softmax activations are common for
the output for multi-class classification (see Section 4.3.2 for an explanation).

6.4 Loss functions and activation functions


Different loss functions make different assumptions about the range of values they will get
as input and, as we have seen, different activation functions will produce output values in
different ranges. When you are designing a neural network, it’s important to make these
things fit together well. In particular, we will think about matching loss functions with the
activation function in the last layer, fL . Here is a table of loss functions and activations that
make sense for them:
Loss fL task
squared linear regression
NLL sigmoid binary classification
NLLM softmax multi-class classification

We explored squared loss in Chapter 2 and (NLL and NLLM) in Chapter 4.

6.5 Error back-propagation


We will train neural networks using gradient descent methods. It’s possible to use batch
gradient descent, in which we sum up the gradient over all the points (as in Section 3.2 of
chapter 3) or stochastic gradient descent (SGD), in which we take a small step with respect
to the gradient considering a single point at a time (as in Section 3.4 of Chapter 3).
Our notation is going to get pretty hairy pretty quickly. To keep it as simple as we can,
we’ll focus on computing the contribution of one data point x(i) to the gradient of the loss
with respect to the weights, for SGD; you can simply sum up these gradients over all the
data points if you wish to do batch descent.
So, to do SGD for a training example (x, y), we need to compute ∇W L(NN(x; W), y),
where W represents all weights W l , W0l in all the layers l = (1, . . . , L). This seems terrifying,
but is actually quite easy to do using the chain rule. Remember the chain
Remember that we are always computing the gradient of the loss function with respect rule! If a = f(b) and
b = g(c), so that
to the weights for a particular value of (x, y). That tells us how much we want to change the
a = f(g(c)), then
weights, in order to reduce the loss incurred on this particular training example. da
= da · db =
dc db dc
′ ′
f (b)g (c) =
f ′ (g(c))g ′ (c).
6.5.1 First, suppose everything is one-dimensional
To get some intuition for how these derivations work, we’ll first suppose everything in our
neural network is one-dimensional. In particular, we’ll assume there are ml = 1 inputs
and nl = 1 outputs at every layer. So layer l looks like:

al = fl (zl ), zl = wl al−1 + wl0 .

In the equation above, we’re using the lowercase letters al , zl , wl , al−1 , wl0 to emphasize
that all of these quantities are scalars just for the moment. We’ll look at the more general
matrix case below.
To use SGD, then, we want to compute ∂L(NN(x; W), y)/∂wl and ∂L(NN(x; W), y)/∂wl0 Check your understand-
ing: why do we need
exactly these quantities
Last Updated: 05/06/24 18:25:27 for SGD?
MIT 6.390 Spring 2024 55

for each layer l and each data point (x, y). Below we’ll write “loss” as an abbreviation for
L(NN(x; W), y). Then our first quantity of interest is ∂loss/∂wl . The chain rule gives us
the following. First, let’s look at the case l = L:

∂loss ∂loss ∂aL ∂zL


= · ·
∂wL ∂aL ∂zL ∂wL
∂loss
= · (fL ) ′ (zL ) · aL−1 .
∂aL
Now we can look at the case of general l:

∂loss ∂loss ∂aL ∂zL ∂aL−1 ∂zl+1 ∂al ∂zl


= · · · L−1 · · · · ·
∂wl ∂aL ∂zL ∂aL−1 ∂z ∂al ∂zl ∂wl
∂loss
= · (fL ) ′ (zL ) · wL · (fL−1 ) ′ (zL−1 ) · · · · wl+1 · (fl ) ′ (zl ) · al−1
∂aL
∂loss
= · al−1 .
∂zl
Note that every multiplication above is scalar multiplication because every term in ev-
ery product above is a scalar. And though we solved for all the other terms in the product,
we haven’t solved for ∂loss/∂aL because the derivative will depend on which loss function
you choose. Once you choose a loss function though, you should be able to compute this
derivative.
Study Question: Suppose you choose squared loss. What is ∂loss/∂aL ?

Study Question: Check the derivations above yourself. You should use the chain
rule and also solve for the individual derivatives that arise in the chain rule.

Study Question: Check that the the final layer (l = L) case is a special case of the
general layer l case above.

Study Question: Derive ∂L(NN(x; W), y)/∂wl0 for yourself, for both the final layer
(l = L) and general l.

Study Question: Does the L = 1 case remind you of anything from earlier in this
course?

Study Question: Write out the full SGD algorithm for this neural network.
It’s pretty typical to run the chain rule from left to right like we did above. But, for
where we’re going next, it will be useful to notice that it’s completely equivalent to write it
in the other direction. So we can rewrite our result from above as follows:
∂loss ∂loss
= al−1 · (6.1)
∂wl ∂zl
∂loss ∂al ∂zl+1 ∂aL−1 ∂zL ∂aL ∂loss
= · · · · · · · (6.2)
∂zl ∂zl ∂al ∂zL−1 ∂aL−1 ∂zL ∂aL
∂al ∂aL−1 ∂aL ∂loss
= · wl+1 · · · L−1 · wL · L · . (6.3)
∂zl ∂z ∂z ∂aL

6.5.2 The general case


Next we’re going to do everything that we did above, but this time we’ll allow any number
of inputs ml and outputs nl at every layer. First, we’ll tell you the results that correspond
to our derivations above. Then we’ll talk about why they make sense. And finally we’ll
derive them carefully.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 56

OK, let’s start with the results! Again, below we’ll be using “loss” as an abbreviation
for L(NN(x; W), y). Then,
 T
∂loss l−1 ∂loss
l
= A
| {z } (6.4)
|∂W
{z } ∂Zl
ml ×1 | {z }
ml ×nl 1×nl

∂loss ∂A ∂Zl+1
l
∂AL−1 ∂ZL ∂AL ∂loss
= · · · · · · · · (6.5)
∂Zl ∂Zl ∂Al ∂ZL−1 ∂AL−1 ∂ZL ∂AL
l L−1 L
∂A ∂A ∂A ∂loss
= l
· W l+1 · · · · L−1
· WL · · . (6.6)
∂Z ∂Z ∂ZL ∂AL
First, compare each equation to its one-dimensional counterpart, and make sure you
see the similarities. That is, compare the general weight derivatives in Eq. 6.4 to the one-
dimensional case in Eq. 6.1. Compare the intermediate derivative of loss with respect to the
pre-activations Zl in Eq. 6.5 to the one-dimensional case in Eq. 6.2. And finally compare the
version where we’ve substituted in some of the derivatives in Eq. 6.6 to Eq. 6.3. Hopefully
you see how the forms are very analogous. But in the matrix case, we now have to be
careful about the matrix dimensions. We’ll check these matrix dimensions below.
Let’s start by talking through each of the terms in the matrix version of these equations.
Recall that loss is a scalar, and W l is a matrix of size ml × nl . You can read about the
conventions in the course for derivatives starting in this chapter in Appendix A. By these
conventions (not the only possible conventions!), we have that ∂loss/∂W l will be a matrix
of size ml × nl whose (i, j) entry is the scalar ∂loss/∂Wi,j
l
. In some sense, we’re just doing
a bunch of traditional scalar derivatives, and the matrix notation lets us write them all
simultaneously and succinctly. In particular, for SGD, we need to find the derivative of the
loss with respect to every scalar component of the weights because these are our model’s
parameters and therefore are the things we want to update in SGD.
The next quantity we see in Eq. 6.4 is Al−1 , which we recall has size ml × 1 (or equiva-
lently nl−1 × 1 since it represents the outputs of the l − 1 layer). Finally, we see ∂loss/∂Zl .
Again, loss is a scalar, and Zl is a nl × 1 vector. So by the conventions in Appendix A, we
have that ∂loss/∂Zl has size nl × 1. The transpose then has size 1 × nl . Now you should
be able to check that the dimensions all make sense in Eq. 6.4; in particular, you can check
that inner dimensions agree in the matrix multiplication and that, after the multiplication,
we should be left with something that has the dimensions on the lefthand side.
Now let’s look at Eq. 6.6. We’re computing ∂loss/∂Zl so that we can use it in Eq. 6.4.
The weights are familiar. The one part that remains is terms of the form ∂Al /∂Zl . Checking
out Appendix A, we see that this term should be a matrix of size nl × nl since Al and Zl
both have size nl × 1. The (i, j) entry of this matrix is ∂Alj /∂Zli . This scalar derivative is
something that you can compute when you know your activation function. If you’re not
using a softmax activation function, Alj typically is a function only of Zlj , which means that
∂Alj /∂Zli should equal 0 whenever i ̸= j, and that ∂Alj /∂Zlj = (fl ) ′ (Zlj ).
Study Question: Compute the dimensions of every term in Eqs. 6.5 and 6.6 using
Appendix A. After you’ve done that, check that all the matrix multiplications work;
that is, check that the inner dimensions agree and that the lefthand side and right-
hand side of these equations have the same dimensions.

Study Question: If I use the identity activation function, what is ∂Alj /∂Zlj for any j?
What is the full matrix ∂Al /∂Zl ?

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 57

6.5.3 Derivations for the general case


You can use everything above without deriving it yourself. But if you want to find the
gradients of loss with respect to W0l (which we need for SGD!), then you’ll want to know
how to actually do these derivations. So next we’ll work out the derivations.
The key trick is to just break every equation down into its scalar meaning. For instance,
the (i, j) element of ∂loss/∂W l is ∂loss/∂Wi,jl
. If you think about it for a moment (and it
might help to go back to the one-dimensional case), the loss is a function of the elements
of Zl , and the elements of Zl are a function of the Wi,j
l
. There are nl elements of Zl , so we
can use the chain rule to write

∂loss X ∂loss ∂Zl


nl
k
= . (6.7)
∂Wi,j k=1 ∂Zlk ∂Wi,j
l l

To figure this out, let’s remember that Zl = (W l )⊤ Al−1 + W0l . We can write one element
P l
of the Zl vector, then, as Zlb = m l
a=1 Wa,b Aa
l−1
+ (W0l )b . It follows that ∂Zlk /∂Wi,j
l
will be
zero except when k = j (check you agree!). So we can rewrite Eq. 6.7 as

∂loss ∂loss ∂Zlj ∂loss l−1


= = A . (6.8)
l
∂Wi,j ∂Zlj ∂Wi,j
l ∂Zlj i

Finally, then, we match entries of the matrices on both sides of the equation above to re-
cover Eq. 6.4.
Study Question: Check that Eq. 6.8 and Eq. 6.4 say the same thing.

Study Question: Convince yourself that ∂Zl /∂Al−1 = W l by comparing the entries
of the matrices on both sides on the equality sign.

Study Question: Convince yourself that Eq. 6.5 is true.

Study Question: Apply the same reasoning to find the gradients of loss with respect
to W0l .

6.5.4 Reflecting on backpropagation


This general process of computing the gradients of the loss with respect to the weights is
called error back-propagation. The idea is that we first do a forward pass to compute all the a We could call this
and z values at all the layers, and finally the actual loss. Then, we can work backward and “blame propagation”.
Think of loss as how
compute the gradient of the loss with respect to the weights in each layer, starting at layer
mad we are about the
L and going back to layer 1. prediction just made.
Then ∂loss/∂AL is how
y much we blame AL
for the loss. The last
module has to take in
X = A0 W 1 Z1 A1 W2 Z2 A2 AL−1 W L ZL AL ∂loss/∂AL and com-
f1 f2 ··· fL Loss
W01 W02 W0L pute ∂loss/∂ZL , which
is how much we blame
∂loss ∂loss ∂loss ∂loss ∂loss ∂loss ∂loss ZL for the loss. The
∂Z1 ∂A1 ∂Z2 ∂A2 ∂AL−1 ∂ZL ∂AL
next module (work-
ing backwards) takes
If we view our neural network as a sequential composition of modules (in our work in ∂loss/∂ZL and com-
so far, it has been an alternation between a linear transformation with a weight matrix, putes ∂loss/∂AL−1 . So
every module is accept-
and a component-wise application of a non-linear activation function), then we can define ing its blame for the
a simple API for a module that will let us compute the forward and backward passes, as loss, computing how
much of it to allocate to
Last Updated: 05/06/24 18:25:27 each of its inputs, and
passing the blame back
to them.
MIT 6.390 Spring 2024 58

well as do the necessary weight updates for gradient descent. Each module has to provide
the following “methods.” We are already using letters a, x, y, z with particular meanings,
so here we will use u as the vector input to the module and v as the vector output:

• forward: u → v

• backward: u, v, ∂L/∂v → ∂L/∂u

• weight grad: u, ∂L/∂v → ∂L/∂W only needed for modules that have weights W

In homework we will ask you to implement these modules for neural network components,
and then use them to construct a network and train it as described in the next section.

6.6 Training
Here we go! Here’s how to do stochastic gradient descent training on a feed-forward neural
network. After this pseudo-code, we motivate the choice of initialization in lines 2 and 3.
The actual computation of the gradient values (e.g., ∂loss/∂AL ) is not directly defined in
this code, because we want to make the structure of the computation clear.
Study Question: What is ∂Zl /∂W l ?

Study Question: Which terms in the code below depend on fL ?

SGD-N EURAL -N ET(Dn , T , L, (m1 , . . . , mL ), (f1 , . . . , fL ), Loss)


1 for l = 1 to L
2 Wijl
∼ Gaussian(0, 1/ml )
3 W0j ∼ Gaussian(0, 1)
l

4 for t = 1 to T
5 i = random sample from {1, . . . , n}
6 A0 = x(i)
7 // forward pass to compute the output AL
8 for l = 1 to L
T
9 Zl = W l Al−1 + W0l
10 A = f (Zl )
l l

11 loss = Loss(AL , y(i) )


12 for l = L to 1:
13 // error back-propagation
14 ∂loss/∂Al = if l < L then ∂Zl+1 /∂Al · ∂loss/∂Zl+1 else ∂loss/∂AL
15 ∂loss/∂Zl = ∂Al /∂Zl · ∂loss/∂Al
16 // compute gradient with respect to weights
T
17 ∂loss/∂W l = Al−1 · ∂loss/∂Zl
18 ∂loss/∂W0l = ∂loss/∂Zl
19 // stochastic gradient descent update
20 W l = W l − η(t) · ∂loss/∂W l
21 W0l = W0l − η(t) · ∂loss/∂W0l

Initializing W is important; if you do it badly there is a good chance the neural net-
work training won’t work well. First, it is important to initialize the weights to random
values. We want different parts of the network to tend to “address” different aspects of
the problem; if they all start at the same weights, the symmetry will often keep the values
from moving in useful directions. Second, many of our activation functions have (near)

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 59

zero slope when the pre-activation z values have large magnitude, so we generally want to
keep the initial weights small so we will be in a situation where the gradients are non-zero,
so that gradient descent will have some useful signal about which way to go.
One good general-purpose strategy is to choose each weight at random from a Gaussian
(normal) distribution with mean 0 and standard deviation (1/m) where m is the number
of inputs to the unit.
Study Question: If the input x to this unit is a vector of 1’s, what would the ex-
pected pre-activation z value be with these initial weights?

 (where ∼ means “is drawn randomly from the distribution”) as


We write this choice
l
Wij ∼ Gaussian 0, m1l . It will often turn out (especially for fancier activations and loss
L
functions) that computing ∂loss
∂ZL
is easier than computing ∂loss
∂AL
and ∂A
∂ZL
. So, we may instead
ask for an implementation of a loss function to provide a backward method that computes
∂loss/∂ZL directly.

6.7 Optimizing neural network parameters


Because neural networks are just parametric functions, we can optimize loss with respect to
the parameters using standard gradient-descent software, but we can take advantage of the
structure of the loss function and the hypothesis class to improve optimization. As we have
seen, the modular function-composition structure of a neural network hypothesis makes it
easy to organize the computation of the gradient. As we have also seen earlier, the structure
of the loss function as a sum over terms, one per training data point, allows us to consider
stochastic gradient methods. In this section we’ll consider some alternative strategies for
organizing training, and also for making it easier to handle the step-size parameter.

6.7.1 Batches
Assume that we have an objective of the form

X
n
J(W) = L(h(x(i) ; W), y(i) ) ,
i=1

where h is the function computed by a neural network, and W stands for all the weight
matrices and vectors in the network.
Recall that, when we perform batch (or the vanilla) gradient descent, we use the update
rule
Wt = Wt−1 − η∇W J(Wt−1 ) ,
which is equivalent to

X
n
Wt = Wt−1 − η ∇W L(h(x(i) ; Wt−1 ), y(i) ) .
i=1

So, we sum up the gradient of loss at each training point, with respect to W, and then take
a step in the negative direction of the gradient.
In stochastic gradient descent, we repeatedly pick a point (x(i) , y(i) ) at random from the
data set, and execute a weight update on that point alone:

Wt = Wt−1 − η∇W L(h(x(i) ; Wt−1 ), y(i) ) .

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 60

As long as we pick points uniformly at random from the data set, and decrease η at an
appropriate rate, we are guaranteed, with high probability, to converge to at least a local
optimum.
These two methods have offsetting virtues. The batch method takes steps in the exact
gradient direction but requires a lot of computation before even a single step can be taken,
especially if the data set is large. The stochastic method begins moving right away, and can
sometimes make very good progress before looking at even a substantial fraction of the
whole data set, but if there is a lot of variability in the data, it might require a very small η
to effectively average over the individual steps moving in “competing” directions.
An effective strategy is to “average” between batch and stochastic gradient descent by
using mini-batches. For a mini-batch of size K, we select K distinct data points uniformly
at random from the data set and do the update based just on their contributions to the
gradient
XK
Wt = Wt−1 − η ∇W L(h(x(i) ; Wt−1 ), y(i) ) .
i=1

Most neural network software packages are set up to do mini-batches.


Study Question: For what value of K is mini-batch gradient descent equivalent to
stochastic gradient descent? To batch gradient descent?
Picking K unique data points at random from a large data-set is potentially computa-
tionally difficult. An alternative strategy, if you have an efficient procedure for randomly
shuffling the data set (or randomly shuffling a list of indices into the data set) is to operate
in a loop, roughly as follows:

M INI -B ATCH -SGD(NN, data, K)


1 n = length(data)
2 while not done:
3 R ANDOM -S HUFFLE(data)
4 for i = 1 to ⌈n/K⌉
5 B ATCH -G RADIENT-U PDATE(NN, data[(i − 1)K : iK])

See note on the ceiling1 function, for the case when n/K is not an integer.

6.7.2 Adaptive step-size


Picking a value for η is difficult and time-consuming. If it’s too small, then convergence is
slow and if it’s too large, then we risk divergence or slow convergence due to oscillation.
This problem is even more pronounced in stochastic or mini-batch mode, because we know
we need to decrease the step size for the formal guarantees to hold.
It’s also true that, within a single neural network, we may well want to have differ-
ent step sizes. As our networks become deep (with increasing numbers of layers) we can
find that magnitude of the gradient of the loss with respect the weights in the last layer,
∂loss/∂WL , may be substantially different from the gradient of the loss with respect to the
weights in the first layer ∂loss/∂W1 . If you look carefully at Eq. 6.6, you can see that the
output gradient is multiplied by all the weight matrices of the network and is “fed back”
through all the derivatives of all the activation functions. This can lead to a problem of
exploding or vanishing gradients, in which the back-propagated gradient is much too big or
small to be used in an update rule with the same step size.
1 In line 4 of the algorithm above, ⌈·⌉ is known as the ceiling function; it returns the smallest integer greater

than or equal to its input. E.g., ⌈2.5⌉ = 3 and ⌈3⌉ = 3.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 61

So, we can consider having an independent step-size parameter for each weight, and up-
dating it based on a local view of how the gradient updates have been going. Some com- This section is very
mon strategies for this include momentum (“averaging” recent gradient updates), Adadelta strongly influenced
by Sebastian Ruder’s
(take larger steps in parts of the space where J(W) is nearly flat), and Adam (which com-
excellent blog posts on
bines these two previous ideas). Details of these approaches are described in Appendix B.0.1. the topic: ruder.io/
optimizing-gradient-descent

6.8 Regularization
So far, we have only considered optimizing loss on the training data as our objective for
neural network training. But, as we have discussed before, there is a risk of overfitting if
we do this. The pragmatic fact is that, in current deep neural networks, which tend to be
very large and to be trained with a large amount of data, overfitting is not a huge problem.
This runs counter to our current theoretical understanding and the study of this question
is a hot area of research. Nonetheless, there are several strategies for regularizing a neural
network, and they can sometimes be important.

6.8.1 Methods related to ridge regression


One group of strategies can, interestingly, be shown to have similar effects to each other:
early stopping, weight decay, and adding noise to the training data. Result is due to
Early stopping is the easiest to implement and is in fairly common use. The idea is Bishop, described
in his textbook and
to train on your training set, but at every epoch (a pass through the whole training set, or
here doi.org/10.1162/
possibly more frequently), evaluate the loss of the current W on a validation set. It will neco.1995.7.1.108.
generally be the case that the loss on the training set goes down fairly consistently with
each iteration, the loss on the validation set will initially decrease, but then begin to increase
again. Once you see that the validation loss is systematically increasing, you can stop
training and return the weights that had the lowest validation error.
Another common strategy is to simply penalize the norm of all the weights, as we did in
ridge regression. This method is known as weight decay, because when we take the gradient
of the objective
Xn
J(W) = L(NN(x(i) ), y(i) ; W) + λ∥W∥2
i=1

we end up with an update of the form


  
Wt = Wt−1 − η ∇W L(NN(x(i) ), y(i) ; Wt−1 ) + 2λWt−1
 
= Wt−1 (1 − 2λη) − η ∇W L(NN(x(i) ), y(i) ; Wt−1 ) .

This rule has the form of first “decaying” Wt−1 by a factor of (1 − 2λη) and then taking a
gradient step.
Finally, the same effect can be achieved by perturbing the x(i) values of the training data
by adding a small amount of zero-mean normally distributed noise before each gradient
computation. It makes intuitive sense that it would be more difficult for the network to
overfit to particular training data if they are changed slightly on each training step.

6.8.2 Dropout
Dropout is a regularization method that was designed to work with deep neural networks.
The idea behind it is, rather than perturbing the data every time we train, we’ll perturb the
network! We’ll do this by randomly, on each training step, selecting a set of units in each
layer and prohibiting them from participating. Thus, all of the units will have to take a

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 62

kind of “collective” responsibility for getting the answer right, and will not be able to rely
on any small subset of the weights to do all the necessary computation. This tends also to
make the network more robust to data perturbations.
During the training phase, for each training example, for each unit, randomly with
probability p temporarily set aℓj = 0. There will be no contribution to the output and no
gradient update for the associated unit.
Study Question: Be sure you understand why, when using SGD, setting an activa-
tion value to 0 will cause that unit’s weights not to be updated on that iteration.
When we are done training and want to use the network to make predictions, we mul-
tiply all weights by p to achieve the same average activation levels.
Implementing dropout is easy! In the forward pass during training, we let

aℓ = f(zℓ ) ∗ dℓ

where ∗ denotes component-wise product and dℓ is a vector of 0’s and 1’s drawn randomly
with probability p. The backwards pass depends on aℓ , so we do not need to make any
further changes to the algorithm.
It is common to set p to 0.5, but this is something one might experiment with to get
good results on your problem and data.

6.8.3 Batch normalization


Another strategy that seems to help with regularization and robustness in training is batch
normalization. It was originally developed to address a problem of covariate shift: that is, if For more details see
you consider the second layer of a two-layer neural network, the distribution of its input arxiv.org/abs/1502.03167.

values is changing over time as the first layer’s weights change. Learning when the input
distribution is changing is extra difficult: you have to change your weights to improve your
predictions, but also just to compensate for a change in your inputs (imagine, for instance,
that the magnitude of the inputs to your layer is increasing over time—then your weights
will have to decrease, just to keep your predictions the same).
So, when training with mini-batches, the idea is to standardize the input values for each
mini-batch, just in the way that we did it in Section 5.3.3 of Chapter 5, subtracting off the
mean and dividing by the standard deviation of each input dimension. This means that the
scale of the inputs to each layer remains the same, no matter how the weights in previous
layers change. However, this somewhat complicates matters, because the computation of
the weight updates will need to take into account that we are performing this transforma-
tion. In the modular view, batch normalization can be seen as a module that is applied to
zl , interposed after the product with W l and before input to fl . We follow here the sug-
Although batch-norm was originally justified based on the problem of covariate shift, gestion from the origi-
nal paper of applying
it’s not clear that that is actually why it seems to improve performance. Batch normaliza-
batch normalization
tion can also end up having a regularizing effect for similar reasons that adding noise and before the activation
dropout do: each mini-batch of data ends up being mildly perturbed, which prevents the function. Since then it
network from exploiting very particular values of the data points. For those interested, the has been shown that, in
some cases, applying it
equations for batch normalization, including a derivation of the forward pass and back- after works a bit better.
ward pass, are described in Appendix B.0.2. But there aren’t any def-
inite findings on which
works better and when.

Last Updated: 05/06/24 18:25:27


CHAPTER 7

Convolutional Neural Networks

So far, we have studied what are called fully connected neural networks, in which all of the
units at one layer are connected to all of the units in the next layer. This is a good arrange-
ment when we don’t know anything about what kind of mapping from inputs to outputs
we will be asking the network to learn to approximate. But if we do know something about
our problem, it is better to build it into the structure of our neural network. Doing so can
save computation time and significantly diminish the amount of training data required to
arrive at a solution that generalizes robustly.
One very important application domain of neural networks, where the methods have
achieved an enormous amount of success in recent years, is signal processing. Signals
might be spatial (in two-dimensional camera images or three-dimensional depth or CAT
scans) or temporal (speech or music). If we know that we are addressing a signal-processing
problem, we can take advantage of invariant properties of that problem. In this chapter, we
will focus on two-dimensional spatial problems (images) but use one-dimensional ones as
a simple example. In a later chapter, we will address temporal problems.
Imagine that you are given the problem of designing and training a neural network that
takes an image as input, and outputs a classification, which is positive if the image contains
a cat and negative if it does not. An image is described as a two-dimensional array of pixels, A pixel is a “picture ele-
each of which may be represented by three integer values, encoding intensity levels in red, ment.”
green, and blue color channels.
There are two important pieces of prior structural knowledge we can bring to bear on
this problem:

• Spatial locality: The set of pixels we will have to take into consideration to find a cat
will be near one another in the image. So, for example, we
won’t have to consider
• Translation invariance: The pattern of pixels that characterizes a cat is the same no some combination of
matter where in the image the cat occurs. pixels in the four cor-
ners of the image, in
We will design neural network structures that take advantage of these properties. order to see if they en-
code cat-ness.

Cats don’t look different


if they’re on the left or
the right side of the im-
age.

63
MIT 6.390 Spring 2024 64

7.1 Filters
We begin by discussing image filters. An image filter is a function that takes in a local spatial Unfortunately in AI/M-
neighborhood of pixel values and detects the presence of some pattern in that data. L/CS/Math, the word
“filter” gets used in
Let’s consider a very simple case to start, in which we have a 1-dimensional binary
many ways: in addition
“image” and a filter F of size two. The filter is a vector of two numbers, which we will to the one we describe
move along the image, taking the dot product between the filter values and the image here, it can describe a
values at each step, and aggregating the outputs to produce a new image. temporal process (in
fact, our moving aver-
Let X be the original image, of size d; then pixel i of the the output image is specified ages are a kind of filter)
by and even a somewhat
Yi = F · (Xi−1 , Xi ) . esoteric algebraic struc-
ture.
To ensure that the output image is also of dimension d, we will generally “pad” the input
image with 0 values if we need to access pixels that are beyond the bounds of the input
image. This process of applying the filter to the image to create a new image is called
“convolution.” And filters are also
If you are already familiar with what a convolution is, you might notice that this def- sometimes called con-
volutional kernels.
inition corresponds to what is often called a correlation and not to a convolution. In-
deed, correlation and convolution refer to different operations in signal processing. How-
ever, in the neural networks literature, most libraries implement the correlation (as de-
scribed in this chapter) but call it convolution. The distinction is not significant; in prin-
ciple, if convolution is required to solve the problem, the network could learn the nec-
essary weights. For a discussion of the difference between convolution and correlation
and the conventions used in the literature you can read Section 9.1 in this excellent book:
https://www.deeplearningbook.org.
Here is a concrete example. Let the filter F1 = (−1, +1). Then given the image in the first
line below, we can convolve it with filter F1 to obtain the second image. You can think of
this filter as a detector for “left edges” in the original image—to see this, look at the places
where there is a 1 in the output image, and see what pattern exists at that position in the
input image. Another interesting filter is F2 = (−1, +1, −1). The third image (the last line
below) shows the result of convolving the first image with F2 , where we see that the output
pixel i corresponds to when the center of F2 is aligned at input pixel i.
Study Question: Convince yourself that filter F2 can be understood as a detector for
isolated positive pixels in the binary image.

Image: 0 0 1 1 1 0 1 0 0 0

F1 : -1 +1

After convolution (with F1 ): 0 0 1 0 0 -1 1 -1 0 0

F2 -1 +1 -1

After convolution (with F2 ): 0 -1 0 -1 0 -2 1 -1 0 0

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 65

Two-dimensional versions of filters like these are thought to be found in the visual
cortex of all mammalian brains. Similar patterns arise from statistical analysis of natural
images. Computer vision people used to spend a lot of time hand-designing filter banks. A
filter bank is a set of sets of filters, arranged as shown in the diagram below.

Image

All of the filters in the first group are applied to the original image; if there are k such
filters, then the result is k new images, which are called channels. Now imagine stacking
all these new images up so that we have a cube of data, indexed by the original row and
column indices of the image, as well as by the channel. The next set of filters in the filter
bank will generally be three-dimensional: each one will be applied to a sub-range of the row
and column indices of the image and to all of the channels.
These 3D chunks of data are called tensors. The algebra of tensors is fun, and a lot like There are now many
matrix algebra, but we won’t go into it in any detail. useful neural-network
software packages, such
Here is a more complex example of two-dimensional filtering. We have two 3 × 3 filters
as TensorFlow and Py-
in the first layer, f1 and f2 . You can think of each one as “looking” for three pixels in a Torch that make opera-
row, f1 vertically and f2 horizontally. Assuming our input image is n × n, then the result tions on tensors easy.
of filtering with these two filters is an n × n × 2 tensor. Now we apply a tensor filter
(hard to draw!) that “looks for” a combination of two horizontal and two vertical bars
(now represented by individual pixels in the two channels), resulting in a single final n × n
image. When we have a color
image as input, we treat
it as having three chan-
nels, and hence as an
f2 n × n × 3 tensor.

tensor
filter

f1

We are going to design neural networks that have this structure. Each “bank” of the
filter bank will correspond to a neural-network layer. The numbers in the individual filters
will be the “weights” (plus a single additive bias or offset value for each filter) of the net-
work, that we will train using gradient descent. What makes this interesting and powerful
(and somewhat confusing at first) is that the same weights are used many many times in
the computation of each layer. This weight sharing means that we can express a transforma-

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 66

tion on a large image with relatively few parameters; it also means we’ll have to take care
in figuring out exactly how to train it!
We will define a filter layer l formally with: For simplicity, we are
assuming that all im-
• number of filters ml ; ages and filters are
square (having the same
• size of one filter is kl × kl × ml−1 plus 1 bias value (for this one filter); number of rows and
columns). That is in no
• stride sl is the spacing at which we apply the filter to the image; in all of our examples way necessary, but is
so far, we have used a stride of 1, but if we were to “skip” and apply the filter only at usually fine and def-
initely simplifies our
odd-numbered indices of the image, then it would have a stride of two (and produce notation.
a resulting image of half the size);

• input tensor size nl−1 × nl−1 × ml−1

• padding: pl is how many extra pixels – typically with value 0 – we add around the
edges of the input. For an input of size nl−1 × nl−1 × ml−1 , our new effective input
size with padding becomes (nl−1 + 2 · pl ) × (nl−1 + 2 · pl ) × ml−1 .

This layer will produce an output tensor of size nl × nl × ml , where nl = ⌈(nl−1 + 2 · pl −


(kl − 1))/sl ⌉.1 The weights are the values defining the filter: there will be ml different kl ×
kl ×ml−1 tensors of weight values; plus each filter may have a bias term, which means there
is one more weight value per filter. A filter with a bias operates just like the filter examples
above, except we add the bias to the output. For instance, if we incorporated a bias term
of 0.5 into the filter F2 above, the output would be (−0.5, 0.5, −0.5, 0.5, −1.5, 1.5, −0.5, 0.5)
instead of (−1, 0, −1, 0, −2, 1, −1, 0).
This may seem complicated, but we get a rich class of mappings that exploit image
structure and have many fewer weights than a fully connected layer would.
Study Question: How many weights are in a convolutional layer specified as
above?

Study Question: If we used a fully-connected layer with the same size inputs and
outputs, how many weights would it have?

7.2 Max pooling


It is typical to structure filter banks into a pyramid, in which the image sizes get smaller in Both in engineering and
successive layers of processing. The idea is that we find local patterns, like bits of edges in nature
in the early layers, and then look for patterns in those patterns, etc. This means that, ef-
fectively, we are looking for patterns in larger pieces of the image as we apply successive
filters. Having a stride greater than one makes the images smaller, but does not necessarily
aggregate information over that spatial range.
Another common layer type, which accomplishes this aggregation, is max pooling. A
max pooling layer operates like a filter, but has no weights. You can think of it as purely
functional, like a ReLU in a fully connected network. It has a filter size, as in a filter layer, but
simply returns the maximum value in its field. Usually, we apply max pooling with the We sometimes use the
following traits: term receptive field or
just field to mean the
• stride > 1, so that the resulting image is smaller than the input image; and area of an input image
that a filter is being ap-
• k ⩾ stride, so that the whole image is covered. plied to.

1 Recall that ⌈·⌉ is the ceiling function; it returns the smallest integer greater than or equal to its input. E.g.,

⌈2.5⌉ = 3 and ⌈3⌉ = 3.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 67

As a result of applying a max pooling layer, we don’t keep track of the precise location of a
pattern. This helps our filters to learn to recognize patterns independent of their location.
Consider a max pooling layer where both the strides and k are set to be 2. This would
map a 64 × 64 × 3 image to a 32 × 32 × 3 image. Note that max pooling layers do not have
additional bias or offset values.
Study Question: Maximilian Poole thinks it would be a good idea to add two max
pooling layers of size k, one right after the other, to their network. What single layer
would be equivalent?
One potential concern about max-pooling layers is that they actually don’t completely
preserve translation invariance. If you do max-pooling with a stride other than 1 (or just
pool over the whole image size), then shifting the pattern you are hoping to detect within
the image by a small amount can change the output of the max-pooling layer substan-
tially, just because there are discontinuities induced by the way the max-pooling window
matches up with its input image. Here is an interesting paper that illustrates this phe- https://arxiv.org/
nomenon clearly and suggests that one should first do max-pooling with a stride of 1, then pdf/1904.11486.pdf
do “downsampling” by averaging over a window of outputs.

7.3 Typical architecture


Here is the form of a typical convolutional network:

The “depth” dimension in the layers shown as cuboids corresponds to the number of chan-
nels in the output tensor. (Figure source: https://www.mathworks.com/solutions/deep-
learning/convolutional-neural-network.html)

At the end of each filter layer, we typically apply a ReLU activation function. There
may be multiple filter plus ReLU layers. Then we have a max pooling layer. Then we have
some more filter + ReLU layers. Then we have max pooling again. Once the output is
down to a relatively small size, there is typically a last fully-connected layer, leading into
an activation function such as softmax that produces the final output. The exact design of
these structures is an art—there is not currently any clear theoretical (or even systematic
empirical) understanding of how these various design choices affect overall performance
of the network.
The critical point for us is that this is all just a big neural network, which takes an input
and computes an output. The mapping is a differentiable function of the weights, which Well, techinically the
means we can adjust the weights to decrease the loss by performing gradient descent, and derivative does not exist
at every point, both be-
we can compute the relevant gradients using back-propagation!
cause of the ReLU and
the max pooling oper-
ations, but we ignore
that fact.
Last Updated: 05/06/24 18:25:27
MIT 6.390 Spring 2024 68

7.4 Backpropagation in a simple CNN


Let’s work through a very simple example of how back-propagation can work on a convo-
lutional network. The architecture is shown below. Assume we have a one-dimensional
single-channel image X of size n × 1 × 1, and a single filter W 1 of size k × 1 × 1 (where we
omit the filter bias) for the first convolutional operation denoted “conv” in the figure be-
low. Then we pass the intermediate result Z1 through a ReLU layer to obtain the activation
A1 , and finally through a fully-connected layer with weights W 2 , denoted “fc” below, with
no additional activation function, resulting in the output A2 .

conv ReLU fc
W2
Z2 = A2
W1

pad with 0’s


(to get output 0
of same shape) Z1 A1
X = A0

For simplicity assume k is odd, let the input image X = A0 , and assume we are using
squared loss. Then we can describe the forward pass as follows:
T
Z1i = W 1 A0[i−⌊k/2⌋:i+⌊k/2⌋]
A1 = ReLU(Z1 )
T
A2 = Z2 = W 2 A1
Lsquare (A2 , y) = (A2 − y)2

Study Question: Assuming a stride of 1, for a filter of size k, how much padding
do we need to add to the top and bottom of the image? We see one zero at the top
and bottom in the figure just above; what filter size is implicitly being shown in the
figure? (Recall the padding is for the sake of getting an output the same size as the
input.)

7.4.1 Weight update


How do we update the weights in filter W 1 ?

∂loss ∂Z1 ∂A1 ∂loss


=
∂W 1 ∂W 1 ∂Z1 ∂A1
• ∂Z1 /∂W 1 is the k×n matrix such that ∂Z1i /∂Wj1 = Xi−⌊k/2⌋+j−1 . So, for example, if i =
10, which corresponds to column 10 in this matrix, which illustrates the dependence
of pixel 10 of the output image on the weights, and if k = 5, then the elements in
column 10 will be X8 , X9 , X10 , X11 , X12 .

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 69

• ∂A1 /∂Z1 is the n × n diagonal matrix such that



1 1 1 if Z1i > 0
∂Ai /∂Zi =
0 otherwise

• ∂loss/∂A1 = (∂loss/∂A2 )(∂A2 /∂A1 ) = 2(A2 − y)W 2 , an n × 1 vector

Multiplying these components yields the desired gradient, of shape k × 1.

7.4.2 Max pooling


One last point is how to handle back-propagation through a max-pooling operation. Let’s
study this via a simple example. Imagine

y = max(a1 , a2 ) ,

where a1 and a2 are each computed by some network. Consider doing back-propagation
through the maximum. First consider the case where a1 > a2 . Then the error value at y
is propagated back entirely to the network computing the value a1 . The weights in the
network computing a1 will ultimately be adjusted, and the network computing a2 will be
untouched.
Study Question: What is ∇(x,y) max(x, y) ?

Last Updated: 05/06/24 18:25:27


CHAPTER 8

Transformers

Transformers are a very recent family of architectures that have revolutionized fields like
natural language processing (NLP), image processing, and multi-modal generative AI.
Transformers were originally introduced in the field of NLP in 2017, as an approach
to process and understand human language. Human language is inherently sequential in
nature (e.g., characters form words, words form sentences, and sentences form paragraphs
and documents). Prior to the advent of the transformers architecture, recurrent neural net-
works (RNNs) briefly dominated the field for their ability to process sequential information
(RNNs are described in Appendix C for reference). However, RNNs, like many other ar-
chitectures, processed sequential information in an iterative/sequential fashion, whereby
each item of a sequence was individually processed one after another. Transformers offer
many advantages over RNNs, including their ability to process all items in a sequence in a
parallel fashion (as do CNNs).
Like CNNs, transformers factorize the signal processing problem into stages that in-
volve independent and identically processed chunks. However, they also include layers
that mix information across the chunks, called attention layers, so that the full pipeline can
model dependencies between the chunks.
In this chapter, we describe transformers from the bottom up. We start with the idea
of embeddings and tokens (Section 8.1). We then describe the attention mechanism (Sec-
tion 8.2). And finally we then assemble all these ideas together to arrive at the full trans-
former architecture in Section 8.3.

8.1 Vector embeddings and tokens


Before we can understand the attention mechanism in detail, we need to first introduce a
new data structure and a new way of thinking about neural processing for language.
The field of NLP aims to represent words with vectors (aka word embeddings) such that
they capture semantic meaning. More precisely, the degree to which any two words are
related in the ‘real-world’ to us humans should be reflected by their corresponding vectors
(in terms of their numeric values). So, words such as ‘dog’ and ‘cat’ should be represented
by vectors that are more similar to one another than, say, ‘cat’ and ‘table’ are. Nowadays,
it’s also typical for every individual occurrence of a word to have its own distinct repre-
sentation/vector. So, a story about a dog may mention the word ‘dog’ a dozen times, with

70
MIT 6.390 Spring 2024 71

each vector being slightly different based on its context in the sentence and story at large.
To measure how similar any two word embeddings are (in terms of their numeric val-
ues) it is common to use cosine similarity as the metric:

uT v
= cos < u, v > , (8.1)
|u| |v|

where |u| and |v| are the lengths of the vectors, and < u, v > is the angle between u and v.
The cosine similarity is +1 when u = v, zero when the two vectors are perpendicular to
each other, and −1 when the two vectors are diametrically opposed to each other. Thus,
higher values correspond to vectors that are numerically more similar to each other.
While word embeddings – and various approaches to create them – have existed for
decades, the first approach that produced astonishingly effective word embeddings was
word2vec in 2012. This revolutionary approach was the first highly-successful approach of
applying deep learning to NLP, and it enabled all subsequent progress in the field, includ-
ing Transformers. The details of word2vec are beyond the scope of this course, but we note
two facts: (1) it created a single word embedding for each distinct word in the training cor-
pus (not on a per-occurrence basis); (2) it produced word embeddings that were so useful,
many relationships between the vectors corresponded with real-world semantic related-
ness. For example, when using Euclidean distance as a distance metric between two vectors,
word2vec produced word embeddings with properties such as (where vword is the vector
for word):

vparis − vfrance + vitaly ≈ vrome (8.2)

This corresponds with the real-world property that Paris is to France what Rome is to
Italy. This incredible finding existed not only for geographic words but all sorts of real-
world concepts in the vocabulary. Nevertheless, to some extent, the exact values in each
embedding is arbitrary, and what matters most is the holistic relation between all embed-
dings, along with how performant/useful they are for the exact task that we care about.
For example, an embedding may be considered good if it accurately captures the con-
ditional probability for a given word to appear next in a sequence of words. You probably
have a good idea of what words might typically fill in the blank at the end of this sentence:

After the rain, the grass was

Or a model could be built that tries to correctly predict words in the middle of sentences:

The child fell during the long car ride

The model can be built by minimizing a loss function that penalizes incorrect word guesses,
and rewards correct ones. This is done by training a model on a very large corpus of written
material, such as all of Wikipedia, or even all the accessible digitized written materials
produced by humans.
While we will not dive into the full details of tokenization, the high-level idea is straight-
forward: the individual inputs of data that are represented and processed by a model are
referred to as tokens. And, instead of processing each word as a whole, words are typically
split into smaller, meaningful pieces (akin to syllables). Thus, when we refer to tokens,
know that we’re referring to each individual input, and that in practice, nowadays, they
tend to be sub-words (e.g., the word ‘talked’ may be split into two tokens, ‘talk’ and ‘ed’).

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 72

8.2 Query, key, value, and attention


Attention is a strategy for processing global information efficiently, focusing just on the
parts of the signal that are most salient to the task at hand. What we present be-
It might help our understanding of the “attention” mechanism to think about a dictio- low is the so-called
“dot-product attention”
nary look-up scenario. Consider a dictionary with keys k mapping to some values v(k).
mechanism; there can
For example, let k be the name of some foods, such as pizza, apple, sandwich, donut, be other variants that
chili, burrito, sushi, hamburger, . . .. The corresponding values may be information involve more complex
about the food, such as where it is available, how much it costs, or what its ingredients are. attention functions
Suppose that instead of looking up foods by a specific name, we wanted to query by
cuisine, e.g., “mexican” foods. Clearly, we cannot simply look for the word “mexican”
among the dictionary keys, since that word is not a food. What does work is to utilize again
the idea of finding “similarity" between vector embeddings of the query and the keys. The
end result we’d hope to get, is a probability distribution over the foods, p(k|q) indicating
which are best matches for a given query q. With such a distribution, we can look for keys
that are semantically close to the given query.
More concretely, to get such distribution, we follow these steps: First, embed the word
we are interested in (“mexican” in our example) into a so-called query vector, denoted
simply as q ∈ Rdk ×1 where dk is the embedding dimension.
Next, suppose our given dictionary has n number of entries/entries, we embed each
one of these into a so-called key vector. In particular, for each of the jth entry in the dictio-
nary, we produce a kj ∈ Rdk ×1 key vector, where j = 1, 2, 3, . . . , n.
We can then obtain the desired probability distribution using a softmax (see Chapter 6)
applied to the inner-product between the key and query:

p(k|q) = softmax [qT k1 ; qT k2 ; qT k3 ; . . . , qT kn ]




This vector-based lookup mechanism has come to be known as “attention” in the sense
that p(k|q) is a conditional probability distribution that says how much attention should
be given to the key kj for a given query q.
In other words, the conditional probability distribution p(k|q) gives the “attention weights,”
and the weighted average value
X
p(kj |q) vj (8.3)
j

is the “attention output.”


The meaning of this weighted average value may be ambiguous when the values are
just words. However, the attention output really becomes meaningful when the value are
projected in some semantic embedding space (and such projection are typically done in
transformers via learned embedding weights).
The same weighted-sum idea generalizes to multiple query, key, and values. In particu-
lar, suppose there are nq number of queries, nk number of keys (and therefore nk number
of values), one can compute an attention matrix

 √ 
softmax q⊤ ⊤
q⊤

1 k1 q 1 k2 · · ·
 
1 knk  /√dk 
 softmax q⊤ ⊤
2 k1 q 2 k2 · · · q⊤
2 k nk / d k

A= .. (8.4)
 

 .  √ 

softmax qnq k1 qnq k2 · · · q⊤
 ⊤ ⊤
n q k n k / dk

Here, softmaxj is a softmax over the nk -dimensional vector indexed by j, so in √


Eq. 8.2
this means a softmax computed over keys. In this equation, the normalization by dk is

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 73

done to reduce the magnitude of the dot product, which would otherwise grow undesir-
ably large with increasing dk , making it difficult for (overall) training.
Let αij be the entry in ith row and jth column in the attention matrix A. Then αij helps
answer the question "which tokens x(j) help the most with predicting the corresponding
output token y(i) ?" The attention output is given by a weighted sum over the values:

X
n
y(i) = αij vj
j=1

8.2.1 Self Attention


Self-attention is an attention mechanism where the keys, values, and queries are all gener-
ated from the same input.
At a very high level, typical transformer with self-attention layers maps Rn×d −→
n×d
R . In particular, the transformer takes in data (a sequence of tokens) X ∈ Rn×d and
for each token x(i) ∈ Rd×1 , it computes (via learned projection, to be discussed in Section
8.3.1), a query qi ∈ Rdq ×1 , key ki ∈ Rdk ×1 , and value vi ∈ Rdv ×1 . In practice, dq = dk = dv
and we often denote all three embedding dimension via a unified dk . Note that dk differs
The self-attention layer then take in these query, key, and values, and compute a self- from d: d is the dimen-
sion of raw input token
attention matrix
∈ Rdq ×1
 √  
softmax q⊤ ⊤
· · · q⊤
 
1 k1 q 1 k2 1 kn  /√dk 
softmax q⊤ ⊤
2 k1 q 2 k2 · · · q⊤
2 k n / dk

A= (8.5)
 
.. 
 .  √ 

softmax q⊤ ⊤
· · · q⊤

n k1 q n k2 n k n / dk

Comparing this self-attention matrix with the attention matrix described in Equation
8.2, we notice the only difference lies in the dimensions: since in self-attention, the query,
key, and value all come from the same input, we have nq = nk = nv , and we often denote
all three with a unified n.
The self-attention output is then given by a weighted sum over the values:

X
n
y(i) = αij vj
j=1

This diagram below shows (only) the middle input token generating a query that is then
combined with the keys computed with all tokens to generate the attention weights via a
softmax. The output of the softmax is then combined with values computed from all to-
kens, to generate the attention output corresponding to the middle input token. Repeating
this for each input token then generates the output.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 74

Study Question: We have five colored tokens in the diagram above (gray, blue, or-
gange, green, red). Could you read off the diagram the correspondence between the
color and input, query, query, value, output?
Note that the size of the output is the same as the size of the input. Also, observe that
there is no apparent notion of ordering of the input words in the depicted structure. Posi-
tional information can be added by encoding a number for token (giving say, the token’s
position relative to the start of the sequence) into the vector embedding of each token.
And note that a given query need not pay attention to all other tokens in the input; in this
example, the token used for the query is not used for a key or value.
More generally, a mask may be applied to limit which tokens are used in the attention
computation. For example, one common mask limits the attention computation to tokens
that occur previously in time to the one being used for the query. This prevents the atten-
tion mechanism from “looking ahead” in scenarios where the transformer is being used to
generate one token at a time.
Each self-attention stage is trained to have key, value, and query embeddings that lead
it to pay specific attention to some particular feature of the input. We generally want to
pay attention to many different kinds of features in the input; for example, in translation
one feature might be be the verbs, and another might be objects or subjects. A transformer
utilizes multiple instances of self-attention, each known as an “attention head,” to allow
combinations of attention paid to many different features.

8.3 Transformers
A transformer is the composition of a number of transformer blocks, each of which has
multiple attention heads. At a very high-level, the goal of a transformer block is to output

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 75

a really rich, useful representation for each input token, all for the sake of being high-
performant for whatever task the model is trained to learn.
Rather than depicting the transformer graphically, it is worth returning to the beauty of
the underlying equations1 .

8.3.1 Learned embedding


For simplicity, we assume the transformer internally uses self-attention. Full general atten-
tion layers work out similarly.
Formally, a transformer block is a parameterized function fθ that maps Rn×d → Rn×d ,
where the input data X ∈ Rn×d is often represented as a sequence of n tokens, with each
token x(i) ∈ Rd×1 .
Three projection matrices (weights) Wq , Wk , Wv are to be learned, such that, for each
token x(i) ∈ Rd×1 , we produce 3 distinct vectors: a query vector qi = WqT x(i) ; a key vector
ki = WkT x(i) ; a value vector vi = WvT x(i) , all 3 of these vectors Rdk ×1 and the learned
weights Wq , Wk , Wv ∈ Rd×dk .
If we stack these n query, key, value vectors into matrix- form, such that Q ∈ Rn×dk , K ∈
n×dk
R , and V ∈ Rn×dk , then we can more compactly write out the learned transformation
from the sequence of input token X:

Q = XWq
K = XWk
V = XWv

These Q, K, V triple can then be used to produce one (self)attention-layer output. One such
layer is called one "attention head".
One can have more than one "attention head", such that: the queries, keys, and values
are embedded via encoding matrices:

Q(h) = XWh,q (8.6)


K(h) = XWh,k (8.7)
(h)
V = XWh,v (8.8)

and Wh,q , Wh,k , Wh,v ∈ Rd×dk where dk is the size of the key/query embedding space, and
h ∈ {1, · · · , H} is an index over “attention heads.” for each attention-head
We then perform a weighted sum over all the outputs for each head, h, we learn one set of
Wh,q , Wh,k , Wh,v .

(i)
X
H X
n
(h) (h)
u′ = T
Wh,c αij Vj , (8.9)
h=1 j=1

where Wh,c ∈ Rdk ×d , u ′ (i) ∈ Rd×1 , the indices i ∈ {1, · · · , n} and j ∈ {1, · · · , n} are an
integer index over tokens. (h)
Vj is the dk × 1 value
This is then standardized and combined with x(i) using a LayerNorm function (defined embedding vector that
below) to become corresponds to the input
  token xj for attention
(i) head h.
u(i) = LayerNorm x(i) + u ′ ; γ1 , β1 (8.10)

with parameters γ1 , β1 ∈ Rd .
1 The presentation here follows the notes by John Thickstun.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 76

To get the final output, we follow the “intermediate output then layer norm" recipe
again. In particular, we first get the transformer block output z ′ (i) given by
 
(i)
z ′ = W2T ReLU W1T u(i) (8.11)

with weights W1 ∈ Rd×m and W2 ∈ Rm×d . This is then standardized and combined with
u(i) to give the final output z(i) :
 
(i)
z(i) = LayerNorm u(i) + z ′ ; γ2 , β2 , (8.12)

with parameters γ2 , β2 ∈ Rd . These vectors are then assembled (e.g., through parallel
computation) to produce z ∈ Rn×d .
The LayerNorm function transforms a d-dimensional input z with parameters γ, β ∈ Rd
into
z − µz
LayerNorm(z; γ, β) = γ + β, (8.13)
σz
where µz is the mean and σz the standard deviation of z:

1X
d
µz = zi (8.14)
d
i=1
v
u1 X
u d
σz = t (zi − µz )2 . (8.15)
d
i=1

Layer normalization is done to improve convergence stability during training.


The model parameters comprise the weight matrices Wh,q , Wh,k , Wh,v , Wh,c , W1 , W2
and the LayerNorm parameters γ1 , γ2 , β1 , β2 . A transformer is the composition of L trans-
former blocks, each with its own parameters:

fθL ◦ · · · ◦ fθ2 ◦ fθ1 (x) ∈ Rn×d . (8.16)

The hyperparameters of this model are d, dk , m, H, and L.

8.3.2 Variations and training


Many variants on this transformer structure exist. For example, the LayerNorm may be
moved to other stages of the neural network. Or a more sophisticated attention function
may be employed instead of the simple dot product used in Eq. 8.2. Transformers may
also be used in pairs, for example, one to process the input and a separate one to gen-
erate the output given the transformed input. Self-attention may also be replaced with
cross-attention, where some input data are used to generate queries and other input data
generate keys and values. Positional encoding and masking are also common, though they
are left implicit in the above equations for simplicity.
How are transformers trained? The number of parameters in θ can be very large; mod-
ern transformer models like GPT4 have tens of billions of parameters or more. A great deal
of data is thus necessary to train such models, else the models may simply overfit small
datasets.
Training large transformer models is thus generally done in two stages. A first “pre-
training” stage employs a very large dataset to train the model to extract patterns. This is
done with unsupervised (or self-supervised) learning and unlabelled data. For example,
the well-known BERT model was pre-trained using sentences with words masked. The

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 77

model was trained to predict the masked words. BERT was also trained on sequences of
sentences, where the model was trained to predict whether two sentences are likely to be
contextually close together or not. The pre-training stage is generally very expensive.
The second “fine-tuning” stage trains the model for a specific task, such as classification
or question answering. This training stage can be relatively inexpensive, but it generally
requires labeled data.

Last Updated: 05/06/24 18:25:27


CHAPTER 9

Non-parametric methods

Neural networks have adaptable complexity, in the sense that we can try different struc-
tural models and use cross validation to find one that works well on our data. Beyond
neural networks, we may further broaden the class of models that we can fit to our data,
for example as illustrated by the techniques introduced in this chapter.
Here, we turn to models that automatically adapt their complexity to the training data.
The name non-parametric methods is misleading: it is really a class of methods that does not
have a fixed parameterization in advance. Rather, the complexity of the parameterization
can grow as we acquire more data.
Some non-parametric models, such as nearest-neighbor, rely directly on the data to
make predictions and do not compute a model that summarizes the data. Other non-
parametric methods, such as decision trees, can be seen as dynamically constructing some- These are sometimes
thing that ends up looking like a more traditional parametric model, but where the actual called classification trees;
the decision analysis lit-
training data affects exactly what the form of the model will be.
erature uses “decision
The non-parametric methods we consider here tend to have the form of a composition tree” for a structure that
of simple models: lays out possible fu-
ture events that consist
• Nearest neighbor models: (Section 9.1) where we don’t process data at training time, of choices interspersed
but do all the work when making predictions, by looking for the closest training with chance nodes.
example(s) to a given new data point.
• Tree models: (Section 9.2) where we partition the input space and use different sim-
ple predictions on different regions of the space; the hypothesis space can become
arbitrarily large allowing finer and finer partitions of the input space.
• Ensemble models: (Section 9.2.3) in which we train several different classifiers on the
whole space and average the answers; this decreases the estimation error. In particu-
lar, we will look at bootstrap aggregation, or bagging of trees.
• Boosting is a way to construct a model composed of a sequence of component models
(e.g., a model consisting of a sequence of trees, each subsequent tree seeking to correct
errors in the previous trees) that decreases both estimation and structural error. We
won’t consider this in detail in this class.

Why are we studying these methods, in the heyday of complicated models such as
neural networks?

78
MIT 6.390 Spring 2024 79

• They are fast to implement and have few or no hyperparameters to tune.


• They often work as well as or better than more complicated methods.
• Predictions from some of these models can be easier to explain to a human user:
decision trees are fairly directly human-interpretable, and nearest neighbor methods
can justify their decision to some extent by showing a few training examples that the
prediction was based on.

9.1 Nearest Neighbor


In nearest-neighbor models, we don’t do any processing of the data at training time – we
just remember it! All the work is done at prediction time.
Input values x can be from any domain X (Rd , documents, tree-structured objects, etc.).
We just need a distance metric, d : X × X → R+ , which satisfies the following, for all
x, x ′ , x ′′ ∈ X:

d(x, x) = 0
d(x, x ′ ) = d(x ′ , x)
d(x, x ′′ ) ⩽ d(x, x ′ ) + d(x ′ , x ′′ )

Given a data-set D = {(x(i) , y(i) )}n


i=1 , our predictor for a new x ∈ X is

h(x) = y(i) where i = arg min d(x, x(i) ) , (9.1)


i

that is, the predicted output associated with the training point that is closest to the query
point x. Tie breaking is typically done at random.
This same algorithm works for regression and classification!
The nearest neighbor prediction function can be described by dividing the space up
into regions whose closest point is each individual training point as shown below : Decision boundary re-
gions can also be de-
scribed by Voronoi di-
agrams. In a Voronoi
diagram, each of the
data points would have
its own “cell” or region
in the space that is clos-
est to the data point
in question. In the di-
agram provided here,
cells have been merged
if the predicted value
is the same in adjacent
cells.

In each region, we predict the associated y value.


Study Question: Convince yourself that these boundaries do represent the nearest-
neighbor classifier derived from these six data points.
There are several useful variations on this method. In k-nearest-neighbors, we find the
k training points nearest to the query point x and output the majority y value for classifi-
cation or the average for regression. We can also do locally weighted regression in which we

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 80

fit locally linear regression models to the k nearest points, possibly giving less weight to
those that are farther away. In large data-sets, it is important to use good data structures
(e.g., ball trees) to perform the nearest-neighbor look-ups efficiently (without looking at all
the data points each time).

9.2 Tree Models


The idea here is that we would like to find a partition of the input space and then fit very
simple models to predict the output in each piece. The partition is described using a (typi-
cally binary) “tree” that recursively splits the space.
Tree methods differ by:

• The class of possible ways to split the space at each node; these are typically linear
splits, either aligned with the axes of the space, or sometimes using more general
classifiers.

• The class of predictors within the partitions; these are often simply constants, but
may be more general classification or regression models.

• The way in which we control the complexity of the hypothesis: it would be within
the capacity of these methods to have a separate partition element for each individual
training example.

• The algorithm for making the partitions and fitting the models.

One advantage of tree models is that they are easily interpretable by humans. This is
important in application domains, such as medicine, where there are human experts who
often ultimately make critical decisions and who need to feel confident in their under-
standing of recommendations made by an algorithm. Below is an example decision tree,
illustrating how one might be able to understand the decisions made by the tree.

Example: Here is a sample tree (reproduced from Breiman, Friedman, Olshen, Stone
(1984)):

Minimum systolic blood


pressure over 24h period ⩾ 91?

no yes

high risk Age ⩾ 65?

no yes

Is sinus tachycardia
low risk present?

no yes

low risk high risk

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 81

These methods are most appropriate for domains where the input space is not very
high-dimensional and where the individual input features have some substantially useful
information individually or in small groups. Trees would not be good for image input,
but might be good in cases with, for example, a set of meaningful measurements of the
condition of a patient in the hospital, as in the example above.
We’ll concentrate on the CART/ID3 (“classification and regression trees” and “iterative
dichotomizer 3”, respectively) family of algorithms, which were invented independently
in the statistics and the artificial intelligence communities. They work by greedily con-
structing a partition, where the splits are axis aligned and by fitting a constant model in the
leaves. The interesting questions are how to select the splits and how to control complexity.
The regression and classification versions are very similar.
As a concrete example, consider the following images:

The left image depicts a set of labeled data points in a two-dimensional feature space. The
right shows a partition into regions by a decision tree, in this case having no classification
errors in the final partitions.

9.2.1 Regression
The predictor is made up of

• a partition function, π, mapping elements of the input space into exactly one of M
regions, R1 , . . . , RM , and

• a collection of M output values, Om , one for each region.

If we already knew a division of the space into regions, we would set Om , the constant
output for region Rm , tobe the average of the training output values in that region. For
a training data set D = x(i) , y(i) , i = 1, . . . n, we let I be an indicator set of all of the


elements within D, so that I = {1, . . . , n} for our whole data set. We can define Im as the
subset of data set samples that are in region Rm , so that Im = {i | x(i) ∈ Rm }. Then

Om = averagei∈Im y(i) .

We can define the error in a region as Em . For example, Em as the sum of squared error
would be expressed as X
Em = (y(i) − Om )2 . (9.2)
i∈Im

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 82

Ideally, we would select the partition to minimize

X
M
λM + Em , (9.3)
m=1

for some regularization constant λ. It is enough to search over all partitions of the train-
ing data (not all partitions of the input space!) to optimize this, but the problem is NP-
complete.
Study Question: Be sure you understand why it’s enough to consider all partitions
of the training data, if this is your objective.

9.2.1.1 Building a tree


So, we’ll be greedy. We establish a criterion, given a set of data, for finding the best single
split of that data, and then apply it recursively to partition the space. For the discussion
below, we will select the partition of the data that minimizes the sum of the sum of squared
errors of each partition element.
 (i)Then later, we will consider other splitting criteria.
Given a data set D = x , y (i)

, i = 1, . . . n, we now consider I to be an indicator of
the subset of elements within D that we wish to build a tree (or subtree) for. That is, I may
already indicate a subset of data set D, based on prior splits in constructing our overall
tree. We define terms as follows:

• I+
j,s indicates the set of examples (subset of I) whose feature value in dimension j is
greater than or equal to split point s;

• I−
j,s indicates the set of examples (subset of I) whose feature value in dimension j is
less than s;
+
• ŷ+
j,s is the average y value of the data points indicated by set Ij,s ; and


• ŷ−
j,s is the average y value of the data points indicated by set Ij,s .

Here is the pseudocode. In what follows, k is the largest leaf size that we will allow in
the tree, and is a hyperparameter of the algorithm.

B UILD T REE(I, k)
1 if |I| ⩽ k
2 Set ŷ = averagei∈I y(i)
3 return L EAF(value = ŷ)
4 else
5 for each split dimension j and split value s
(i)
6 j,s = {i ∈ I | xj ⩾ s}
Set I+
(i)
7 j,s = {i ∈ I | xj < s}
Set I−
+
8 Set ŷj,s = averagei∈I+ y(i)
j,s
9 Set ŷ−
j,s = averagei∈I− y(i)
P j,s P
10 Set Ej,s = i∈I+ j,s
(y(i) − ŷ+ 2
j,s ) + i∈I−
j,s
(y(i) − ŷ−
j,s )
2

∗ ∗
11 Set (j , s ) = arg minj,s Ej,s
12 return N ODE(j∗ , s∗ , B UILD T REE(I− +
j∗ ,s∗ , k), B UILD T REE (Ij∗ ,s∗ , k))

In practice, we typically start by calling B UILD T REE with the first input equal to our
whole data set (that is, with I = {1, . . . , n}). But then that call of B UILD T REE can recursively
lead to many other calls of B UILD T REE.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 83

Let’s think about how long each call of B UILD T REE takes to run. We have to consider
all possible splits. So we consider a split in each of the d dimensions. In each dimension,
we only need to consider splits betwee two data points (any other split will give the same
error on the training data). So, in total, we consider O(dn) splits in each call to B UILD T REE.
Study Question: Concretely, what would be a good set of split-points to consider for
dimension j of a data set indicated by I?

9.2.1.2 Pruning
It might be tempting to regularize by using a somewhat large value of k, or by stopping
when splitting a node does not significantly decrease the error. One problem with short-
sighted stopping criteria is that they might not see the value of a split that will require one
more split before it seems useful.
Study Question: Apply the decision-tree algorithm to the XOR problem in two di-
mensions. What is the training-set error of all possible hypotheses based on a single
split?
So, we will tend to build a tree that is too large, and then prune it back.
We define cost complexity of a tree T , where m ranges over its leaves, as
|T |
X
Cα (T ) = Em (T ) + α|T | , (9.4)
m=1

and |T | is the number of leaves. For a fixed α, we can find a T that (approximately) mini-
mizes Cα (T ) by “weakest-link” pruning:
• Create a sequence of trees by successively removing the bottom-level split that mini-
mizes the increase in overall error, until the root is reached.
• Return the T in the sequence that minimizes the cost complexity.
We can choose an appropriate α using cross validation.

9.2.2 Classification
The strategy for building and pruning classification trees is very similar to the strategy for
regression trees.
Given a region Rm corresponding to a leaf of the tree, we would pick the output class
y to be the value that exists most frequently (the majority value) in the data points whose x
values are in that region, i.e., data points indicated by Im :
Om = majorityi∈Im y(i) .
Let’s now define the error in a region as the number of data points that do not have the
value Om :
Em = {i | i ∈ Im and y(i) ̸= Om } .
We define the empirical probability of an item from class k occurring in region m as:
{i | i ∈ Im and y(i) = k}
P̂m,k = P̂(Im , k) = ,
Nm
where Nm is the number of training points in region m; that is, Nm = |Im |. For later use,
we’ll also define the empirical probabilities of split values, P̂m,j,s , as the fraction of points
with dimension j in split s occurring in region m (one branch of the tree), and 1 − P̂m,j,s as
the complement (the fraction of points in the other branch).

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 84

Splitting criteria In our greedy algorithm, we need a way to decide which split to make
next. There are many criteria that express some measure of the “impurity” in child nodes.
Some measures include:
• Misclassification error:
Em
Qm (T ) = = 1 − P̂m,Om (9.5)
Nm
• Gini index: X
Qm (T ) = P̂m,k (1 − P̂m,k ) (9.6)
k

• Entropy: X
Qm (T ) = H(Im ) = − P̂m,k log2 P̂m,k (9.7)
k

So that the entropy H is well-defined when P̂ = 0, we will stipulate that 0 log2 0 = 0.


These splitting criteria are very similar, and it’s not entirely obvious which one is better.
We will focus on entropy, just to be concrete.
Analogous to how for regression we choose the dimension j and split s that minimizes
the sum of squared error Ej,s , for classification, we choose the dimension j and split s
that minimizes the weighted average entropy over the “child” data points in each of the
two corresponding splits, I+ −
j,s and Ij,s . We calculate the entropy in each split based on the
empirical probabilities of class memberships in the split, and then calculate the weighted
average entropy Ĥ as
Ĥ = (fraction of points in left data set) · H(I−
j,s )

+ (fraction of points in right data set) · H(I+


j,s )

= (1 − P̂m,j,s )H(I− +
j,s ) + P̂m,j,s H(Ij,s )
|I−
j,s | |I+
j,s |
= · H(I−
j,s ) + · H(I+
j,s ) . (9.8)
Nm Nm
Choosing the split that minimizes the entropy of the children is equivalent to maximizing
the information gain of the test xj = s, defined by
|I−
j,s | |I |
+
!
j,s
INFO G AIN(xj = s, Im ) = H(Im ) − · H(I−
j,s ) + · H(I+
j,s ) (9.9)
Nm Nm
In the two-class case (with labels 0 and 1), all of the splitting criteria mentioned above
have the values 
0.0 when P̂m,0 = 0.0
.
0.0 when P̂m,0 = 1.0
The respective impurity curves are shown below, where p = P̂m,0 ; the vertical axis plots
Qm (T ) for each of the three criteria.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 85

There used to be endless haggling about which impurity function one should use. It seems
to be traditional to use entropy to select which node to split while growing the tree, and
misclassification error in the pruning criterion.

9.2.3 Bagging
One important limitation or drawback in conventional trees is that they can have high
estimation error: small changes in the data can result in very big changes in the resulting
tree.
Bootstrap aggregation is a technique for reducing the estimation error of a non-linear
predictor, or one that is adaptive to the data. The key idea applied to trees, is to build
multiple trees with different subsets of the data, and then create an ensemble model that
combines the results from multiple trees to make a prediction.
• Construct B new data sets of size n. Each data set is constructed by sampling n data
points with replacement from D. A single data set is called bootstrap sample of D.
• Train a predictor f̂b (x) on each bootstrap sample.
• Regression case: bagged predictor is

1X b
B
f̂bag (x) = f̂ (x) . (9.10)
B
b=1

• Classification case: Let K be the number of classes. We find a majority bagged predictor
as follows. We let f̂b (x) be a “one-hot” vector with a single 1 and K − 1 zeros, and
define the predicted output ŷ for predictor fb as ŷb (x) = arg maxk f̂b (x)k . Then

1X b
B
f̂bag (x) = f̂ (x), (9.11)
B
b=1

which is a vector containing the proportion of classifiers that predicted each class k
for input x. Then the overall predicted output is
ŷbag (x) = arg max f̂bag (x)k . (9.12)
k

There are theoretical arguments showing that bagging does, in fact, reduce estimation
error. However, when we bag a model, any simple intrepetability is lost.

9.2.4 Random Forests


Random forests are collections of trees that are constructed to be de-correlated, so that
using them to vote gives maximal advantage. In competitions, they often have excellent
classification performance among large collections of much fancier methods.
In what follows, B, m, and n are hyperparameters of the algorithm.

R ANDOM F OREST(D; B, m, n)
1 for b = 1, . . . , B
2 Draw a bootstrap sample Db of size n from D
3 Grow a tree Tb on data Db by recursively:
4 Select m variables at random from the d variables
5 Pick the best variable and split point among the m variables
6 Split the node
7 return tree Tb
Given the ensemble of trees, vote to make a prediction on a new x.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 86

9.2.5 Tree variants and tradeoffs


There are many variations on the tree theme. One is to employ different regression or
classification methods in each leaf. For example, a linear regression might be used to model
the examples in each leaf, rather than using a constant value.
In the relatively simple trees that we’ve considered, splits have been based on only a
single feature at a time, and with the resulting splits being axis-parallel. Other methods
for splitting are possible, including consideration of multiple features and linear classifiers
based on those, potentially resulting in non-axis-parallel splits. Complexity is a concern in
such cases, as many possible combinations of features may need to be considered, to select
the best variable combination (rather than a single split variable).
Another generalization is a hierarchical mixture of experts, where we make a “soft” ver-
sion of trees, in which the splits are probabilistic (so every point has some degree of mem-
bership in every leaf). Such trees can be trained using a form of gradient descent. Combi-
nations of bagging, boosting, and mixture tree approaches (e.g., gradient boosted trees) and
implementations are readily available (e.g., XGBoost).
Trees have a number of strengths, and remain a valuable tool in the machine learning
toolkit. Some benefits include being relatively easy to interpret, fast to train, and ability
to handle multi-class classification in a natural way. Trees can easily handle different loss
functions; one just needs to change the predictor and loss being applied in the leaves. Meth-
ods also exist to identify which features are particularly important or influential in forming
the tree, which can aid in human understanding of the data set. Finally, in many situations,
trees perform surprisingly well, often comparable to more complicated regression or clas-
sification models. Indeed, in some settings it is considered good practice to start with trees
(especially random forest or boosted trees) as a “baseline” machine learning model, against
which one can evaluate performance of more sophisticated models.

Last Updated: 05/06/24 18:25:27


CHAPTER 10

Markov Decision Processes

So far, most of the learning problems we have looked at have been supervised, that is, for
each training input x(i) , we are told which value y(i) should be the output. From a tradi-
tional machine-learning viewpoint, there’re two other major groups of learning problems:
one is the unsupervised learning problems, in which we are given data and no expected
outputs, and we will look at later in Chapter 12.1 and Chapter 12.2.
The other major type is the so-called Reinforcement learning (RL) problems. Reinforce-
ment learning differs significantly from supervised learning problems, and we will delve
into the details later in in Chapter 11. However, it’s worth pointing out one major dif-
ference at a very high level: in supervised learning, our goal is to learn a one-time static
mapping to make predictions, whereas in RL, the setup requires us to sequentially take
actions to maximize cumulative rewards.
This setup change necessitates additional mathematical and algorithmic tools for us to
understand RL. Markov decision process (MDP) is precisely such a classical and fundamental
tool.

10.1 Definition and value functions


Formally, a Markov decision process is ⟨S, A, T , R, γ⟩ where S is the state space, A is the
action space, and:

• T : S × A × S → R is a transition model, where

T (s, a, s ′ ) = Pr(St = s ′ |St−1 = s, At−1 = a) ,

specifying a conditional probability distribution; The notation St = s ′


uses a capital letter S
• R : S × A → R is a reward function, where R(s, a) specifies an immediate reward for to stand for a random
taking action a when in state s; and variable, and small let-
ter s to stand for a con-
• γ ∈ [0, 1] is a discount factor, which we’ll discuss in Section 10.1.2. crete value. So St here
is a random variable
In this class, we assume the rewards are deterministic functions. Further, in this MDP that can take on ele-
ments of S as values.
chapter, we assume the state space and action space are finite (in fact, typically small).

87
MIT 6.390 Spring 2024 88

The following description of a simple machine as Markov decision process provides


a concrete example of an MDP. The machine has three possible operations (actions):
“wash”, “paint”, and “eject” (each with a corresponding button). Objects are put
into the machine. Each time you push a button, something is done to the object.
However, it’s an old machine, so it’s not very reliable. The machine has a camera
inside that can clearly detect what is going on with the object and will output the
state of the object: “dirty”, “clean”, “painted”, or “ejected”. For each action, this is
what is done to the object:
Wash:

• If you perform the “wash” operation on any object, whether it’s dirty, clean,
or painted, it will end up “clean” with probability 0.9 and “dirty” otherwise.

Paint:

• If you perform the “paint” operation on a clean object, it will become nicely
“painted” with probability 0.8. With probability 0.1, the paint misses but the
object stays clean, and also with probability 0.1, the machine dumps rusty
dust all over the object and it becomes “dirty”.

• If you perform the “paint” operation on a “painted” object, it stays “painted”


with probability 1.0.

• If you perform the “paint” operation on a “dirty” part, it stays “dirty” with
probability 1.0.

Eject:

• If you perform an “eject” operation on any part, the part comes out of the
machine and this fun game is over. The part remains "ejected" regardless of
any further action.

These descriptions specify the transition model T , and the transition function for
each action can be depicted as a state machine diagram. For example, here is the
diagram for “wash”:

0.1 0.9
0.9

dirty clean

0.1

0.1 0.9 1.0

painted ejected

You get reward +10 for ejecting a painted object, reward 0 for ejecting a non-painted
object, reward 0 for any action on an "ejected" object, and reward -3 otherwise. The
MDP description would be completed by also specifying a discount factor.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 89

A policy is a function π : S → A that specifies what action to take in each state. The
policy is what we will want to learn; it is akin to the strategy that a player employs to
win a given game. Below, we take just the initial steps towards this eventual goal. We
describe how to evaluate how good a policy is, first in the finite horizon case (Section 10.1.1)
when the total number of transition steps is finite. Then we consider the infinite horizon
case (Section 10.1.2), when you don’t know when the game will be over.

10.1.1 Finite-horizon value functions


The goal of a policy is to maximize the expected total reward, averaged over the stochastic
transitions that the domain makes. Let’s first consider the case where there is a finite horizon
H, indicating the total number of steps of interaction that the agent will have with the MDP. In the finite-horizon
We seek to measure the goodness of a policy. We do so by defining for a given MDP case, we usually set the
discount factor γ to 1.
policy π and horizon h, the “horizon h value” of a state, Vπh (s). We do this by induction on
the horizon, which is the number of steps left to go.
The base case is when there are no steps remaining, in which case, no matter what state
we’re in, the value is 0, so
Vπ0 (s) = 0 . (10.1)
Then, the value of a policy in state s at horizon h + 1 is equal to the reward it will get in
state s plus the next state’s expected horizon h value, discounted by a factor γ. So, starting
with horizons 1 and 2, and then moving to the general case, we have:

Vπ1 (s) = R(s, π(s)) + 0 (10.2)


X
Vπ2 (s) = R(s, π(s)) + γ T (s, π(s), s ′ )Vπ1 (s ′ ) (10.3)
s′
..
.
X
Vπh (s) = R(s, π(s)) + γ T (s, π(s), s ′ )Vπh−1 (s ′ ) (10.4)
s′

The sum over s ′ is an expectation: it considers all possible next states s ′ , and computes
an average of their (h − 1)-horizon values, weighted by the probability that the transition
function from state s with the action chosen by the policy π(s) assigns to arriving in state
s ′ , and discounted by γ.
P
Study Question: What is s ′ T (s, a, s ′ ) for any particular s and a?

Study Question: Convince yourself that Eqs. 10.1 and 10.3 are special cases of
Eq. 10.4.
Then we can say that a policy π1 is better than policy π2 for horizon h, i.e., π1 >h π2 ,
if and only if for all s ∈ S, Vπh1 (s) ⩾ Vπh2 (s) and there exists at least one s ∈ S such that
Vπh1 (s) > Vπh2 (s).

10.1.2 Infinite-horizon value functions


More typically, the actual finite horizon is not known, i.e., when you don’t know when the
game will be over! This is called the infinite horizon version of the problem. How does one
evaluate the goodness of a policy in the infinite horizon case?
If we tried to simply take our definitions above and use them for an infinite horizon,
we could get in trouble. Imagine we get a reward of 1 at each step under one policy and a
reward of 2 at each step under a different policy. Then the reward as the number of steps
grows in each case keeps growing to become infinite in the limit of more and more steps.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 90

Even though it seems intuitive that the second policy should be better, we can’t justify that
by saying ∞ < ∞.
One standard approach to deal with this problem is to consider the discounted infinite
horizon. We will generalize from the finite-horizon case by adding a discount factor.
In the finite-horizon case, we valued a policy based on an expected finite-horizon value:

X
"h−1 #
E γ Rt | π, s0 ,
t
(10.5)
t=0

where Rt is the reward received at time t.

What is E [·]? This mathematical notation indicates an expectation, i.e., an average


taken over all the random possibilities which may occur for the argument. Here,
the expectation is taken over the conditional probability Pr(Rt = r | π, s0 ), where Rt
is the random variable for the reward, subject to the policy being π and the state
being s0 . Since π is a function, this notation is shorthand for conditioning on all of
the random variables implied by policy π and the stochastic transitions of the MDP.

A very important point is that R(s, a) is always deterministic (in this class) for any
given s and a. Here Rt represents the set of all possible R(st , a) at time step t; this Rt
is a random variable because the state we’re in at step t is itself a random variable,
due to prior stochastic state transitions up to but not including at step t and prior
(deterministic) actions dictated by policy π.

Now, for the infinite-horizon case, we select a discount factor 0 < γ < 1, and evaluate a
policy based on its expected infinite horizon discounted value:
"∞
X
#
γ Rt | π, s0 = E R0 + γR1 + γ2 R2 + . . . | π, s0 .
t
 
E (10.6)
t=0

Note that the t indices here are not the number of steps to go, but actually the number
of steps forward from the starting state (there is no sensible notion of “steps to go” in the
infinite horizon case).

Eqs. 10.5 and 10.6 are a conceptual stepping stone. Our main objective is to get to
Eq. 10.8, which can also be viewed as including γ in Eq. 10.4, with the appropriate
definition of the infinite-horizon value.

There are two good intuitive motivations for discounting. One is related to economic
theory and the present value of money: you’d generally rather have some money today
than that same amount of money next week (because you could use it now or invest it).
The other is to think of the whole process terminating, with probability 1 − γ on each step
of the interaction. This value is the expected amount of reward the agent would gain under At every step, your ex-
this terminating model. pected future lifetime,
given that you have
Study Question: Verify this fact: if, on every day you wake up, there is a probability survived until now, is
of 1 − γ that today will be your last day, then your expected lifetime is 1/(1 − γ) days. 1/(1 − γ).

Let us now evaluate a policy in terms of the expected discounted infinite-horizon value
that the agent will get in the MDP if it executes that policy. We define the infinite-horizon
value of a state s under policy π as

Vπ (s) = E[R0 +γR1 +γ2 R2 +· · · | π, S0 = s] = E[R0 +γ(R1 +γ(R2 +γ . . . ))) | π, S0 = s] . (10.7)

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 91

Because the expectation of a linear combination of random variables is the linear combina-
tion of the expectations, we have

Vπ (s) = E[R0 | π, S0 = s] + γE[R1 + γ(R2 + γ . . . ))) | π, S0 = s]


X
= R(s, π(s)) + γ T (s, π(s), s ′ )Vπ (s ′ ) . (10.8)
s′

This is so cool! In a dis-


The equation defined in Eq. 10.8 is known as the Bellman Equation, which breaks down counted model, if you
find that you survived
the value function into the immediate reward and the (discounted) future value function.
this round and landed
You could write down one of these equations for each of the n = |S| states. There are n in some state s ′ , then
unknowns Vπ (s). These are linear equations, and standard software (e.g., using Gaussian you have the same ex-
elimination or other linear algebraic methods) will, in most cases, enable us to find the pected future lifetime as
you did before. So the
value of each state under this policy. value function that is
relevant in that state is
exactly the same one as
10.2 Finding policies for MDPs in state s.

Given an MDP, our goal is typically to find a policy that is optimal in the sense that it
gets as much total reward as possible, in expectation over the stochastic transitions that
the domain makes. We build on what we have learned about evaluating the goodness of
a policy (Sections 10.1.1 and 10.1.2), and find optimal policies for the finite horizon case
(Section 10.2.1), then the infinite horizon case (Section 10.2.2).

10.2.1 Finding optimal finite-horizon policies


How can we go about finding an optimal policy for an MDP? We could imagine enumerat-
ing all possible policies and calculating their value functions as in the previous section and
picking the best one – but that’s too much work!
The first observation to make is that, in a finite-horizon problem, the best action to take
depends on the current state, but also on the horizon: imagine that you are in a situation
where you could reach a state with reward 5 in one step or a state with reward 100 in two
steps. If you have at least two steps to go, then you’d move toward the reward 100 state,
but if you only have one step left to go, you should go in the direction that will allow you
to gain 5!
One way to find an optimal policy is to compute an optimal action-value function, Q. For
the finite-horizon case, we define Qh (s, a) to be the expected value of

• starting in state s,

• executing action a, and

• continuing for h − 1 more steps executing an optimal policy for the appropriate hori-
zon on each step.

Similar to our definition of V h for evaluating a policy, we define the Qh function recur-
sively according to the horizon. The only difference is that, on each step with horizon h,
rather than selecting an action specified by a given policy, we select the value of a that will

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 92

maximize the expected Qh value of the next state.

Q0 (s, a) = 0 (10.9)
1
Q (s, a) = R(s, a) + 0 (10.10)
X
2 ′ 1 ′ ′
Q (s, a) = R(s, a) + γ T (s, a, s ) max

Q (s , a ) (10.11)
a
s′
..
.
X
Qh (s, a) = R(s, a) + γ T (s, a, s ′ ) max

Qh−1 (s ′ , a ′ ) (10.12)
a
s′

where (s ′ , a ′ ) denotes the next time-step state/action pair. We can solve for the values of
Qh with a simple recursive algorithm called finite-horizon value iteration that just computes
Qh starting from horizon 0 and working backward to the desired horizon H. Given Qh , an
optimal π∗h can be found as follows:

π∗h (s) = arg max Qh (s, a) . (10.13)


a

which gives the immediate best action(s) to take when there are h steps left; then π∗h−1 (s)
gives the best action(s) when there are (h − 1) steps left, and so on. In the case where there
are multiple best actions, we typically can break ties randomly.
Additionally, it is worth noting that in order for such an optimal policy to be computed,
we assume that the reward function R(s, a) is bounded on the set of all possible (state,
action) pairs. Furthermore, we will assume that the set of all possible actions is finite.
Study Question: The optimal value function is unique, but the optimal policy is not.
Think of a situation in which there is more than one optimal policy.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 93

Dynamic programming (somewhat counter-intuitively, dynamic programming is


neither really “dynamic” nor a type of “programming” as we typically understand
it) is a technique for designing efficient algorithms. Most methods for solving MDPs
or computing value functions rely on dynamic programming to be efficient. The
principle of dynamic programming is to compute and store the solutions to simple
sub-problems that can be re-used later in the computation. It is a very important
tool in our algorithmic toolbox.
Let’s consider what would happen if we tried to compute Q4 (s, a) for all (s, a) by
directly using the definition:

• To compute Q4 (si , aj ) for any one (si , aj ), we would need to compute Q3 (s, a)
for all (s, a) pairs.

• To compute Q3 (si , aj ) for any one (si , aj ), we’d need to compute Q2 (s, a) for
all (s, a) pairs.

• To compute Q2 (si , aj ) for any one (si , aj ), we’d need to compute Q1 (s, a) for
all (s, a) pairs.

• Luckily, those are just our R(s, a) values.

So, if we have n states and m actions, this is O((mn)3 ) work — that seems like
way too much, especially as the horizon increases! But observe that we really only
have mnh values that need to be computed: Qh (s, a) for all h, s, a. If we start with
h = 1, compute and store those values, then using and reusing the Qh−1 (s, a) values
to compute the Qh (s, a) values, we can do all this computation in time O(mn2 h),
which is much better!

10.2.2 Finding optimal infinite-horizon policies


In contrast to the finite-horizon case, the best way of behaving in an infinite-horizon dis-
counted MDP is not time-dependent. That is, the decisions you make at time t = 0 looking
forward to infinity, will be the same decisions that you make at time t = T for any positive
T , also looking forward to infinity.
An important theorem about MDPs is: in the infinite-horizon case, there exists a station-
ary optimal policy π∗ (there may be more than one) such that for all s ∈ S and all other Stationary means that
policies π, we have it doesn’t change over
time; in contrast, the
Vπ∗ (s) ⩾ Vπ (s) . (10.14)
optimal policy in a
There are many methods for finding an optimal policy for an MDP. We have already finite-horizon MDP is
non-stationary.
seen the finite-horizon value iteration case. Here we will study a very popular and useful
method for the infinite-horizon case, infinite-horizon value iteration. It is also important to
us, because it is the basis of many reinforcement-learning methods.
We will again assume that the reward function R(s, a) is bounded on the set of all pos-
sible (state, action) pairs and additionally that the number of actions in the action space
is finite. Define Q(s, a) to be the expected infinite-horizon discounted value of being in
state s, executing action a, and executing an optimal policy π∗ thereafter. Using similar
reasoning to the recursive definition of Vπ , we can express this value recursively as
X
Q(s, a) = R(s, a) + γ T (s, a, s ′ ) max

Q(s ′ , a ′ ) . (10.15)
a
s′

This is also a set of equations, one for each (s, a) pair. This time, though, they are not
linear (due to the max operation), and so they are not easy to solve. But there is a theorem

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 94

that says they have a unique solution!


Once we know the optimal action-value function, then we can extract an optimal policy
π∗ as
π∗ (s) = arg max Q(s, a) . (10.16)
a
As in the finite-horizon
We can iteratively solve for the Q∗ values with the infinite-horizon value iteration algo- case, there may be more
than one optimal policy
rithm, shown below:
in the infinite-horizon
case.
I NFINITE -H ORIZON -VALUE -I TERATION(S, A, T , R, γ, ϵ)
1 for s ∈ S, a ∈ A :
2 Qold (s, a) = 0
3 while not converged:
4 for s ∈ S, a ∈ A :
P
5 Qnew (s, a) = R(s, a) + γ s ′ T (s, a, s ′ ) maxa ′ Qold (s ′ , a ′ )
6 if maxs,a |Qold (s, a) − Qnew (s, a)| < ϵ :
7 return Qnew
8 Qold = Qnew

Theory There are a lot of nice theoretical results about infinite-horizon value iteration.
For some given (not necessarily optimal) Q function, define πQ (s) = arg maxa Q(s, a).

• After executing infinite-horizon value iteration with convergence hyper-parameter ϵ,


Note the new nota-
∥VπQnew − Vπ∗ ∥max < ϵ . (10.17) tion! Given two func-
tions f and f ′ , we write
∥f − f ′ ∥max to mean
• There is a value of ϵ such that
maxx |f(x) − f ′ (x)|. It
measures the maximum
∥Qold − Qnew ∥max < ϵ =⇒ πQnew = π∗ (10.18) absolute disagreement
between the two func-
• As the algorithm executes, ∥VπQnew − Vπ∗ ∥max decreases monotonically on each itera- tions at any input x.
tion.

• The algorithm can be executed asynchronously, in parallel: as long as all (s, a) pairs
are updated infinitely often in an infinite run, it still converges to the optimal value. This is very important
for reinforcement learn-
ing.

Last Updated: 05/06/24 18:25:27


CHAPTER 11

Reinforcement learning

So far, all the learning problems we have looked at have been supervised, that is, for each
training input x(i) , we are told which value y(i) should be the output. Reinforcement learning
differs from previous learning problems in several important ways:

• The learner interacts explicitly with an environment, rather than implicitly (as in su-
pervised learning) through an available training data set of (x(i) , y(i) ) pairs drawn
from the environment.

• The learner has some choice over what new information it seeks to gain from the
environment.

• The learner updates models incrementally as additional information about the envi- Online learning is a vari-
ronment becomes available. ant of supervised learn-
ing in which new data
pairs become avail-
In a reinforcement learning problem, the interaction with the environment takes a par-
able over time and the
ticular form: model is updated, e.g.,
by retraining over the
Learner entire larger data set, or
by weight update using
state reward action just the new data.

Environment
• Learner observes input state s(i)
• Learner generates output action a(i)
• Learner observes reward r(i)
• Learner observes input state s(i+1)
• Learner generates output action a(i+1)
• Learner observes reward r(i+1)
• ...
Similar to MDPs, the learner is supposed to find a policy, mapping a state s to action a, that
maximizes expected reward over time.

95
MIT 6.390 Spring 2024 96

11.1 Reinforcement learning algorithms overview


A reinforcement learning (RL) algorithm is a kind of a policy that depends on the whole history
of states, actions, and rewards and selects the next action to take. There are several different
ways to measure the quality of an RL algorithm, including:

• Ignoring the r(i) values that it gets while learning, but considering how many inter-
actions with the environment are required for it to learn a policy π : S → A that is
nearly optimal.

• Maximizing the expected sum of discounted rewards while it is learning.

Most of the focus is on the first criterion (which is called “sample efficiency”), because the
second one is very difficult. The first criterion is reasonable when the learning can take
place somewhere safe (imagine a robot learning, inside the robot factory, where it can’t
hurt itself too badly) or in a simulated environment.
Approaches to reinforcement learning differ significantly according to what kind of
hypothesis or model is being learned. Roughly speaking, RL methods can be categorized
into model-free methods and model-based methods. The main distinction is that model-
based methods explicitly learn the transition and reward models to assist the end-goal
of learning a policy; model-free methods do not. We will start our discussion with the
model-free methods, and introduce two of the arguably most popular types of algorithms,
Q-learning (Section 11.2.1) and policy gradient (Section 11.2.4). We then describe model-
based methods (Section 11.3). Finally, we briefly consider “bandit” problems (Section 11.4),
which differ from our MDP learning context by having probabilistic rewards.

11.2 Model-free methods


Model-free methods are methods that do not explicitly learn transition and rewards mod-
els. Depending on what is explicitly being learned, model-free methods are sometimes
further categorized into value-based methods (where the goal is to learn/estimate a value
function) and policy-based methods (where the goal is to directly learn an optimal pol-
icy). It’s important to note that such categorization is approximate and the boundaries are
blurry. In fact, current RL research tends to combine the learning of value functions, poli-
cies, and transition and reward models all into a complex learning algorithm, in an attempt
to combine the strengths of each approach.

11.2.1 Q-learning
Q-learning is a frequently used class of RL algorithms that concentrates on learning (es-
timating) the state-action value function, i.e., the Q function. Specifically, recall the MDP
value-iteration update: The thing that most stu-
X dents seem to get con-
Q(s, a) = R(s, a) + γ T (s, a, s ′ ) max Q(s ′ , a ′ ) (11.1) fused about is when we
a ′ do value iteration and
s′
when we do Q learning.
Value iteration assumes
The Q-learning algorithm below adapts this value-iteration idea to the RL scenario, where you know T and R and
we do not know the transition function T or reward function R, and instead rely on samples just need to compute Q.
to perform the updates. In Q learning, we don’t
know or even directly
estimate T and R: we
estimate Q directly from
experience!

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 97

Q-L EARNING(S, A, γ, ϵ, α, s0 )
1 for s ∈ S, a ∈ A :
2 Qold (s, a) = 0
3 s = s0
4 while True:
5 a = select_action(s, Qold (s, a))
6 r, s ′ = execute(a)
7 Qnew (s, a) = (1 − α)Qold (s, a) + α(r + γ maxa ′ Qold (s ′ , a ′ ))
8 s = s′
9 if maxs,a |Qold (s, a) − Qnew (s, a)| < ϵ :
10 return Qnew
11 Qold = Qnew
With the pseudo-code provided for Q-learning, there are a few key things to note. First,
we must determine which state to initialize the learning from. In the context of a game,
this initial state may be well defined. In the context of a robot navigating an environment,
one may consider sampling the initial state at random. In any case, the initial state is neces-
sary to determine the trajectory the agent will experience as it navigates the environment.
Second, different contexts will influence how we want to choose when to stop iterating
through the while loop. Again, in some games there may be a clear terminating state based
on the rules of how it is played. On the other hand, a robot may be allowed to explore an
environment ad infinitum. In such a case, one may consider either setting a fixed number
of transitions to take or we may want to stop iterating once the values in the Q-table are
not changing. Finally, a single trajectory through the environment may not be sufficient
to adequately explore all state-action pairs. In these instances, it becomes necessary to
run through a number of iterations of the Q-learning algorithm, potentially with different
choices of initial state s0 . Of course, we would then want to modify Q-learning such that This notion of running
the Q table is not reset with each call. a number of instances
of Q-learning is often
Now, let’s dig in to what is happening in Q-learning. Here, α ∈ (0, 1] represents the
referred to as experienc-
“learning rate,” which needs to decay for convergence purposes, but in practice is often ing multiple episodes.
set to a constant. It’s also worth mentioning that Q-learning assumes a discrete state and
action space where states and actions take on discrete values like 1, 2, 3, . . . etc. In contrast,
a continuous state space would allow the state to take values from, say, a continuous range
of numbers; for example, the state could be any real number in the interval [1, 3]. Similarly,
a continuous action space would allow the action to be drawn from, e.g., a continuous
range of numbers. There are now many extensions developed based on Q-learning that
can handle continuous state and action spaces (we’ll look at one soon), and therefore the
algorithm above is also sometimes referred to more specifically as tabular Q-learning.
In the Q-learning update rule
Q[s, a] ← (1 − α)Q[s, a] + α(r + γ max

Q[s ′ , a ′ ]) (11.2)
a

the term r + γ maxa ′ Q[s ′ , a ′ ] is often referred to as the (one-step look-ahead) target. The
update can be viewed as a combination of two different iterative processes that we have
already seen: the combination of an old estimate with the target using a running average
with a learning rate α, and the dynamic-programming update of a Q value from value
iteration.
Eq. 11.2 can also be equivalently rewritten as
 
′ ′
Q[s, a] ← Q[s, a] + α (r + γ max ′
Q[s , a ]) − Q[s, a] , (11.3)
a

which allows us to interpret Q-learning in yet another way: we make an update (or correc-
tion) based on the temporal difference between the target and the current estimated value

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 98

Q[s, a].
The Q-learning algorithm above includes a procedure called select_action, that, given the
current state s and current Q function, has to decide which action to take. If the Q value is
estimated very accurately and the agent is behaving in the world, then generally we would
want to choose the apparently optimal action arg maxa∈A Q(s, a). But, during learning, the
Q value estimates won’t be very good and exploration is important. However, exploring
completely at random is also usually not the best strategy while learning, because it is
good to focus your attention on the parts of the state space that are likely to be visited
when executing a good policy (not a bad or random one).
A typical action-selection strategy that attempts to address this exploration versus ex-
ploitation dilemma is the so-called ϵ-greedy strategy:

• with probability 1 − ϵ, choose arg maxa∈A Q(s, a);

• with probability ϵ, choose the action a ∈ A uniformly at random.

where the ϵ probability of choosing a random action helps the agent to explore and try out
actions that might not seem so desirable at the moment.
Q-learning has the surprising property that it is guaranteed to converge to the actual
optimal Q function under fairly weak conditions! Any exploration strategy is okay as
long as it tries every action infinitely often on an infinite run (so that it doesn’t converge
prematurely to a bad action choice).
Q-learning can be very inefficient. Imagine a robot that has a choice between moving to
the left and getting a reward of 1, then returning to its initial state, or moving to the right
and walking down a 10-step hallway in order to get a reward of 1000, then returning to its
initial state.

+1 +1000

-1 robot 1 2 3 4 5 6 7 8 9 10

The first time the robot moves to the right and goes down the hallway, it will update
the Q value just for state 9 on the hallway and action “right” to have a high value, but it
won’t yet understand that moving to the right in the earlier steps was a good choice. The
next time it moves down the hallway it updates the value of the state before the last one,
and so on. After 10 trips down the hallway, it now can see that it is better to move to the
right than to the left.
More concretely, consider the vector of Q values Q(i = 0, . . . , 9; right), representing the
Q values for moving right at each of the positions i = 0, . . . , 9. Position index 0 is the
starting position of the robot as pictured above.
Then, for α = 1 and γ = 0.9, Eq. 11.3 becomes

Q(i, right) = R(i, right) + 0.9 · max Q(i + 1, a). (11.4)


a

Starting with Q values of 0,

Q(0) (i = 0, . . . , 9; right) = 0
 
0 0 0 0 0 0 0 0 0 . (11.5)

Since the only nonzero reward from moving right is R(9, right) = 1000, after our robot We are violating our
makes it down the hallway once, our new Q vector is usual notational con-
ventions here, and writ-
ing Q(i) to mean the
Q(1) (i = 0, . . . , 9; right) = 0 0 0 0 0 0 0 0 0 1000 .
 
(11.6)
Q value function that
results after the robot
runs all the way to
Last Updated: 05/06/24 18:25:27 the end of the hallway,
when executing the pol-
icy that always moves
to the right.
MIT 6.390 Spring 2024 99

After making its way down the hallway again, Q(8, right) = 0 + 0.9 · Q(9, right) = 900
updates:

Q(2) (i = 0, . . . , 9; right) = 0 0 0 0 0 0 0 0 900 1000 .


 
(11.7)

Similarly,

Q(3) (i = 0, . . . , 9; right) = 0 0 0 0 0 0 0 810 900 1000


 
(11.8)
Q(4) (i = 0, . . . , 9; right) = 0 0 0 0 0 0 729 810 900 1000
 
(11.9)
..
. (11.10)
(10)
 
Q (i = 0, . . . , 9; right) = 387.4 420.5 478.3 531.4 590.5 656.1 729 810 900 1000 ,
(11.11)

and the robot finally sees the value of moving right from position 0. Here, we can see the
exploration/exploita-
Study Question: Determine the Q value functions that will result from updates due tion dilemma in action:
to the robot always executing the “move left” policy. from the perspective of
s0 = 0, it will seem that
getting the immediate
reward of 1 is a better
11.2.2 Function approximation: Deep Q learning strategy without explor-
ing the long hallway.
In our Q-learning algorithm above, we essentially keep track of each Q value in a table,
indexed by s and a. What do we do if S and/or A are large (or continuous)?
We can use a function approximator like a neural network to store Q values. For exam-
ple, we could design a neural network that takes in inputs s and a, and outputs Q(s, a).
We can treat this as a regression problem, optimizing this loss: This is the so-called
squared Bellman error;
 2 as the name suggests,
′ ′
Q(s, a) − (r + γ max

Q(s , a )) , (11.12) it’s closely related to the
a Bellman equation we
saw in MDPs in Chap-
where Q(s, a) is now the output of the neural network. ter 10. Roughly speak-
There are several different architectural choices for using a neural network to approxi- ing, this error measures
mate Q values: how much the Bellman
equality is violated.
• One network for each action a, that takes s as input and produces Q(s, a) as output;

• One single network that takes s as input and produces a vector Q(s, ·), consisting of
the Q values for each action; or

• One single network that takes s, a concatenated into a vector (if a is discrete, we
would probably use a one-hot encoding, unless it had some useful internal structure)
and produces Q(s, a) as output.
For continuous action
The first two choices are only suitable for discrete (and not too big) action sets. The last spaces, it is popular
choice can be applied for continuous actions, but then it is difficult to find arg maxa∈A Q(s, a). to use a class of meth-
ods called actor-critic
There are not many theoretical guarantees about Q-learning with function approxima- methods, which com-
tion and, indeed, it can sometimes be fairly unstable (learning to perform well for a while, bine policy and value-
and then getting suddenly worse, for example). But neural network Q-learning has also function learning. We
won’t get into them in
had some significant successes. detail here, though.
One form of instability that we do know how to guard against is catastrophic forgetting.
In standard supervised learning, we expect that the training x values were drawn inde-
pendently from some distribution. But when a learning agent, such as a robot, is moving And, in fact, we rou-
through an environment, the sequence of states it encounters will be temporally correlated. tinely shuffle their order
For example, the robot might spend 12 hours in a dark environment and then 12 in a light in the data file, anyway.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 100

one. This can mean that while it is in the dark, the neural-network weight-updates will
make the Q function “forget” the value function for when it’s light.
One way to handle this is to use experience replay, where we save our (s, a, s ′ , r) expe-
riences in a replay buffer. Whenever we take a step in the world, we add the (s, a, s ′ , r) to
the replay buffer and use it to do a Q-learning update. Then we also randomly select some
number of tuples from the replay buffer, and do Q-learning updates based on them, as well.
In general it may help to keep a sliding window of just the 1000 most recent experiences in
the replay buffer. (A larger buffer will be necessary for situations when the optimal policy
might visit a large part of the state space, but we like to keep the buffer size small for mem-
ory reasons and also so that we don’t focus on parts of the state space that are irrelevant for
the optimal policy.) The idea is that it will help us propagate reward values through our
state space more efficiently if we do these updates. We can see it as doing something like
value iteration, but using samples of experience rather than a known model.

11.2.3 Fitted Q-learning


An alternative strategy for learning the Q function that is somewhat more robust than the
standard Q-learning algorithm is a method called fitted Q.

F ITTED -Q-L EARNING(A, s0 , γ, α, ϵ, m)


1 s = s0 // (e.g., s0 can be drawn randomly from S)
2 D={}
3 initialize neural-network representation of Q
4 while True:
5 Dnew = experience from executing ϵ-greedy policy based on Q for m steps
6 D = D ∪ Dnew represented as (s, a, s ′ , r) tuples
7 Dsupervised = {(x(i) , y(i) )} where x(i) = (s, a) and y(i) = r + γ maxa ′ ∈A Q(s ′ , a ′ )
8 for each tuple (s, a, s ′ , r)(i) ∈ D
9 re-initialize neural-network representation of Q
10 Q = SUPERVISED -NN- REGRESSION(Dsupervised )

Here, we alternate between using the policy induced by the current Q function to gather
a batch of data Dnew , adding it to our overall data set D, and then using supervised neural-
network training to learn a representation of the Q value function on the whole data set.
This method does not mix the dynamic-programming phase (computing new Q values
based on old ones) with the function approximation phase (supervised training of the neu-
ral network) and avoids catastrophic forgetting. The regression training in line 9 typically
uses squared error as a loss function and would be trained until the fit is good (possibly
measured on held-out data).

11.2.4 Policy gradient


A different model-free strategy is to search directly for a good policy. The strategy here is
to define a functional form f(s; θ) = a for the policy, where θ represents the parameters we
learn from experience. We choose f to be differentiable, and often define f(s, a; θ) = Pr(a|s), This means the chance
a conditional probability distribution over our possible actions. of choosing an action
depends on which state
Now, we can train the policy parameters using gradient descent:
the agent is in. Sup-
pose, e.g., a robot is
• When θ has relatively low dimension, we can compute a numeric estimate of the gra- trying to get to a goal
dient by running the policy multiple times for different values of θ, and computing and can go left or right.
the resulting rewards. An unconditional policy
can say: I go left 99% of
the time; a conditional
policy can consider the
Last Updated: 05/06/24 18:25:27 robot’s state, and say: if
I’m to the right of the
goal, I go left 99% of
the time.
MIT 6.390 Spring 2024 101

• When θ has higher dimensions (e.g., it represents the set of parameters in a com-
plicated neural network), there are more clever algorithms, e.g., one called REIN-
FORCE, but they can often be difficult to get to work reliably.

Policy search is a good choice when the policy has a simple known form, but the MDP
would be much more complicated to estimate.

11.3 Model-based RL
The conceptually simplest approach to RL is to model R and T from the data we have gotten
so far, and then use those models, together with an algorithm for solving MDPs (such as
value iteration) to find a policy that is near-optimal given the current models.
Assume that we have had some set of interactions with the environment, which can be
characterized as a set of tuples of the form (s(t) , a(t) , s(t+1) , r(t) ).
Because the transition function T (s, a, s ′ ) specifies probabilities, multiple observations
of (s, a, s ′ ) may be needed to model the transition function. One approach to this task of
building a model T̂ (s, a, s ′ ) for the true T (s, a, s ′ ) is to estimate it using a simple counting
strategy,
#(s, a, s ′ ) + 1
T̂ (s, a, s ′ ) = . (11.13)
#(s, a) + |S|
Here, #(s, a, s ′ ) represents the number of times in our data set we have the situation where
s(t) = s, a(t) = a, s(t+1) = s ′ and #(s, a) represents the number of times in our data set we
have the situation where s(t) = s, a(t) = a.
P
Study Question: Prove to yourself that #(s, a) = s ′ #(s, a, s ′ ).
Adding 1 and |S| to the numerator and denominator, respectively, are a form of smooth-
ing called the Laplace correction. It ensures that we never estimate that a probability is 0,
and keeps us from dividing by 0. As the amount of data we gather increases, the influence Conceptually, this is
of this correction fades away. also similar to having
“initialized” our esti-
In contrast, the reward function R(s, a) (as we have specified it in this text) is a deter-
mate for the transition
ministic function, such that knowing the reward r for a given (s, a) is sufficient to fully function with uniform
determine the function at that point. In other words, our model R̂ can simply be a record random probabilities,
of observed rewards, such that R̂(s, a) = r = R(s, a). before having made any
observations.
Given empirical models T̂ and R̂ for the transition and reward functions, we can now
solve the MDP (S, A, T̂ , R̂) to find an optimal policy using value iteration, or use a search
algorithm to find an action to take for a particular state.
This approach is effective for problems with small state and action spaces, where it is
not too hard to get enough experience to model T and R well; but it is difficult to generalize
this method to handle continuous (or very large discrete) state spaces, and is a topic of
current research.

11.4 Bandit problems


Bandit problems differ from our reinforcement learning setting as described above in two
ways: the reward function is probabilistic, and the key decision is usually framed as whether
or not to continue exploring (to improve the model) versus exploiting (take actions to max-
imize expected rewards based on the current model).
A basic bandit problem is given by

• A set of actions A;

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 102

• A set of reward values R; and

• A probabilistic reward function Rp : A × R → R, i.e., Rp is a function that takes an


action and a reward and returns the probability of getting that reward conditioned
on that action being taken, Rp (a, r) = Pr(reward = r | action = a). This is analogous
to how the transition function T is defined. Each time the agent takes an action, a
new value is drawn from this distribution.

The most typical bandit problem has R = {0, 1} and |A| = k. This is called a k-armed
bandit problem, where the decision is which “arm” (action a) to select, and the reward is Why? Because in En-
either getting a payoff (1) or not (0). There is a lot of mathematical literature on optimal glish slang, “one-armed
bandit” is a name for
strategies for k-armed bandit problems under various assumptions. The important ques-
a slot machine (an old-
tion is usually one of exploration versus exploitation. Imagine that you have tried each action style gambling machine
10 times, and now you have estimates R̂p (a, r) for the probabilities Rp (a, r) for reward r where you put a coin
given action a. Which arm should you pick next? You could into a slot and then pull
its arm to see if you get
exploit your knowledge, and for future trials choose the arm with the highest value of a payoff) because it has
one arm and takes your
expected reward; or money! What we have
here is a similar sort
explore further, by trying some or all actions more times, hoping to get better estimates of machine, but with k
of the Rp (a, r) values. arms.

The theory ultimately tells us that, the longer our horizon h (or, similarly, closer to 1 our
discount factor), the more time we should spend exploring, so that we don’t converge
prematurely on a bad choice of action.
Study Question: Why is it that “bad” luck during exploration is more dangerous
than “good” luck? Imagine that there is an action that generates reward value 1 with
probability 0.9, but the first three times you try it, it generates value 0. How might
that cause difficulty? Why is this more dangerous than the situation when an action
that generates reward value 1 with probability 0.1 actually generates reward 1 on the
first three tries?
Bandit problems are reinforcement learning problems (and are very different from batch
supervised learning) in that: There is a setting of su-
pervised learning, called
• The agent gets to influence what data it obtains (selecting a gives it another sample active learning, where in-
from R(a, r)), and stead of being given a
training set, the learner
• The agent is penalized for mistakes it makes while it is learning (if it is trying to gets to select a value of
P x and the environment
maximize the expected reward r r · Pr(Rp (a, r) = r) it gets while behaving). gives back a label y;
the problem of picking
In a contextual bandit problem, you have multiple possible states, drawn from some set good x values to query
S, and a separate bandit problem associated with each one. is interesting, but the
Bandit problems are an essential subset of reinforcement learning. It’s important to be problem of deriving a
hypothesis from (x, y)
aware of the issues, but we will not study solutions to them in this class. pairs is the same as the
supervised problem we
have been studying.

Last Updated: 05/06/24 18:25:27


CHAPTER 12

Unsupervised Learning

In previous chapters, we have largely focused on classification and regression problems,


where we use supervised learning with training samples that have both features/inputs
and corresponding outputs or labels, to learn hypotheses or models that can then be used
to predict labels for new data.
In contrast to supervised learning paradigm, we can also have an unsupervised learn-
ing setting, where we only have features but no corresponding outputs or labels for our
dataset. On natural question aries then: if there are no labels, what are we learning?
One canonical example of unsupervised learning is clustering, which we will learn about
in Section 12.1. In clustering, the goal is to develop algorithms that can reason about “sim-
ilarity” among data points’s features, and group the data points into clusters.
Autoencoders are another family of unsupervised learning algorithms, which we will
look at in Section 12.2, and in this case, we will be seeking to obtain insights about our
data by learning compressed versions of the original data. Or, in other words, by finding a
good lower-dimensional feature representations of the same data set. Such insights might
help us to discover and characterize underlying factors of variation in data, which can aid
in scientific discovery; to compress data for efficient storage or communication; or to pre-
process our data prior to supervised learning, perhaps to reduce the amount of data that is
needed to learn a good classifier or regressor.

12.1 Clustering
Oftentimes a dataset can be partitioned into different categories. A doctor may notice that
their patients come in cohorts and different cohorts respond to different treatments. A biol-
ogist may gain insight by identifying that bats and whales, despite outward appearances,
have some underlying similarity, and both should be considered members of the same cat-
egory, i.e., “mammal”. The problem of automatically identifying meaningful groupings in
datasets is called clustering. Once these groupings are found, they can be leveraged toward
interpreting the data and making optimal decisions for each group.

103
MIT 6.390 Spring 2024 104

12.1.1 Clustering formalisms


Mathematically, clustering looks a bit like classification: we wish to find a mapping from
datapoints, x, to categories, y. However, rather than the categories being predefined labels,
the categories in clustering are automatically discovered partitions of an unlabeled dataset.
Because clustering does not learn from labeled examples, it is an example of an unsuper-
vised learning algorithm. Instead of mimicking the mapping implicit in supervised training
pairs {x(i) , y(i) }n
i=1 , clustering assigns datapoints to categories based on how the unlabeled
data {x(i) }n
i=1 is distributed in data space.
Intuitively, a “cluster” is a group of datapoints that are all nearby to each other and far
away from other clusters. Let’s consider the following scatter plot. How many clusters do
you think there are?

Figure 12.1: A dataset we would like to cluster. How many clusters do you think there are?

There seem to be about five clumps of datapoints and those clumps are what we would
like to call clusters. If we assign all datapoints in each clump to a cluster corresponding to
that clump, then we might desire that nearby datapoints are assigned to the same cluster,
while far apart datapoints are assigned to different clusters.
In designing clustering algorithms, three critical things we need to decide are:

• How do we measure distance between datapoints? What counts as “nearby” and “far
apart”?

• How many clusters should we look for?

• How do we evaluate how good a clustering is?

We will see how to begin making these decisions as we work through a concrete clus-
tering algorithm in the next section.

12.1.2 The k-means formulation


One of the simplest and most commonly used clustering algorithms is called k-means. The
goal of the k-means algorithm is to assign datapoints to k clusters in such a way that the We will be careful to
variance within clusters is as small as possible. Notice that this matches our intuitive idea distinguish between the
k-means algorithm and
that a cluster should be a tightly packed set of datapoints.
the k-means objective.
Similar to the way we showed that supervised learning could be formalized mathe- As we will see, the k-
matically as the minimization of an objective function (loss function + regularization), we means algorithm can be
will show how unsupervised learning can also be formalized as minimizing an objective understood to be just
one optimization al-
function. Let us denote the cluster assignment for a datapoint x(i) as y(i) ∈ {1, 2, . . . , k}, gorithm which finds a
i.e., y(i) = 1 means we are assigning datapoint x(i) to cluster number 1. Then the k-means local optimum of the k-
means objective.

Recall that variance is a


Last Updated: 05/06/24 18:25:27 measure of how “spread
out” data is, defined as
the mean squared dis-
tance from the average
value of the data.
MIT 6.390 Spring 2024 105

objective can be quantified with the following objective function (which we also call the
“k-means objective” or “k-means loss”):

X
k X
n
2
1(y(i) = j) x(i) − µ(j) , (12.1)
j=1 i=1

P P
where µ(j) = N1j n i=1 1(y
(i)
= j)x(i) and Nj = n i=1 1(y
(i)
= j), so that µ(j) is the mean of
all datapoints in cluster j, and using 1(·) to denote the indicator function (which takes on
value of 1 if its argument is true and 0 otherwise). The inner sum (over data points) of the
loss is the variance of datapoints within cluster j. We sum up the variance of all k clusters
to get our overall loss.

12.1.3 K-means algorithm


The k-means algorithm minimizes this loss by alternating between two steps: given some
initial cluster assignments: 1) compute the mean of all data in each cluster and assign this as
the “cluster mean”, and 2) reassign each datapoint to the cluster with nearest cluster mean.
Fig. 12.2 shows what happens when we repeat these steps on the dataset from above.

Figure 12.2: The first three steps of running the k-means algorithm on this data. Datapoints
are colored according to the cluster to which they are assigned. Cluster means are the larger
X’s with black outlines.

Each time we reassign the data to the nearest cluster mean, the k-means loss decreases
(the datapoints end up closer to their assigned cluster mean), or stays the same. And each
time we recompute the cluster means the loss also decreases (the means end up closer to
their assigned datapoints) or stays the same. Overall then, the clustering gets better and
better, according to our objective – until it stops improving. After four iterations of cluster
assignment + update means in our example, the k-means algorithm stops improving. Its
final solution is shown in Fig. 12.3. It seems to arrive at something reasonable!

Figure 12.3: K-means loss stops improving; final result.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 106

Now let’s write out the algorithm in complete detail:

K - MEANS (k, τ, {x(i) }n


i=1 )
1 µ, y = random initialization
2 for t = 1 to τ
3 yold = y
4 for i = 1 to n
2
5 y(i) = arg minj x(i) − µ(j)
6 for j = 1 to k
P
7 µ(j) = N1j n i=1 1(y
(i)
= j)x(i)
8 if 1(y = yold )
9 break
10 return µ, y

Study Question: Why do we have the “break” statement on line 9? Could the clus-
tering improve if we ran it for more iterations after this point? Has it converged?
The for-loop over the n datapoints assigns each datapoint to the nearest cluster center.
The for-loop over the k clusters updates the cluster center to be the mean of all datapoints
currently assigned to that cluster. As suggested above, it can be shown that this algorithm
reduces the loss in Eq. 12.1 on each iteration, until it converges to a local minimum of the
loss.
It’s like classification except the algorithm picked what the classes are rather than being
given examples of what the classes are.

12.1.4 Using gradient descent to minimize k-means objective


We can also use gradient descent to optimize the k-means objective. To show how to apply The k-means algorithm
gradient descent, we first rewrite the objective as a differentiable function only of µ: presented above is a
form of block coordinate
X
n
2
descent, rather than gra-
L(µ) = min x(i) − µ(j) . (12.2) dient descent. For cer-
j tain problems, and in
i=1
particular k-means, this
method can converge
L(µ) is the value of the k-means loss given that we pick the optimal assignments of the
faster than gradient de-
datapoints to cluster means (that’s what the minj does). Now we can use the gradient scent.
∂L(µ)
∂µ to find the values for µ that achieve minimum loss when cluster assignments are
optimal. Finally, we read off the optimal cluster assignments, given the optimized µ, just L(µ) is a smooth func-
tion except with kinks
by assigning datapoints to their nearest cluster mean: where the nearest clus-
ter changes; that means
2
it’s differentiable almost
y(i) = arg min x(i) − µ(j) . (12.3)
j everywhere, which in
practice is sufficient for
This procedure yields a local minimum of Eq. 12.1, as does the standard k-means algorithm us to apply gradient de-
scent.
we presented (though they might arrive at different solutions). It might not be a global
optimum since the objective is not convex (due to minj , as the minimum of multiple convex
functions is not necessarily convex).

12.1.5 Importance of initialization


The standard k-means algorithm, as well as the variant that uses gradient descent, both
are only guaranteed to converge to a local minimum, not necessarily a global minimum
of the loss. Thus the answer we get out depends on how we initialize the cluster means.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 107

Figure 12.4: With the initialization of the means to the left, the yellow and red means end
up splitting what perhaps should be one cluster in half.

Figure 12.4 is an example of a different initialization on our toy data, which results in a
worse converged clustering:
A variety of methods have been developed to pick good initializations (for example,
check out the k-means++ algorithm). One simple option is to run the standard k-means
algorithm multiple times, with different random initial conditions, and then pick from
these the clustering that achieves the lowest k-means loss.

12.1.6 Importance of k
A very important choice in cluster algorithms is the number of clusters we are looking
for. Some advanced algorithms can automatically infer a suitable number of clusters, but
most of the time, like with k-means, we will have to pick k – it’s a hyperparameter of the
algorithm.

Figure 12.5: Example of k-means run on our toy data, with two different values of k. Setting
k=4, on the left, results in one cluster being merged, compared to setting k=5, on the right.
Which clustering do you think is better? How could you decide?

Figure 12.5 shows an example of the effect. Which result looks more correct? It can be
hard to say! Using higher k we get more clusters, and with more clusters we can achieve
lower within-cluster variance – the k-means objective will never increase, and will typically
strictly decrease as we increase k. Eventually, we can increase k to equal the total number
of datapoints, so that each datapoint is assigned to its own cluster. Then the k-means
objective is zero, but the clustering reveals nothing. Clearly, then, we cannot use the k-
means objective itself to choose the best value for k. In subsection 12.1.7.1, we will discuss
some ways of evaluating the success of clustering beyond its ability to minimize the k-
means objective, and it’s with these sorts of methods that we might decide on a proper
value of k.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 108

Alternatively, you may be wondering: why bother picking a single k? Wouldn’t it be


nice to reveal a hierarchy of clusterings of our data, showing both coarse and fine group-
ings? Indeed hierarchical clustering is another important class of clustering algorithms, be-
yond k-means. These methods can be useful for discovering tree-like structure in data, and
they work a bit like this: initially a coarse split/clustering of the data is applied at the root
of the tree, and then as we descend the tree we split and cluster the data in ever more fine-
grained ways. A prototypical example of hierarchical clustering is to discover a taxonomy
of life, where creatures may be grouped at multiple granularities, from species to families
to kingdoms.

12.1.7 k-means in feature space


Clustering algorithms group data based on a notion of similarity, and thus we need to
define a distance metric between datapoints. This notion will also be useful in other machine
learning approaches, such as nearest-neighbor methods that we see in Chapter 9. In k-
means and other methods, our choice of distance metric can have a big impact on the
results we will find.
Our k-means algorithm uses the Euclidean distance, i.e., x(i) − µ(j) , with a loss func-
tion that is the square of this distance. We can modify k-means to use different distance
metrics, but a more common trick is to stick with Euclidean distance but measured in a
feature space. Just like we did for regression and classification problems, we can define a
feature map from the data to a nicer feature representation, ϕ(x), and then apply k-means
to cluster the data in the feature space. In fact, using a sim-
As a simple example, suppose we have two-dimensional data that is very stretched out ple distance metric in
feature space can be
in the first dimension and has less dynamic range in the second dimension. Then we may
equivalent to using a
want to scale the dimensions so that each has similar dynamic range, prior to clustering. more sophisticated dis-
We could use standardization, like we did in Chapter 5. tance metric in the data
If we want to cluster more complex data, like images, music, chemical compounds, space, and this trick
forms the basis of kernel
etc., then we will usually need more sophisticated feature representations. One common methods, which you can
practice these days is to use feature representations learned with a neural network. For learn about in more ad-
example, we can use an autoencoder to compress images into feature vectors, then cluster vanced machine learn-
those feature vectors. ing classes.

12.1.7.1 How to evaluate clustering algorithms


One of the hardest aspects of clustering is knowing how to evaluate it. This is actually
a big issue for all unsupervised learning methods, since we are just looking for patterns
in the data, rather than explicitly trying to predict target values (which was the case with
supervised learning).
Remember, evaluation metrics are not the same as loss functions, so we can’t just mea-
sure success by looking at the k-means loss. In prediction problems, it is critical that the
evaluation is on a held-out test set, while the loss is computed over training data. If we
evaluate on training data we cannot detect overfitting. Something similar is going on with
the example in Section 12.1.6, where setting k to be too large can precisely “fit” the data
(minimize the loss), but yields no general insight.
One way to evaluate our clusters is to look at the consistency with which they are found
when we run on different subsamples of our training data, or with different hyperparam-
eters of our clustering algorithm (e.g., initializations). For example, if running on several
bootstrapped samples (random subsets of our data) results in very different clusters, it
should call into question the validity of any of the individual results.
If we have some notion of what ground truth clusters should be, e.g., a few data points
that we know should be in the same cluster, then we can measure whether or not our

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 109

discovered clusters group these examples correctly.


Clustering is often used for visualization and interpretability, to make it easier for
humans to understand the data. Here, human judgment may guide the choice of clustering
algorithm. More quantitatively, discovered clusters may be used as input to downstream
tasks. For example, we may fit a different regression function on the data within each
cluster. Figure 12.6 gives an example where this might be useful. In cases like this, the
success of a clustering algorithm can be indirectly measured based on the success of the
downstream application (e.g., does it make the downstream predictions more accurate).

Figure 12.6: Averaged across the whole population, risk of heart disease positively corre-
lates with hours of exercise. However, if we cluster the data, we can observe that there are
four subgroups of the population which correspond to different age groups, and within
each subgroup the correlation is negative. We can make better predictions, and better cap-
ture the presumed true effect, if we cluster this data and then model the trend in each
cluster separately.

12.2 Autoencoder structure


Assume that we have input data D = {x(1) , . . . , x(n) }, where x(i) ∈ Rd . We seek to learn an
autoencoder that will output a new dataset Dout = {a(1) , . . . , a(n) }, where a(i) ∈ Rk with
k < d. We can think about a(i) as the new representation of data point x(i) . For example,
in Fig. 12.7 we show the learned representations of a dataset of MNIST digits with k = 2.
We see, after inspecting the individual data points, that unsupervised learning has found
a compressed (or latent) representation where images of the same digit are close to each
other, potentially greatly aiding subsequent clustering or classification tasks.
Formally, an autoencoder consists of two functions, a vector-valued encoder g : Rd → Rk
that deterministically maps the data to the representation space a ∈ Rk , and a decoder
h : Rk → Rd that maps the representation space back into the original data space.
In general, the encoder and decoder functions might be any functions appropriate to the
domain. Here, we are particularly interested in neural network embodiments of encoders
and decoders. The basic architecture of one such autoencoder, consisting of only a single
layer neural network in each of the encoder and decoder, is shown in Figure 12.8; note that
bias terms W01 and W02 into the summation nodes exist, but are omitted for clarity in the fig-
ure. In this example, the original d-dimensional input is compressed into k = 3 dimensions
T
via the encoder g(x; W 1 , W01 ) = f1 (W 1 x + W01 ) with W 1 ∈ Rd×k and W01 ∈ Rk , and where
the non-linearity f1 is applied to each dimension of the vector. To recover (an approxima-
T
tion to) the original instance, we then apply the decoder h(a; W 2 , W02 ) = f2 (W 2 a + W02 ),
where f2 denotes a different non-linearity (activation function). In general, both the de-

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 110

Figure 12.7: Compression of digits dataset into two dimensions. The input x(i) , an image of a
handwritten digit, is shown at the new low-dimensional representation (a1 , a2 ).

coder and the encoder could involve multiple layers, as opposed to the single layer shown
here. Learning seeks parameters W 1 , W01 and W 2 , W02 such that the reconstructed instances,
h(g(x(i) ; W 1 , W01 ); W 2 , W02 ), are close to the original input x(i) .

Figure 12.8: Autoencoder structure, showing the encoder (left half, light green), and the decoder
(right half, light blue), encoding inputs x to the representation a, and decoding the representation to
produce x̃, the reconstruction. In this specific example, the representation (a1 , a2 , a3 ) only has three
dimensions.

12.2.1 Autoencoder Learning


We learn the weights in an autoencoder using the same tools that we previously used for
supervised learning, namely (stochastic) gradient descent of a multi-layer neural network
to minimize a loss function. All that remains is to specify the loss function L(x̃, x), which
tells us how to measure the discrepancy between the reconstruction x̃ = h(g(x; W 1 , W01 ); W 2 , W02 )
and the original input x. For example, for continuous-valued x it might make sense to use
P
squared loss, i.e., LSE (x̃, x) = d 2
j=1 (xj − x̃j ) . Learning then seeks to optimize the parame- Alternatively, you could
ters of h and g so as to minimize the reconstruction error, measured according to this loss think of this as multi-
task learning, where the
function:
Xn   goal is to predict each
min LSE h(g(x(i) ; W 1 , W01 ); W 2 , W02 ), x(i) dimension of x. One
W 1 ,W01 ,W 2 ,W02 can mix-and-match loss
i=1
functions as appropriate
for each dimension’s
Last Updated: 05/06/24 18:25:27 data type.
MIT 6.390 Spring 2024 111

12.2.2 Evaluating an autoencoder


What makes a good learned representation in an autoencoder? Notice that, without further
constraints, it is always possible to perfectly reconstruct the input. For example, we could
let k = d and h and g be the identity functions. In this case, we would not obtain any
compression of the data.
To learn something useful, we must create a bottleneck by making k to be smaller (of-
ten much smaller) than d. This forces the learning algorithm to seek transformations that
describe the original data using as simple a description as possible. Thinking back to the
digits dataset, for example, an example of a compressed representation might be the digit
label (i.e., 0–9), rotation, and stroke thickness. Of course, there is no guarantee that the
learning algorithm will discover precisely this representation. After learning, we can in-
spect the learned representations, such as by artificially increasing or decreasing one of the
dimensions (e.g., a1 ) and seeing how it affects the output h(a), to try to better understand
what it has learned.
As with clustering, autoencoders can be a preliminary step toward building other mod-
els, such as a regressor or classifier. For example, once a good encoder has been learned,
the decoder might be replaced with another neural network that is then trained with su-
pervised learning (perhaps using a smaller dataset that does include labels).

12.2.3 Linear encoders and decoders


We close by mentioning that even linear encoders and decoders can be very powerful. In
this case, rather than minimizing the above objective with gradient descent, a technique
called principal components analysis (PCA) can be used to obtain a closed-form solution to
the optimization problem using a singular value decomposition (SVD). Just as a multilayer
neural network with nonlinear activations for regression (learned by gradient descent) can
be thought of as a nonlinear generalization of a linear regressor (fit by matrix algebraic
operations), the neural network based autoencoders discussed above (and learned with
gradient descent) can be thought of as a generalization of linear PCA (as solved with matrix
algebra by SVD).

12.2.4 Advanced encoders and decoders


Advanced neural networks that build on the encoder-decoder conceptual decomposition
have become increasingly powerful in recent years. One family of applications are gen-
erative networks, where new outputs that are “similar to” but different from any existing
training sample are desired. In variational autoencoders the compressed representation en-
compasses information about the probability distribution of training samples, e.g., learning
both mean and standard deviation variables in the bottleneck layer or latent representation.
Then, new outputs can be generated by random sampling based on the latent representa-
tion variables and feeding those samples into the decoder. For instance, Transformers use
multiple encoder and decoder blocks, together with a self-attention mechanism to make
predictions about potential next outputs resulting from sequences of inputs. Such trans-
former networks have many applications in natural language processing and elsewhere.

Last Updated: 05/06/24 18:25:27


APPENDIX A

Matrix derivative common cases

What are some conventions for derivatives of matrices and vectors? It will always work
to explicitly write all indices and treat everything as scalars, but we introduce here some
shortcuts that are often faster to use and helpful for understanding.
There are at least two consistent but different systems for describing shapes and rules
for doing matrix derivatives. In the end, they all are correct, but it is important to be
consistent.
We will use what is often called the ‘Hessian’ or denominator layout, in which we say
that for
x of size n × 1 and y of size m × 1, ∂y/∂x is a matrix of size n × m with the (i, j) entry
∂yj /∂xi . This denominator layout convention has been adopted by the field of machine
learning to ensure that the shape of the gradient is the same as the shape of the shape
of the respective derivative. This is somewhat controversial at large, but alas, we shall
continue with denominator layout.
The discussion below closely follows the Wikipedia on matrix derivatives.

A.1 The shapes of things


Here are important special cases of the rule above:

• Scalar-by-scalar: For x of size 1 × 1 and y of size 1 × 1, ∂y/∂x is the (scalar) partial


derivative of y with respect to x.

• Scalar-by-vector: For x of size n × 1 and y of size 1 × 1, ∂y/∂x (also written ∇x y, the


gradient of y with respect to x) is a column vector of size n × 1 with the ith entry
∂y/∂xi :

 
∂y/∂x1
 ∂y/∂x2 
∂y/∂x =  .
 
..
 . 
∂y/∂xn

112
MIT 6.390 Spring 2024 113

• Vector-by-scalar: For x of size 1 × 1 and y of size m × 1, ∂y/∂x is a row vector of size


1 × m with the jth entry ∂yj /∂x:

 
∂y/∂x = ∂y1 /∂x ∂y2 /∂x · · · ∂ym /∂x .

• Vector-by-vector: For x of size n × 1 and y of size m × 1, ∂y/∂x is a matrix of size


n × m with the (i, j) entry ∂yj /∂xi :

···
 
∂y1 /∂x1 ∂y2 /∂x1 ∂ym /∂x1
 ∂y1 /∂x2 ∂y2 /∂x2 ··· ∂ym /∂x2 
∂y/∂x =  .
 
.. .. .. ..
 . . . . 
∂y1 /∂xn ∂y2 /∂xn ··· ∂ym /∂xn

• Scalar-by-matrix: For X of size n × m and y of size 1 × 1, ∂y/∂X (also written ∇X y, the


gradient of y with respect to X) is a matrix of size n × m with the (i, j) entry ∂y/∂Xi,j :
 
∂y/∂X1,1 · · · ∂y/∂X1,m
.. .. ..
∂y/∂X =  .
 
. . .
∂y/∂Xn,1 ··· ∂y/∂Xn,m

You may notice that in this list, we have not included matrix-by-matrix, matrix-by-
vector, or vector-by-matrix derivatives. This is because, generally, they cannot be expressed
nicely in matrix form and require higher order objects (e.g., tensors) to represent their
derivatives. These cases are beyond the scope of this course.
Additionally, notice that for all cases, you can explicitly compute each element of the
derivative object using (scalar) partial derivatives. You may find it useful to work through
some of these by hand as you are reviewing matrix derivatives.

A.2 Some vector-by-vector identities


Here are some examples of ∂y/∂x. In each case, assume x is n × 1, y is m × 1, a is a scalar
constant, a is a vector that does not depend on x and A is a matrix that does not depend on
x, u and v are scalars that do depend on x, and u and v are vectors that do depend on x. We
also have vector-valued functions f and g.

A.2.1 Some fundamental cases


First, we will cover a couple of fundamental cases: suppose that a is an m × 1 vector which
is not a function of x, an n × 1 vector. Then,

∂a
= 0, (A.1)
∂x
is an n × m matrix of 0s. This is similar to the scalar case of differentiating a constant. Next,
we can consider the case of differentiating a vector with respect to itself:

∂x
=I (A.2)
∂x
This is the n × n identity matrix, with 1’s along the diagonal and 0’s elsewhere. It makes
sense, because ∂xj /xi is 1 for i = j and 0 otherwise. This identity is also similar to the scalar
case.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 114

A.2.2 Derivatives involving a constant matrix


Let the dimensions of A be m × n. Then the object Ax is an m × 1 vector. We can then
compute the derivative of Ax with respect to x as:

···
 
∂(Ax)1 /∂x1 ∂(Ax)2 /∂x1 ∂(Ax)m /∂x1
∂Ax  ∂(Ax)1 /∂x2 ∂(Ax)2 /∂x2 ··· ∂(Ax)m /∂x2 
= (A.3)

.. .. .. ..
∂x

 . . . . 
∂(Ax)1 /∂xn ∂(Ax)2 /∂xn ··· ∂(Ax)m /∂xn
Note that any element of the column vector Ax can be written as, for j = 1, . . . , m:

X
n
(Ax)j = Aj,k xk .
k=1
∂Ax
Thus, computing the (i, j) entry of ∂x requires computing the partial derivative ∂(Ax)j /∂xi :

X
n
!
∂(Ax)j /∂xi = ∂ Aj,k xk /∂xi = Aj,i
k=1

∂Ax
Therefore, the (i, j) entry of ∂x is the (j, i) entry of A:

∂Ax
= AT (A.4)
∂x
Similarly, for objects x, A of the same shape, on can obtain,

∂xT A
=A (A.5)
∂x

A.2.3 Linearity of derivatives


Suppose that u, v are both vectors of size m × 1. Then,

∂(u + v) ∂u ∂v
= + (A.6)
∂x ∂x ∂x
Suppose that a is a scalar constant and u is an m × 1 vector that is a function of x. Then,

∂au ∂u
=a (A.7)
∂x ∂x
One can extend the previous identity to vector- and matrix-valued constants. Suppose
that a is a vector with shape m × 1 and v is a scalar which depends on x. Then,

∂va ∂v T
= a (A.8)
∂x ∂x
First, checking dimensions, ∂v/∂x is n × 1 and a is m × 1 so aT is 1 × m and our answer
is n × m as it should be. Now, checking a value, element (i, j) of the answer is ∂vaj /∂xi =
(∂v/∂xi )aj which corresponds to element (i, j) of (∂v/∂x)aT .
Similarly, suppose that A is a matrix which does not depend on x and u is a column
vector which does depend on x. Then,

∂Au ∂u T
= A (A.9)
∂x ∂x

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 115

A.2.4 Product rule (vector-valued numerator)


Suppose that v is a scalar which depends on x, while u is a column vector of shape m × 1
and x is a column vector of shape n × 1. Then,

∂vu ∂u ∂v T
=v + u (A.10)
∂x ∂x ∂x
One can see this relationship by expanding the derivative as follows:

···
 
∂(vu1 )/∂x1 ∂(vu2 )/∂x1 ∂(vum )/∂x1
∂vu  ∂(vu1 )/∂x2
 ∂(vu2 )/∂x2 ··· ∂(vum )/∂x2 
= .

.. .. .. ..
∂x  . . . . 
∂(vu1 )/∂xn ∂(vu2 )/∂xn ··· ∂(vum )/∂xn

Then, one can use the product rule for scalar-valued functions,

∂(vuj )/∂xi = v(∂uj /∂xi ) + (∂v/∂xi )uj ,

to obtain the desired result.

A.2.5 Chain rule


Suppose that g is a vector-valued function with output vector of shape m × 1, and the
argument to g is a column vector u of shape d × 1 which depends on x. Then, one can
obtain the chain rule as,

∂g(u) ∂u ∂g(u)
= (A.11)
∂x ∂x ∂u
Following “the shapes of things,” ∂u/∂x is n × d and ∂g(u)/∂u is d × m, where element
(i, j) is ∂g(u)j /∂ui . The same chain rule applies for further compositions of functions:

∂f(g(u)) ∂u ∂g(u) ∂f(g)


= (A.12)
∂x ∂x ∂u ∂g

A.3 Some other identities


You can get many scalar-by-vector and vector-by-scalar cases as special cases of the rules
above, making one of the relevant vectors just be 1 x 1. Here are some other ones that are
handy. For more, see the Wikipedia article on Matrix derivatives (for consistency, only use
the ones in denominator layout).

∂uT v ∂u ∂v
= v+ u (A.13)
∂x ∂x ∂x

∂uT ∂u T
= (A.14)
∂x ∂x

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 116

A.4 Derivation of gradient for linear regression


Applying identities A.5, A.13, A.6, A.4 A.1

∂(X̃θ − Ỹ)T (X̃θ − Ỹ)/n 2 ∂(X̃θ − Ỹ)


= (X̃θ − Ỹ)
∂θ n ∂θ
2 ∂X̃θ ∂Ỹ 
= − (X̃θ − Ỹ)
n ∂θ ∂θ
2 T 
= X̃ − 0 (X̃θ − Ỹ)
n
2
= X̃T (X̃θ − Ỹ)
n

A.5 Matrix derivatives using Einstein summation


You do not have to read or learn this! But you might find it interesting or helpful.
Consider the objective function for linear regression, written out as products of matri-
ces:
1
J(θ) = (X̃θ − Ỹ)T (X̃θ − Ỹ) , (A.15)
n
where X̃ = XT is n × d, Ỹ = Y T is n × 1, and θ is d × 1. How does one show, with no
shortcuts, that
2 T
∇θ J = X̃ (X̃θ − Ỹ) ? (A.16)
n
One neat way, which is very explicit, is to simply write all the matrices as variables with
row and column indices, e.g., X̃ab is the row a, column b entry of the matrix X̃. Further-
more, let us use the convention that in any product, all indices which appear more than
once get summed over; this is a popular convention in theoretical physics, and lets us
suppress all the summation symbols which would otherwise clutter the following express-
sions. For example, X̃ab θb would be the implicit summation notation giving the element
at the ath row of the matrix-vector product X̃θ.
Using implicit summation notation with explicit indices, we can rewrite J(θ) as

1  
J(θ) = X̃ab θb − Ỹa X̃ac θc − Ỹa . (A.17)
n
Note that we no longer need the transpose on the first term, because all that transpose
accomplished was to take a dot product between the vector given by the left term, and the
vector given by the right term. With implicit summation, this is accomplished by the two
terms sharing the repeated index a.
Taking the derivative of J with respect to the dth element of θ thus gives, using the chain
rule for (ordinary scalar) multiplication:

dJ 1   
= X̃ab δbd X̃ac θc − Ỹa + X̃ab θb − Ỹa X̃ac δcd (A.18)
dθd n
1   
= X̃ad X̃ac θc − Ỹa + X̃ab θb − Ỹa X̃ad (A.19)
n
2 
= X̃ad X̃ab θb − Ỹa , (A.20)
n

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 117

where the second line follows from the first, with the definition that δbd = 1 only when
b = d (and similarly for δcd ). And the third line follows from the second by recognizing
that the two terms in the second line are identical. Now note that in this implicit sum-
mation notation, the a, b element of the matrix product of A and B is (AB)ac = Aab Bbc .
That is, ordinary matrix multiplication sums over indices which are adjacent to each other,
because a row of A times a column of B becomes a scalar number. So the term in the above
equation with X̃ad X̃ab is not a matrix product of X̃ with X̃. However, taking the transpose
X̃T switches row and column indices, so X̃ad = X̃Tda . And X̃Tda X̃ab is a matrix product of
X̃T with X̃! Thus, we have that
dJ 2 T 
= X̃ X̃ab θb − Ỹa (A.21)
dθd n da
2 T 
= X̃ X̃θ − Ỹ d , (A.22)
n
which is the desired result.

Last Updated: 05/06/24 18:25:27


APPENDIX B

Optimizing Neural Networks

B.0.1 Strategies towards adaptive step-size


B.0.1.1 Running averages
We’ll start by looking at the notion of a running average. It’s a computational strategy for
estimating a possibly weighted average of a sequence of data. Let our data sequence be
c1 , c2 , . . .; then we define a sequence of running average values, C0 , C1 , C2 , . . . using the
equations
C0 = 0
Ct = γt Ct−1 + (1 − γt )ct
where γt ∈ (0, 1). If γt is a constant, then this is a moving average, in which
CT = γCT −1 + (1 − γ)cT
= γ(γCT −2 + (1 − γ)cT −1 ) + (1 − γ)cT
X
T
= γT −t (1 − γ)ct
t=1

So, you can see that inputs ct closer to the end of the sequence T have more effect on CT
than early inputs.
If, instead, we set γt = (t − 1)/t, then we get the actual average.
Study Question: Prove to yourself that the previous assertion holds.

B.0.1.2 Momentum
Now, we can use methods that are a bit like running averages to describe strategies for
computing η. The simplest method is momentum, in which we try to “average” recent
gradient updates, so that if they have been bouncing back and forth in some direction, we
take out that component of the motion. For momentum, we have
V0 = 0
Vt = γVt−1 + η∇W J(Wt−1 )
Wt = Wt−1 − Vt

118
MIT 6.390 Spring 2024 119

This doesn’t quite look like an adaptive step size. But what we can see is that, if we let
η = η ′ (1 − γ), then the rule looks exactly like doing an update with step size η ′ on a
moving average of the gradients with parameter γ:

M0 = 0
Mt = γMt−1 + (1 − γ)∇W J(Wt−1 )
Wt = Wt−1 − η ′ Mt

Study Question: Prove to yourself that these formulations are equivalent.


We will find that Vt will be bigger in dimensions that consistently have the same sign
for ∇W and smaller for those that don’t. Of course we now have two parameters to set (η
and γ), but the hope is that the algorithm will perform better overall, so it will be worth
trying to find good values for them. Often γ is set to be something like 0.9.

The red arrows show the update after each successive step of mini-batch gradient
descent with momentum. The blue points show the direction of the gradient with
respect to the mini-batch at each step. Momentum smooths the path taken towards
the local minimum and leads to faster convergence.

Study Question: If you set γ = 0.1, would momentum have more of an effect or less
of an effect than if you set it to 0.9?

B.0.1.3 Adadelta
Another useful idea is this: we would like to take larger steps in parts of the space where
J(W) is nearly flat (because there’s no risk of taking too big a step due to the gradient
being large) and smaller steps when it is steep. We’ll apply this idea to each weight in-
dependently, and end up with a method called adadelta, which is a variant on adagrad (for
adaptive gradient). Even though our weights are indexed by layer, input unit and output
unit, for simplicity here, just let Wj be any weight in the network (we will do the same
thing for all of them).

gt,j = ∇W J(Wt−1 )j
Gt,j = γGt−1,j + (1 − γ)g2t,j
η
Wt,j = Wt−1,j − p gt,j
Gt,j + ϵ

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 120

The sequence Gt,j is a moving average of the square of the jth component of the gradient.
We square it in order to be insensitive to the sign—we want to know whether the magni-
tude is p big or small. Then, we perform a gradient update to weight j, but divide the step
size by Gt,j + ϵ, which is larger when the surface is steeper in direction j at point Wt−1 in
weight space; this means that the step size will be smaller when it’s steep and larger when
it’s flat.

B.0.1.4 Adam
Adam has become the default method of managing step sizes in neural networks. It com- Although, interestingly,
bines the ideas of momentum and adadelta. We start by writing moving averages of the it may actually violate
the convergence
gradient and squared gradient, which reflect estimates of the mean and variance of the
conditions of SGD:
gradient for weight j: arxiv.org/abs/1705.08292

gt,j = ∇W J(Wt−1 )j
mt,j = B1 mt−1,j + (1 − B1 )gt,j
vt,j = B2 vt−1,j + (1 − B2 )g2t,j .

A problem with these estimates is that, if we initialize m0 = v0 = 0, they will always be


biased (slightly too small). So we will correct for that bias by defining
mt,j
m̂t,j =
1 − Bt1
vt,j
v̂t,j =
1 − Bt2
η
Wt,j = Wt−1,j − p m̂t,j .
v̂t,j + ϵ

Note that Bt1 is B1 raised to the power t, and likewise for Bt2 . To justify these corrections,
note that if we were to expand mt,j in terms of m0,j and g0,j , g1,j , . . . , gt,j the coefficients
would sum to 1. However, the coefficient behind m0,j is Bt1 and since m0,j = 0, the sum of
coefficients of non-zero terms is 1 − Bt1 , hence the correction. The same justification holds
for vt,j .
Now, our update for weight j has a step size that takes the steepness into account, as in
adadelta, but also tends to move in the same direction, as in momentum. The authors of
this method propose setting B1 = 0.9, B2 = 0.999, ϵ = 10−8 . Although we now have even
more parameters, Adam is not highly sensitive to their values (small changes do not have
a huge effect on the result).
Study Question: Define m̂j directly as a moving average of gt,j . What is the decay
(γ parameter)?
Even though we now have a step-size for each weight, and we have to update vari-
ous quantities on each iteration of gradient descent, it’s relatively easy to implement by

maintaining a matrix for each quantity (mℓt , vℓt , gℓt , g2t ) in each layer of the network.

B.0.2 Batch Normalization Details


Let’s think of the batch-norm layer as taking Zl as input and producing an output Z bl as
l l
output. But now, instead of thinking of Z as an n × 1 vector, we have to explicitly think
about handling a mini-batch of data of size K, all at once, so Zl will be nl × K, and so will
the output Zbl .

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 121

Our first step will be to compute the batchwise mean and standard deviation. Let µl be
the nl × 1 vector where
1X l
K
µli = Zij ,
K
j=1

and let σl be the nl × 1 vector where


v
u1 XK
u
σli = t (Zlij − µi )2 .
K
j=1

The basic normalized version of our data would be a matrix, element (i, j) of which is

l Zlij − µli
Zij = ,
σli + ϵ

where ϵ is a very small constant to guard against division by zero. However, if we let these
be our Zbl values, we really are forcing something too strong on our data—our goal was to
normalize across the data batch, but not necessarily force the output values to have exactly
mean 0 and standard deviation 1. So, we will give the layer the “opportunity” to shift and
scale the outputs by adding new weights to the layer. These weights are Gl and Bl , each of
which is an nl × 1 vector. Using the weights, we define the final output to be

bl = Gl Zl + Bl .
Z ij i ij i

That’s the forward pass. Whew!


bl ,
Now, for the backward pass, we have to do two things: given ∂L/∂Z
• Compute ∂L/∂Zl for back-propagation, and

• Compute ∂L/∂Gl and ∂L/∂Bl for gradient updates of the weights in this layer.
Schematically For simplicity we will
∂L ∂L ∂Zb drop the reference to
= . the layer l in the rest of
∂B ∂Z ∂B
b the derivation
It’s hard to think about these derivatives in matrix terms, so we’ll see how it works for the
components. Bi contributes to Z bij for all data points j in the batch. So

∂L X ∂L ∂Zbij
=
∂Bi ∂Zij ∂Bi
b
j
X ∂L
= ,
∂Z
j
bij

Similarly, Gi contributes to Z
bij for all data points j in the batch. So

∂L X ∂L ∂Zbij
=
∂Gi bij ∂Gi
∂Z j
X ∂L
= Z .
bij ij
∂Z
j

Now, let’s figure out how to do backprop. We can start schematically:

∂L ∂L ∂Z
b
= .
∂Z b ∂Z
∂Z

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 122

And because dependencies only exist across the batch, but not across the unit outputs,

∂L X ∂L ∂Z
K bik
= .
∂Zij bik ∂Zij
∂Z
k=1

The next step is to note that

∂Z
bik bik ∂Zik
∂Z
=
∂Zij ∂Zik ∂Zij
∂Zik
= Gi .
∂Zij

And now that  


∂Zik ∂µi 1 Zik − µi ∂σi
= δjk − − ,
∂Zij ∂Zij σi σ2i ∂Zij
where δjk = 1 if j = k and δjk = 0 otherwise. Getting close! We need two more small parts:

∂µi 1
= ,
∂Zij K
∂σi Zij − µi
= .
∂Zij Kσi

Putting the whole crazy thing together, we get

∂L X ∂L
K
1

(Zik − µi )(Zij − µi )

= Gi δjk K − 1 − .
∂Zij ∂Zik
b
k=1
Kσ i σ2i

Last Updated: 05/06/24 18:25:27


APPENDIX C

Recurrent Neural Networks

So far, we have limited our attention to domains in which each output y is assumed to
have been generated as a function of an associated input x, and our hypotheses have been
“pure” functions, in which the output depends only on the input (and the parameters we
have learned that govern the function’s behavior). In the next few chapters, we are going
to consider cases in which our models need to go beyond functions. In particular, behavior
as a function of time will be an important concept:

• In recurrent neural networks, the hypothesis that we learn is not a function of a single
input, but of the whole sequence of inputs that the predictor has received.

• In reinforcement learning, the hypothesis is either a model of a domain (such as a game)


as a recurrent system or a policy which is a pure function, but whose loss is deter-
mined by the ways in which the policy interacts with the domain over time.

In this chapter, we introduce state machines. We start with deterministic state machines,
and then consider recurrent neural network (RNN) architectures to model their behavior.
Later, in Chapter 10, we will study Markov decision processes (MDPs) that extend to consider
probabilistic (rather than deterministic) transitions in our state machines. RNNs and MDPs
will enable description and modeling of temporally sequential patterns of behavior that are
important in many domains.

C.1 State machines


A state machine is a description of a process (computational, physical, economic) in terms This is such a pervasive
of its potential sequences of states. idea that it has been
given many names in
The state of a system is defined to be all you would need to know about the system to
many subareas of com-
predict its future trajectories as well as possible. It could be the position and velocity of an puter science, control
object or the locations of your pieces on a game board, or the current traffic densities on a theory, physics, etc.,
highway network. including: automaton,
transducer, dynamical sys-
Formally, we define a state machine as (S, X, Y, s0 , fs , fo ) where tem, etc.
• S is a finite or infinite set of possible states;

• X is a finite or infinite set of possible inputs;

123
MIT 6.390 Spring 2024 124

• Y is a finite or infinite set of possible outputs;

• s0 ∈ S is the initial state of the machine;

• fs : S × X → S is a transition function, which takes an input and a previous state and


produces a next state;

• fo : S → Y is an output function, which takes a state and produces an output.

The basic operation of the state machine is to start with state s0 , then iteratively compute In some cases, we will
for t ⩾ 1: pick a starting state
from a set or distribu-
tion.
st = fs (st−1 , xt ) (C.1)
yt = fo (st ) (C.2)

The diagram below illustrates this process. Note that the “feedback” connection of
st back into fs has to be buffered or delayed by one time step—-otherwise what it
is computing would not generally be well defined.

xt st yt
fs fo

st−1

So, given a sequence of inputs x1 , x2 , . . . the machine generates a sequence of outputs

fo (fs (s0 , x1 )), fo (fs (fs (s0 , x1 ), x2 )), . . . .


| {z } | {z }
y1 y2

We sometimes say that the machine transduces sequence x into sequence y. The output at
time t can have dependence on inputs from steps 1 to t. There are a huge num-
One common form is finite state machines, in which S, X, and Y are all finite sets. They are ber of major and minor
variations on the idea of
often described using state transition diagrams such as the one below, in which nodes stand
a state machine. We’ll
for states and arcs indicate transitions. Nodes are labeled by which output they generate just work with one spe-
and arcs are labeled by which input causes the transition. cific one in this section
and another one in the
next, but don’t worry if
One can verify that the state machine below reads binary strings and determines the you see other variations
parity of the number of zeros in the given string. Check for yourself that all input out in the world!
binary strings end in state S1 if and only if they contain an even number of zeros.
All computers can be
described, at the digital
level, as finite state ma-
chines. Big, but finite!

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 125

Another common structure that is simple but powerful and used in signal processing
and control is linear time-invariant (LTI) systems. In this case, all the quantities are real-
valued vectors: S = Rm , X = Rl and Y = Rn . The functions fs and fo are linear functions of
their inputs. The transition function is described by the state matrix A and the input matrix
B; the output function is defined by the output matrix C, each with compatible dimensions.
In discrete time, they can be defined by a linear difference equation, like

st = fs (st−1 , xt ) = Ast−1 + Bxt , (C.3)


yt = fo (st ) = Cst , (C.4)

and can be implemented using state to store relevant previous input and output informa-
tion. We will study recurrent neural networks which are a lot like a non-linear version of an
LTI system.

C.2 Recurrent neural networks


In Chapter 6, we studied neural networks and how the weights of a network can be ob-
tained by training on data, so that the neural network will model a function that approx-
imates the relationship between the (x, y) pairs in a supervised-learning training set. In
Section C.1 above, we introduced state machines to describe sequential temporal behav-
ior. Here in Section C.2, we explore recurrent neural networks by defining the architecture
and weight matrices in a neural network to enable modeling of such state machines. Then,
in Section C.3, we present a loss function that may be employed for training sequence to
sequence RNNs, and then consider application to language translation and recognition in
Section C.4. In Section C.5, we’ll see how to use gradient-descent methods to train the
weights of an RNN so that it performs a transduction that matches as closely as possible a
training set of input-output sequences.

A recurrent neural network is a state machine with neural networks constituting functions
fs and fo :

st = fs (W sx xt + W ss st−1 + W0ss ) (C.5)


o
yt = fo (W st + W0o ) . (C.6)

The inputs, states, and outputs are all vector-valued:

xt : ℓ × 1 (C.7)
st : m × 1 (C.8)
yt : v × 1 . (C.9)

The weights in the network, then, are

W sx : m × ℓ (C.10)
W ss : m × m (C.11)
W0ss :m×1 (C.12)
o
W :v×m (C.13)
W0o :v×1 (C.14)

with activation functions fs and fo .


Study Question: Check dimensions here to be sure it all works out. Remember that
we apply fs and fo elementwise, unless fo is a softmax activation.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 126

C.3 Sequence-to-sequence RNN


Now, how can we set up an RNN to model and be trained to produce a transduction of one
sequence to another? This problem is sometimes called sequence-to-sequence mapping. You
can think of it as a kind of regression problem: given an input sequence, learn to generate
the corresponding output sequence.
   One way to think of
A training set has the form x(1) , y(1) , . . . , x(q) , y(q) , where training a sequence
classifier is to reduce it
• x(i) and y(i) are length n(i) sequences; to a transduction prob-
lem, where yt = 1 if the
• sequences in the same pair are the same length; and sequences in different pairs may sequence x1 , . . . , xt is a
positive example of the
have different lengths. class of sequences and
−1 otherwise.
Next, we need a loss function. We start by defining a loss function on sequences. There
are many possible choices, but usually it makes sense just to sum up a per-element loss
function on each of the output values, where g is the predicted sequence and y is the actual
one:
  nX(i)
 
(i) (i)
Lseq g , y
(i) (i)
= Lelt gt , yt . (C.15)
t=1

The per-element loss function Lelt will depend on the type of yt and what information it is So it could be NLL,
encoding, in the same way as for a supervised network. squared loss, etc.
Then, letting W = (W sx , W ss , W o , W0ss , W0o ), our overall goal is to minimize the objec-
tive
1X
q  
J(W) = Lseq RNN(x(i) ; W), y(i) , (C.16)
q
i=1

where RNN(x; W) is the output sequence generated, given input sequence x.


It is typical to choose fs to be tanh but any non-linear activation function is usable. We Remember that it looks
choose fo to align with the types of our outputs and the loss function, just as we would do like a sigmoid but
ranges from -1 to +1.
in regular supervised learning.

C.4 RNN as a language model


A language model is a sequence to sequence RNN which is trained on a token sequence of the A “token” is generally
form, c = (c1 , c2 , . . . , ck ), and is used to predict the next token ct , t ⩽ k, given a sequence a character, common
word fragment, or a
of the previous (t − 1) tokens:
word.
ct = RNN ((c1 , c2 , . . . , ct−1 ); W) (C.17)

We can convert this to a sequence-to-sequence training problem by constructing a data


set of q different (x, y) sequence pairs, where we make up new special tokens, start and
end, to signal the beginning and end of the sequence:

x = (⟨start⟩, c1 , c2 , . . . , ck ) (C.18)
y = (c1 , c2 , . . . , ⟨end⟩) (C.19)

C.5 Back-propagation through time


Now the fun begins! We can now try to find a W to minimize J using gradient descent. We
will work through the simplest method, back-propagation through time (BPTT), in detail. This
is generally not the best method to use, but it’s relatively easy to understand. In Section C.6
we will sketch alternative methods that are in much more common use.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 127

What we want you to take away from this section is that, by “unrolling” a recurrent
network out to model a particular sequence, we can treat the whole thing as a feed-
forward network with a lot of parameter sharing. Thus, we can tune the parameters
using stochastic gradient descent, and learn to model sequential mappings. The
concepts here are very important. While the details are important to get right if you
need to implement something, we present the mathematical details below primarily
to convey or explain the larger concepts.

Calculus reminder: total derivative Most of us are not very careful about the differ-
ence between the partial derivative and the total derivative. We are going to use a nice
example from the Wikipedia article on partial derivatives to illustrate the difference.
The volume of a circular cone depends on its height and radius:

πr2 h
V(r, h) = . (C.20)
3
The partial derivatives of volume with respect to height and radius are

∂V 2πrh ∂V πr2
= and = . (C.21)
∂r 3 ∂h 3
They measure the change in V assuming everything is held constant except the
single variable we are changing. Now assume that we want to preserve the cone’s
proportions in the sense that the ratio of radius to height stays constant. Then we
can’t really change one without changing the other. In this case, we really have to
think about the total derivative. If we’re interested in the total derivative with respect
to r, we sum the “paths” along which r might influence V:

dV ∂V ∂V dh
= + (C.22)
dr ∂r ∂h dr
2πrh πr2 dh
= + (C.23)
3 3 dr
Or if we’re interested in the total derivative with respect to h, we consider how h
might influence V, either directly or via r:

dV ∂V ∂V dr
= + (C.24)
dh ∂h ∂r dh
πr2 2πrh dr
= + (C.25)
3 3 dh
Just to be completely concrete, let’s think of a right circular cone with a fixed angle
α = tan r/h, so that if we change r or h then α remains constant. So we have
r = h tan−1 α; let constant c = tan−1 α, so now r = ch. Thus, we finally have

dV 2πrh πr2 1
= + (C.26)
dr 3 3 c
2
dV πr 2πrh
= + c. (C.27)
dh 3 3

The BPTT process goes like this:

(1) Sample a training pair of sequences (x, y); let their length be n.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 128

(2) “Unroll" the RNN to be length n (picture for n = 3 below), and initialize s0 :

Now, we can see our problem as one of performing what is almost an ordinary back-
propagation training procedure in a feed-forward neural network, but with the dif-
ference that the weight matrices are shared among the layers. In many ways, this is
similar to what ends up happening in a convolutional network, except in the conv-
net, the weights are re-used spatially, and here, they are re-used temporally.

(3) Do the forward pass, to compute the predicted output sequence g:

z1t = W sx xt + W ss st−1 + W0ss (C.28)


st = fs (z1t ) (C.29)
z2t = W o st + W0o (C.30)
gt = fo (z2t ) (C.31)

(4) Do backward pass to compute the gradients. For both W ss and W sx we need to find

dLseq (g, y) X dLelt (gu , yu )


n
=
dW dW
u=1

Letting Lu = Lelt (gu , yu ) and using the total derivative, which is a sum over all the
ways in which W affects Lu , we have

X
n X
n
∂st ∂Lu
=
∂W ∂st
u=1 t=1

Re-organizing, we have

X
n
∂st X ∂Lu
n
=
∂W ∂st
t=1 u=1

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 129

Because st only affects Lt , Lt+1 , . . . , Ln ,

Xn
∂st X ∂Lu
n
=
∂W u=t ∂st
t=1
 

Xn Xn
 
∂st  ∂Lt ∂Lu 
 .
=  + (C.32)
∂W  ∂st ∂st 
t=1 u=t+1
| {z }
 
δst

where δst is the dependence of the future loss (incurred after step t) on the state St . That is, δst is how
much we can blame
We can compute this backwards, with t going from n down to 1. The trickiest part is state st for all the fu-
figuring out how early states contribute to later losses. We define the future loss after ture element losses.
step t to be
X
n
Ft = Lelt (gu , yu ) , (C.33)
u=t+1
so
∂Ft
δst = . (C.34)
∂st
At the last stage, Fn = 0 so δsn = 0.
Now, working backwards,

∂ X
n
δst−1 = Lelt (gu , yu ) (C.35)
∂st−1 u=t
∂st ∂ X
n
= Lelt (gu , yu ) (C.36)
∂st−1 ∂st u=t
X
" n
#
∂st ∂
= Lelt (gt , yt ) + Lelt (gu , yu ) (C.37)
∂st−1 ∂st
u=t+1
 
∂st ∂Lelt (gt , yt )
= + δst (C.38)
∂st−1 ∂st

Now, we can use the chain rule again to find the dependence of the element loss at
time t on the state at that same time,

∂Lelt (gt , yt ) ∂z2t ∂Lelt (gt , yt )


= , (C.39)
∂st ∂st ∂z2t
| {z } |{z} | {z }
(m×1) (m×v) (v×1)

and the dependence of the state at time t on the state at the previous time,

∂st ∂z1t ∂st ∂st


= = W ss T 1 (C.40)
∂st−1 ∂st−1 ∂z1t ∂zt
| {z } | {z } |{z}
(m×m) (m×m) (m×m)

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 130

Note that ∂st /∂z1t is formally an m × m diagonal matrix, with the values along
the diagonal being fs′ (z1t,i ), 1 ⩽ i ⩽ m. But since this is a diagonal matrix,
one could represent it as an m × 1 vector fs′ (z1t ). In that case the product
of the matrix W ss T by the vector fs′ (z1t ), denoted W ss T ∗ fs′ (z1t ), should be
interpreted as follows: take the first column of the matrix W ss T and multiply
each of its elements by the first element of the vector ∂st /∂z1t , then take the
second column of the matrix W ss T and multiply each of its elements by the
second element of the vector ∂st /∂z1t , and so on and so forth ...

Putting this all together, we end up with


 
st−1 ∂st
ss T o T ∂Lt st
δ =W W +δ (C.41)
∂z1 ∂z2t
| {z t} | {z }
∂st ∂Ft−1
∂st−1 ∂st

We’re almost there! Now, we can describe the actual weight updates. Using Eq. C.32
and recalling the definition of δst = ∂Ft /∂st , as we iterate backwards, we can accu-
mulate the terms in Eq. C.32 to get the gradient for the whole loss.

dLseq X dLelt (gt , yt ) X ∂z1 ∂st ∂Ft−1


n n
t
= = (C.42)
dW ss dW ss ∂W ss ∂z1t ∂st
t=1 t=1

dLseq Xn
dLelt (gt , yt ) Xn
∂z1t ∂st ∂Ft−1
= = (C.43)
dW sx dW sx ∂W sx ∂z1t ∂st
t=1 t=1

We can handle W o separately; it’s easier because it does not affect future losses in the
way that the other weight matrices do:

dLseq X dLt X ∂Lt ∂z2


n n
t
= = (C.44)
dW o dW o
t=1
∂z2t ∂W o t=1

Assuming we have ∂L t
∂z2t
= (gt − yt ), (which ends up being true for squared loss,
softmax-NLL, etc.), then

dLseq X
n
= (gt − yt ) sTt . (C.45)
|dW
o
{z } | {z } |{z}
t=1 v×1 1×m
v×m

Whew!

Study Question: Derive the updates for the offsets W0ss and W0o .

C.6 Vanishing gradients and gating mechanisms


Let’s take a careful look at the backward propagation of the gradient along the sequence:
 
st−1 ∂st ∂Lelt (gt , yt ) st
δ = +δ . (C.46)
∂st−1 ∂st

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 131

Consider a case where only the output at the end of the sequence is incorrect, but it depends
critically, via the weights, on the input at time 1. In this case, we will multiply the loss at
step n by
∂s2 ∂s3 ∂sn
··· . (C.47)
∂s1 ∂s2 ∂sn−1
In general, this quantity will either grow or shrink exponentially with the length of the
sequence, and make it very difficult to train.
Study Question: The last time we talked about exploding and vanishing gradients, it
was to justify per-weight adaptive step sizes. Why is that not a solution to the prob-
lem this time?
An important insight that really made recurrent networks work well on long sequences
is the idea of gating.

C.6.1 Simple gated recurrent networks


A computer only ever updates some parts of its memory on each computation cycle. We
can take this idea and use it to make our networks more able to retain state values over time
and to make the gradients better-behaved. We will add a new component to our network,
called a gating network. Let gt be a m × 1 vector of values and let W gx and W gs be m × l
and m × m weight matrices, respectively. We will compute gt as It can have an offset,
too, but we are omitting
gt = sigmoid(W gx xt + W gs st−1 ) (C.48) it for simplicity.

and then change the computation of st to be

st = (1 − gt ) ∗ st−1 + gt ∗ fs (W sx xt + W ss st−1 + W0ss ) , (C.49)

where ∗ is component-wise multiplication. We can see, here, that the output of the gating
network is deciding, for each dimension of the state, how much it should be updated now.
This mechanism makes it much easier for the network to learn to, for example, “store”
some information in some dimension of the state, and then not change it during future
state updates, or change it only under certain conditions on the input or other aspects of
the state.
Study Question: Why is it important that the activation function for g be a sigmoid?

C.6.2 Long short-term memory


The idea of gating networks can be applied to make a state machine that is even more like
a computer memory, resulting in a type of network called an LSTM for “long short-term
memory.” We won’t go into the details here, but the basic idea is that there is a memory Yet another awesome
cell (really, our state vector) and three (!) gating networks. The input gate selects (using name for a neural net-
work!
a “soft” selection as in the gated network above) which dimensions of the state will be
updated with new values; the forget gate decides which dimensions of the state will have
its old values moved toward 0, and the output gate decides which dimensions of the state
will be used to compute the output value. These networks have been used in applications
like language translation with really amazing results. A diagram of the architecture is
shown below:

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 132

Last Updated: 05/06/24 18:25:27


APPENDIX D

Supervised learning in a nutshell

In which we try to describe the outlines of the “lifecycle” of supervised learning, including
hyperparameter tuning and evaluation of the final product.

D.1 General case


We start with a very generic setting.

D.1.1 Minimal problem specification


Given:
• Space of inputs X
• Space of outputs Y
• Space of possible hypotheses H such that each h ∈ H is a function h : X → Y
• Loss function L : Y × Y → R
a supervised learning algorithm A takes as input a data set of the form
   
D= x(1) , y(1) , . . . , x(n) , y(n) ,

where x(i) ∈ X and y(i) ∈ Y and returns an h ∈ H.

D.1.2 Evaluating a hypothesis


Given a problem specification and a set of data D, we evaluate hypothesis h according to
average loss, or error,
D
1 X
E(h, L, D) = L(h(x(i) ), y(i) )
|D|
i=1

If the data used for evaluation were not used during learning of the hypothesis then this is
a reasonable estimate of how well the hypothesis will make additional predictions on new
data from the same source.

133
MIT 6.390 Spring 2024 134

D.1.3 Evaluating a supervised learning algorithm


A validation strategy V takes an algorithm A, a loss function L, and a data source D and pro-
duces a real number which measures how well A performs on data from that distribution.

D.1.3.1 Using a validation set


In the simplest case, we can divide D into two sets, Dtrain and Dval , train on the first, and
then evaluate the resulting hypothesis on the second. In that case,

V(A, L, D) = E(A(Dtrain ), L, Dval ) .

D.1.3.2 Using multiple training/evaluation runs


We can’t reliably evaluate an algorithm based on a single application of it to a single train-
ing and test set, because there are many aspects of the training and testing data, as well
as, sometimes, randomness in the algorithm itself, that cause variance in the performance
of the algorithm. To get a good idea of how well an algorithm performs, we need to, mul-
tiple times, train it and evaluate the resulting hypothesis, and report the average over K
executions of the algorithm of the error of the hypothesis it produced each time.
We divide the data into 2K random non-overlapping subsets: Dtrain
1 , Dval
1 , . . . , DK , DK .
train val

Then,
1X
K
V(A, L, D) = E(A(Dtrain
k ), L, Dk ) .
val
K
k=1

D.1.3.3 Cross validation


In cross validation, we do a similar computation, but allow data to be re-used in the K dif-
ferent iterations of training and testing the algorithm (but never share training and testing
data for a single iteration!). See Section 2.7.2.2 for details.

D.1.4 Comparing supervised learning algorithms


Now, if we have two different algorithms A1 and A2 , we might be interested in knowing
which one will produce hypotheses that generalize the best, using data from a particular
source. We could compute V(A1 , L, D) and V(A∈ , L, D), and prefer the algorithm with
lower validation error. More generally, given algorithms A1 , . . . , AM , we would prefer

A∗ = arg min V(AM , L, D) .


m

D.1.5 Fielding a hypothesis


Now what? We have to deliver a hypothesis to our customer. We now know how to find
the algorithm, A∗ , that works best for our type of data. We can apply it to all of our data to
get the best hypothesis we know how to create, which would be

h∗ = A∗ (D) ,

and deliver this resulting hypothesis as our best product.

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 135

D.1.6 Learning algorithms as optimizers


A majority of learning algorithms have the form of optimizing some objective involving
the training data and a loss function. Interestingly, this loss
So for example, (assuming a perfect optimizer which doesn’t, of course, exist) we might function is not always the
same as the loss function
say our algorithm is to solve an optimization problem:
that is used for evalua-
A(D) = arg min J(h; D) . tion! We will see this in
h∈H logistic regression.
Our objective often has the form
J(h; D) = E(h, L, D) + R(h) ,
where L is a loss to be minimized during training and R is a regularization term.

D.1.7 Hyperparameters
Often, rather than comparing an arbitrary collection of learning algorithms, we think of
our learning algorithm as having some parameters that affect the way it maps data to a
hypothesis. These are not parameters of the hypothesis itself, but rather parameters of the
algorithm. We call these hyperparameters. A classic example would be to use a hyperparam-
eter λ to govern the weight of a regularization term on an objective to be optimized:
J(h; D) = E(h, L, D) + λR(h) .
Then we could think of our algorithm as A(D; λ). Picking a good value of λ is the same as
comparing different supervised learning algorithms, which is accomplished by validating
them and picking the best one!

D.2 Concrete case: linear regression


In linear regression the problem formulation is this:
• X = Rd
• Y=R
• H = {θT x + θ0 } for values of parameters θ ∈ Rd and θ0 ∈ R.
• L(g, y) = (g − y)2
Our learning algorithm has hyperparameter λ and can be written as:
1 X
A(D; λ) = Θ∗ (λ, D) = arg min (θT x + θ0 − y)2 + λ∥θ∥2 .
θ,θ0 |D|
(x,y)∈D

For a particular training data set and parameter λ, it finds the best hypothesis on this data,
specified with parameters Θ = (θ, θ0 ), written Θ∗ (λ, D).
Picking the best value of the hyperparameter is choosing among learning algorithms.
We could, most simply, optimize using a single training / validation split, so D = Dtrain ∪
Dval , and
λ∗ = arg min V(Aλ , L, Dval )
λ

= arg min E(Θ∗ (λ, Dtrain ), mse, Dval )


λ
1 X
= arg min (θ∗ (λ, Dtrain )T x + θ∗0 (λ, Dtrain ) − y)2
λ |Dval |
(x,y)∈Dval

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 136

It would be much better to select the best λ using multiple runs or cross-validation; that
would just be a different choices of the validation procedure V in the top line.
Note that we don’t use regularization here because we just want to measure how good
the output of the algorithm is at predicting values of new points, and so that’s what we
measure. We use the regularizer during training when we don’t want to focus only on
optimizing predictions on the training data.
Finally! To make a predictor to ship out into the world, we would use all the data we
have, D, to train, using the best hyperparameters we know, and return

Θ∗ = A(D; λ∗ )
= Θ∗ (λ∗ , D)
1 X
= arg min (θT x + θ0 − y)2 + λ∗ ∥θ∥2
θ,θ0 |D|
(x,y)∈D

Finally, a customer might evaluate this hypothesis on their data, which we have never
seen during training or validation, as

Etest = E(Θ∗ , mse, Dtest )


1 X
= test (θ∗ T x + θ∗0 − y)2
|D | test
(x,y)∈D

Here are the same ideas, written out in informal pseudocode.


# returns theta_best(D, lambda)
define train(D, lambda):
return minimize(mse(theta, D) + lambda * norm(theta)**2, theta)

# returns lambda_best using very simple validation


define simple_tune(D_train, D_val, possible_lambda_vals):
scores = [mse(train(D_train, lambda), D_val)
for lambda in possible_lambda_vals]
return possible_lambda_vals[least_index[scores]]

# returns theta_best overall


define theta_best(D_train, D_val, possible_lambda_vals):
return train(D_train + D_val,
simple_tune(D_train, D_val, possible_lambda_vals))

# customer evaluation of the theta delivered to them


define customer_val(theta):
return mse(theta, D_test)

D.3 Concrete case: logistic regression


In binary logistic regression the problem formulation is as follows. We are writing the class
labels as 1 and 0.
• X = Rd

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 137

• Y = {+1, 0}

• H = {σ(θT x + θ0 )} for values of parameters θ ∈ Rd and θ0 ∈ R.

• L(g, y) = L01 (g, h)

• Proxy loss Lnll (g, y) = − (y log(g) + (1 − y) log(1 − g))

Our learning algorithm has hyperparameter λ and can be written as:

1 X
A(D; λ) = Θ∗ (λ, D) = arg min Lnll (σ(θT x + θ0 ), y) + λ∥θ∥2 .
θ,θ0 |D|
(x,y)∈D

For a particular training data set and parameter λ, it finds the best hypothesis on this data,
specified with parameters Θ = (θ, θ0 ), written Θ∗ (λ, D) according to the proxy loss Lnll .
Picking the best value of the hyperparameter is choosing among learning algorithms
based on their actual predictions. We could, most simply, optimize using a single training
/ validation split, so D = Dtrain ∪ Dval , and we use the real 01 loss:

λ∗ = arg min V(Aλ , L01 , Dval )


λ

= arg min E(Θ∗ (λ, Dtrain ), L01 , Dval )


λ
1 X
= arg min L01 (σ(θ∗ (λ, Dtrain )T x + θ∗0 (λ, Dtrain )), y)
λ |Dval |
(x,y)∈Dval

It would be much better to select the best λ using multiple runs or cross-validation; that
would just be a different choices of the validation procedure V in the top line.
Finally! To make a predictor to ship out into the world, we would use all the data we
have, D, to train, using the best hyperparameters we know, and return

Θ∗ = A(D; λ∗ )

Study Question: What loss function is being optimized inside this algorithm?
Finally, a customer might evaluate this hypothesis on their data, which we have never
seen during training or validation, as

Etest = E(Θ∗ , L01 , Dtest )

The customer just wants to buy the right stocks! So we use the real L01 here for validation.

Last Updated: 05/06/24 18:25:27


Index

activation function, 52
hyperbolic tangent, 53
ReLU, 53
sigmoid, 53
softmax, 53
step function, 52
active learning, 102
adaptive step size, 60
adadelta, 119
adam, 120
momentum, 118
running average, 118
autoencoder, 103
variational, 111

backpropagation through time, 126


bagging, 85
bandit problem
contextual bandit problem, 102
k-armed bandit, 102
basis functions
polynomial basis, 42
radial basis, 45
boosting, 78

classification, 29
binary classification, 29
clustering, 103
evaluation, 108
into partitions, 104
k-means algorithm, 105
convolution, 64
convolutional neural network, 64
backpropagation, 68
cross validation, 22

138
MIT 6.390 Spring 2024 139

data
training data, 6
distribution, 11
independent and identically distributed, 6
dynamic programming, 93

empirical probability, 83
ensemble models, 78
error
test error, 11
training error, 11
error back-propagation, 57
estimation, 6
estimation error, 21
expectation, 89
experience replay, 100
exploding (or vanishing) gradients, 60

features, 13
filter, 64
channels, 65
filter size, 66
finite state machine, 124
fitted Q-learning, 100

gating network, 131


generative networks, 111
gradient descent, 23
applied to k-means, 106
applied to logistic regression, 38
applied to neural network training, 54
applied to regression, 26
applied to ridge regression, 27
convergence, 24
learning rate, 23
stopping criteria, 24

hand-built features
binary code, 47
factored, 47
numeric, 46
one-hot, 47
text encoding, 47
thermometer, 46
hierarchical clustering, 108
hyperparameter
hyperparameter tuning, 22
hyperplane, 15
hypothesis, 10
linear regression, 15

k-means
algorithm, 106

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 140

formulation, 105
in feature space, 108
initialization, 107
kernel methods, 46

language model, 126


latent representation, 109
learning
from data, 6
learning algorithm, 21
linear classifier, 30
hypothesis class, 30
linear logistic classifier, 33
linear time-invariant systems, 125
linearly separable, 31
logistic linear classifier
prediction threshold, 34
long short-term memory, 131
loss function, 10

Machine Learning, 6
Markov decision process, 87
max pooling, 66
model, 7
learned model, 10
model class, 7, 12
no model, 11
prediction rule, 11

nearest neighbor models, 78


negative log-likelihood, 35
loss function, 36
multi-class classification, 39
neural network, 50
batch normalization, 62
dropout, 61
feed-forward network, 51
fully connected, 51
initialization of weights, 58
layers, 51
loss and activation function choices, 54
output units, 51
regularization, 61
training, 58
weight decay, 61
neuron, 50
non-parametric methods, 78

optimal action-value function, 91


output function, 124

parameter, 11
policy, 89

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 141

finite horizon, 89
infinite horizon, 89
discount factor, 90
optimal policy, 91
reinforcement learning, 95
stationary policy, 93

Q-function, 91

real numbers, 13
recurrent neural network, 123
recurrent neural network, 125
gating, 131
regression, 13
linear regression, 15
random regression, 16
ridge regression, 20
regularizer, 14
reinforcement
actor-critic methods, 99
reinforcement learning
exploration vs. exploitation, 102
policy, 95
reinforcement learning, 87, 95, 123
Laplace correction, 101
RL algorithm, 96
reward function, 87, 101

separator, 30
linear separator, 31
sequence-to-sequence mapping, 126
sign function, 30
sliding window, 100
spatial locality, 63
state machine, 123
state transition diagram, 124
stochastic gradient descent, 28
convergence, 28
structural error, 21
supervised learning, 7

tensor, 65
total derivative, 127
transduction, 126
transition function, 124
transition model, 87
translation invariance, 63
tree models
information gain, 84
tree models, 78
building a tree, 82
classification, 83

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 142

cost complexity, 83
pruning, 83
regression tree, 81
splitting criteria, 84
entropy, 84
Gini index, 84
misclassification error, 84

value function
calculating value function, 91
value function, 91

weight sharing, 65

Last Updated: 05/06/24 18:25:27


MIT 6.390 Spring 2024 143

Last Updated: 05/06/24 18:25:27

You might also like