d2l en PDF

Dive into Deep Learning
Release 1.0.0-alpha0
Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola
Aug 19, 2022

Contents
Preface 1
Installation 9
Notation 11
1 Introduction 15
1.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Key Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Kinds of Machine Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.5 The Road to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Success Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7 The Essence of Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2 Preliminaries 41
2.1 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.1.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.1.2 Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1.3 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1.4 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.5 Saving Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.6 Conversion to Other Python Objects . . . . . . . . . . . . . . . . . . . . . 47
2.1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.1 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2.3 Conversion to the Tensor Format . . . . . . . . . . . . . . . . . . . . . . . 50
2.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.1 Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.5 Basic Properties of Tensor Arithmetic . . . . . . . . . . . . . . . . . . . . 55
2.3.6 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.7 Non-Reduction Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3.8 Dot Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3.9 Matrix-Vector Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.3.10 Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 59
i
2.3.11 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.3.12 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4.1 Derivatives and Differentiation . . . . . . . . . . . . . . . . . . . . . . . . 63
2.4.2 Visualization Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4.3 Partial Derivatives and Gradients . . . . . . . . . . . . . . . . . . . . . . . 67
2.4.4 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5.1 A Simple Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.5.2 Backward for Non-Scalar Variables . . . . . . . . . . . . . . . . . . . . . . 70
2.5.3 Detaching Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5.4 Gradients and Python Control Flow . . . . . . . . . . . . . . . . . . . . . 71
2.5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.6 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.6.1 A Simple Example: Tossing Coins . . . . . . . . . . . . . . . . . . . . . . 74
2.6.2 A More Formal Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.6.3 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.6.4 Multiple Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.6.5 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.6.6 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
2.6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.7 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.7.1 Functions and Classes in a Module . . . . . . . . . . . . . . . . . . . . . . 85
2.7.2 Specific Functions and Classes . . . . . . . . . . . . . . . . . . . . . . . . 86
3 Linear Neural Networks for Regression 89

3.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.1.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.1.2 Vectorization for Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.1.3 The Normal Distribution and Squared Loss . . . . . . . . . . . . . . . . . 95
3.1.4 Linear Regression as a Neural Network . . . . . . . . . . . . . . . . . . . 97
3.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2 Object-Oriented Design for Implementation . . . . . . . . . . . . . . . . . . . . . 100
3.2.1 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.3 Synthetic Regression Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.1 Generating the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.3 Concise Implementation of the Data Loader . . . . . . . . . . . . . . . . . 107
3.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.4 Linear Regression Implementation from Scratch . . . . . . . . . . . . . . . . . . 108
3.4.1 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.4.2 Defining the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.4.3 Defining the Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . 110
3.4.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.5 Concise Implementation of Linear Regression . . . . . . . . . . . . . . . . . . . . 113
ii
3.5.3 Defining the Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . 115
3.5.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.6 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.6.1 Training Error and Generalization Error . . . . . . . . . . . . . . . . . . . 118
3.6.2 Underfitting or Overfitting? . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.6.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.7 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.7.1 Norms and Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
3.7.2 High-Dimensional Linear Regression . . . . . . . . . . . . . . . . . . . . 125
3.7.3 Implementation from Scratch . . . . . . . . . . . . . . . . . . . . . . . . 126
3.7.4 Concise Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
3.7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4 Linear Neural Networks for Classification 131

4.1 Softmax Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.1.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.1.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.1.3 Information Theory Basics . . . . . . . . . . . . . . . . . . . . . . . . . . 136
4.1.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.2 The Image Classification Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.2.1 Loading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.2.2 Reading a Minibatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.2.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.3 The Base Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.3.1 The Classifier Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.4 Softmax Regression Implementation from Scratch . . . . . . . . . . . . . . . . . . 144
4.4.1 The Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.4.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.4.3 The Cross-Entropy Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.4.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.4.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.5 Concise Implementation of Softmax Regression . . . . . . . . . . . . . . . . . . . 149
4.5.2 Softmax Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.6 Generalization in Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.6.1 The Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.6.2 Test Set Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
4.6.3 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.7 Environment and Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.7.1 Types of Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . . . 158
iii
4.7.2 Examples of Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . 161
4.7.3 Correction of Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . 162
4.7.4 A Taxonomy of Learning Problems . . . . . . . . . . . . . . . . . . . . . . 165
4.7.5 Fairness, Accountability, and Transparency in Machine Learning . . . . . 167
4.7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5 Multilayer Perceptrons 169

5.1 Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.1.1 Hidden Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
5.1.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.2 Implementation of Multilayer Perceptrons . . . . . . . . . . . . . . . . . . . . . . 178
5.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5.3 Forward Propagation, Backward Propagation, and Computational Graphs . . . . . 182
5.3.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
5.3.2 Computational Graph of Forward Propagation . . . . . . . . . . . . . . . . 183
5.3.3 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.3.4 Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.4 Numerical Stability and Initialization . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.4.1 Vanishing and Exploding Gradients . . . . . . . . . . . . . . . . . . . . . 186
5.4.2 Parameter Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.5 Generalization in Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
5.5.1 Revisiting Overfitting and Regularization . . . . . . . . . . . . . . . . . . 192
5.5.2 Inspiration from Nonparametrics . . . . . . . . . . . . . . . . . . . . . . 193
5.5.3 Early Learning and Early Stopping . . . . . . . . . . . . . . . . . . . . . . 193
5.5.4 Classical Regularization Methods for Deep Networks . . . . . . . . . . . . 194
5.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.6 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.6.1 Dropout in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
5.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.7 Predicting House Prices on Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . 200
5.7.1 Downloading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.7.2 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
5.7.3 Accessing and Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . 202
5.7.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.7.5 Error Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
5.7.6 K-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.7.7 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.7.8 Submitting Predictions on Kaggle . . . . . . . . . . . . . . . . . . . . . . 207
5.7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6 Buildersʼ Guide 209

6.1 Layers and Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.1.1 A Custom Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.1.2 The Sequential Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
iv
6.1.3 Executing Code in the Forward Propagation Method . . . . . . . . . . . . 213
6.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.2 Parameter Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
6.2.1 Parameter Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.2.2 Tied Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
6.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3 Parameter Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
6.3.1 Built-in Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.4 Lazy Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.5 Custom Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.5.1 Layers without Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.5.2 Layers with Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.6 File I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.6.1 Loading and Saving Tensors . . . . . . . . . . . . . . . . . . . . . . . . . 225
6.6.2 Loading and Saving Model Parameters . . . . . . . . . . . . . . . . . . . . 226
6.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.7 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.7.1 Computing Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
6.7.2 Tensors and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
6.7.3 Neural Networks and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7 Convolutional Neural Networks 235

7.1 From Fully Connected Layers to Convolutions . . . . . . . . . . . . . . . . . . . . 236
7.1.1 Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.1.2 Constraining the MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
7.1.3 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.1.4 Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.2 Convolutions for Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.2.1 The Cross-Correlation Operation . . . . . . . . . . . . . . . . . . . . . . . 242
7.2.2 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.2.3 Object Edge Detection in Images . . . . . . . . . . . . . . . . . . . . . . . 244
7.2.4 Learning a Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
7.2.5 Cross-Correlation and Convolution . . . . . . . . . . . . . . . . . . . . . . 246
7.2.6 Feature Map and Receptive Field . . . . . . . . . . . . . . . . . . . . . . . 246
7.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
7.3 Padding and Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
7.3.1 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
7.3.2 Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
7.4 Multiple Input and Multiple Output Channels . . . . . . . . . . . . . . . . . . . . 253
7.4.1 Multiple Input Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.4.2 Multiple Output Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
7.4.3 1 × 1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
7.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
7.5.1 Maximum Pooling and Average Pooling . . . . . . . . . . . . . . . . . . . 258
v
7.5.2 Padding and Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
7.5.3 Multiple Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.6 Convolutional Neural Networks (LeNet) . . . . . . . . . . . . . . . . . . . . . . . 263
7.6.1 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
7.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
8 Modern Convolutional Neural Networks 269

8.1 Deep Convolutional Neural Networks (AlexNet) . . . . . . . . . . . . . . . . . . . 270
8.1.1 Learning Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
8.1.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
8.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
8.2 Networks Using Blocks (VGG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
8.2.1 VGG Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
8.2.2 VGG Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
8.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
8.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
8.3 Network in Network (NiN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
8.3.1 NiN Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
8.3.2 NiN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
8.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
8.4 Multi-Branch Networks (GoogLeNet) . . . . . . . . . . . . . . . . . . . . . . . . . 287
8.4.1 Inception Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
8.4.2 GoogLeNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
8.4.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
8.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
8.5 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
8.5.1 Training Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
8.5.2 Batch Normalization Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 295
8.5.4 Applying Batch Normalization in LeNet . . . . . . . . . . . . . . . . . . . 298
8.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
8.6 Residual Networks (ResNet) and ResNeXt . . . . . . . . . . . . . . . . . . . . . . . 301
8.6.1 Function Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.6.2 Residual Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
8.6.3 ResNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
8.6.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
8.6.5 ResNeXt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.7 Densely Connected Networks (DenseNet) . . . . . . . . . . . . . . . . . . . . . . 311
8.7.1 From ResNet to DenseNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
8.7.2 Dense Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
8.7.3 Transition Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
8.7.4 DenseNet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
8.7.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
8.8 Designing Convolution Network Architectures . . . . . . . . . . . . . . . . . . . . 316
vi
8.8.1 The AnyNet Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
8.8.2 Constraining Design Spaces with Lower Error Distributions . . . . . . . . 319
8.8.3 RegNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
8.8.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
8.8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9 Recurrent Neural Networks 323

9.1 Working with Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
9.1.1 Basic Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
9.1.2 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
9.1.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
9.1.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
9.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
9.2 Converting Raw Text into Sequence Data . . . . . . . . . . . . . . . . . . . . . . . 334
9.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.2.3 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
9.2.4 Putting it all Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
9.2.5 Exploratory Language Statistics . . . . . . . . . . . . . . . . . . . . . . . . 337
9.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.3 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
9.3.1 Learning Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . 341
9.3.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
9.3.3 Partitioning Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
9.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
9.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
9.4.1 Neural Networks without Hidden States . . . . . . . . . . . . . . . . . . . 346
9.4.2 Recurrent Neural Networks with Hidden States . . . . . . . . . . . . . . . 347
9.4.3 RNN-based Character-Level Language Models . . . . . . . . . . . . . . . . 349
9.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.5 Recurrent Neural Network Implementation from Scratch . . . . . . . . . . . . . . 350
9.5.1 RNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
9.5.2 RNN-based Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 351
9.5.3 Gradient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
9.5.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
9.5.5 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
9.5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
9.6 Concise Implementation of Recurrent Neural Networks . . . . . . . . . . . . . . . 357
9.6.2 Training and Predicting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
9.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
9.7 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
9.7.1 Analysis of Gradients in RNNs . . . . . . . . . . . . . . . . . . . . . . . . 360
9.7.2 Backpropagation Through Time in Detail . . . . . . . . . . . . . . . . . . 362
9.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
10 Modern Recurrent Neural Networks 367

10.1 Gated Recurrent Units (GRU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
10.1.1 Gated Hidden State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
vii
10.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
10.2 Long Short-Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . 374
10.2.1 Gated Memory Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
10.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
10.3 Deep Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
10.3.1 Functional Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
10.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
10.4 Bidirectional Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 385
10.4.1 Dynamic Programming in Hidden Markov Models . . . . . . . . . . . . . 385
10.4.2 Bidirectional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
10.4.3 Training a Bidirectional RNN for a Wrong Application . . . . . . . . . . . 389
10.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
10.5 Machine Translation and the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 391
10.5.1 Downloading and Preprocessing the Dataset . . . . . . . . . . . . . . . . 392
10.5.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
10.5.3 Loading Sequences of Fixed Length . . . . . . . . . . . . . . . . . . . . . 394
10.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
10.6 Encoder-Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
10.6.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
10.6.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
10.6.3 Putting the Encoder and Decoder Together . . . . . . . . . . . . . . . . . 398
10.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
10.7 Sequence to Sequence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
10.7.1 Teacher Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.7.2 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
10.7.3 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
10.7.4 Encoder-Decoder for Sequence to Sequence Learning . . . . . . . . . . . 404
10.7.5 Loss Function with Masking . . . . . . . . . . . . . . . . . . . . . . . . . 404
10.7.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
10.7.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
10.7.8 Evaluation of Predicted Sequences . . . . . . . . . . . . . . . . . . . . . . 406
10.7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
10.8 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
10.8.1 Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
10.8.2 Exhaustive Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
10.8.3 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
10.8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
11 Attention Mechanisms and Transformers 413

11.1 Attention Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
11.1.1 Attention Cues in Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
11.1.2 Queries, Keys, and Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
11.1.3 Visualization of Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
11.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
11.2 Attention Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
11.2.1 Generating the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
viii
11.2.2 Average Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
11.2.3 Nonparametric Attention Pooling . . . . . . . . . . . . . . . . . . . . . . 420
11.2.4 Parametric Attention Pooling . . . . . . . . . . . . . . . . . . . . . . . . . 422
11.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
11.3 Attention Scoring Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
11.3.1 Masked Softmax Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 426
11.3.2 Additive Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
11.3.3 Scaled Dot-Product Attention . . . . . . . . . . . . . . . . . . . . . . . . . 429
11.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
11.4 Bahdanau Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
11.4.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
11.4.2 Defining the Decoder with Attention . . . . . . . . . . . . . . . . . . . . . 432
11.4.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
11.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
11.5 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
11.5.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
11.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
11.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
11.6 Self-Attention and Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . 439
11.6.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
11.6.2 Comparing CNNs, RNNs, and Self-Attention . . . . . . . . . . . . . . . . . 440
11.6.3 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
11.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
11.7 The Transformer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
11.7.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
11.7.2 Positionwise Feed-Forward Networks . . . . . . . . . . . . . . . . . . . . 446
11.7.3 Residual Connection and Layer Normalization . . . . . . . . . . . . . . . 447
11.7.4 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
11.7.5 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
11.7.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
11.7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
11.8 Transformers for Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
11.8.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
11.8.2 Patch Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
11.8.3 Vision Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 458
11.8.4 Putting All Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . 459
11.8.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
11.9 Large-Scale Pretraining with Transformers . . . . . . . . . . . . . . . . . . . . . . 461
11.9.1 Encoder-Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
11.9.2 Encoder-Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
11.9.3 Decoder-Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
11.9.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
12 Optimization Algorithms 473

12.1 Optimization and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
12.1.1 Goal of Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
12.1.2 Optimization Challenges in Deep Learning . . . . . . . . . . . . . . . . . 475
12.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
12.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
ix
12.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
12.2.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
12.2.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485
12.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
12.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
12.3.1 One-Dimensional Gradient Descent . . . . . . . . . . . . . . . . . . . . . 488
12.3.2 Multivariate Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 491
12.3.3 Adaptive Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
12.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
12.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
12.4.1 Stochastic Gradient Updates . . . . . . . . . . . . . . . . . . . . . . . . . 498
12.4.2 Dynamic Learning Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
12.4.3 Convergence Analysis for Convex Objectives . . . . . . . . . . . . . . . . 502
12.4.4 Stochastic Gradients and Finite Samples . . . . . . . . . . . . . . . . . . . 504
12.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
12.5 Minibatch Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . 505
12.5.1 Vectorization and Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
12.5.2 Minibatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
12.5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
12.6 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
12.6.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
12.6.2 Practical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
12.6.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
12.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
12.7 Adagrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
12.7.1 Sparse Features and Learning Rates . . . . . . . . . . . . . . . . . . . . . 525
12.7.2 Preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
12.7.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
12.7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
12.8 RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
12.8.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
12.8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
12.9 Adadelta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
12.9.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
12.9.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
12.9.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
12.10 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
12.10.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
12.10.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540
12.10.3 Yogi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
12.10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
12.11 Learning Rate Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
12.11.1 Toy Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
12.11.2 Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
x
12.11.3 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
12.11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
13 Computational Performance 555

13.1 Compilers and Interpreters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
13.1.1 Symbolic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
13.1.2 Hybrid Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
13.1.3 Hybridizing the Sequential Class . . . . . . . . . . . . . . . . . . . . . . . 558
13.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
13.2 Asynchronous Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
13.2.1 Asynchrony via Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
13.2.2 Barriers and Blockers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
13.2.3 Improving Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
13.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
13.3 Automatic Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563
13.3.1 Parallel Computation on GPUs . . . . . . . . . . . . . . . . . . . . . . . . 564
13.3.2 Parallel Computation and Communication . . . . . . . . . . . . . . . . . 565
13.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
13.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
13.4.1 Computers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
13.4.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
13.4.3 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
13.4.4 CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
13.4.5 GPUs and other Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . 574
13.4.6 Networks and Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
13.4.7 More Latency Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
13.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
13.5 Training on Multiple GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
13.5.1 Splitting the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
13.5.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
13.5.3 A Toy Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
13.5.4 Data Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
13.5.5 Distributing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
13.5.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
13.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
13.6 Concise Implementation for Multiple GPUs . . . . . . . . . . . . . . . . . . . . . 589
13.6.1 A Toy Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
13.6.2 Network Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
13.6.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
13.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
13.7 Parameter Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
13.7.1 Data-Parallel Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
13.7.2 Ring Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
13.7.3 Multi-Machine Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
13.7.4 Key–Value Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
13.7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
14 Computer Vision 601

14.1 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
14.1.1 Common Image Augmentation Methods . . . . . . . . . . . . . . . . . . . 602
14.1.2 Training with Image Augmentation . . . . . . . . . . . . . . . . . . . . . 606
xi
14.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
14.2 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
14.2.1 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
14.2.2 Hot Dog Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
14.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615
14.3 Object Detection and Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . 616
14.3.1 Bounding Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
14.3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
14.4 Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
14.4.1 Generating Multiple Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . 619
14.4.2 Intersection over Union (IoU) . . . . . . . . . . . . . . . . . . . . . . . . . 622
14.4.3 Labeling Anchor Boxes in Training Data . . . . . . . . . . . . . . . . . . . 623
14.4.4 Predicting Bounding Boxes with Non-Maximum Suppression . . . . . . . 629
14.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
14.5 Multiscale Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
14.5.1 Multiscale Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
14.5.2 Multiscale Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
14.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
14.6 The Object Detection Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
14.6.1 Downloading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
14.6.3 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
14.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
14.7 Single Shot Multibox Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
14.7.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
14.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
14.7.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
14.7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
14.8 Region-based CNNs (R-CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
14.8.1 R-CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
14.8.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
14.8.3 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
14.8.4 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
14.8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
14.9 Semantic Segmentation and the Dataset . . . . . . . . . . . . . . . . . . . . . . . 657
14.9.1 Image Segmentation and Instance Segmentation . . . . . . . . . . . . . . 657
14.9.2 The Pascal VOC2012 Semantic Segmentation Dataset . . . . . . . . . . . . 658
14.9.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
14.10 Transposed Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
14.10.1 Basic Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
14.10.2 Padding, Strides, and Multiple Channels . . . . . . . . . . . . . . . . . . . 665
14.10.3 Connection to Matrix Transposition . . . . . . . . . . . . . . . . . . . . . 667
14.10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
14.11 Fully Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
14.11.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
14.11.2 Initializing Transposed Convolutional Layers . . . . . . . . . . . . . . . . 671
14.11.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
14.11.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
14.11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
14.12 Neural Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
xii
14.12.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
14.12.2 Reading the Content and Style Images . . . . . . . . . . . . . . . . . . . . 677
14.12.3 Preprocessing and Postprocessing . . . . . . . . . . . . . . . . . . . . . . 678
14.12.4 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
14.12.6 Initializing the Synthesized Image . . . . . . . . . . . . . . . . . . . . . . 682
14.12.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
14.12.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
14.13 Image Classification (CIFAR-10) on Kaggle . . . . . . . . . . . . . . . . . . . . . . 684
14.13.1 Obtaining and Organizing the Dataset . . . . . . . . . . . . . . . . . . . . 685
14.13.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
14.13.5 Defining the Training Function . . . . . . . . . . . . . . . . . . . . . . . . 689
14.13.6 Training and Validating the Model . . . . . . . . . . . . . . . . . . . . . . 690
14.13.7 Classifying the Testing Set and Submitting Results on Kaggle . . . . . . . . 691
14.13.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
14.14 Dog Breed Identification (ImageNet Dogs) on Kaggle . . . . . . . . . . . . . . . . 692
14.14.1 Obtaining and Organizing the Dataset . . . . . . . . . . . . . . . . . . . . 693
14.14.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694
14.14.4 Fine-Tuning a Pretrained Model . . . . . . . . . . . . . . . . . . . . . . . 695
14.14.5 Defining the Training Function . . . . . . . . . . . . . . . . . . . . . . . . 696
14.14.6 Training and Validating the Model . . . . . . . . . . . . . . . . . . . . . . 697
14.14.7 Classifying the Testing Set and Submitting Results on Kaggle . . . . . . . . 697
14.14.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 698
15 Natural Language Processing: Pretraining 701

15.1 Word Embedding (word2vec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
15.1.1 One-Hot Vectors Are a Bad Choice . . . . . . . . . . . . . . . . . . . . . . 702
15.1.2 Self-Supervised word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
15.1.3 The Skip-Gram Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
15.1.4 The Continuous Bag of Words (CBOW) Model . . . . . . . . . . . . . . . . 705
15.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
15.2 Approximate Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
15.2.1 Negative Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
15.2.2 Hierarchical Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
15.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709
15.3 The Dataset for Pretraining Word Embeddings . . . . . . . . . . . . . . . . . . . . 710
15.3.2 Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 711
15.3.3 Extracting Center Words and Context Words . . . . . . . . . . . . . . . . 712
15.3.4 Negative Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714
15.3.5 Loading Training Examples in Minibatches . . . . . . . . . . . . . . . . . 715
15.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
15.4 Pretraining word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
15.4.1 The Skip-Gram Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
15.4.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
15.4.3 Applying Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 721
15.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722
xiii
15.5 Word Embedding with Global Vectors (GloVe) . . . . . . . . . . . . . . . . . . . . 722
15.5.1 Skip-Gram with Global Corpus Statistics . . . . . . . . . . . . . . . . . . . 722
15.5.2 The GloVe Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
15.5.3 Interpreting GloVe from the Ratio of Co-occurrence Probabilities . . . . . 724
15.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
15.6 Subword Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
15.6.1 The fastText Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
15.6.2 Byte Pair Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
15.6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 729
15.7 Word Similarity and Analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
15.7.1 Loading Pretrained Word Vectors . . . . . . . . . . . . . . . . . . . . . . 730
15.7.2 Applying Pretrained Word Vectors . . . . . . . . . . . . . . . . . . . . . . 732
15.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
15.8 Bidirectional Encoder Representations from Transformers (BERT) . . . . . . . . . 734
15.8.1 From Context-Independent to Context-Sensitive . . . . . . . . . . . . . . 734
15.8.2 From Task-Specific to Task-Agnostic . . . . . . . . . . . . . . . . . . . . . 735
15.8.3 BERT: Combining the Best of Both Worlds . . . . . . . . . . . . . . . . . . 735
15.8.4 Input Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
15.8.5 Pretraining Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
15.8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742
15.9 The Dataset for Pretraining BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
15.9.1 Defining Helper Functions for Pretraining Tasks . . . . . . . . . . . . . . 743
15.9.2 Transforming Text into the Pretraining Dataset . . . . . . . . . . . . . . . 746
15.9.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
15.10 Pretraining BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
15.10.1 Pretraining BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749
15.10.2 Representing Text with BERT . . . . . . . . . . . . . . . . . . . . . . . . . 751
15.10.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
16 Natural Language Processing: Applications 753

16.1 Sentiment Analysis and the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 754
16.1.2 Preprocessing the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
16.1.3 Creating Data Iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
16.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
16.2 Sentiment Analysis: Using Recurrent Neural Networks . . . . . . . . . . . . . . . 758
16.2.1 Representing Single Text with RNNs . . . . . . . . . . . . . . . . . . . . . 758
16.2.2 Loading Pretrained Word Vectors . . . . . . . . . . . . . . . . . . . . . . 759
16.2.3 Training and Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . 760
16.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
16.3 Sentiment Analysis: Using Convolutional Neural Networks . . . . . . . . . . . . . 761
16.3.1 One-Dimensional Convolutions . . . . . . . . . . . . . . . . . . . . . . . . 762
16.3.2 Max-Over-Time Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
16.3.3 The textCNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
16.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
16.4 Natural Language Inference and the Dataset . . . . . . . . . . . . . . . . . . . . . 768
16.4.1 Natural Language Inference . . . . . . . . . . . . . . . . . . . . . . . . . 768
16.4.2 The Stanford Natural Language Inference (SNLI) Dataset . . . . . . . . . . 769
16.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
xiv
16.5 Natural Language Inference: Using Attention . . . . . . . . . . . . . . . . . . . . 772
16.5.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
16.5.2 Training and Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . 777
16.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
16.6 Fine-Tuning BERT for Sequence-Level and Token-Level Applications . . . . . . . . 779
16.6.1 Single Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 780
16.6.2 Text Pair Classification or Regression . . . . . . . . . . . . . . . . . . . . 781
16.6.3 Text Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
16.6.4 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782
16.6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
16.7 Natural Language Inference: Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . 784
16.7.1 Loading Pretrained BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
16.7.2 The Dataset for Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . . . . 786
16.7.3 Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
16.7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
17 Recommender Systems 791

17.1 Overview of Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . 791
17.1.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
17.1.2 Explicit Feedback and Implicit Feedback . . . . . . . . . . . . . . . . . . 793
17.1.3 Recommendation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
17.1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
18 Generative Adversarial Networks 795

18.1 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
18.1.1 Generate Some “Real” Data . . . . . . . . . . . . . . . . . . . . . . . . . . 797
18.1.2 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
18.1.3 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
18.1.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
18.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
18.2 Deep Convolutional Generative Adversarial Networks . . . . . . . . . . . . . . . . 801
18.2.1 The Pokemon Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
18.2.2 The Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
18.2.3 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804
18.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
18.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807
19 Appendix: Mathematics for Deep Learning 809

19.1 Geometry and Linear Algebraic Operations . . . . . . . . . . . . . . . . . . . . . 810
19.1.1 Geometry of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 810
19.1.2 Dot Products and Angles . . . . . . . . . . . . . . . . . . . . . . . . . . . 812
19.1.3 Hyperplanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
19.1.4 Geometry of Linear Transformations . . . . . . . . . . . . . . . . . . . . 817
19.1.5 Linear Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
19.1.6 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
19.1.7 Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
19.1.8 Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
19.1.9 Tensors and Common Linear Algebra Operations . . . . . . . . . . . . . . 823
19.1.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825
19.2 Eigendecompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826
19.2.1 Finding Eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
xv
19.2.2 Decomposing Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828
19.2.3 Operations on Eigendecompositions . . . . . . . . . . . . . . . . . . . . . 828
19.2.4 Eigendecompositions of Symmetric Matrices . . . . . . . . . . . . . . . . 829
19.2.5 Gershgorin Circle Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 829
19.2.6 A Useful Application: The Growth of Iterated Maps . . . . . . . . . . . . . 830
19.2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835
19.2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835
19.3 Single Variable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
19.3.1 Differential Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
19.3.2 Rules of Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839
19.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
19.4 Multivariable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
19.4.1 Higher-Dimensional Differentiation . . . . . . . . . . . . . . . . . . . . . 847
19.4.2 Geometry of Gradients and Gradient Descent . . . . . . . . . . . . . . . . 849
19.4.3 A Note on Mathematical Optimization . . . . . . . . . . . . . . . . . . . . 850
19.4.4 Multivariate Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
19.4.5 The Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 853
19.4.6 Hessians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856
19.4.7 A Little Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858
19.4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
19.5 Integral Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
19.5.1 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
19.5.2 The Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . 865
19.5.3 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867
19.5.4 A Comment on Sign Conventions . . . . . . . . . . . . . . . . . . . . . . . 868
19.5.5 Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
19.5.6 Change of Variables in Multiple Integrals . . . . . . . . . . . . . . . . . . 871
19.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
19.6 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
19.6.1 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 873
19.6.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
19.7 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 890
19.7.1 The Maximum Likelihood Principle . . . . . . . . . . . . . . . . . . . . . 890
19.7.2 Numerical Optimization and the Negative Log-Likelihood . . . . . . . . . 892
19.7.3 Maximum Likelihood for Continuous Variables . . . . . . . . . . . . . . . 893
19.7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894
19.8 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
19.8.1 Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
19.8.2 Discrete Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897
19.8.3 Continuous Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
19.8.4 Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
19.8.5 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
19.8.6 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905
19.8.7 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 908
19.8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909
19.9 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910
19.9.1 Optical Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . 910
19.9.2 The Probabilistic Model for Classification . . . . . . . . . . . . . . . . . . 912
19.9.3 The Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 912
19.9.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913
19.9.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
xvi
19.10 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 917
19.10.1 Evaluating and Comparing Estimators . . . . . . . . . . . . . . . . . . . . 917
19.10.2 Conducting Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . 921
19.10.3 Constructing Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . 925
19.10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927
19.11 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928
19.11.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928
19.11.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 930
19.11.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932
19.11.4 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 936
19.11.5 Cross-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938
19.11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 941
20 Appendix: Tools for Deep Learning 943

20.1 Using Jupyter Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943
20.1.1 Editing and Running the Code Locally . . . . . . . . . . . . . . . . . . . . 943
20.1.2 Advanced Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 947
20.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948
20.2 Using Amazon SageMaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948
20.2.1 Signing Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948
20.2.2 Creating a SageMaker Instance . . . . . . . . . . . . . . . . . . . . . . . . 949
20.2.3 Running and Stopping an Instance . . . . . . . . . . . . . . . . . . . . . . 950
20.2.4 Updating Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 950
20.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 951
20.3 Using AWS EC2 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 951
20.3.1 Creating and Running an EC2 Instance . . . . . . . . . . . . . . . . . . . . 952
20.3.2 Installing CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956
20.3.3 Installing Libraries for Running the Code . . . . . . . . . . . . . . . . . . 957
20.3.4 Running the Jupyter Notebook remotely . . . . . . . . . . . . . . . . . . . 958
20.3.5 Closing Unused Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . 958
20.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
20.4 Using Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
20.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
20.5 Selecting Servers and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
20.5.1 Selecting Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
20.5.2 Selecting GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962
20.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964
20.6 Contributing to This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
20.6.1 Submitting Minor Changes . . . . . . . . . . . . . . . . . . . . . . . . . . 965
20.6.2 Proposing Major Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
20.6.3 Submitting Major Changes . . . . . . . . . . . . . . . . . . . . . . . . . . 966
20.6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
20.7 Utility Functions and Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 969
20.8 d2l API Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979
20.8.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979
20.8.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
20.8.3 Trainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982
20.8.4 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982
Bibliography 985
xvii
Index 1003
xviii
Preface
Just a few years ago, there were no legions of deep learning scientists developing intelligent prod-
ucts and services at major companies and startups. When we entered the field, machine learning
did not command headlines in daily newspapers. Our parents had no idea what machine learning
was, let alone why we might prefer it to a career in medicine or law. Machine learning was a blue
skies academic discipline whose industrial significance was limited to a narrow set of real-world
applications, including speech recognition and computer vision. Moreover, many of these appli-
cations required so much domain knowledge that they were often regarded as entirely separate
areas for which machine learning was one small component. At that time, neural networks—the
predecessors of the deep learning methods that we focus on in this book—were generally regarded
as outmoded.
In just the past few years, deep learning has taken the world by surprise, driving rapid progress in
such diverse fields as computer vision, natural language processing, automatic speech recogni-
tion, reinforcement learning, and biomedical informatics. Moreover, the success of deep learn-
ing on so many tasks of practical interest has even catalyzed developments in theoretical machine
learning and statistics. With these advances in hand, we can now build cars that drive themselves
with more autonomy than ever before (and less autonomy than some companies might have you
believe), smart reply systems that automatically draft the most mundane emails, helping people
dig out from oppressively large inboxes, and software agents that dominate the worldʼs best hu-
mans at board games like Go, a feat once thought to be decades away. Already, these tools exert
ever-wider impacts on industry and society, changing the way movies are made, diseases are di-
agnosed, and playing a growing role in basic sciences—from astrophysics to biology.
About This Book
This book represents our attempt to make deep learning approachable, teaching you the concepts,
the context, and the code.
One Medium Combining Code, Math, and HTML
For any computing technology to reach its full impact, it must be well-understood, well-
documented, and supported by mature, well-maintained tools. The key ideas should be clearly
distilled, minimizing the onboarding time needing to bring new practitioners up to date. Mature
libraries should automate common tasks, and exemplar code should make it easy for practitioners
to modify, apply, and extend common applications to suit their needs. Take dynamic web appli-
cations as an example. Despite a large number of companies, like Amazon, developing successful
database-driven web applications in the 1990s, the potential of this technology to aid creative en-
1
trepreneurs has been realized to a far greater degree in the past ten years, owing in part to the
development of powerful, well-documented frameworks.
Testing the potential of deep learning presents unique challenges because any single application
brings together various disciplines. Applying deep learning requires simultaneously understand-
ing (i) the motivations for casting a problem in a particular way; (ii) the mathematical form of a
given model; (iii) the optimization algorithms for fitting the models to data; (iv) the statistical prin-
ciples that tell us when we should expect our models to generalize to unseen data and practical
methods for certifying that they have, in fact, generalized; and (v) the engineering techniques re-
quired to train models efficiently, navigating the pitfalls of numerical computing and getting the
most out of available hardware. Teaching both the critical thinking skills required to formulate
problems, the mathematics to solve them, and the software tools to implement those solutions all
in one place presents formidable challenges. Our goal in this book is to present a unified resource
to bring would-be practitioners up to speed.
When we started this book project, there were no resources that simultaneously (i) remained up
to date; (ii) covered the breadth of modern machine learning practices with sufficient techni-
cal depth; and (iii) interleaved exposition of the quality one expects of a textbook with the clean
runnable code that one expects of a hands-on tutorial. We found plenty of code examples for how
to use a given deep learning framework (e.g., how to do basic numerical computing with matrices
in TensorFlow) or for implementing particular techniques (e.g., code snippets for LeNet, AlexNet,
ResNet, etc.) scattered across various blog posts and GitHub repositories. However, these exam-
ples typically focused on how to implement a given approach, but left out the discussion of why
certain algorithmic decisions are made. While some interactive resources have popped up sporad-
ically to address a particular topic, e.g., the engaging blog posts published on the website Distill1 ,
or personal blogs, they only covered selected topics in deep learning, and often lacked associated
code. On the other hand, while several deep learning textbooks have emerged—e.g., (Goodfellow
et al., 2016), which offers a comprehensive survey on the basics of deep learning—these resources
do not marry the descriptions to realizations of the concepts in code, sometimes leaving read-
ers clueless as to how to implement them. Moreover, too many resources are hidden behind the
paywalls of commercial course providers.
We set out to create a resource that could (i) be freely available for everyone; (ii) offer sufficient
technical depth to provide a starting point on the path to actually becoming an applied machine
learning scientist; (iii) include runnable code, showing readers how to solve problems in practice;
(iv) allow for rapid updates, both by us and also by the community at large; and (v) be comple-
mented by a forum2 for interactive discussion of technical details and to answer questions.
These goals were often in conflict. Equations, theorems, and citations are best managed and laid
out in LaTeX. Code is best described in Python. And webpages are native in HTML and JavaScript.
Furthermore, we want the content to be accessible both as executable code, as a physical book,
as a downloadable PDF, and on the Internet as a website. No workflows seemed suited to these
demands, so we decided to assemble our own (Section 20.6). We settled on GitHub to share the
source and to facilitate community contributions; Jupyter notebooks for mixing code, equations
and text; Sphinx as a rendering engine; and Discourse as a discussion platform. While our system
is not perfect, these choices strike a compromise among the competing concerns. We believe that
Dive into Deep Learning might be the first book published using such an integrated workflow.
1
http://distill.pub
2
http://discuss.d2l.ai
2 Contents
Learning by Doing
Many textbooks present concepts in succession, covering each in exhaustive detail. For example,
Chris Bishopʼs excellent textbook (Bishop, 2006), teaches each topic so thoroughly that getting to
the chapter on linear regression requires a non-trivial amount of work. While experts love this
book precisely for its thoroughness, for true beginners, this property limits its usefulness as an
introductory text.
In this book, we teach most concepts just in time. In other words, you will learn concepts at the
very moment that they are needed to accomplish some practical end. While we take some time at
the outset to teach fundamental preliminaries, like linear algebra and probability, we want you to
taste the satisfaction of training your first model before worrying about more esoteric concepts.
Aside from a few preliminary notebooks that provide a crash course in the basic mathematical
background, each subsequent chapter introduces both a reasonable number of new concepts and
provides several self-contained working examples, using real datasets. This presented an organi-
zational challenge. Some models might logically be grouped together in a single notebook. And
some ideas might be best taught by executing several models in succession. On the other hand,
there is a big advantage to adhering to a policy of one working example, one notebook: This makes
it as easy as possible for you to start your own research projects by leveraging our code. Just copy
a notebook and start modifying it.
Throughout, we interleave the runnable code with background material as needed. In general,
we err on the side of making tools available before explaining them fully (often filling in the back-
ground later). For instance, we might use stochastic gradient descent before explaining why it is
useful or offering intuitions for why it works. This helps to give practitioners the necessary am-
munition to solve problems quickly, at the expense of requiring the reader to trust us with some
curatorial decisions.
This book teaches deep learning concepts from scratch. Sometimes, we delve into fine details
about models that would typically be hidden from users by modern deep learning frameworks.
This comes up especially in the basic tutorials, where we want you to understand everything that
happens in a given layer or optimizer. In these cases, we often present two versions of the example:
one where we implement everything from scratch, relying only on NumPy-like functionality and
automatic differentiation, and a more practical example, where we write succinct code using the
high-level APIs of deep learning frameworks. After explaining how some component works, we
rely on the high-level API in subsequent tutorials.
Contents 3
Content and Structure
The book can be divided into roughly three parts, focusing on preliminaries, deep learning tech-
niques, and advanced topics focused on real systems and applications (Fig. 1).
Fig. 1: Book structure
• Part 1: Basics and Preliminaries. Chapter 1 offers an introduction to deep learning. Then,
in Chapter 2, we quickly bring you up to speed on the prerequisites required for hands-on
deep learning, such as how to store and manipulate data, and how to apply various numeri-
cal operations based on basic concepts from linear algebra, calculus, and probability. Chap-
ter 3 and Chapter 5 cover the most basic concepts and techniques in deep learning, includ-
ing regression and classification; linear models; multilayer perceptrons; and overfitting and
regularization.
• Part 2: Modern Deep Learning Techniques. Chapter 6 describes the key computational
components of deep learning systems and lays the groundwork for our subsequent imple-
mentations of more complex models. Next, Chapter 7 and Chapter 8 introduce convolutional
neural networks (CNNs), powerful tools that form the backbone of most modern computer
vision systems. Similarly, Chapter 9 and Chapter 10 introduce recurrent neural networks
(RNNs), models that exploit sequential (e.g., temporal) structure in data and are commonly
used for natural language processing and time series prediction. In Chapter 11, we introduce
a relatively new class of models based on so-called attention mechanisms that has displaced
RNNs as the dominant architecture for most natural language processing tasks. These sec-
tions will bring you up to speed on the most powerful and general tools that are widely used
by deep learning practitioners.
• Part 3: Scalability, Efficiency, and Applications. In Chapter 12, we discuss several com-
4 Contents
mon optimization algorithms used to train deep learning models. Next, in Chapter 13, we
examine several key factors that influence the computational performance of deep learning
code. Then, in Chapter 14, we illustrate major applications of deep learning in computer
vision. Finally, in Chapter 15 and Chapter 16, we demonstrate how to pretrain language
representation models and apply them to natural language processing tasks. This part is
available online3 .
Code
Most sections of this book feature executable code. We believe that some intuitions are best de-
veloped via trial and error, tweaking the code in small ways and observing the results. Ideally, an
elegant mathematical theory might tell us precisely how to tweak our code to achieve a desired re-
sult. However, deep learning practitioners today must often tread where no solid theory provides
guidance. Despite our best attempts, formal explanations for the efficacy of various techniques
are still lacking, both because the mathematics to characterize these models can be so difficult,
because the explanation likely depends on properties of the data that currently lack clear defi-
nitions, and because serious inquiry on these topics has just recently kicked into high gear. We
are hopeful that as the theory of deep learning progresses, each future edition of this book will
provide insights that eclipse those presently available.
To avoid unnecessary repetition, we encapsulate some of our most frequently imported and used
functions and classes in the d2l package. Throughout, we mark blocks of code (such as functions,
classes, or collection of import statements) with #@save to indicate that they will be accessed later
via the d2l package. We offer a detailed overview of these functions and classes in Section 20.8.
The d2l package is lightweight and only requires the following dependencies:
#@save
import collections
import hashlib
import inspect
import math
import os
import random
import re
import shutil
import sys
import tarfile
import time
import zipfile
from collections import defaultdict
import pandas as pd
import requests
from IPython import display
from matplotlib import pyplot as plt
from matplotlib_inline import backend_inline
d2l = sys.modules[__name__]
Most of the code in this book is based on PyTorch, an extremely popular open-source framework
that has been enthusiastically embraced by the deep learning research community. All of the
code in this book has passed tests under the latest stable version of PyTorch. However, due to
the rapid development of deep learning, some code in the print edition may not work properly in
3
https://d2l.ai
Contents 5
future versions of PyTorch. We plan to keep the online version up-to-date. In case you encounter
any problems, please consult Installation (page 9) to update your code and runtime environment.
Here is how we import modules from PyTorch.
#@save
import numpy as np
import torch
import torchvision
from PIL import Image
from torch import nn
from torch.nn import functional as F
from torch.utils import data
from torchvision import transforms
Target Audience
This book is for students (undergraduate or graduate), engineers, and researchers, who seek a
solid grasp of the practical techniques of deep learning. Because we explain every concept from
scratch, no previous background in deep learning or machine learning is required. Fully explain-
ing the methods of deep learning requires some mathematics and programming, but we will only
assume that you come in with some basics, including modest amounts of linear algebra, calculus,
probability, and Python programming. Just in case you forget the basics, the Appendix provides
a refresher on most of the mathematics you will find in this book. Most of the time, we will pri-
oritize intuition and ideas over mathematical rigor. If you would like to extend these foundations
beyond the prerequisites to understand our book, we happily recommend some other terrific re-
sources: Linear Analysis by Bela Bollobas (Bollobás, 1999) covers linear algebra and functional
analysis in great depth. All of Statistics (Wasserman, 2013) provides a marvelous introduction to
statistics. Joe Blitzsteinʼs books4 and courses5 on probability and inference are pedagogical gems.
And if you have not used Python before, you may want to peruse this Python tutorial6 .
4
https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1138369918
5
https://projects.iq.harvard.edu/stat110/home
6
http://learnpython.org/
6 Contents
Forum
Associated with this book, we have launched a discussion forum, located at discuss.d2l.ai7 . When
you have questions on any section of the book, you can find a link to the associated discussion
page at the end of each notebook.
Acknowledgments
We are indebted to the hundreds of contributors for both the English and the Chinese drafts. They
helped improve the content and offered valuable feedback. Specifically, we thank every con-
tributor of this English draft for making it better for everyone. Their GitHub IDs or names are
(in no particular order): alxnorden, avinashingit, bowen0701, brettkoonce, Chaitanya Prakash
Bapat, cryptonaut, Davide Fiocco, edgarroman, gkutiel, John Mitro, Liang Pu, Rahul Agarwal,
Mohamed Ali Jamaoui, Michael (Stu) Stewart, Mike Müller, NRauschmayr, Prakhar Srivastav,
sad-, sfermigier, Sheng Zha, sundeepteki, topecongiro, tpdi, vermicelli, Vishaal Kapoor, Vish-
wesh Ravi Shrimali, YaYaB, Yuhong Chen, Evgeniy Smirnov, lgov, Simon Corston-Oliver, Igor
Dzreyev, Ha Nguyen, pmuens, Andrei Lukovenko, senorcinco, vfdev-5, dsweet, Mohammad
Mahdi Rahimi, Abhishek Gupta, uwsd, DomKM, Lisa Oakley, Bowen Li, Aarush Ahuja, Prasanth
Buddareddygari, brianhendee, mani2106, mtn, lkevinzc, caojilin, Lakshya, Fiete Lüer, Surbhi Vi-
jayvargeeya, Muhyun Kim, dennismalmgren, adursun, Anirudh Dagar, liqingnz, Pedro Larroy,
lgov, ati-ozgur, Jun Wu, Matthias Blume, Lin Yuan, geogunow, Josh Gardner, Maximilian Böther,
Rakib Islam, Leonard Lausen, Abhinav Upadhyay, rongruosong, Steve Sedlmeyer, Ruslan Bara-
tov, Rafael Schlatter, liusy182, Giannis Pappas, ati-ozgur, qbaza, dchoi77, Adam Gerson, Phuc
Le, Mark Atwood, christabella, vn09, Haibin Lin, jjangga0214, RichyChen, noelo, hansent, Giel
Dops, dvincent1337, WhiteD3vil, Peter Kulits, codypenta, joseppinilla, ahmaurya, karolszk, heyti-
tle, Peter Goetz, rigtorp, Tiep Vu, sfilip, mlxd, Kale-ab Tessera, Sanjar Adilov, MatteoFerrara,
hsneto, Katarzyna Biesialska, Gregory Bruss, Duy–Thanh Doan, paulaurel, graytowne, Duc Pham,
sl7423, Jaedong Hwang, Yida Wang, cys4, clhm, Jean Kaddour, austinmw, trebeljahr, tbaums,
Cuong V. Nguyen, pavelkomarov, vzlamal, NotAnotherSystem, J-Arun-Mani, jancio, eldarkur-
tic, the-great-shazbot, doctorcolossus, gducharme, cclauss, Daniel-Mietchen, hoonose, biagiom,
abhinavsp0730, jonathanhrandall, ysraell, Nodar Okroshiashvili, UgurKap, Jiyang Kang, Steven-
Jokes, Tomer Kaftan, liweiwp, netyster, ypandya, NishantTharani, heiligerl, SportsTHU, Hoa
Nguyen, manuel-arno-korfmann-webentwicklung, aterzis-personal, nxby, Xiaoting He, Josiah Yo-
der, mathresearch, mzz2017, jroberayalas, iluu, ghejc, BSharmi, vkramdev, simonwardjones,
LakshKD, TalNeoran, djliden, Nikhil95, Oren Barkan, guoweis, haozhu233, pratikhack, Yue Ying,
tayfununal, steinsag, charleybeller, Andrew Lumsdaine, Jiekui Zhang, Deepak Pathak, Florian
Donhauser, Tim Gates, Adriaan Tijsseling, Ron Medina, Gaurav Saha, Murat Semerci, Lei Mao,
Levi McClenny, Joshua Broyde, jake221, jonbally, zyhazwraith, Brian Pulfer, Nick Tomasino,
Lefan Zhang, Hongshen Yang, Vinney Cavallo, yuntai, Yuanxiang Zhu, amarazov, pasricha, Ben
Greenawald, Shivam Upadhyay, Quanshangze Du, Biswajit Sahoo, Parthe Pandit, Ishan Kumar,
HomunculusK, Lane Schwartz, varadgunjal, Jason Wiener, Armin Gholampoor, Shreshtha13,
eigen-arnav, Hyeonggyu Kim, EmilyOng, Bálint Mucsányi, Chase DuBois, Juntian Tao, Wenxiang
Xu, Lifu Huang, filevich, quake2005, nils-werner, Yiming Li, Marsel Khisamutdinov, Francesco
“Fuma” Fumagalli, Peilin Sun, Vincent Gurgul, qingfengtommy, Janmey Shukla, Mo Shan, Kaan
Sancak, regob, AlexSauer, Gopalakrishna Ramachandra, Tobias Uelwer, Chao Wang, Tian Cao,
7
https://discuss.d2l.ai/
Contents 7
Nicolas Corthorn, akash5474, kxxt, zxydi1992, Jacob Britton, Shuangchi He, zhmou, krahets, Jie-
Han Chen, Atishay Garg, Marcel Flygare, adtygan.
We thank Amazon Web Services, especially Swami Sivasubramanian, Peter DeSantis, Adam Selip-
sky, and Andrew Jassy for their generous support in writing this book. Without the available time,
resources, discussions with colleagues, and continuous encouragement, this book would not have
happened.
Summary
Deep learning has revolutionized pattern recognition, introducing technology that now powers a
wide range of technologies, in such diverse fields as computer vision, natural language processing,
and automatic speech recognition. To successfully apply deep learning, you must understand how
to cast a problem, the basic mathematics of modeling, the algorithms for fitting your models to
data, and the engineering techniques to implement it all. This book presents a comprehensive
resource, including prose, figures, mathematics, and code, all in one place. To ask (or answer)
questions related to this book, visit our forum at https://discuss.d2l.ai/. All of our notebooks are
available for download on the D2L.ai website8 and on GitHub9 .
Exercises
1. Register an account on the discussion forum of this book discuss.d2l.ai10 .

2. Install Python on your computer.
3. Follow the links at the bottom of the section to the forum, where you will be able to seek out
help and discuss the book and find answers to your questions by engaging the authors and
broader community.
Discussions11
8
https://d2l.ai
9
https://github.com/d2l-ai/d2l-en
10
https://discuss.d2l.ai/
11
https://discuss.d2l.ai/t/20
8 Contents
Installation
In order to get up and running, we will need an environment for running Python, the Jupyter
Notebook, the relevant libraries, and the code needed to run the book itself.
Installing Miniconda
Your simplest option is to install Miniconda12 . Note that the Python 3.x version is required. You
can skip the following steps if your machine already has conda installed.
Visit the Miniconda website and determine the appropriate version for your system based on your
Python 3.x version and machine architecture. Suppose that your Python version is 3.8 (our tested
version). If you are using macOS, you would download the bash script whose name contains the
strings “MacOSX”, navigate to the download location, and execute the installation as follows:
# The file name is subject to changes

sh Miniconda3-py38_4.10.3-MacOSX-x86_64.sh -b
A Linux user would download the file whose name contains the strings “Linux” and execute the
following at the download location:
# The file name is subject to changes

sh Miniconda3-py38_4.10.3-Linux-x86_64.sh -b
Next, initialize the shell so we can run conda directly.
~/miniconda3/bin/conda init
Then close and reopen your current shell. You should be able to create a new environment as
follows:
conda create --name d2l python=3.8 -y
Now we can activate the d2l environment:
conda activate d2l
12
https://conda.io/en/latest/miniconda.html
9
Installing the Deep Learning Framework and the d2l Package
Before installing any deep learning framework, please first check whether or not you have proper
GPUs on your machine (the GPUs that power the display on a standard laptop are not relevant
for our purposes). For example, if your computer has NVIDIA GPUs and has installed CUDA13 ,
then you are all set. If your machine does not house any GPU, there is no need to worry just yet.
Your CPU provides more than enough horsepower to get you through the first few chapters. Just
remember that you will want to access GPUs before running larger models.
You can install PyTorch with either CPU or GPU support as follows:
pip install torch torchvision
Our next step is to install the d2l package that we developed in order to encapsulate frequently
used functions and classes found throughout this book:
pip install d2l==1.0.0a0
Downloading and Running the Code
Next, you will want to download the notebooks so that you can run each of the bookʼs code blocks.
Simply click on the “Notebooks” tab at the top of any HTML page on the D2L.ai website14 to down-
load the code and then unzip it. Alternatively, you can fetch the notebooks from the command
line as follows:
mkdir d2l-en && cd d2l-en

curl https://d2l.ai/d2l-en.zip -o d2l-en.zip
unzip d2l-en.zip && rm d2l-en.zip
cd pytorch
If you donʼt already have unzip installed, first run sudo apt-get install unzip. Now we can start
the Jupyter Notebook server by running:
jupyter notebook
At this point, you can open http://localhost:8888 (it may have already opened automatically) in
your Web browser. Then we can run the code for each section of the book. Whenever you open
a new command line window, you will need to execute conda activate d2l to activate the run-
time environment before running the D2L notebooks, or updating your packages (either the deep
learning framework or the d2l package). To exit the environment, run conda deactivate.
Discussions15
13
https://developer.nvidia.com/cuda-downloads
14
https://d2l.ai/
15
10 Contents
Notation
Throughout this book, we adhere to the following notational conventions. Note that some of these
symbols are placeholders, while others refer to specific objects. As a general rule of thumb, the
indefinite article “a” often indicates that the symbol is a placeholder and that similarly formatted
symbols can denote other objects of the same type. For example, “x: a scalar” means that low-
ercased letters generally represent scalar values, but “Z: the set of integers” refers specifically to
the symbol Z.
Numerical Objects
• x: a scalar
• x: a vector
• X: a matrix
• X: a general tensor
• I: the identity matrix (of some given dimension), i.e., a square matrix with 1 on all diagonal
entries and 0 on all off-diagonals
• xi , [x]i : the ith element of vector x
• xij , xi,j ,[X]ij , [X]i,j : the element of matrix X at row i and column j.
Set Theory
• X : a set
• Z: the set of integers
• Z+ : the set of positive integers
• R: the set of real numbers
• Rn : the set of n-dimensional vectors of real numbers
• Ra×b : The set of matrices of real numbers with a rows and b columns
• |X |: cardinality (number of elements) of set X
• A ∪ B: union of sets A and B
• A ∩ B: intersection of sets A and B
11
• A \ B: set subtraction of B from A (contains only those elements of A that do not belong to
B)
Functions and Operators
• f (·): a function
• log(·): the natural logarithm (base e)
• log2 (·): logarithm with base 2
• exp(·): the exponential function
• 1(·): the indicator function, evaluates to 1 if the boolean argument is true and 0 otherwise
• 1X (z): the set-membership indicator function, evaluates to 1 if the element z belongs to the
set X and 0 otherwise
• (·)⊤ : transpose of a vector or a matrix
• X−1 : inverse of matrix X
• ⊙: Hadamard (elementwise) product
• [·, ·]: concatenation
• ∥ · ∥p : ℓp norm
• ∥ · ∥: ℓ2 norm
• ⟨x, y⟩: dot product of vectors x and y
∑
• : summation over a collection of elements
∏
• : product over a collection of elements
def
• = : an equality asserted as a definition of the symbol on the left-hand side
Calculus
dy
• dx : derivative of y with respect to x
∂y
• ∂x : partial derivative of y with respect to x
• ∇x y: gradient of y with respect to x
∫b
• a f (x) dx: definite integral of f from a to b with respect to x
∫
• f (x) dx: indefinite integral of f with respect to x
12 Contents
Probability and Information Theory
• X: a random variable
• P : a probability distribution
• X ∼ P : the random variable X follows distribution P
• P (X = x): the probability assigned to the event where random variable X takes value x
• P (X | Y ): the conditional probability distribution of X given Y
• p(·): a probability density function (PDF) associated with distribution P
• E[X]: expectation of a random variable X
• X ⊥ Y : random variables X and Y are independent
• X ⊥ Y | Z: random variables X and Y are conditionally independent given Z
• σX : standard deviation of random variable X
2
• Var(X): variance of random variable X, equal to σX
• Cov(X, Y ): covariance of random variables X and Y
Cov(X,Y )
• ρ(X, Y ): the Pearson correlation coefficient between X and Y , equals σX σY
• H(X): entropy of random variable X

• DKL (P ∥Q): the KL-divergence (or relative entropy) from distribution Q to distribution P
Discussions16
16
Contents 13
14 Contents
1 | Introduction
Until recently, nearly every computer program that you might interact with on an ordinary day
was coded up as a rigid set of rules specifying precisely how it should behave. Say that we wanted
to write an application to manage an e-commerce platform. After huddling around a whiteboard
for a few hours to ponder the problem, we might settle on the broad strokes of a working solution,
for example: (i) users interact with the application through an interface running in a web browser
or mobile application; (ii) our application interacts with a commercial-grade database engine to
keep track of each userʼs state and maintain records of historical transactions; and (iii) at the heart
of our application, the business logic (you might say, the brains) of our application spells out a set
of rules that map every conceivable circumstance to the corresponding action that our program
should take.
To build the brains of our application, we might enumerate all the common events that our pro-
gram should handle. For example, whenever a customer clicks to add an item to their shopping
cart, our program should add an entry to the shopping cart database table, associating that userʼs
ID with the requested productʼs ID. We might then attempt to step through every possible cor-
ner case, testing the appropriateness of our rules and making any necessary modifications. What
happens if a user initiates a purchase with an empty cart? While few developers ever get it com-
pletely right the first time (it might take some test runs to work out the kinks), for the most part,
we can write such programs and confidently launch them before ever seeing a real customer. Our
ability to manually design automated systems that drive functioning products and systems, often
in novel situations, is a remarkable cognitive feat. And when you are able to devise solutions that
work 100% of the time, you typically should not be worrying about machine learning.
Fortunately for the growing community of machine learning scientists, many tasks that we would
like to automate do not bend so easily to human ingenuity. Imagine huddling around the white-
board with the smartest minds you know, but this time you are tackling one of the following prob-
lems:
• Write a program that predicts tomorrowʼs weather given geographic information, satellite
images, and a trailing window of past weather.
• Write a program that takes in a factoid question, expressed in free-form text, and answers it
correctly.
• Write a program that, given an image, identifies all of people depicted in it and draws out-
lines around each.
• Write a program that presents users with products that they are likely to enjoy but unlikely,
in the natural course of browsing, to encounter.
For these problems, even elite programmers would struggle to code up solutions from scratch.
The reasons can vary. Sometimes the program that we are looking for follows a pattern that
changes over time, so there is no fixed right answer! In such cases, any successful solution must
15
adapt gracefully to a changing world. At other times, the relationship (say between pixels, and ab-
stract categories) may be too complicated, requiring thousands or millions of computations and
following unknown principles. In the case of image recognition, the precise steps required to per-
form the task lie beyond our conscious understanding, even though our subconscious cognitive
processes execute the task effortlessly.
Machine learning is the study of algorithms that can learn from experience. As a machine learn-
ing algorithm accumulates more experience, typically in the form of observational data or in-
teractions with an environment, its performance improves. Contrast this with our deterministic
e-commerce platform, which follows the same business logic, no matter how much experience
accrues, until the developers themselves learn and decide that it is time to update the software. In
this book, we will teach you the fundamentals of machine learning, focusing in particular on deep
learning, a powerful set of techniques driving innovations in areas as diverse as computer vision,
natural language processing, healthcare, and genomics.
1.1 A Motivating Example
Before beginning writing, the authors of this book, like much of the work force, had to become
caffeinated. We hopped in the car and started driving. Using an iPhone, Alex called out “Hey Siri”,
awakening the phoneʼs voice recognition system. Then Mu commanded “directions to Blue Bottle
coffee shop”. The phone quickly displayed the transcription of his command. It also recognized
that we were asking for directions and launched the Maps application (app) to fulfill our request.
Once launched, the Maps app identified a number of routes. Next to each route, the phone dis-
played a predicted transit time. While we fabricated this story for pedagogical convenience, it
demonstrates that in the span of just a few seconds, our everyday interactions with a smart phone
can engage several machine learning models.
Imagine just writing a program to respond to a wake word such as “Alexa”, “OK Google”, and “Hey
Siri”. Try coding it up in a room by yourself with nothing but a computer and a code editor, as
illustrated in Fig. 1.1.1. How would you write such a program from first principles? Think about
it… the problem is hard. Every second, the microphone will collect roughly 44000 samples. Each
sample is a measurement of the amplitude of the sound wave. What rule could map reliably from
a snippet of raw audio to confident predictions {yes, no} on whether the snippet contains the wake
word? If you are stuck, do not worry. We do not know how to write such a program from scratch
either. That is why we use machine learning.
Fig. 1.1.1: Identify a wake word.
Here is the trick. Often, even when we do not know how to tell a computer explicitly how to map
from inputs to outputs, we are nonetheless capable of performing the cognitive feat ourselves. In
other words, even if you do not know how to program a computer to recognize the word “Alexa”,
you yourself are able to recognize it. Armed with this ability, we can collect a huge dataset con-
taining examples of audio snippets and associated labels, indicating which snippets contain the
wake word. In the dominant approach to machine learning, we do not attempt to design a sys-
tem explicitly to recognize wake words. Instead, we define a flexible program whose behavior is
16 Chapter 1. Introduction
determined by a number of parameters. Then we use the dataset to determine the best possible pa-
rameter values, i.e., those that improve the performance of our program with respect to a chosen
performance measure.
You can think of the parameters as knobs that we can turn, manipulating the behavior of the
program. Fixing the parameters, we call the program a model. The set of all distinct programs
(input-output mappings) that we can produce just by manipulating the parameters is called a fam-
ily of models. And the meta-program that uses our dataset to choose the parameters is called a
learning algorithm.
Before we can go ahead and engage the learning algorithm, we have to define the problem pre-
cisely, pinning down the exact nature of the inputs and outputs, and choosing an appropriate
model family. In this case, our model receives a snippet of audio as input, and the model gener-
ates a selection among {yes, no} as output. If all goes according to plan the modelʼs guesses will
typically be correct as to whether the snippet contains the wake word.
If we choose the right family of models, there should exist one setting of the knobs such that the
model fires “yes” every time it hears the word “Alexa”. Because the exact choice of the wake word
is arbitrary, we will probably need a model family sufficiently rich that, via another setting of the
knobs, it could fire “yes” only upon hearing the word “Apricot”. We expect that the same model
family should be suitable for “Alexa” recognition and “Apricot” recognition because they seem,
intuitively, to be similar tasks. However, we might need a different family of models entirely if we
want to deal with fundamentally different inputs or outputs, say if we wanted to map from images
to captions, or from English sentences to Chinese sentences.
As you might guess, if we just set all of the knobs randomly, it is unlikely that our model will
recognize “Alexa”, “Apricot”, or any other English word. In machine learning, the learning is the
process by which we discover the right setting of the knobs coercing the desired behavior from
our model. In other words, we train our model with data. As shown in Fig. 1.1.2, the training
process usually looks like the following:
1. Start off with a randomly initialized model that cannot do anything useful.
2. Grab some of your data (e.g., audio snippets and corresponding {yes, no} labels).
3. Tweak the knobs to make the model perform better as assessed on those examples.
4. Repeat Steps 2 and 3 until the model is awesome.
Fig. 1.1.2: A typical training process.
To summarize, rather than code up a wake word recognizer, we code up a program that can learn
to recognize wake words, if presented with a large labeled dataset. You can think of this act of
determining a programʼs behavior by presenting it with a dataset as programming with data. That
is to say, we can “program” a cat detector by providing our machine learning system with many
examples of cats and dogs. This way the detector will eventually learn to emit a very large positive
number if it is a cat, a very large negative number if it is a dog, and something closer to zero if it
1.1. A Motivating Example 17

is not sure. This barely scratches the surface of what machine learning can do. Deep learning,
which we will explain in greater detail later, is just one among many popular methods for solving
machine learning problems.
1.2 Key Components
In our wake word example, we described a dataset consisting of audio snippets and binary labels,
and we gave a hand-wavy sense of how we might train a model to approximate a mapping from
snippets to classifications. This sort of problem, where we try to predict a designated unknown la-
bel based on known inputs given a dataset consisting of examples for which the labels are known,
is called supervised learning. This is just one among many kinds of machine learning problems.
Before we explore other varieties, we would like to shed more light on some core components that
will follow us around, no matter what kind of machine learning problem we take on:
1. The data that we can learn from.
2. A model of how to transform the data.
3. An objective function that quantifies how well (or badly) the model is doing.
4. An algorithm to adjust the modelʼs parameters to optimize the objective function.
1.2.1 Data
It might go without saying that you cannot do data science without data. We could lose hundreds
of pages pondering what precisely data is, but for now, we will focus on the key properties of the
datasets that we will be concerned with. Generally, we are concerned with a collection of exam-
ples. In order to work with data usefully, we typically need to come up with a suitable numerical
representation. Each example (or data point, data instance, sample) typically consists of a set of
attributes called features (sometimes called covariates or inputs), based on which the model must
make its predictions. In supervised learning problems, our goal is to predict the value of a special
attribute, called the label (or target), that is not part of the modelʼs input.
If we were working with image data, each example might consist of an individual photograph (the
features) and a number indicating the category to which the photograph belongs (the label). The
photograph would be represented numerically as three grids of numerical values representing
the brightness of red, green, and blue light at each pixel location. For example, a 200 × 200 color
photograph would consist of 200 × 200 × 3 = 120000 numerical values.
Alternatively, we might work with electronic health record data and tackle the task of predicting
the likelihood that a given patient will survive the next 30 days. Here, our features might consist
of a collection of readily available attributes and frequently recorded measurements, including
age, vital signs, comorbidities, current medications, and recent procedures. The label available
for training would be a binary value indicating whether each patient in the historical data survived
within the 30-day window.
In such cases, when every example is characterized by the same number of numerical features,
we say that the inputs are fixed-length vectors and we call the (constant) length of the vectors the
dimensionality of the data. As you might imagine, fixed-length inputs can be convenient, giving
us one less complication to worry about. However, not all data can easily be represented as fixed-
length vectors. While we might expect microscope images to come from standard equipment, we
cannot expect images mined from the Internet to all show up with the same resolution or shape.
For images, we might consider cropping them all to a standard size, but that strategy only gets us
so far. We risk losing information in the cropped out portions. Moreover, text data resists fixed-
length representations even more stubbornly. Consider the customer reviews left on e-commerce
sites such as Amazon, IMDb, and TripAdvisor. Some are short: “it stinks!”. Others ramble for
pages. One major advantage of deep learning over traditional methods is the comparative grace
with which modern models can handle varying-length data.
Generally, the more data we have, the easier our job becomes. When we have more data, we
can train more powerful models and rely less heavily on preconceived assumptions. The regime
change from (comparatively) small to big data is a major contributor to the success of modern
deep learning. To drive the point home, many of the most exciting models in deep learning do not
work without large datasets. Some others work in the small data regime, but are no better than
traditional approaches.
Finally, it is not enough to have lots of data and to process it cleverly. We need the right data. If
the data is full of mistakes, or if the chosen features are not predictive of the target quantity of
interest, learning is going to fail. The situation is captured well by the cliché: garbage in, garbage
out. Moreover, poor predictive performance is not the only potential consequence. In sensitive
applications of machine learning, like predictive policing, resume screening, and risk models
used for lending, we must be especially alert to the consequences of garbage data. One common
failure mode occurs in datasets where some groups of people are unrepresented in the training
data. Imagine applying a skin cancer recognition system in the wild that had never seen black
skin before. Failure can also occur when the data does not merely under-represent some groups
but reflects societal prejudices. For example, if past hiring decisions are used to train a predictive
model that will be used to screen resumes, then machine learning models could inadvertently
capture and automate historical injustices. Note that this can all happen without the data scientist
actively conspiring, or even being aware.
1.2.2 Models
Most machine learning involves transforming the data in some sense. We might want to build a
system that ingests photos and predicts smiley-ness. Alternatively, we might want to ingest a set of
sensor readings and predict how normal vs. anomalous the readings are. By model, we denote the
computational machinery for ingesting data of one type, and spitting out predictions of a possibly
different type. In particular, we are interested in statistical models that can be estimated from
data. While simple models are perfectly capable of addressing appropriately simple problems,
the problems that we focus on in this book stretch the limits of classical methods. Deep learning
is differentiated from classical approaches principally by the set of powerful models that it focuses
on. These models consist of many successive transformations of the data that are chained together
top to bottom, thus the name deep learning. On our way to discussing deep models, we will also
discuss some more traditional methods.
1.2. Key Components 19

1.2.3 Objective Functions
Earlier, we introduced machine learning as learning from experience. By learning here, we mean
improving at some task over time. But who is to say what constitutes an improvement? You might
imagine that we could propose to update our model, and some people might disagree on whether
the proposed update constituted an improvement or a decline.
In order to develop a formal mathematical system of learning machines, we need to have formal
measures of how good (or bad) our models are. In machine learning, and optimization more
generally, we call these objective functions. By convention, we usually define objective functions
so that lower is better. This is merely a convention. You can take any function for which higher is
better, and turn it into a new function that is qualitatively identical but for which lower is better
by flipping the sign. Because lower is better, these functions are sometimes called loss functions.
When trying to predict numerical values, the most common loss function is squared error, i.e., the
square of the difference between the prediction and the ground truth target. For classification,
the most common objective is to minimize error rate, i.e., the fraction of examples on which our
predictions disagree with the ground truth. Some objectives (e.g., squared error) are easy to opti-
mize, while others (e.g., error rate) are difficult to optimize directly, owing to non-differentiability
or other complications. In these cases, it is common to optimize a surrogate objective.
During optimization, we think of the loss as a function of the modelʼs parameters, and treat the
training dataset as a constant. We learn the best values of our modelʼs parameters by minimizing
the loss incurred on a set consisting of some number of examples collected for training.
However, doing well on the training data does not guarantee that we will do well on unseen data.
So we will typically want to split the available data into two partitions: the training dataset (or
training set), for learning model parameters; and the test dataset (or test set), which is held out for
evaluation. At the end of the day, we typically report how our models perform on both
partitions. You could think of training performance as analogous to the scores that a student
achieves on the practice exams used to prepare for some real final exam. Even if the results are
encouraging, that does not guarantee success on the final exam. Over the course of studying, the
student might begin to memorize the practice questions, appearing to master the topic but
faltering when faced with previously unseen questions on the actual final exam.
When a model performs well on the training set but fails to generalize to unseen data, we say
that it is overfitting to the training data.
1.2.4 Optimization Algorithms
Once we have got some data source and representation, a model, and a well-defined objective func-
tion, we need an algorithm capable of searching for the best possible parameters for minimizing
the loss function. Popular optimization algorithms for deep learning are based on an approach
called gradient descent. In short, at each step, this method checks to see, for each parameter, which
way the training set loss would move if you perturbed that parameter just a small amount. It then
updates the parameter in the direction that lowers the loss.
1.3 Kinds of Machine Learning Problems
The wake word problem in our motivating example is just one among many problems that ma-
chine learning can tackle. To motivate the reader further and provide us with some common lan-
guage that will follow us throughout the book, we now provide a broad overview of the landscape
of machine learning problem formulations.
1.3.1 Supervised Learning
Supervised learning describes tasks where we are given a dataset containing both features and
labels and tasked with producing a model to predict the labels given input features. Each feature–
label pair is called an example. Sometimes, when the context is clear, we may use the term ex-
amples to refer to a collection of inputs, even when the corresponding labels are unknown. The
supervision comes into play because for choosing the parameters, we (the supervisors) provide
the model with a dataset consisting of labeled examples. In probabilistic terms, we typically are
interested in estimating the conditional probability of a label given input features. While it is just
one among several paradigms within machine learning, supervised learning accounts for the ma-
jority of successful applications of machine learning in industry. Partly, that is because many im-
portant tasks can be described crisply as estimating the probability of something unknown given
a particular set of available data:
• Predict cancer vs. not cancer, given a computer tomography image.
• Predict the correct translation in French, given a sentence in English.
• Predict the price of a stock next month based on this monthʼs financial reporting data.
While all supervised learning problems are captured by the simple description “predicting the
labels given input features”, supervised learning can take diverse forms and require tons of mod-
eling decisions, depending on (among other considerations) the type, size, and quantity of the in-
puts and outputs. For example, we use different models to process sequences of arbitrary lengths
and for processing fixed-length vector representations. We will visit many of these problems in
depth throughout this book.
Informally, the learning process looks something like the following. First, grab a big collection of
examples for which the features are known and select from them a random subset, acquiring the
ground-truth labels for each. Sometimes these labels might be available data that have already
been collected (e.g., did a patient die within the following year?) and other times we might need
to employ human annotators to label the data, (e.g., assigning images to categories). Together,
these inputs and corresponding labels comprise the training set. We feed the training dataset
into a supervised learning algorithm, a function that takes as input a dataset and outputs another
function: the learned model. Finally, we can feed previously unseen inputs to the learned model,
using its outputs as predictions of the corresponding label. The full process is drawn in Fig. 1.3.1.
1.3. Kinds of Machine Learning Problems 21

Fig. 1.3.1: Supervised learning.
Regression
Perhaps the simplest supervised learning task to wrap your head around is regression. Consider,
for example, a set of data harvested from a database of home sales. We might construct a table,
where each row corresponds to a different house, and each column corresponds to some relevant
attribute, such as the square footage of a house, the number of bedrooms, the number of bath-
rooms, and the number of minutes (walking) to the center of town. In this dataset, each example
would be a specific house, and the corresponding feature vector would be one row in the table.
If you live in New York or San Francisco, and you are not the CEO of Amazon, Google, Microsoft,
or Facebook, the (sq. footage, no. of bedrooms, no. of bathrooms, walking distance) feature vec-
tor for your home might look something like: [600, 1, 1, 60]. However, if you live in Pittsburgh, it
might look more like [3000, 4, 3, 10]. Fixed-length feature vectors like this are essential for most
classic machine learning algorithms.
What makes a problem a regression is actually the form of the target. Say that you are in the
market for a new home. You might want to estimate the fair market value of a house, given some
features like above. The data here might consist of historical home listings and the labels might
be the observed sales prices. When labels take on arbitrary numerical values (even within some
interval), we call this a regression problem. The goal is to produce a model whose predictions
closely approximate the actual label values.
Lots of practical problems are easily described as regression problems. Predicting the rating that
a user will assign to a movie can be thought of as a regression problem and if you designed a great
algorithm to accomplish this feat in 2009, you might have won the 1-million-dollar Netflix prize17 .
Predicting the length of stay for patients in the hospital is also a regression problem. A good rule
of thumb is that any how much? or how many? problem should suggest regression, for example:
• How many hours will this surgery take?
• How much rainfall will this town have in the next six hours?
Even if you have never worked with machine learning before, you have probably worked through
a regression problem informally. Imagine, for example, that you had your drains repaired and
that your contractor spent 3 hours removing gunk from your sewage pipes. Then he sent you a
bill of 350 dollars. Now imagine that your friend hired the same contractor for 2 hours and that he
received a bill of 250 dollars. If someone then asked you how much to expect on their upcoming
gunk-removal invoice you might make some reasonable assumptions, such as more hours worked
costs more dollars. You might also assume that there is some base charge and that the contractor
then charges per hour. If these assumptions held true, then given these two data examples, you
could already identify the contractorʼs pricing structure: 100 dollars per hour plus 50 dollars to
17
https://en.wikipedia.org/wiki/Netflix_Prize
show up at your house. If you followed that much, then you already understand the high-level
idea behind linear regression.
In this case, we could produce the parameters that exactly matched the contractorʼs prices. Some-
times this is not possible, e.g., if some of the variance owes to a few factors besides your two fea-
tures. In these cases, we will try to learn models that minimize the distance between our predic-
tions and the observed values. In most of our chapters, we will focus on minimizing the squared
error loss function. As we will see later, this loss corresponds to the assumption that our data
were corrupted by Gaussian noise.
Classification
While regression models are great for addressing how many? questions, lots of problems do not
bend comfortably to this template. Consider, for example, a bank that wants to develop a check
scanning feature for its mobile app. Ideally, the customer would simply snap a photo of a check
and the app would automatically recognize the text from the image. Assuming that we had some
ability to segment out image patches corresponding to each handwritten character, then the pri-
mary remaining task would be to determine which character among some known set is depicted
in each image patch. These kinds of which one? problems are called classification and require a
different set of tools than those used for regression, although many techniques will carry over.
In classification, we want our model to look at features, e.g., the pixel values in an image, and then
predict which category (sometimes called a class) among some discrete set of options, an example
belongs. For handwritten digits, we might have ten classes, corresponding to the digits 0 through
9. The simplest form of classification is when there are only two classes, a problem which we call
binary classification. For example, our dataset could consist of images of animals and our labels
might be the classes {cat, dog}. While in regression, we sought a regressor to output a numerical
value, in classification, we seek a classifier, whose output is the predicted class assignment.
For reasons that we will get into as the book gets more technical, it can be hard to optimize a
model that can only output a hard categorical assignment, e.g., either “cat” or “dog”. In these
cases, it is usually much easier to instead express our model in the language of probabilities. Given
features of an example, our model assigns a probability to each possible class. Returning to our
animal classification example where the classes are {cat, dog}, a classifier might see an image and
output the probability that the image is a cat as 0.9. We can interpret this number by saying that
the classifier is 90% sure that the image depicts a cat. The magnitude of the probability for the
predicted class conveys one notion of uncertainty. It is not the only notion of uncertainty and we
will discuss others in more advanced chapters.
When we have more than two possible classes, we call the problem multiclass classification. Com-
mon examples include hand-written character recognition {0, 1, 2, ...9, a, b, c, ...}. While we at-
tacked regression problems by trying to minimize the squared error loss function, the common
loss function for classification problems is called cross-entropy, whose name can be demystified
via an introduction to information theory in subsequent chapters.
Note that the most likely class is not necessarily the one that you are going to use for your decision.
Assume that you find a beautiful mushroom in your backyard as shown in Fig. 1.3.2.

Fig. 1.3.2: Death cap—do not eat!
Now, assume that you built a classifier and trained it to predict whether a mushroom is poisonous
based on a photograph. Say our poison-detection classifier outputs that the probability that Fig.
1.3.2 contains a death cap is 0.2. In other words, the classifier is 80% sure that our mushroom is
not a death cap. Still, you would have to be a fool to eat it. That is because the certain benefit
of a delicious dinner is not worth a 20% risk of dying from it. In other words, the effect of the
uncertain risk outweighs the benefit by far. Thus, in order to make a decision about whether to
eat the mushroom, we need to compute the expected disutility associated with each action which
depends both on the likely outcomes and the benefits or harms associated with each. In this case,
the disutility incurred by eating the mushroom might be 0.2 × ∞ + 0.8 × 0 = ∞, whereas the loss
of discarding it is 0.2 × 0 + 0.8 × 1 = 0.8. Our caution was justified: as any mycologist would tell
us, the mushroom in Fig. 1.3.2 is actually a death cap.
Classification can get much more complicated than just binary or multiclass classification. For
instance, there are some variants of classification addressing hierarchically structured classes. In
such cases not all errors are equal—if we must err, we might prefer to misclassify to a related class
rather than a distant class. Usually, this is referred to as hierarchical classification. For inspiration,
you might think of Linnaeus18 , who organized the animals in a hierarchy.
In the case of animal classification, it might not be so bad to mistake a poodle for a schnauzer,
but our model would pay a huge penalty if it confused a poodle for a dinosaur. Which hierarchy
is relevant might depend on how you plan to use the model. For example, rattlesnakes and garter
snakes might be close on the phylogenetic tree, but mistaking a rattler for a garter could be deadly.
Tagging
Some classification problems fit neatly into the binary or multiclass classification setups. For ex-
ample, we could train a normal binary classifier to distinguish cats from dogs. Given the current
state of computer vision, we can do this easily, with off-the-shelf tools. Nonetheless, no matter
how accurate our model gets, we might find ourselves in trouble when the classifier encounters
an image of the Town Musicians of Bremen, a popular German fairy tale featuring four animals (Fig.
1.3.3).
18
https://en.wikipedia.org/wiki/Carl_Linnaeus
Fig. 1.3.3: A donkey, a dog, a cat, and a rooster.
As you can see, the photo features a cat, a rooster, a dog, and a donkey, with some trees in the
background. When we anticipate encountering such images, multiclass classification might not
be the right problem formulation. Instead, we might want to give the model the option of saying
the image depicts a cat, a dog, a donkey, and a rooster.
The problem of learning to predict classes that are not mutually exclusive is called multi-label clas-
sification. Auto-tagging problems are typically best described as multi-label classification prob-
lems. Think of the tags people might apply to posts on a technical blog, e.g., “machine learning”,
“technology”, “gadgets”, “programming languages”, “Linux”, “cloud computing”, “AWS”. A typical
article might have 5–10 tags applied. Typically, tags will exhibit some correlation structure. Posts
about “cloud computing” are likely to mention “AWS” and posts about “machine learning” are
likely to mention “GPUs”.
Sometimes such tagging problems draw on enormous label sets. The National Library of Medicine
employs many professional annotators who associate each article to be indexed in PubMed with
a set of tags drawn from the Medical Subject Headings (MeSH) ontology, a collection of roughly
28000 tags. Correctly tagging articles is important because it allows researchers to conduct ex-
haustive reviews of the literature. This is a time-consuming process and the annotators typically
have a one-year lag between archiving and tagging. Machine learning can provide provisional
tags until each article can have a proper manual review. Indeed, for several years, the BioASQ
organization has hosted competitions19 for this task.
19
http://bioasq.org/

Search
In the field of information retrieval, we often impose rankings over sets of items. Take web search
for example. The goal is less to determine whether a particular page is relevant for a query, but
rather, which, among a set of relevant results should be shown most prominently to a particu-
lar user. One possible solution might be to first assign a score to every element in the set and
then to retrieve the top-rated elements. PageRank20 , the original secret sauce behind the Google
search engine, was an early example of such a scoring system. Peculiarly, the scoring provided by
PageRank did not depend on the actual query. Instead, they relied on a simple relevance filter to
identify the set of relevant candidates and then used PageRank to prioritize the more authoritative
pages. Nowadays, search engines use machine learning and behavioral models to obtain query-
dependent relevance scores. There are entire academic conferences devoted to this subject.
Recommender Systems
Recommender systems are another problem setting that is related to search and ranking. The
problems are similar insofar as the goal is to display a set of relevant items to the user. The main
difference is the emphasis on personalization to specific users in the context of recommender sys-
tems. For instance, for movie recommendations, the results page for a science fiction fan and
the results page for a connoisseur of Peter Sellers comedies might differ significantly. Similar
problems pop up in other recommendation settings, e.g., for retail products, music, and news
recommendation.
In some cases, customers provide explicit feedback, communicating how much they liked a par-
ticular product (e.g., the product ratings and reviews on Amazon, IMDb, and Goodreads). In other
cases, they provide implicit feedback, e.g., by skipping titles on a playlist, which might indicate
dissatisfaction, or might just indicate that the song was inappropriate in context. In the simplest
formulations, these systems are trained to estimate some score, such as an expected star rating
or the probability that a given user will purchase a particular item.
Given such a model, for any given user, we could retrieve the set of objects with the largest scores,
which could then be recommended to the user. Production systems are considerably more ad-
vanced and take detailed user activity and item characteristics into account when computing such
scores. Fig. 1.3.4 displays the deep learning books recommended by Amazon based on personal-
ization algorithms tuned to capture Astonʼs preferences.
20
https://en.wikipedia.org/wiki/PageRank
Fig. 1.3.4: Deep learning books recommended by Amazon.
Despite their tremendous economic value, recommendation systems naively built on top of pre-
dictive models suffer some serious conceptual flaws. To start, we only observe censored feedback:
users preferentially rate movies that they feel strongly about. For example, on a five-point scale,
you might notice that items receive many one- and five-star ratings but that there are conspicu-
ously few three-star ratings. Moreover, current purchase habits are often a result of the recom-
mendation algorithm currently in place, but learning algorithms do not always take this detail
into account. Thus it is possible for feedback loops to form where a recommender system pref-
erentially pushes an item that is then taken to be better (due to greater purchases) and in turn is
recommended even more frequently. Many of these problems about how to deal with censoring,
incentives, and feedback loops, are important open research questions.
Sequence Learning
So far, we have looked at problems where we have some fixed number of inputs and produce a
fixed number of outputs. For example, we considered predicting house prices given a fixed set
of features: square footage, number of bedrooms, number of bathrooms, and the transit time
to downtown. We also discussed mapping from an image (of fixed dimension) to the predicted
probabilities that it belongs to each among a fixed number of classes and predicting star ratings
associated with purchases based on the user ID and product ID alone. In these cases, once our
model is trained, after each test example is fed into our model, it is immediately forgotten. We
assumed that successive observations were independent and thus there was no need to hold on to
this context.
But how should we deal with video snippets? In this case, each snippet might consist of a different
number of frames. And our guess of what is going on in each frame might be much stronger if we

take into account the previous or succeeding frames. Same goes for language. One popular deep
learning problem is machine translation: the task of ingesting sentences in some source language
and predicting their translations in another language.
These problems also occur in medicine. We might want a model to monitor patients in the inten-
sive care unit and to fire off alerts whenever their risk of dying in the next 24 hours exceeds some
threshold. Here, we wouldnʼt throw away everything that we know about the patient history every
hour, making predictions based only on the most recent measurements.
These problems are among the most exciting applications of machine learning and they are in-
stances of sequence learning. They require a model to either ingest sequences of inputs or to
emit sequences of outputs (or both). Specifically, sequence-to-sequence learning considers prob-
lems where inputs and outputs both consist of variable-length sequences. Examples include ma-
chine translation and speech-to-text transcription. While it is impossible to consider all types of
sequence transformations, the following special cases are worth mentioning.
Tagging and Parsing. This involves annotating a text sequence with attributes. Here, the inputs
and outputs are aligned, i.e., they are of the same number and occur in a corresponding order.
For instance, in part-of-speech (PoS) tagging, we annotate every word in a sentence with the cor-
responding part of speech, i.e., “noun” or “direct object”. Alternatively, we might want to know
which groups of contiguous words refer to named entities, like people, places, or organizations. In
the cartoonishly simple example below, we might just want to indicate, for every word in a sen-
tence, whether it is part of a named entity (tagged as “Ent”).
Tom has dinner in Washington with Sally

Ent - - - Ent - Ent
Automatic Speech Recognition. With speech recognition, the input sequence is an audio record-
ing of a speaker (Fig. 1.3.5), and the output is a transcript of what the speaker said. The challenge
is that there are many more audio frames (sound is typically sampled at 8kHz or 16kHz) than text,
i.e., there is no 1:1 correspondence between audio and text, since thousands of samples may cor-
respond to a single spoken word. These are sequence-to-sequence learning problems, where the
output is much shorter than the input.
Fig. 1.3.5: -D-e-e-p- L-ea-r-ni-ng- in an audio recording.
Text to Speech. This is the inverse of automatic speech recognition. Here, the input is text and
the output is an audio file. In this case, the output is much longer than the input. While humans
are remarkably good at recognizing speech, even from low-quality audio, getting computers to
perform the feat is a formidable challenge.
Machine Translation. Unlike the case of speech recognition, where corresponding inputs and
outputs occur in the same order, in machine translation, unaligned data poses a new challenge.
Here the input and output sequences can have different lengths, and the corresponding regions
of the respective sequences may appear in different orders. Consider the following illustrative
example of the peculiar tendency of Germans to place the verbs at the end of sentences:
German: Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?

English: Did you already check out this excellent tutorial?
Wrong alignment: Did you yourself already this excellent tutorial looked-at?
Many related problems pop up in other learning tasks. For instance, determining the order in
which a user reads a webpage is a two-dimensional layout analysis problem. Dialogue problems
exhibit all kinds of additional complications, where determining what to say next requires taking
into account real-world knowledge and the prior state of the conversation across long temporal
distances. These are active areas of research.
1.3.2 Unsupervised and Self-Supervised Learning
The previous examples focused on supervised learning, where we feed the model a giant dataset
containing both the features and corresponding label values. You could think of the supervised
learner as having an extremely specialized job and an extremely dictatorial boss. The boss stands
over its shoulder and tells it exactly what to do in every situation until you learn to map from
situations to actions. Working for such a boss sounds pretty lame. On the other hand, pleasing
such a boss is pretty easy. You just recognize the pattern as quickly as possible and imitate their
actions.
Considering the opposite situation, it could be frustrating to work for a boss who has no idea what
they want you to do. However, if you plan to be a data scientist, you had better get used to it. The
boss might just hand you a giant dump of data and tell you to do some data science with it! This
sounds vague because it is. We call this class of problems unsupervised learning, and the type and
number of questions we could ask is limited only by our creativity. We will address unsupervised
learning techniques in later chapters. To whet your appetite for now, we describe a few of the
following questions you might ask.
• Can we find a small number of prototypes that accurately summarize the data? Given a
set of photos, can we group them into landscape photos, pictures of dogs, babies, cats, and
mountain peaks? Likewise, given a collection of usersʼ browsing activities, can we group
them into users with similar behavior? This problem is typically known as clustering.
• Can we find a small number of parameters that accurately capture the relevant properties
of the data? The trajectories of a ball are well described by velocity, diameter, and mass of
the ball. Tailors have developed a small number of parameters that describe human body
shape fairly accurately for the purpose of fitting clothes. These problems are referred to as
subspace estimation. If the dependence is linear, it is called principal component analysis.
• Is there a representation of (arbitrarily structured) objects in Euclidean space such that sym-
bolic properties can be well matched? This can be used to describe entities and their rela-
tions, such as “Rome” − “Italy” + “France” = “Paris”.
• Is there a description of the root causes of much of the data that we observe? For instance,
if we have demographic data about house prices, pollution, crime, location, education, and
salaries, can we discover how they are related simply based on empirical data? The fields
concerned with causality and probabilistic graphical models tackle such questions.
• Another important and exciting recent development in unsupervised learning is the advent
of deep generative models. These models estimate the density of the data p(x), either ex-

plicitly or implicitly. Once trained, we can use a generative model either to score examples
according to how likely they are, or to sample synthetic examples from the learned distri-
bution. Early deep learning breakthroughs in generative modeling came with the invention
of variational autoencoders (Kingma and Welling, 2014) and continued with the development
of generative adversarial networks (Goodfellow et al., 2014). More recent advances include
normalizing flows, diffusion models, and score-based models.
A major development in unsupervised learning, has been the rise of self-supervised learning, tech-
niques that leverage some aspect of the unlabeled data to provide supervision. For text, we can
train models to “fill in the blanks” by predicting randomly masked words using their surrounding
words (contexts) in big corpora without any labeling effort (Devlin et al., 2018)! For images, we
may train models to tell the relative position between two cropped regions of the same image (Do-
ersch et al., 2015), to predict an occluded part of an image based on the remaining portions of the
image, or to predict whether two examples are perturbed versions of the same underlying image.
Self-supervised models often learn representations that are subsequently leveraged by fine-tuning
the resulting models on some downstream task of interest.
1.3.3 Interacting with an Environment
So far, we have not discussed where data actually comes from, or what actually happens when a
machine learning model generates an output. That is because supervised learning and unsuper-
vised learning do not address these issues in a very sophisticated way. In either case, we grab a big
pile of data upfront, then set our pattern recognition machines in motion without ever interacting
with the environment again. Because all of the learning takes place after the algorithm is discon-
nected from the environment, this is sometimes called offline learning. For example, supervised
learning assumes the simple interaction pattern depicted in Fig. 1.3.6.
Fig. 1.3.6: Collecting data for supervised learning from an environment.
This simplicity of offline learning has its charms. The upside is that we can worry about pattern
recognition in isolation, without worrying about complications arising from interactions with a
dynamic environment. But this problem formulation is limiting. If you grew up reading Asimovʼs
Robot novels, then you might imagine artificially intelligent agents capable not only of making
predictions, but also of taking actions in the world. We want to think about intelligent agents, not
just predictive models. This means that we need to think about choosing actions, not just making
predictions. Unlike mere predictions, actions actually impact the environment. If we want to train
an intelligent agent, we must account for the way its actions might impact the future observations
of the agent.
Considering the interaction with an environment opens a whole set of new modeling questions.
The following are just a few examples.
• Does the environment remember what we did previously?
• Does the environment want to help us, e.g., a user reading text into a speech recognizer?
• Does the environment want to beat us, e.g., spammers altering their emails to evade spam
filters?
• Does the environment have shifting dynamics? For example, does future data always re-
semble the past or do the patterns change over time, either naturally or in response to our
automated tools?
These questions raise the problem of distribution shift, where training and test data are different.
Most of us have have experienced this problem when taking exams written by a lecturer, while
the homework was composed by their teaching assistants. Next, we briefly describe reinforce-
ment learning, a rich framework for posing learning problems in which an agent interacts with
an environment.
1.3.4 Reinforcement Learning
If you are interested in using machine learning to develop an agent that interacts with an environ-
ment and takes actions, then you are probably going to wind up focusing on reinforcement learning.
This might include applications to robotics, to dialogue systems, and even to developing artifi-
cial intelligence (AI) for video games. Deep reinforcement learning, which applies deep learning to
reinforcement learning problems, has surged in popularity. The breakthrough deep Q-network
that beat humans at Atari games using only the visual input (), and the AlphaGo program that
dethroned the world champion at the board game Go (Silver et al., 2016) are two prominent exam-
ples.
Reinforcement learning gives a very general statement of a problem, in which an agent interacts
with an environment over a series of time steps. At each time step, the agent receives some ob-
servation from the environment and must choose an action that is subsequently transmitted back
to the environment via some mechanism (sometimes called an actuator). Finally, the agent re-
ceives a reward from the environment. This process is illustrated in Fig. 1.3.7. The agent then
receives a subsequent observation, and chooses a subsequent action, and so on. The behavior of
a reinforcement learning agent is governed by a policy. In short, a policy is just a function that
maps from observations of the environment to actions. The goal of reinforcement learning is to
produce good policies.
Fig. 1.3.7: The interaction between reinforcement learning and an environment.

It is hard to overstate the generality of the reinforcement learning framework. For example, we
can cast supervised learning problems as reinforcement learning problems. Say we had a classi-
fication problem. We could create a reinforcement learning agent with one action corresponding
to each class. We could then create an environment which gave a reward that was exactly equal
to the loss function from the original supervised learning problem.
That being said, reinforcement learning can also address many problems that supervised learn-
ing cannot. For example, in supervised learning, we always expect that the training input comes
associated with the correct label. But in reinforcement learning, we do not assume that for each
observation the environment tells us the optimal action. In general, we just get some reward.
Moreover, the environment may not even tell us which actions led to the reward.
Consider the game of chess. The only real reward signal comes at the end of the game when
we either win, earning a reward of, say, 1, or when we lose, receiving a reward of, say, -1. So
reinforcement learners must deal with the credit assignment problem: determining which actions
to credit or blame for an outcome. The same goes for an employee who gets a promotion on
October 11. That promotion likely reflects a large number of well-chosen actions over the previous
year. Getting more promotions in the future requires figuring out what actions along the way led
to the promotion.
Reinforcement learners may also have to deal with the problem of partial observability. That is,
the current observation might not tell you everything about your current state. Say a cleaning
robot found itself trapped in one of many identical closets in a house. Inferring the precise loca-
tion of the robot might require considering its previous observations before entering the closet.
Finally, at any given point, reinforcement learners might know of one good policy, but there might
be many other better policies that the agent has never tried. The reinforcement learner must
constantly choose whether to exploit the best (currently) known strategy as a policy, or to explore
the space of strategies, potentially giving up some short-run reward in exchange for knowledge.
The general reinforcement learning problem is a very general setting. Actions affect subsequent
observations. Rewards are only observed corresponding to the chosen actions. The environment
may be either fully or partially observed. Accounting for all this complexity at once may ask too
much of researchers. Moreover, not every practical problem exhibits all this complexity. As a
result, researchers have studied a number of special cases of reinforcement learning problems.
When the environment is fully observed, we call the reinforcement learning problem a Markov
decision process. When the state does not depend on the previous actions, we call the problem
a contextual bandit problem. When there is no state, just a set of available actions with initially
unknown rewards, this problem is the classic multi-armed bandit problem.
1.4 Roots
We have just reviewed a small subset of problems that machine learning can address. For a di-
verse set of machine learning problems, deep learning provides powerful tools for solving them.
Although many deep learning methods are recent inventions, the core ideas behind learning from
data have been studied for centuries. In fact, humans have held the desire to analyze data and to
predict future outcomes for long and much of natural science has its roots in this. For instance,
the Bernoulli distribution is named after Jacob Bernoulli (1655–1705)21 , and the Gaussian distri-
bution was discovered by Carl Friedrich Gauss (1777–1855)22 . He invented, for instance, the least
21
https://en.wikipedia.org/wiki/Jacob_Bernoulli
22
https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss
mean squares algorithm, which is still used today for countless problems from insurance calcu-
lations to medical diagnostics. These tools gave rise to an experimental approach in the natural
sciences—for instance, Ohmʼs law relating current and voltage in a resistor is perfectly described
by a linear model.
Even in the middle ages, mathematicians had a keen intuition of estimates. For instance, the
geometry book of Jacob Köbel (1460–1533)23 illustrates averaging the length of 16 adult menʼs feet
to estimate the average foot length in the population (Fig. 1.4.1).
Fig. 1.4.1: Estimating the length of a foot.
As a group of individuals exited a church, 16 adult men were asked to line up in a row and have
their feet measured. The sum of these measurements was then divided by 16 to obtain an estimate
for what now amounts to 1 foot. This “algorithm” was later improved to deal with misshapen feet;
The 2 men with the shortest and longest feet were sent away, averaging only over the remainder.
This is among the earliest examples of a trimmed mean estimate.
Statistics really took off with the collection and availability of data. One of its pioneers, Ronald
Fisher (1890–1962)24 , contributed significantly to its theory and also its applications in genetics.
Many of his algorithms (such as linear discriminant analysis) and formulas (such as the Fisher
information matrix) still hold a prominent place in the foundations of modern statistics. Even
his data resources had a lasting impact. The Iris dataset that Fisher released in 1936 is still used
sometimes to demonstrate machine learning algorithms. Fisher was also a proponent of eugenics,
which should remind us that the morally dubious use of data science has as long and enduring a
history as its productive use in industry and the natural sciences.
A second influence for machine learning came from information theory by Claude Shannon
23
https://www.maa.org/press/periodicals/convergence/mathematical-treasures-jacob-kobels-geometry
24
https://en.wikipedia.org/wiki/Ronald_Fisher
1.4. Roots 33
(1916–2001)25 and the theory of computation via Alan Turing (1912–1954)26 . Turing posed the
question “can machines think?” in his famous paper Computing Machinery and Intelligence (Tur-
ing, 1950). In what he described as the Turing test, a machine can be considered intelligent if it is
difficult for a human evaluator to distinguish between the replies from a machine and a human
based on textual interactions.
Another influence can be found in neuroscience and psychology. After all, humans clearly ex-
hibit intelligent behavior. Many scholars have asked whether one could explain and possibly re-
verse engineer this capacity. One of the oldest biologically inspired algorithms was formulated
by Donald Hebb (1904–1985)27 . In his groundbreaking book The Organization of Behavior (Hebb
and Hebb, 1949), he posited that neurons learn by positive reinforcement. This became known as
the Hebbian learning rule. These ideas inspired later works like Rosenblattʼs perceptron learning
algorithm and laid the foundations of many stochastic gradient descent algorithms that underpin
deep learning today: reinforce desirable behavior and diminish undesirable behavior to obtain
good settings of the parameters in a neural network.
Biological inspiration is what gave neural networks their name. For over a century (dating back
to the models of Alexander Bain, 1873 and James Sherrington, 1890), researchers have tried to
assemble computational circuits that resemble networks of interacting neurons. Over time, the
interpretation of biology has become less literal, but the name stuck. At its heart, lie a few key
principles that can be found in most networks today:
• The alternation of linear and nonlinear processing units, often referred to as layers.
• The use of the chain rule (also known as backpropagation) for adjusting parameters in the
entire network at once.
After initial rapid progress, research in neural networks languished from around 1995 until 2005.
This was mainly due to two reasons. First, training a network is computationally very expensive.
While random-access memory was plentiful at the end of the past century, computational power
was scarce. Second, datasets were relatively small. In fact, Fisherʼs Iris dataset from 1932 was a
popular tool for testing the efficacy of algorithms. The MNIST dataset with its 60000 handwritten
digits was considered huge.
Given the scarcity of data and computation, strong statistical tools such as kernel methods, de-
cision trees, and graphical models proved empirically superior in many applications. Moreover,
unlike neural networks, they did not require weeks to train and provided predictable results with
strong theoretical guarantees.
1.5 The Road to Deep Learning
Much of this changed with the availability of large amounts of data, due to the World Wide Web,
the advent of companies serving hundreds of millions of users online, a dissemination of cheap,
high-quality sensors, cheap data storage (Kryderʼs law), and cheap computation (Mooreʼs law). In
particular, the landscape of computation in deep learning was revolutionized by advances in GPUs,
which were originally engineered for computer gaming. Suddenly algorithms and models that
seemed computationally infeasible became relevant (and vice versa). This is best illustrated in
Chapter 1.5.
25
https://en.wikipedia.org/wiki/Claude_Shannon
26
https://en.wikipedia.org/wiki/Alan_Turing
27
https://en.wikipedia.org/wiki/Donald_O._Hebb
Decade Dataset Memory Floating point calculations per second
1970 100 (Iris) 1 KB 100 KF (Intel 8080)
1980 1 K (house prices in Boston) 100 KB 1 MF (Intel 80186)
1990 10 K (optical character recognition) 10 MB 10 MF (Intel 80486)
2000 10 M (web pages) 100 MB 1 GF (Intel Core)
2010 10 G (advertising) 1 GB 1 TF (Nvidia C2050)
2020 1 T (social network) 100 GB 1 PF (Nvidia DGX-2)
Table: Dataset vs. computer memory and computational power

Note that random-access memory has not kept pace with the growth in data. At the same time,
increases in computational power have outpaced the growth in datasets. This means that statisti-
cal models need to become more memory efficient, and are free to spend more computer cycles
optimizing parameters, due to the increased compute budget. Consequently, the sweet spot in
machine learning and statistics moved from (generalized) linear models and kernel methods to
deep neural networks. This is also one of the reasons why many of the mainstays of deep learning,
such as multilayer perceptrons (McCulloch and Pitts, 1943), convolutional neural networks (Le-
Cun et al., 1998), long short-term memory (Hochreiter and Schmidhuber, 1997), and Q-Learning
(Watkins and Dayan, 1992), were essentially “rediscovered” in the past decade, after laying com-
paratively dormant for considerable time.
The recent progress in statistical models, applications, and algorithms has sometimes been
likened to the Cambrian explosion: a moment of rapid progress in the evolution of species. In-
deed, the state of the art is not just a mere consequence of available resources, applied to decades
old algorithms. Note that the list below barely scratches the surface of the ideas that have helped
researchers achieve tremendous progress over the past decade.
• Novel methods for capacity control, such as dropout (Srivastava et al., 2014), have helped to
mitigate overfitting. Here, noise is injected (Bishop, 1995) throughout the neural network
during training.
• Attention mechanisms solved a second problem that had plagued statistics for over a cen-
tury: how to increase the memory and complexity of a system without increasing the num-
ber of learnable parameters. Researchers found an elegant solution by using what can only
be viewed as a learnable pointer structure (Bahdanau et al., 2014). Rather than having to
remember an entire text sequence, e.g., for machine translation in a fixed-dimensional rep-
resentation, all that needed to be stored was a pointer to the intermediate state of the trans-
lation process. This allowed for significantly increased accuracy for long sequences, since
the model no longer needed to remember the entire sequence before commencing the gen-
eration of a new sequence. Built solely on attention mechanisms, the transformer architec-
ture (Vaswani et al., 2017) has demonstrated compelling success in a wide range of areas.
For example, a single transformer pretrained on modalities as diverse as text, images, joint
torques, and button presses can play Atari, caption images, chat, and control a robot (Reed
et al., 2022).
• Multi-stage designs, e.g., via the memory networks (Sukhbaatar et al., 2015) and the neural
programmer-interpreter (Reed and De Freitas, 2015) allowed statistical modelers to describe
iterative approaches to reasoning. These tools allow for an internal state of the deep neural
network to be modified repeatedly, thus carrying out subsequent steps in a chain of reason-
ing, similar to how a processor can modify memory for a computation.
• Another key development was the invention of generative adversarial networks (Goodfellow
et al., 2014). Traditionally, statistical methods for density estimation and generative models
1.5. The Road to Deep Learning 35

focused on finding proper probability distributions and (often approximate) algorithms for
sampling from them. As a result, these algorithms were largely limited by the lack of flex-
ibility inherent in the statistical models. The crucial innovation in generative adversarial
networks was to replace the sampler by an arbitrary algorithm with differentiable parame-
ters. These are then adjusted in such a way that the discriminator (effectively a two-sample
test) cannot distinguish fake from real data. Through the ability to use arbitrary algorithms
to generate data, it opened up density estimation to a wide variety of techniques. Examples
of galloping Zebras (Zhu et al., 2017) and of fake celebrity faces (Karras et al., 2017) are both
testimony to this progress. Even amateur doodlers can produce photorealistic images based
on just sketches that describe how the layout of a scene looks like (Park et al., 2019).
• In many cases, a single GPU is insufficient to process the large amounts of data available
for training. Over the past decade the ability to build parallel and distributed training al-
gorithms has improved significantly. One of the key challenges in designing scalable algo-
rithms is that the workhorse of deep learning optimization, stochastic gradient descent, re-
lies on relatively small minibatches of data to be processed. At the same time, small batches
limit the efficiency of GPUs. Hence, training on 1024 GPUs with a minibatch size of, say 32
images per batch amounts to an aggregate minibatch of about 32000 images. Recent work,
first by (Li, 2017), and subsequently by (You et al., 2017) and (Jia et al., 2018) pushed the size
up to 64000 observations, reducing training time for the ResNet-50 model on the ImageNet
dataset to less than 7 minutes. For comparison—initially training times were measured in
the order of days.
• The ability to parallelize computation has also contributed to progress in reinforcement
learning, This has led to significant progress in computers achieving superhuman perfor-
mance on tasks like Go, Atari games, Starcraft, and in physics simulations (e.g., using Mu-
JoCo), Where environment simulators are available. See, e.g., Silver et al. (2016) for a de-
scription of how to achieve this in AlphaGo. In a nutshell, reinforcement learning works
best if plenty of (state, action, reward) tuples are available. Simulation provides such an
avenue.
• Deep learning frameworks have played a crucial role in disseminating ideas. The first gener-
ation of open-source frameworks for neural network modeling consisted of Caffe28 , Torch29 ,
and Theano30 . Many seminal papers were written using these tools. By now, they have been
superseded by TensorFlow31 (often used via its high level API Keras32 ), CNTK33 , Caffe 234 ,
and Apache MXNet35 . The third generation of tools consists of so-called imperative tools for
deep learning, a trend that was arguably ignited by Chainer36 , which used a syntax similar
to Python NumPy to describe models. This idea was adopted by both PyTorch37 , the Gluon
API38 of MXNet, and Jax39 .
The division of labor between system researchers building better tools and statistical modelers
building better neural networks has greatly simplified things. For instance, training a linear lo-
gistic regression model used to be a nontrivial homework problem, worthy to give to new machine
28
https://github.com/BVLC/caffe
29
https://github.com/torch
30
https://github.com/Theano/Theano
31
https://github.com/tensorflow/tensorflow
32
https://github.com/keras-team/keras
33
https://github.com/Microsoft/CNTK
34
https://github.com/caffe2/caffe2
35
https://github.com/apache/incubator-mxnet
36
https://github.com/chainer/chainer
37
https://github.com/pytorch/pytorch
38
https://github.com/apache/incubator-mxnet
39
https://github.com/google/jax
learning Ph.D. students at Carnegie Mellon University in 2014. By now, this task can be accom-
plished with less than 10 lines of code, putting it firmly into the grasp of programmers.
1.6 Success Stories
AI has a long history of delivering results that would be difficult to accomplish otherwise. For in-
stance, the mail sorting systems using optical character recognition have been deployed since the
1990s. This is, after all, the source of the famous MNIST dataset of handwritten digits. The same
applies to reading checks for bank deposits and scoring creditworthiness of applicants. Financial
transactions are checked for fraud automatically. This forms the backbone of many e-commerce
payment systems, such as PayPal, Stripe, AliPay, WeChat, Apple, Visa, and MasterCard. Computer
programs for chess have been competitive for decades. Machine learning feeds search, recom-
mendation, personalization, and ranking on the Internet. In other words, machine learning is
pervasive, albeit often hidden from sight.
It is only recently that AI has been in the limelight, mostly due to solutions to problems that were
considered intractable previously and that are directly related to consumers. Many of such ad-
vances are attributed to deep learning.
• Intelligent assistants, such as Appleʼs Siri, Amazonʼs Alexa, and Googleʼs assistant, are able to
answer spoken questions with a reasonable degree of accuracy. This includes menial tasks,
like turning on light switches, and more complex tasks, like arranging barberʼs appointments
and offering phone support dialog. This is likely the most noticeable sign that AI is affecting
our lives.
• A key ingredient in digital assistants is the ability to recognize speech accurately. Gradually,
the accuracy of such systems has increased to the point of achieving human parity for certain
applications (Xiong et al., 2018).
• Object recognition has likewise come a long way. Estimating the object in a picture was a
fairly challenging task in 2010. On the ImageNet benchmark researchers from NEC Labs
and University of Illinois at Urbana-Champaign achieved a top-5 error rate of 28% (Lin et al.,
2010). By 2017, this error rate was reduced to 2.25% (Hu et al., 2018). Similarly, stunning
results have been achieved for identifying birds and for diagnosing skin cancer.
• Prowess in games used to provide a measuring stick for human intelligence. Starting from
TD-Gammon, a program for playing backgammon using temporal difference reinforcement
learning, algorithmic and computational progress has led to algorithms for a wide range of
applications. Unlike backgammon, chess has a much more complex state space and set of
actions. DeepBlue beat Garry Kasparov using massive parallelism, special-purpose hard-
ware and efficient search through the game tree (Campbell et al., 2002). Go is more difficult
still, due to its huge state space. AlphaGo reached human parity in 2015, using deep learn-
ing combined with Monte Carlo tree sampling (Silver et al., 2016). The challenge in Poker
was that the state space is large and only partially observed (we do not know the opponentsʼ
cards). Libratus exceeded human performance in Poker using efficiently structured strate-
gies (Brown and Sandholm, 2017).
• Another indication of progress in AI is the advent of self-driving cars and trucks. While full
autonomy is not quite within reach, excellent progress has been made in this direction, with
companies such as Tesla, NVIDIA, and Waymo shipping products that enable at least par-
tial autonomy. What makes full autonomy so challenging is that proper driving requires the
ability to perceive, to reason and to incorporate rules into a system. At present, deep learn-
1.6. Success Stories 37

ing is used primarily in the computer vision aspect of these problems. The rest is heavily
tuned by engineers.
This barely scratches the surface for impactful applications of machine learning. For instance,
robotics, logistics, computational biology, particle physics, and astronomy owe some of their most
impressive recent advances at least in parts to machine learning. Machine learning is thus becom-
ing a ubiquitous tool for engineers and scientists.
Frequently, questions about a coming AI apocalypse and the plausibility of a singularity have been
raised in non-technical articles on AI. The fear is that somehow machine learning systems will
become sentient and make decisions, independently from their programmers that directly impact
the lives of humans. To some extent, AI already affects the livelihood of humans in direct ways:
creditworthiness is assessed automatically, autopilots mostly navigate vehicles, decisions about
whether to grant bail use statistical data as input. More frivolously, we can ask Alexa to switch on
the coffee machine.
Fortunately, we are far from a sentient AI system that could deliberately manipulate its human cre-
ators. First, AI systems are engineered, trained, and deployed in a specific, goal-oriented manner.
While their behavior might give the illusion of general intelligence, it is a combination of rules,
heuristics and statistical models that underlie the design. Second, at present tools for artificial
general intelligence simply do not exist that are able to improve themselves, reason about them-
selves, and that are able to modify, extend, and improve their own architecture while trying to
solve general tasks.
A much more pressing concern is how AI is being used in our daily lives. It is likely that many
menial tasks fulfilled by truck drivers and shop assistants can and will be automated. Farm robots
will likely reduce the cost for organic farming but they will also automate harvesting operations.
This phase of the industrial revolution may have profound consequences on large swaths of soci-
ety, since truck drivers and shop assistants are some of the most common jobs in many countries.
Furthermore, statistical models, when applied without care can lead to racial, gender, or age bias
and raise reasonable concerns about procedural fairness if automated to drive consequential de-
cisions. It is important to ensure that these algorithms are used with care. With what we know
today, this strikes us a much more pressing concern than the potential of malevolent superintel-
ligence to destroy humanity.
1.7 The Essence of Deep Learning
Thus far, we have talked about machine learning broadly. Deep learning is the subset of machine
learning concerned with models based on many-layered neural networks. It is deep in precisely
the sense that its models learn many layers of transformations. While this might sound narrow,
deep learning has given rise to a dizzying array of models, techniques, problem formulations, and
applications. Many intuitions have been developed to explain the benefits of depth. Arguably,
all machine learning has many layers of computation, the first consisting of feature processing
steps. What differentiates deep learning is that the operations learned at each of the many layers
of representations are learned jointly from data.
The problems that we have discussed so far, such as learning from the raw audio signal, the raw
pixel values of images, or mapping between sentences of arbitrary lengths and their counterparts
in foreign languages, are those where deep learning excels and traditional methods falter. It turns
out that these many-layered models are capable of addressing low-level perceptual data in a way
that previous tools could not. Arguably the most significant commonality in deep learning meth-
ods is end-to-end training. That is, rather than assembling a system based on components that are
individually tuned, one builds the system and then tunes their performance jointly. For instance,
in computer vision scientists used to separate the process of feature engineering from the process
of building machine learning models. The Canny edge detector (Canny, 1987) and Loweʼs SIFT
feature extractor (Lowe, 2004) reigned supreme for over a decade as algorithms for mapping im-
ages into feature vectors. In bygone days, the crucial part of applying machine learning to these
problems consisted of coming up with manually-engineered ways of transforming the data into
some form amenable to shallow models. Unfortunately, there is only so little that humans can
accomplish by ingenuity in comparison with a consistent evaluation over millions of choices car-
ried out automatically by an algorithm. When deep learning took over, these feature extractors
were replaced by automatically tuned filters, yielding superior accuracy.
Thus, one key advantage of deep learning is that it replaces not only the shallow models at the
end of traditional learning pipelines, but also the labor-intensive process of feature engineering.
Moreover, by replacing much of the domain-specific preprocessing, deep learning has eliminated
many of the boundaries that previously separated computer vision, speech recognition, natural
language processing, medical informatics, and other application areas, offering a unified set of
tools for tackling diverse problems.
Beyond end-to-end training, we are experiencing a transition from parametric statistical descrip-
tions to fully nonparametric models. When data is scarce, one needs to rely on simplifying as-
sumptions about reality in order to obtain useful models. When data is abundant, these can be re-
placed by nonparametric models that better fit the data. To some extent, this mirrors the progress
that physics experienced in the middle of the previous century with the availability of computers.
Rather than solving parametric approximations of how electrons behave by hand, one can now
resort to numerical simulations of the associated partial differential equations. This has led to
much more accurate models, albeit often at the expense of explainability.
Another difference to previous work is the acceptance of suboptimal solutions, dealing with non-
convex nonlinear optimization problems, and the willingness to try things before proving them.
This newfound empiricism in dealing with statistical problems, combined with a rapid influx of
talent has led to rapid progress of practical algorithms, albeit in many cases at the expense of
modifying and re-inventing tools that existed for decades.
In the end, the deep learning community prides itself on sharing tools across academic and cor-
porate boundaries, releasing many excellent libraries, statistical models, and trained networks as
open source. It is in this spirit that the notebooks forming this book are freely available for distri-
bution and use. We have worked hard to lower the barriers of access for everyone to learn about
deep learning and we hope that our readers will benefit from this.
1.8 Summary
Machine learning studies how computer systems can leverage experience (often data) to improve
performance at specific tasks. It combines ideas from statistics, data mining, and optimization.
Often, it is used as a means of implementing AI solutions. As a class of machine learning, repre-
sentational learning focuses on how to automatically find the appropriate way to represent data.
As multi-level representation learning through learning many layers of transformations, deep
learning replaces not only the shallow models at the end of traditional machine learning pipelines,
but also the labor-intensive process of feature engineering. Much of the recent progress in deep
learning has been triggered by an abundance of data arising from cheap sensors and Internet-
scale applications, and by significant progress in computation, mostly through GPUs. Besides,
the availability of efficient deep learning frameworks has made design and implementation of
1.8. Summary 39
whole system optimization significantly easier, which is a key component in obtaining high per-
formance.
Exercises
1. Which parts of code that you are currently writing could be “learned”, i.e., improved by
learning and automatically determining design choices that are made in your code? Does
your code include heuristic design choices? What data might you need to learn the desired
behavior?
2. Which problems that you encounter have many examples for how to solve them, yet no spe-
cific way to automate them? These may be prime candidates for using deep learning.
3. Describe the relationships between algorithms, data, and computation. How do character-
istics of the data and the current available computational resources influence the appropri-
ateness of various algorithms?
4. Name some settings where end-to-end training is not currently the default approach but
might be useful.
Discussions40
40
2 | Preliminaries
To prepare for your dive into deep learning, you will need a few survival skills: (i) techniques for
storing and manipulating data; (ii) libraries for ingesting and preprocessing data from a variety of
sources; (iii) knowledge of the basic linear algebraic operations that we apply to high-dimensional
data elements; (iv) just enough calculus to determine which direction to adjust each parameter in
order to decrease the loss function; (v) the ability to automatically compute derivatives so that
you can forget much of the calculus you just learned; (vi) some basic fluency in probability, our
primary language for reasoning under uncertainty; and (vii) some aptitude for finding answers in
the official documentation when you get stuck.
In short, this chapter provides a rapid introduction to the basics that you will need to follow most
of the technical content in this book.
2.1 Data Manipulation
In order to get anything done, we need some way to store and manipulate data. Generally, there
are two important things we need to do with data: (i) acquire them; and (ii) process them once they
are inside the computer. There is no point in acquiring data without some way to store it, so to
start, letʼs get our hands dirty with n-dimensional arrays, which we also call tensors. If you already
know the NumPy scientific computing package, this will be a breeze. For all modern deep learn-
ing frameworks, the tensor class (ndarray in MXNet, Tensor in PyTorch and TensorFlow) resembles
NumPyʼs ndarray, with a few killer features added. First, the tensor class supports automatic dif-
ferentiation. Second, it leverages GPUs to accelerate numerical computation, whereas NumPy
only runs on CPUs. These properties make neural networks both easy to code and fast to run.
2.1.1 Getting Started
To start, we import the PyTorch library. Note that the package name is torch.
import torch
A tensor represents a (possibly multi-dimensional) array of numerical values. With one axis, a
tensor is called a vector. With two axes, a tensor is called a matrix. With k > 2 axes, we drop the
specialized names and just refer to the object as a k th order tensor.
PyTorch provides a variety of functions for creating new tensors prepopulated with values. For
example, by invoking arange(n), we can create a vector of evenly spaced values, starting at 0 (in-
cluded) and ending at n (not included). By default, the interval size is 1. Unless otherwise specified,
new tensors are stored in main memory and designated for CPU-based computation.
41
x = torch.arange(12, dtype=torch.float32)
x
tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.])
Each of these values is called an element of the tensor. The tensor x contains 12 elements. We can
inspect the total number of elements in a tensor via its numel method.
x.numel()
12
We can access a tensorʼs shape (the length along each axis) by inspecting its shape attribute. Be-
cause we are dealing with a vector here, the shape contains just a single element and is identical
to the size.
x.shape
torch.Size([12])
We can change the shape of a tensor without altering its size or values, by invoking reshape. For
example, we can transform our vector x whose shape is (12,) to a matrix X with shape (3, 4). This
new tensor retains all elements but reconfigures them into a matrix. Notice that the elements of
our vector are laid out one row at a time and thus x[3] == X[0, 3].
X = x.reshape(3, 4)
X
tensor([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
Note that specifying every shape component to reshape is redundant. Because we already know
our tensorʼs size, we can work out one component of the shape given the rest. For example, given
a tensor of size n and target shape (h, w), we know that w = n/h. To automatically infer one
component of the shape, we can place a -1 for the shape component that should be inferred au-
tomatically. In our case, instead of calling x.reshape(3, 4), we could have equivalently called
x.reshape(-1, 4) or x.reshape(3, -1).
Practitioners often need to work with tensors initialized to contain all zeros or ones. We can con-
struct a tensor with all elements set to zero and a shape of (2, 3, 4) via the zeros function.
torch.zeros((2, 3, 4))
tensor([[[0., 0., 0., 0.],

[0., 0., 0., 0.],
[0., 0., 0., 0.]],
(continues on next page)
42 Chapter 2. Preliminaries
(continued from previous page)
[[0., 0., 0., 0.],

[0., 0., 0., 0.],
[0., 0., 0., 0.]]])
Similarly, we can create a tensor with all ones by invoking ones.
torch.ones((2, 3, 4))
tensor([[[1., 1., 1., 1.],

[1., 1., 1., 1.],
[1., 1., 1., 1.]],
[[1., 1., 1., 1.],

[1., 1., 1., 1.],
[1., 1., 1., 1.]]])
We often wish to sample each element randomly (and independently) from a given probability
distribution. For example, the parameters of neural networks are often initialized randomly. The
following snippet creates a tensor with elements drawn from a standard Gaussian (normal) distri-
bution with mean 0 and standard deviation 1.
torch.randn(3, 4)
tensor([[-0.1251, -0.0781, 0.2857, -1.8233],

[-1.2177, -0.3742, -0.8155, -0.9469],
[ 1.1702, -0.4262, -0.3659, 0.3414]])
Finally, we can construct tensors by supplying the exact values for each element by supplying
(possibly nested) Python list(s) containing numerical literals. Here, we construct a matrix with a
list of lists, where the outermost list corresponds to axis 0, and the inner list to axis 1.
torch.tensor([[2, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
tensor([[2, 1, 4, 3],
[1, 2, 3, 4],
[4, 3, 2, 1]])
2.1.2 Indexing and Slicing
As with Python lists, we can access tensor elements by indexing (starting with 0). To access an
element based on its position relative to the end of the list, we can use negative indexing. Finally,
we can access whole ranges of indices via slicing (e.g., X[start:stop]), where the returned value
includes the first index (start) but not the last (stop). Finally, when only one index (or slice) is
specified for a k th order tensor, it is applied along axis 0. Thus, in the following code, [-1] selects
the last row and [1:3] selects the second and third rows.
2.1. Data Manipulation 43

X[-1], X[1:3]
(tensor([ 8., 9., 10., 11.]),

tensor([[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]]))
Beyond reading, we can also write elements of a matrix by specifying indices.
X[1, 2] = 17
X
tensor([[ 0., 1., 2., 3.],

[ 4., 5., 17., 7.],
[ 8., 9., 10., 11.]])
If we want to assign multiple elements the same value, we apply the indexing on the left-hand side
of the assignment operation. For instance, [:2, :] accesses the first and second rows, where :
takes all the elements along axis 1 (column). While we discussed indexing for matrices, this also
works for vectors and for tensors of more than 2 dimensions.
X[:2, :] = 12
X
tensor([[12., 12., 12., 12.],

[12., 12., 12., 12.],
[ 8., 9., 10., 11.]])
2.1.3 Operations
Now that we know how to construct tensors and how to read from and write to their elements,
we can begin to manipulate them with various mathematical operations. Among the most useful
tools are the elementwise operations. These apply a standard scalar operation to each element of a
tensor. For functions that take two tensors as inputs, elementwise operations apply some standard
binary operator on each pair of corresponding elements. We can create an elementwise function
from any function that maps from a scalar to a scalar.
In mathematical notation, we denote such unary scalar operators (taking one input) by the signa-
ture f : R → R. This just means that the function maps from any real number onto some other
real number. Most standard operators can be applied elementwise including unary operators like
ex .
torch.exp(x)
tensor([162754.7969, 162754.7969, 162754.7969, 162754.7969, 162754.7969,

162754.7969, 162754.7969, 162754.7969, 2980.9580, 8103.0840,
22026.4648, 59874.1406])
Likewise, we denote binary scalar operators, which map pairs of real numbers to a (single) real
number via the signature f : R, R → R. Given any two vectors u and v of the same shape, and a bi-
nary operator f , we can produce a vector c = F (u, v) by setting ci ← f (ui , vi ) for all i, where
ci , ui , and vi are the ith elements of vectors c, u, and v. Here, we produced the vector-valued
F : Rd , Rd → Rd by lifting the scalar function to an elementwise vector operation. The common
standard arithmetic operators for addition (+), subtraction (-), multiplication (*), division (/), and
exponentiation (**) have all been lifted to elementwise operations for identically-shaped tensors
of arbitrary shape.
x = torch.tensor([1.0, 2, 4, 8])
y = torch.tensor([2, 2, 2, 2])
x + y, x - y, x * y, x / y, x ** y
(tensor([ 3., 4., 6., 10.]),

tensor([-1., 0., 2., 6.]),
tensor([ 2., 4., 8., 16.]),
tensor([0.5000, 1.0000, 2.0000, 4.0000]),
tensor([ 1., 4., 16., 64.]))
In addition to elementwise computations, we can also perform linear algebra operations, such as
dot products and matrix multiplications. We will elaborate on these shortly in Section 2.3.
We can also concatenate multiple tensors together, stacking them end-to-end to form a larger ten-
sor. We just need to provide a list of tensors and tell the system along which axis to concatenate.
The example below shows what happens when we concatenate two matrices along rows (axis 0)
vs. columns (axis 1). We can see that the first outputʼs axis-0 length (6) is the sum of the two input
tensorsʼ axis-0 lengths (3 + 3); while the second outputʼs axis-1 length (8) is the sum of the two
input tensorsʼ axis-1 lengths (4 + 4).
X = torch.arange(12, dtype=torch.float32).reshape((3,4))
Y = torch.tensor([[2.0, 1, 4, 3], [1, 2, 3, 4], [4, 3, 2, 1]])
torch.cat((X, Y), dim=0), torch.cat((X, Y), dim=1)
(tensor([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[ 2., 1., 4., 3.],
[ 1., 2., 3., 4.],
[ 4., 3., 2., 1.]]),
tensor([[ 0., 1., 2., 3., 2., 1., 4., 3.],
[ 4., 5., 6., 7., 1., 2., 3., 4.],
[ 8., 9., 10., 11., 4., 3., 2., 1.]]))
Sometimes, we want to construct a binary tensor via logical statements. Take X == Y as an example.
For each position i, j, if X[i, j] and Y[i, j] are equal, then the corresponding entry in the
result takes value 1, otherwise it takes value 0.
X == Y
tensor([[False, True, False, True],

[False, False, False, False],

[False, False, False, False]])
Summing all the elements in the tensor yields a tensor with only one element.
X.sum()
tensor(66.)
2.1.4 Broadcasting
By now, you know how to perform elementwise binary operations on two tensors of the same
shape. Under certain conditions, even when shapes differ, we can still perform elementwise bi-
nary operations by invoking the broadcasting mechanism. Broadcasting works according to the
following two-step procedure: (i) expand one or both arrays by copying elements along axes with
length 1 so that after this transformation, the two tensors have the same shape; (ii) perform an
elementwise operation on the resulting arrays.
a = torch.arange(3).reshape((3, 1))
b = torch.arange(2).reshape((1, 2))
a, b
(tensor([[0],
[1],
[2]]),
tensor([[0, 1]]))
Since a and b are 3 × 1 and 1 × 2 matrices, respectively, their shapes do not match up. Broadcasting
produces a larger 3 × 2 matrix by replicating matrix a along the columns and matrix b along the
rows before adding them elementwise.
a + b
tensor([[0, 1],
[1, 2],
[2, 3]])
2.1.5 Saving Memory
Running operations can cause new memory to be allocated to host results. For example, if we
write Y = X + Y, we dereference the tensor that Y used to point to and instead point Y at the newly
allocated memory. We can demonstrate this issue with Pythonʼs id() function, which gives us the
exact address of the referenced object in memory. Note that after we run Y = Y + X, id(Y) points
to a different location. Thatʼs because Python first evaluates Y + X, allocating new memory for the
result and then points Y to this new location in memory.
before = id(Y)
Y = Y + X
id(Y) == before
False
This might be undesirable for two reasons. First, we do not want to run around allocating memory
unnecessarily all the time. In machine learning, we often have hundreds of megabytes of param-
eters and update all of them multiple times per second. Whenever possible, we want to perform
these updates in place. Second, we might point at the same parameters from multiple variables.
If we do not update in place, we must be careful to update all of these references, lest we spring a
memory leak or inadvertently refer to stale parameters.
Fortunately, performing in-place operations is easy. We can assign the result of an operation to
a previously allocated array Y by using slice notation: Y[:] = <expression>. To illustrate this
concept, we overwrite the values of tensor Z, after initializing it, using zeros_like, to have the
same shape as Y.
Z = torch.zeros_like(Y)
print('id(Z):', id(Z))
Z[:] = X + Y
print('id(Z):', id(Z))
id(Z): 140166822857040
id(Z): 140166822857040
If the value of X is not reused in subsequent computations, we can also use X[:] = X + Y or X +=
Y to reduce the memory overhead of the operation.
before = id(X)
X += Y
id(X) == before
True
2.1.6 Conversion to Other Python Objects
Converting to a NumPy tensor (ndarray), or vice versa, is easy. The torch Tensor and numpy array
will share their underlying memory, and changing one through an in-place operation will also
change the other.
A = X.numpy()
B = torch.from_numpy(A)
type(A), type(B)
(numpy.ndarray, torch.Tensor)

To convert a size-1 tensor to a Python scalar, we can invoke the item function or Pythonʼs built-in
functions.
a = torch.tensor([3.5])
a, a.item(), float(a), int(a)
(tensor([3.5000]), 3.5, 3.5, 3)
2.1.7 Summary
• The tensor class is the main interface for storing and manipulating data in deep learning
libraries.
• Tensors provide a variety of functionalities including construction routines; indexing and
slicing; basic mathematics operations; broadcasting; memory-efficient assignment; and
conversion to and from other Python objects.
Exercises
1. Run the code in this section. Change the conditional statement X == Y to X < Y or X > Y,
and then see what kind of tensor you can get.
2. Replace the two tensors that operate by element in the broadcasting mechanism with other
shapes, e.g., 3-dimensional tensors. Is the result the same as expected?
Discussions41
2.2 Data Preprocessing
So far, we have been working with synthetic data that arrived in ready-made tensors. However,
to apply deep learning in the wild we must extract messy data stored in arbitrary formats, and
preprocess it to suit our needs. Fortunately, the pandas library42 can do much of the heavy lifting.
This section, while no substitute for a proper pandas tutorial43 , will give you a crash course on
some of the most common routines.
2.2.1 Reading the Dataset
Comma-separated values (CSV) files are ubiquitous for storing tabular (spreadsheet-like) data.
Here, each line corresponds to one record and consists of several (comma-separated) fields, e.g.,
“Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,Accomplishments in the field of
gravitational physics”. To demonstrate how to load CSV files with pandas, we create a CSV file
below ../data/house_tiny.csv. This file represents a dataset of homes, where each row corre-
sponds to a distinct home and the columns correspond to the number of rooms (NumRooms), the
roof type (RoofType), and the price (Price).
41
42
https://pandas.pydata.org/
43
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
import os
os.makedirs(os.path.join('..', 'data'), exist_ok=True)

data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')
Now letʼs import pandas and load the dataset with read_csv.
import pandas as pd
data = pd.read_csv(data_file)
print(data)
NumRooms RoofType Price

0 NaN NaN 127500
1 2.0 NaN 106000
2 4.0 Slate 178100
3 NaN NaN 140000
2.2.2 Data Preparation
In supervised learning, we train models to predict a designated target value, given some set of
input values. Our first step in processing the dataset is to separate out columns corresponding to
input versus target values. We can select columns either by name or via integer-location based
indexing (iloc).
You might have noticed that pandas replaced all CSV entries with value NA with a special NaN (not
a number) value. This can also happen whenever an entry is empty, e.g., “3,,,270000”. These are
called missing values and they are the “bed bugs” of data science, a persistent menace that you will
confront throughout your career. Depending upon the context, missing values might be handled
either via imputation or deletion. Imputation replaces missing values with estimates of their values
while deletion simply discards either those rows or those columns that contain missing values.
Here are some common imputation heuristics. For categorical input fields, we can treat NaN as a
category. Since the RoofType column takes values Slate and NaN, pandas can convert this column
into two columns RoofType_Slate and RoofType_nan. A row whose alley type is Slate will set val-
ues of RoofType_Slate and RoofType_nan to 1 and 0, respectively. The converse holds for a row
with a missing RoofType value.
inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]

inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)
NumRooms RoofType_Slate RoofType_nan

0 NaN 0 1
2.2. Data Preprocessing 49

1 2.0 0 1
2 4.0 1 0
3 NaN 0 1
For missing numerical values, one common heuristic is to replace the NaN entries with the mean
value of the corresponding column.
inputs = inputs.fillna(inputs.mean())
print(inputs)
NumRooms RoofType_Slate RoofType_nan

0 3.0 0 1
1 2.0 0 1
2 4.0 1 0
3 3.0 0 1
2.2.3 Conversion to the Tensor Format
Now that all the entries in inputs and targets are numerical, we can load them into a tensor (recall
Section 2.1).
import torch
X, y = torch.tensor(inputs.values), torch.tensor(targets.values)
X, y
(tensor([[3., 0., 1.],

[2., 0., 1.],
[4., 1., 0.],
[3., 0., 1.]], dtype=torch.float64),
tensor([127500, 106000, 178100, 140000]))
2.2.4 Discussion
You now know how to partition data columns, impute missing variables, and load pandas data into
tensors. In Section 5.7, you will pick up some more data processing skills. While this crash course
kept things simple, data processing can get hairy. For example, rather than arriving in a single
CSV file, our dataset might be spread across multiple files extracted from a relational database.
For instance, in an e-commerce application, customer addresses might live in one table and pur-
chase data in another. Moreover, practitioners face myriad data types beyond categorical and
numeric. Other data types include text strings, images, audio data, and point clouds. Oftentimes,
advanced tools and efficient algorithms are required to prevent data processing from becoming
the biggest bottleneck in the machine learning pipeline. These problems will arise when we get to
computer vision and natural language processing. Finally, we must pay attention to data quality.
Real-world datasets are often plagued by outliers, faulty measurements from sensors, and record-
ing errors, which must be addressed before feeding the data into any model. Data visualization
tools such as seaborn44 , Bokeh45 , or matplotlib46 can help you to manually inspect the data and
develop intuitions about what problems you may need to address.
Exercises
1. Try loading datasets, e.g., Abalone from the UCI Machine Learning Repository47 and inspect
their properties. What fraction of them has missing values? What fraction of the variables
is numerical, categorical, or text?
2. Try out indexing and selecting data columns by name rather than by column number. The
pandas documentation on indexing48 has further details on how to do this.
3. How large a dataset do you think you could load this way? What might be the limitations?
Hint: consider the time to read the data, representation, processing, and memory footprint.
Try this out on your laptop. What changes if you try it out on a server?
4. How would you deal with data that has a very large number of categories? What if the cate-
gory labels are all unique? Should you include the latter?
5. What alternatives to pandas can you think of? How about loading NumPy tensors from a
file49 ? Check out Pillow50 , the Python Imaging Library.
Discussions51
2.3 Linear Algebra
By now, we can load datasets into tensors and manipulate these tensors with basic mathematical
operations. To start building sophisticated models, we will also need a few tools from linear alge-
bra. This section offers a gentle introduction to the most essential concepts, starting from scalar
arithmetic and ramping up to matrix multiplication.
2.3.1 Scalars
Most everyday mathematics consists of manipulating numbers one at a time. Formally, we call
these values scalars. For example, the temperature in Palo Alto is a balmy 72 degrees Fahrenheit. If
you wanted to convert the temperature to Celsius you would evaluate the expression c = 59 (f −32),
setting f to 72. In this equation, the values 5, 9, and 32 are scalars. The variables c and f represent
unknown scalars.
We denote scalars by ordinary lower-cased letters (e.g., x, y, and z) and the space of all (continu-
ous) real-valued scalars by R. For expedience, we will skip past rigorous definitions of spaces. Just
remember that the expression x ∈ R is a formal way to say that x is a real-valued scalar. The sym-
bol ∈ (pronounced “in”) denotes membership in a set. For example, x, y ∈ {0, 1} indicates that x
and y are variables that can only take values 0 or 1.
44
https://seaborn.pydata.org/
45
https://docs.bokeh.org/
46
https://matplotlib.org/
47
https://archive.ics.uci.edu/ml/datasets.php
48
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
49
https://numpy.org/doc/stable/reference/generated/numpy.load.html
50
https://python-pillow.org/
51
2.3. Linear Algebra 51

Scalars are implemented as tensors that contain only one element. Below, we assign two scalars
and perform the familiar addition, multiplication, division, and exponentiation operations.
import torch
x = torch.tensor(3.0)
y = torch.tensor(2.0)
x + y, x * y, x / y, x**y
(tensor(5.), tensor(6.), tensor(1.5000), tensor(9.))
2.3.2 Vectors
For our purposes, you can think of vectors as fixed-length arrays of scalars. As with their code
counterparts, we call these values the elements of the vector (synonyms include entries and compo-
nents). When vectors represent examples from real-world datasets, their values hold some real-
world significance. For example, if we were training a model to predict the risk of a loan default-
ing, we might associate each applicant with a vector whose components correspond to quantities
like their income, length of employment, or number of previous defaults. If we were studying
heart attack risk, each vector might represent a patient and its components might correspond
to their most recent vital signs, cholesterol levels, minutes of exercise per day, etc. We denote
vectors by bold lowercase letters, (e.g., x, y, and z).
Vectors are implemented as 1st -order tensors. In general, such tensors can have arbitrary lengths,
subject to memory limitations. Caution: in Python, like in most programming languages, vector
indices start at 0, also known as zero-based indexing, whereas in linear algebra subscripts begin at
1 (one-based indexing).
x = torch.arange(3)
x
tensor([0, 1, 2])
We can refer to an element of a vector by using a subscript. For example, x2 denotes the second
element of x. Since x2 is a scalar, we do not bold it. By default, we visualize vectors by stacking
their elements vertically.
 
x1
 .. 
x =  . , (2.3.1)
xn
Here x1 , . . . , xn are elements of the vector. Later on, we will distinguish between such column
vectors and row vectors whose elements are stacked horizontally. Recall that we access a tensorʼs
elements via indexing.
x[2]
tensor(2)
To indicate that a vector contains n elements, we write x ∈ Rn . Formally, we call n the dimension-
ality of the vector. In code, this corresponds to the tensorʼs length, accessible via Pythonʼs built-in
len function.
len(x)
We can also access the length via the shape attribute. The shape is a tuple that indicates a tensorʼs
length along each axis. Tensors with just one axis have shapes with just one element.
x.shape
torch.Size([3])
Oftentimes, the word “dimension” gets overloaded to mean both the number of axes and the length
along a articular axis. To avoid this confusion, we use order to refer to the number of axes and
dimensionality exclusively to refer to the number of components.
2.3.3 Matrices
Just as scalars are 0th -order tensors and vectors are 1st -order tensors, matrices are 2nd -order ten-
sors. We denote matrices by bold capital letters (e.g., X, Y, and Z), and represent them in code by
tensors with two axes. The expression A ∈ Rm×n indicates that a matrix A contains m × n real-
valued scalars, arranged as m rows and n columns. When m = n, we say that a matrix is square.
Visually, we can illustrate any matrix as a table. To refer to an individual element, we subscript
both the row and column indices, e.g., aij is the value that belongs to Aʼs ith row and j th column:
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
 
A= . .. .. ..  . (2.3.2)
 .. . . . 
am1 am2 · · · amn
In code, we represent a matrix A ∈ Rm×n by a 2nd -order tensor with shape (m, n). We can convert
any appropriately size m × n tensor into an m × n matrix by passing the desired shape to reshape:
A = torch.arange(6).reshape(3, 2)
A
tensor([[0, 1],
[2, 3],
[4, 5]])
Sometimes, we want to flip the axes. When we exchange a matrixʼs rows and columns, the result is
called its transpose. Formally, we signify a matrix Aʼs transpose by A⊤ and if B = A⊤ , then bij = aji

for all i and j. Thus, the transpose of an m × n matrix is an n × m matrix:
 
a11 a21 . . . am1
 a12 a22 . . . am2 
⊤  
A = . .. .. ..  . (2.3.3)
 .. . . . 
a1n a2n . . . amn
In code, we can access any matrixʼs transpose as follows:
A.T
tensor([[0, 2, 4],
[1, 3, 5]])
Symmetric matrices are the subset of square matrices that are equal to their own transposes: A =
A⊤ . The following matrix is symmetric:
A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])

A == A.T
tensor([[True, True, True],

[True, True, True],
[True, True, True]])
Matrices are useful for representing datasets. Typically, rows correspond to individual records
and columns correspond to distinct attributes.
2.3.4 Tensors
While you can go far in your machine learning journey with only scalars, vectors, and matrices,
eventually you may need to work with higher-order tensors. Tensors give us a generic way to de-
scribe extensions to nth -order arrays. We call software objects of the tensor class “tensors” precisely
because they too can have arbitrary numbers of axes. While it may be confusing to use the word
tensor for both the mathematical object and its realization in code, our meaning should usually
be clear from context. We denote general tensors by capital letters with a special font face (e.g.,
X, Y, and Z) and their indexing mechanism (e.g., xijk and [X]1,2i−1,3 ) follows naturally from that of
matrices.
Tensors will become more important when we start working with images. Each image arrives
as a 3rd -order tensor with axes corresponding to the height, width, and channel. At each spatial
location, the intensities of each color (red, green, and blue) are stacked along the channel. More-
over a collection of images is represented in code by a 4th -order tensor, where distinct images
are indexed along the first axis. Higher-order tensors are constructed analogously to vectors and
matrices, by growing the number of shape components.
torch.arange(24).reshape(2, 3, 4)
tensor([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],

[16, 17, 18, 19],
[20, 21, 22, 23]]])
2.3.5 Basic Properties of Tensor Arithmetic
Scalars, vectors, matrices, and higher-order tensors all have some handy properties. For example,
elementwise operations produce outputs that have the same shape as their operands.
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone() # Assign a copy of À` to `B` by allocating new memory
A, A + B
(tensor([[0., 1., 2.],

[3., 4., 5.]]),
tensor([[ 0., 2., 4.],
[ 6., 8., 10.]]))
The elementwise product of two matrices is called their Hadamard product (denoted ⊙). Below,
we spell out the entries of the Hadamard product of two matrices A, B ∈ Rm×n :
 
a11 b11 a12 b12 . . . a1n b1n
 a21 b21 a22 b22 . . . a2n b2n 
 
A⊙B= . .. . .. . (2.3.4)
 . . . . . . 
am1 bm1 am2 bm2 . . . amn bmn
A * B
tensor([[ 0., 1., 4.],

[ 9., 16., 25.]])
Adding or multiplying a scalar and a tensor produces a result with the same shape as the original
tensor. Here, each element of the tensor is added to (or multiplied by) the scalar.
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
(tensor([[[ 2, 3, 4, 5],
[ 6, 7, 8, 9],
[10, 11, 12, 13]],
[[14, 15, 16, 17],


[18, 19, 20, 21],

[22, 23, 24, 25]]]),
torch.Size([2, 3, 4]))
2.3.6 Reduction
Often, we wish to calculate the sum

∑ of a tensorʼs elements. To express the sum of the elements in
a vector x of length n, we write ni=1 xi . Thereʼs a simple function for it:
x = torch.arange(3, dtype=torch.float32)
x, x.sum()
(tensor([0., 1., 2.]), tensor(3.))
To express sums over the elements of tensors of arbitrary shape, we simply sum
∑over all of its axes.
m ∑n
For example, the sum of the elements of an m × n matrix A could be written i=1 j=1 aij .
A.shape, A.sum()
(torch.Size([2, 3]), tensor(15.))
By default, invoking the sum function reduces a tensor along all of its axes, eventually producing
a scalar. Our libraries also allow us to specify the axes along which the tensor should be reduced.
To sum over all elements along the rows (axis 0), we specify axis=0 in sum. Since the input matrix
reduces along axis 0 to generate the output vector, this axis is missing from the shape of the output.
A.shape, A.sum(axis=0).shape
(torch.Size([2, 3]), torch.Size([3]))
Specifying axis=1 will reduce the column dimension (axis 1) by summing up elements of all the
columns.
A.shape, A.sum(axis=1).shape
(torch.Size([2, 3]), torch.Size([2]))
Reducing a matrix along both rows and columns via summation is equivalent to summing up all
the elements of the matrix.
A.sum(axis=[0, 1]) == A.sum() # Same as À.sum()`
tensor(True)
A related quantity is the mean, also called the average. We calculate the mean by dividing the sum
by the total number of elements. Because computing the mean is so common, it gets a dedicated
library function that works analogously to sum.
A.mean(), A.sum() / A.numel()
(tensor(2.5000), tensor(2.5000))
Likewise, the function for calculating the mean can also reduce a tensor along specific axes.
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
(tensor([1.5000, 2.5000, 3.5000]), tensor([1.5000, 2.5000, 3.5000]))
2.3.7 Non-Reduction Sum
Sometimes it can be useful to keep the number of axes unchanged when invoking the function for
calculating the sum or mean. This matters when we want to use the broadcast mechanism.
sum_A = A.sum(axis=1, keepdims=True)

sum_A, sum_A.shape
(tensor([[ 3.],
[12.]]),
torch.Size([2, 1]))
For instance, since sum_A keeps its two axes after summing each row, we can divide A by sum_A with
broadcasting to create a matrix where each row sums up to 1.
A / sum_A
tensor([[0.0000, 0.3333, 0.6667],

[0.2500, 0.3333, 0.4167]])
If we want to calculate the cumulative sum of elements of A along some axis, say axis=0 (row by
row), we can call the cumsum function. By design, this function does not reduce the input tensor
along any axis.
A.cumsum(axis=0)
tensor([[0., 1., 2.],

[3., 5., 7.]])

2.3.8 Dot Products
So far, we have only performed elementwise operations, sums, and averages. And if this was all
we could do, linear algebra would not deserve its own section. Fortunately, this is where things get
more interesting. One of the most fundamental operations is the dot product. Given two vectors
x, y ∈ Rd , their dot ⊤
⊤
∑dproduct x y (or ⟨x, y⟩) is a sum over the products of the elements at the same
position: x y = i=1 xi yi .
y = torch.ones(3, dtype = torch.float32)

x, y, torch.dot(x, y)
(tensor([0., 1., 2.]), tensor([1., 1., 1.]), tensor(3.))
Equivalently, we can calculate the dot product of two vectors by performing an elementwise mul-
tiplication followed by a sum:
torch.sum(x * y)
tensor(3.)
Dot products are useful in a wide range of contexts. For example, given some set of values, denoted
by a vector x ∈ Rn and a set of weights denoted by w ∈ Rn , the weighted sum of the values in x
according to the weights w could be∑ expressed as the dot product x⊤ w. When the weights are
non-negative and sum to one, i.e., ( ni=1 wi = 1), the dot product expresses a weighted average.
After normalizing two vectors to have unit length, the dot products express the cosine of the angle
between them. Later in this section, we will formally introduce this notion of length.
2.3.9 Matrix-Vector Products
Now that we know how to calculate dot products, we can begin to understand the product between
an m × n matrix A and an n-dimensional vector x. To start off, we visualize our matrix in terms of
its row vectors
 ⊤
a1
 a⊤ 
 2
A =  . , (2.3.5)
 .. 
a⊤m
where each a⊤
i ∈ R is a row vector representing the i row of the matrix A.
n th
The matrix-vector product Ax is simply a column vector of length m, whose ith element is the dot
product a⊤
i x:
 ⊤  ⊤ 
a1 a1 x
 a⊤   a⊤ x 
 2  2 
Ax =  .  x =  .  . (2.3.6)
 ..   .. 
a⊤
m a⊤
mx
We can think of multiplication with a matrix A ∈ Rm×n as a transformation that projects vectors
from Rn to Rm . These transformations are remarkably useful. For example, we can represent
rotations as multiplications by certain square matrices. Matrix-vector products also describe the
key calculation involved in computing the outputs of each layer in a neural network given the
outputs from the previous layer.
To express a matrix-vector product in code, we use the mv function. Note that the column dimen-
sion of A (its length along axis 1) must be the same as the dimension of x (its length). PyTorch
has a convenience operator @ that can execute both matrix-vector and matrix-matrix products
(depending on its arguments). Thus we can write A@x.
A.shape, x.shape, torch.mv(A, x), A@x
(torch.Size([2, 3]), torch.Size([3]), tensor([ 5., 14.]), tensor([ 5., 14.]))
2.3.10 Matrix-Matrix Multiplication
If youʼve gotten the hang of dot products and matrix-vector products, then matrix-matrix multipli-
cation should be straightforward.
Say that we have two matrices A ∈ Rn×k and B ∈ Rk×m :
   
a11 a12 · · · a1k b11 b12 · · · b1m
 a21 a22 · · · a2k  b21 b22 · · · b2m 
   
A= . . . .  , B =  .. .. . . ..  . (2.3.7)
 .. .
. . . . 
.  . . . . 
an1 an2 · · · ank bk1 bk2 · · · bkm
Let a⊤
i ∈ R denote the row vector representing the i row of the matrix A and let bj ∈ R denote
k th k
th
the column vector from the j column of the matrix B:
 ⊤
a1
a⊤  [ ]
 2
A =  .  , B = b1 b2 · · · bm . (2.3.8)
 .. 
a⊤n
To form the matrix product C ∈ Rn×m , we simply compute each element cij as the dot product
between the ith row of A and the j th row of B, i.e., a⊤
i bj :
 ⊤  ⊤ 
a1 a1 b1 a⊤ ⊤
1 b2 · · · a1 bm
a⊤  [ ]  ⊤ ⊤ ⊤ 
 2 a2 b1 a2 b2 · · · a2 bm 
C = AB =  .  b1 b2 · · · bm =  . .. .. ..  . (2.3.9)
 ..   .. . . . 
a⊤
n a⊤ ⊤ ⊤
n b1 an b2 · · · an bm
We can think of the matrix-matrix multiplication AB as performing m matrix-vector products or

m × n dot products and stitching the results together to form an n × m matrix. In the follow-
ing snippet, we perform matrix multiplication on A and B. Here, A is a matrix with 2 rows and 3
columns, and B is a matrix with 3 rows and 4 columns. After multiplication, we obtain a matrix
with 2 rows and 4 columns.
B = torch.ones(3, 4)
torch.mm(A, B), A@B

(tensor([[ 3., 3., 3., 3.],
[12., 12., 12., 12.]]),
tensor([[ 3., 3., 3., 3.],
[12., 12., 12., 12.]]))
The term matrix-matrix multiplication is often simplified to matrix multiplication, and should not
be confused with the Hadamard product.
2.3.11 Norms
Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector
tells us how big it is. For instance, the
ell2 norm measures the (Euclidean) length of a vector. Here, we are employing a notion of size that
concerns the magnitude a vectorʼs components (not its dimensionality).
A norm is a function ∥ · ∥ that maps a vector to a scalar and satisfies the following three properties:
1. Given any vector x, if we scale (all elements of) the vector by a scalar α ∈ R, its norm scales
accordingly:
∥αx∥ = |α|∥x∥. (2.3.10)
2. For any vectors x and y: norms satisfy the triangle inequality:
∥x + y∥ ≤ ∥x∥ + ∥y∥. (2.3.11)
3. The norm of a vector is nonnegative and it only vanishes if the vector is zero:
∥x∥ > 0 for all x ̸= 0. (2.3.12)
Many functions are valid norms and different norms encode different notions of size. The Eu-
clidean norm that we all learned in elementary school geometry when calculating the hypotenuse
of right triangle is the square root of the sum of squares of a vectorʼs elements. Formally, this is
called the ℓ2 norm and expressed as
v
u n
u∑
∥x∥2 = t x2i . (2.3.13)
i=1
The method norm calculates the ℓ2 norm.
u = torch.tensor([3.0, -4.0])
torch.norm(u)
tensor(5.)
The ℓ1 norm is also popular and the associated metric is called the Manhattan distance. By defi-
nition, the ℓ1 norm sums the absolute values of a vectorʼs elements:
∑
n
∥x∥1 = |xi | . (2.3.14)
i=1
Compared to the ℓ2 norm, it is less sensitive to outliers. To compute the ℓ1 norm, we compose the
absolute value with the sum operation.
torch.abs(u).sum()
tensor(7.)
Both the ℓ2 and ℓ1 norms are special cases of the more general ℓp norms:
( n )1/p
∑
∥x∥p = |xi |p . (2.3.15)
i=1
In the case of matrices, matters are more complicated. After all, matrices can be viewed both as
collections of individual entries and as objects that operate on vectors and transform them into
other vectors. For instance, we can ask by how much longer the matrix-vector product Xv could be
relative to v. This line of thought leads to a norm called the spectral norm. For now, we introduce
the Frobenius norm, which is much easier to compute and defined as the square root of the sum of
the squares of a matrixʼs elements:
v
u∑
um ∑ n
∥X∥F = t x2ij . (2.3.16)
i=1 j=1
The Frobenius norm behaves as if it were an ℓ2 norm of a matrix-shaped vector. Invoking the
following function will calculate the Frobenius norm of a matrix.
torch.norm(torch.ones((4, 9)))
tensor(6.)
While we do not want to get too far ahead of ourselves, we can plant some intuition already about
why these concepts are useful. In deep learning, we are often trying to solve optimization prob-
lems: maximize the probability assigned to observed data; maximize the revenue associated with
a recommender model; minimize the distance between predictions and the ground-truth observa-
tions; minimize the distance between representations of photos of the same person while maximiz-
ing the distance between representations of photos of different people. These distances, which
constitute the objectives of deep learning algorithms, are often expressed as norms.
2.3.12 Discussion
In this section, we reviewed all the linear algebra that you will need to understand a remarkable
chunk of modern deep learning. There is a lot more to linear algebra and much of it is useful for
machine learning. For example, matrices can be decomposed into factors, and these decompo-
sitions can reveal low-dimensional structure in real-world datasets. There are entire subfields of
machine learning that focus on using matrix decompositions and their generalizations to high-
order tensors to discover structure in datasets and solve prediction problems. But this book fo-
cuses on deep learning. And we believe you will be more inclined to learn more mathematics
once you have gotten your hands dirty applying machine learning to real datasets. So while we
reserve the right to introduce more mathematics later on, we wrap up this section here.

If you are eager to learn more linear algebra, there are many excellent books and online resources.
For a more advanced crash course, consider checking out (Kolter, 2008, Petersen et al., 2008,
Strang, 1993).
To recap:
• Scalars, vectors, matrices, and tensors are the basic mathematical objects used in linear
algebra and have zero, one, two, and an arbitrary number of axes, respectively.
• Tensors can be sliced or reduced along specified axes via indexing, or operations such as sum
and mean, respectively.
• Elementwise products are called Hadamard products. By contrast, dot products, matrix-
vector products, and matrix-matrix products are not elementwise operations and in general
return objects that have different shapes than the operands.
• Compared to Hadamard products, matrix-matrix products take considerably longer to com-
pute (cubic rather than quadratic time).
• Norms capture various notions of the magnitude of a vector, and are commonly applied to
the difference of two vectors to measure their distance.
• Common vector norms include the ℓ1 and ℓ2 norms, and common matrix norms include the
spectral and Frobenius norms.
Exercises
1. Prove that the transpose of the transpose of a matrix is the matrix itself: (A⊤ )⊤ = A.
2. Given two matrices A and B, show that sum and transposition commute: A⊤ +B⊤ = (A+B)⊤ .
3. Given any square matrix A, is A + A⊤ always symmetric? Can you prove the result by using
only the result of the previous two exercises?
4. We defined the tensor X of shape (2, 3, 4) in this section. What is the output of len(X)? Write
your answer without implementing any code, then check your answer using code.
5. For a tensor X of arbitrary shape, does len(X) always correspond to the length of a certain
axis of X? What is that axis?
6. Run A / A.sum(axis=1) and see what happens. Can you analyze the reason?
7. When traveling between two points in downtown Manhattan, what is the distance that you
need to cover in terms of the coordinates, i.e., in terms of avenues and streets? Can you
travel diagonally?
8. Consider a tensor with shape (2, 3, 4). What are the shapes of the summation outputs along
axis 0, 1, and 2?
9. Feed a tensor with 3 or more axes to the linalg.norm function and observe its output. What
does this function compute for tensors of arbitrary shape?
10. Define three large matrices, say A ∈ R2 ×2 , B ∈ R2 ×2 and C ∈ R2 ×2 , for instance
10 16 16 5 5 14
initialized with Gaussian random variables. You want to compute the product ABC. Is there
any difference in memory footprint and speed, depending on whether you compute (AB)C
or A(BC). Why?
11. Define three large matrices, say A ∈ R2 ×2 , B ∈ R2 ×2 and C ∈ R2 ×2 . Is there any
10 16 16 5 5 16
difference in speed depending on whether you compute AB or AC⊤ ? Why? What changes if
you initialize C = B⊤ without cloning memory? Why?
12. Define three matrices, say A, B, C ∈ R100×200 . Constitute a tensor with 3 axes by stacking
[A, B, C]. What is the dimensionality? Slice out the second coordinate of the third axis to
recover B. Check that your answer is correct.
Discussions52
2.4 Calculus
For a long time, how to calculate the area of a circle remained a mystery. Then, the ancient Greek
mathematician Archimedes came up with the clever idea to inscribe a series of polygons with
increasing numbers of vertices on the inside of a circle (Fig. 2.4.1). For a polygon with n vertices,
we obtain n triangles. The height of each triangle approaches the radius r as we partition the
circle more finely. At the same time, its base approaches 2πr/n, since the ratio between arc and
secant approaches 1 for a large number of vertices. Thus, the area of the triangle approaches
n · r · 12 (2πr/n) = πr2 .
Fig. 2.4.1: Finding the area of a circle as a limit procedure.
This limiting procedure leads to both differential calculus and integral calculus (Section 19.5). The
former can tell us how to increase or decrease a function value by manipulating its arguments.
This comes in handy for the optimization problems that we face in deep learning, where we repeat-
edly update our parameters in order to decrease the loss function. Optimization addresses how to
fit our models to training data, and calculus is its key prerequisite. However, donʼt forget that our
ultimate goal is to perform well on previously unseen data. That problem is called generalization
and will be a key focus of other chapters.
2.4.1 Derivatives and Differentiation
Put simply, a derivative is the rate of change in a function with respect to changes in its arguments.
Derivatives can tell us how rapidly a loss function would increase or decrease were we to increase
or decrease each parameter by an infinitesimally small amount. Formally, for functions f : R → R,
that map from scalars to scalars, the derivative of f at a point x is defined as
f (x + h) − f (x)
f ′ (x) = lim . (2.4.1)
h→0 h
This term on the right hand side is called a limit and it tells us what happens to the value of an
expression as a specified variable approaches a particular value. This limit tells us what the ratio
52
2.4. Calculus 63
between a perturbation h and the change in the function value f (x + h) − f (x) converges to as we
shrink its size to zero.
When f ′ (x) exists, f is said to be differentiable at x; and when f ′ (x) exists for all x on a set, e.g.,
the interval [a, b], we say that f is differentiable on this set. Not all functions are differentiable,
including many that we wish to optimize, including accuracy and the area under the receiving
operating characteristic (AUC). However, because computing the derivative of the loss is a crucial
step in nearly all algorithms for training deep neural networks, we often optimize a differentiable
surrogate instead.
We can interpret the derivative f ′ (x) as the instantaneous rate of change of f (x) with respect to x.
Letʼs develop some intuition with an example. Define u = f (x) = 3x2 − 4x.
%matplotlib inline
import numpy as np
from matplotlib_inline import backend_inline
from d2l import torch as d2l
def f(x):
return 3 * x ** 2 - 4 * x
Setting x = 1, f (x+h)−f
h
(x)
approaches 2 as h approaches 0. While this experiment lacks the rigor
of a mathematical proof, we will soon see that indeed f ′ (1) = 2.
for h in 10.0**np.arange(-1, -6, -1):

print(f'h={h:.5f}, numerical limit={(f(1+h)-f(1))/h:.5f}')
h=0.10000, numerical limit=2.30000

There are several equivalent notational conventions for derivatives. Given y = f (x), the following
expressions are equivalent:
dy df d
f ′ (x) = y ′ = = = f (x) = Df (x) = Dx f (x), (2.4.2)
dx dx dx
d
where the symbols dx and D are differentiation operators. Below, we present the derivatives of some
common functions:
d
C=0 for any constant C
dx
d n
x = nxn−1 for n ̸= 0
dx (2.4.3)
d x
e = ex
dx
d
ln x = x−1
dx
Functions composed from differentiable functions are often themselves differentiable. The fol-
lowing rules come in handy for working with compositions of any differentiable functions f and
g, and constant C.
d d
[Cf (x)] = C f (x) Constant multiple rule
dx dx
d d d
[f (x) + g(x)] = f (x) + g(x) Sum rule
dx dx dx
d d d (2.4.4)
[f (x)g(x)] = f (x) g(x) + g(x) f (x) Product rule
dx dx dx
d f (x) g(x) dx f (x) − f (x) dx
d d
g(x)
= Quotient rule
dx g(x) g 2 (x)
Using this, we can apply the rules to find the derivative of 3x2 − 4x via
d d d
[3x2 − 4x] = 3 x2 − 4 x = 6x − 4. (2.4.5)
dx dx dx
Plugging in x = 1 shows that, indeed, the derivative is 2 at this location. Note that derivatives tell
us the slope of a function at a particular location.
2.4.2 Visualization Utilities
We can visualize the slopes of functions using the matplotlib library. We need to define a few
functions. As its name indicates, use_svg_display tells matplotlib to output graphics in SVG for-
mat for crisper images. The comment #@save is a special modifier that allows us to save any func-
tion, class, or other code block to the d2l package so that we can invoke it later without repeating
the code, e.g., via d2l.use_svg_display().
def use_svg_display(): #@save

"""Use the svg format to display a plot in Jupyter."""
backend_inline.set_matplotlib_formats('svg')
Conveniently, we can set figure sizes with set_figsize. Since the import statement from mat-
plotlib import pyplot as plt was marked via #@save in the d2l package, we can call d2l.plt.
def set_figsize(figsize=(3.5, 2.5)): #@save

"""Set the figure size for matplotlib."""
use_svg_display()
d2l.plt.rcParams['figure.figsize'] = figsize
The set_axes function can associate axes with properties, including labels, ranges, and scales.
#@save
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
"""Set the axes for matplotlib."""
axes.set_xlabel(xlabel), axes.set_ylabel(ylabel)
axes.set_xscale(xscale), axes.set_yscale(yscale)
axes.set_xlim(xlim), axes.set_ylim(ylim)
if legend:
axes.legend(legend)
axes.grid()
With these three functions, we can define a plot function to overlay multiple curves. Much of the
code here is just ensuring that the sizes and shapes of inputs match.
2.4. Calculus 65
#@save
def plot(X, Y=None, xlabel=None, ylabel=None, legend=[], xlim=None,
ylim=None, xscale='linear', yscale='linear',
fmts=('-', 'm--', 'g-.', 'r:'), figsize=(3.5, 2.5), axes=None):
"""Plot data points."""
def has_one_axis(X): # True if `X` (tensor or list) has 1 axis

return (hasattr(X, "ndim") and X.ndim == 1 or isinstance(X, list)
and not hasattr(X[0], "__len__"))
if has_one_axis(X): X = [X]
if Y is None:
X, Y = [[]] * len(X), X
elif has_one_axis(Y):
Y = [Y]
if len(X) != len(Y):
X = X * len(Y)
set_figsize(figsize)
if axes is None: axes = d2l.plt.gca()
axes.cla()
for x, y, fmt in zip(X, Y, fmts):
axes.plot(x,y,fmt) if len(x) else axes.plot(y,fmt)
set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
Now we can plot the function u = f (x) and its tangent line y = 2x−3 at x = 1, where the coefficient
2 is the slope of the tangent line.
x = np.arange(0, 3, 0.1)
plot(x, [f(x), 2 * x - 3], 'x', 'f(x)', legend=['f(x)', 'Tangent line (x=1)'])
2.4.3 Partial Derivatives and Gradients
Thus far, we have been differentiating functions of just one variable. In deep learning, we also
need to work with functions of many variables. We briefly introduce notions of the derivative that
apply to such multivariate functions.
Let y = f (x1 , x2 , . . . , xn ) be a function with n variables. The partial derivative of y with respect to
its ith parameter xi is
∂y f (x1 , . . . , xi−1 , xi + h, xi+1 , . . . , xn ) − f (x1 , . . . , xi , . . . , xn )

= lim . (2.4.6)
∂xi h→0 h
∂y
To calculate ∂x i
, we can treat x1 , . . . , xi−1 , xi+1 , . . . , xn as constants and calculate the derivative of
y with respect to xi . The following notation conventions for partial derivatives are all common
and all mean the same thing:
∂y ∂f
= = ∂xi f = ∂i f = fxi = fi = Di f = Dxi f. (2.4.7)
∂xi ∂xi
We can concatenate partial derivatives of a multivariate function with respect to all its variables
to obtain a vector that is called the gradient of the function. Suppose that the input of function
f : Rn → R is an n-dimensional vector x = [x1 , x2 , . . . , xn ]⊤ and the output is a scalar. The
gradient of the function f with respect to x is a vector of n partial derivatives:
∇x f (x) = [∂x1 f (x), ∂x2 f (x), . . . ∂xn f (x)]⊤ . (2.4.8)
When there is no ambiguity, ∇x f (x) is typically replaced by ∇f (x). The following rules come in
handy for differentiating multivariate functions:
• For all A ∈ Rm×n we have ∇x Ax = A⊤ and ∇x x⊤ A = A.
• For square matrices A ∈ Rn×n we have that ∇x x⊤ Ax = (A+A⊤ )x and in particular ∇x ∥x∥2 =
∇x x⊤ x = 2x.
Similarly, for any matrix X, we have ∇X ∥X∥2F = 2X.
2.4.4 Chain Rule
In deep learning, the gradients of concern are often difficult to calculate because we are work-
ing with deeply nested functions (of functions (of functions…)). Fortunately, the chain rule takes
care of this. Returning to functions of a single variable, suppose that y = f (g(x)) and that the
underlying functions y = f (u) and u = g(x) are both differentiable. The chain rule states that
dy dy du
= . (2.4.9)
dx du dx
Turning back to multivariate functions, suppose that y = f (u) has variables u1 , u2 , . . . , um , where
each ui = gi (x) has variables x1 , x2 , . . . , xn , i.e., u = g(x). Then the chain rule states that
∂y ∂y ∂u1 ∂y ∂u2 ∂y ∂um

= + + ... + and thus ∇x y = A∇u y, (2.4.10)
∂xi ∂u1 ∂xi ∂u2 ∂xi ∂um ∂xi
where A ∈ Rn×m is a matrix that contains the derivative of vector u with respect to vector x. Thus,
evaluating the gradient requires computing a vector-matrix product. This is one of the key reasons
why linear algebra is such an integral building block in building deep learning systems.
2.4. Calculus 67
2.4.5 Discussion
While we have just scratched the surface of a deep topic, a number of concepts already come
into focus: first, the composition rules for differentiation can be applied mindlessly, enabling
us to compute gradients automatically. This task requires no creativity and thus we can focus
our cognitive powers elsewhere. Second, computing the derivatives of vector-valued functions
requires us to multiply matrices as we trace the dependency graph of variables from output to
input. In particular, this graph is traversed in a forward direction when we evaluate a function
and in a backwards direction when we compute gradients. Later chapters will formally introduce
backpropagation, a computational procedure for applying the chain rule.
From the viewpoint of optimization, gradients allow us to determine how to move the parameters
of a model in order to lower the loss, and each step of the optimization algorithms used throughout
this book will require calculating the gradient.
Exercises
1. So far we took the rules for derivatives for granted. Using the definition and limits prove the
properties for (i) f (x) = c, (ii) f (x) = xn , (iii) f (x) = ex and (iv) f (x) = log x.
2. In the same vein, prove the product, sum, and quotient rule from first principles.
3. Prove that the constant multiple rule follows as a special case of the product rule.
4. Calculate the derivative of f (x) = xx .
5. What does it mean that f ′ (x) = 0 for some x? Give an example of a function f and a location
x for which this might hold.
6. Plot the function y = f (x) = x3 − 1
x and plot its tangent line at x = 1.
7. Find the gradient of the function f (x) = 3x21 + 5ex2 .
8. What is the gradient of the function f (x) = ∥x∥2 ? What happens for x = 0?
9. Can you write out the chain rule for the case where u = f (x, y, z) and x = x(a, b), y = y(a, b),
and z = z(a, b)?
10. Given a function f (x) that is invertible, compute the derivative of its inverse f −1 (x). Here
we have that f −1 (f (x)) = x and conversely f (f −1 (y)) = y. Hint: use these properties in
your derivation.
Discussions53
2.5 Automatic Differentiation
Recall from Section 2.4 that calculating derivatives is the crucial step in all of the optimization
algorithms that we will use to train deep networks. While the calculations are straightforward,
working them out by hand can be tedious and error-prone, and this problem only grows as our
models become more complex.
53
Fortunately all modern deep learning frameworks take this work off of our plates by offering auto-
matic differentiation (often shortened to autograd). As we pass data through each successive func-
tion, the framework builds a computational graph that tracks how each value depends on others. To
calculate derivatives, automatic differentiation packages then work backwards through this graph
applying the chain rule. The computational algorithm for applying the chain rule this fashion is
called backpropagation.
While autograd libraries become hot concerns over the past decade, they have a long history. In
fact the earliest references to autograd date back over half of a century (Wengert, 1964). The core
ideas behind modern backpropagation date to a PhD thesis from 1980 (Speelpenning, 1980) and
were further developed in the late 1980s (Griewank, 1989). While backpropagation has become
the default method for computing gradients, itʼs not the only option. For instance, the Julia pro-
gramming language employs forward propagation (Revels et al., 2016). Before exploring methods,
letʼs first master the autograd package.
2.5.1 A Simple Function
Letʼs assume that we are interested in differentiating the function y = 2x⊤ x with respect to the
column vector x. To start, we assign x an initial value.
import torch
x = torch.arange(4.0)
x
tensor([0., 1., 2., 3.])
Before we calculate the gradient of y with respect to x, we need a place to store it. In general,
we avoid allocating new memory every time we take a derivative because deep learning requires
successively computing derivatives with respect to the same parameters thousands or millions of
times, and we might risk running out of memory. Note that the gradient of a scalar-valued function
with respect to a vector x is vector-valued and has the same shape as x.
x.requires_grad_(True) # Better create `x = torch.arange(4.0, requires_grad=True)`

x.grad # The default value is None
We now calculate our function of x and assign the result to y.
y = 2 * torch.dot(x, x)
y
tensor(28., grad_fn=<MulBackward0>)
We can now take the gradient of y with respect to x by calling its backward method. Next, we can
access the gradient via xʼs grad attribute.
y.backward()
x.grad
2.5. Automatic Differentiation 69

tensor([ 0., 4., 8., 12.])
We already know that the gradient of the function y = 2x⊤ x with respect to x should be 4x. We
can now verify that the automatic gradient computation and the expected result are identical.
x.grad == 4 * x
tensor([True, True, True, True])
Now letʼs calculate another function of x and take its gradient. Note that PyTorch does not auto-
matically reset the gradient buffer when we record a new gradient. Instead the new gradient is
added to the already stored gradient. This behavior comes in handy when we want to optimize
the sum of multiple objective functions. To reset the gradient buffer, we can call x.grad.zero()
as follows:
x.grad.zero_() # Reset the gradient

y = x.sum()
y.backward()
x.grad
tensor([1., 1., 1., 1.])
2.5.2 Backward for Non-Scalar Variables
When y is a vector, the most natural interpretation of the derivative of y with respect to a vector
x is a matrix called the Jacobian that contains the partial derivatives of each component of y with
respect to each component of x. Likewise, for higher-order y and x, the differentiation result could
be an even higher-order tensor.
While Jacobians do show up in some advanced machine learning techniques, more commonly we
want to sum up the gradients of each component of y with respect to the full vector x, yielding
a vector of the same shape as x. For example, we often have a vector representing the value of
our loss function calculated separately for each among a batch of training examples. Here, we just
want to sum up the gradients computed individually for each example.
Because deep learning frameworks vary in how they interpret gradients of non-scalar tensors,
PyTorch takes some steps to avoid confusion. Invoking backward on a non-scalar elicits an error
unless we tell PyTorch how to reduce the object to a scalar. More formally, we need to provide
some vector v such that backward will compute v⊤ ∂x y rather than ∂x y. This next part may be
confusing, but for reasons that will become clear later, this argument (representing v) is named
gradient. For a more detailed description, see Yang Zhangʼs Medium post54 .
x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward()
x.grad
54
https://zhang-yang.medium.com/the-gradient-argument-in-pytorchs-backward-function-explained-by-examples-68f266950c29
tensor([0., 2., 4., 6.])
2.5.3 Detaching Computation
Sometimes, we wish to move some calculations outside of the recorded computational graph. For
example, say that we use the input to create some auxiliary intermediate terms for which we do not
want to compute a gradient. In this case, we need to detach the respective computational influence
graph from the final result. The following toy example makes this clearer: suppose we have z = x
* y and y = x * x but we want to focus on the direct influence of x on z rather than the influence
conveyed via y. In this case, we can create a new variable u that takes the same value as y but
whose provenance (how it was created) has been wiped out. Thus u has no ancestors in the graph
and gradients to not flow through u to x. For example, taking the gradient of z = x * u will yield
the result x, (not 3 * x * x as you might have expected since z = x * x * x).
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x
z.sum().backward()
x.grad == u
Note that while this procedure detaches yʼs ancestors from the graph leading to z, the computa-
tional graph leading to y persists and thus we can calculate the gradient of y with respect to x.
x.grad.zero_()
y.sum().backward()
x.grad == 2 * x
2.5.4 Gradients and Python Control Flow
So far we reviewed cases where the path from input to output was well-defined via a function such
as z = x * x * x. Programming offers us a lot more freedom in how we compute results. For
instance, we can make them depend on auxiliary variables or condition choices on intermediate
results. One benefit of using automatic differentiation is that even if building the computational
graph of a function required passing through a maze of Python control flow (e.g., conditionals,
loops, and arbitrary function calls), we can still calculate the gradient of the resulting variable. To
illustrate this, consider the following code snippet where the number of iterations of the while
loop and the evaluation of the if statement both depend on the value of the input a.
def f(a):
b = a * 2
2.5. Automatic Differentiation 71

while b.norm() < 1000:

b = b * 2
if b.sum() > 0:
c = b
else:
c = 100 * b
return c
Below, we call this function, passing in a random value as input. Since the input is a random
variable, we do not know what form the computational graph will take. However, whenever we
execute f(a) on a specific input, we realize a specific computational graph and can subsequently
run backward.
a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()
Even though our function f is a bit contrived for demonstration purposes, its dependence on the
input is quite simple: it is a linear function of a with piecewise defined scale. As such, f(a) / a
is a vector of constant entries and, moreover, f(a) / a needs to match the gradient of f(a) with
respect to a.
a.grad == d / a
tensor(True)
Dynamic control flow is very common in deep learning. For instance, when processing text, the
computational graph depends on the length of the input. In these cases, automatic differentiation
becomes vital for statistical modeling since it is impossible to compute the gradient a priori.
2.5.5 Discussion
Youʼve now gotten a taste of the power of automatic differentiation. The development of libraries
for calculating derivatives both automatically and efficiently has been a massive productivity
booster for deep learning practitioners, liberating them to focus on loftier concerns. Moreover,
autograd permits us to design massive models for which pen and paper gradient computations
would be prohibitively time consuming. Interestingly, while we use autograd to optimize mod-
els (in a statistical sense) the optimization of autograd libraries themselves (in a computational
sense) is a rich subject of vital interest to framework designers. Here, tools from compilers and
graph manipulation are leveraged to compute results in the most expedient and memory-efficient
manner.
For now, try to remember these basics: (i) attach gradients to those variables with respect to which
we desire derivatives; (ii) record the computation of the target value; (iii) execute the backpropa-
gation function; and (iv) access the resulting gradient.
Exercises
1. Why is the second derivative much more expensive to compute than the first derivative?
2. After running the function for backpropagation, immediately run it again and see what hap-
pens. Why?
3. In the control flow example where we calculate the derivative of d with respect to a, what
would happen if we changed the variable a to a random vector or a matrix? At this point, the
result of the calculation f(a) is no longer a scalar. What happens to the result? How do we
analyze this?
4. Let f (x) = sin(x). Plot the graph of f and of its derivative f ′ . Do not exploit the fact that
f ′ (x) = cos(x) but rather use automatic differentiation to get the result.
5. Let f (x) = ((log x2 ) · sin x) + x−1 . Write out a dependency graph tracing results from x to
f (x).
df
6. Use the chain rule to compute the derivative dx of the aforementioned function, placing each
term on the dependency graph that you constructed previously.
7. Given the graph and the intermediate derivative results, you have a number of options when
computing the gradient. Evaluate the result once starting from x to f and once from f tracing
back to x. The path from x to f is commonly known as forward differentiation, whereas the
path from f to x is known as backward differentiation.
8. When might you want to use forward differentiation and when backward differentiation?
Hint: consider the amount of intermediate data needed, the ability to parallelize steps, and
the size of matrices and vectors involved.
Discussions55
2.6 Probability and Statistics
One way or another, machine learning is all about uncertainty. In supervised learning, we want
to predict something unknown (the target) given something known (the features). Depending on
our objective, we might attempt to predict the most likely value of the target. Or we might predict
the value with the smallest expected distance from the target. And sometimes we wish not only
to predict a specific value but to quantify our uncertainty. For example, given some features de-
scribing a patient, we might want to know how likely they are to suffer a heart attack in the next
year. In unsupervised learning, we often care about uncertainty. To determine whether a set of
measurements are anomalous, it helps to know how likely one is to observe values in a population
of interest. Moreover, in reinforcement learning, we wish to develop agents that act intelligently
in various environments. This requires reasoning about how an environment might be expected
to change and what rewards one might expect to encounter in response to each of the available
actions.
Probability is the mathematical field concerned with reasoning under uncertainty. Given a prob-
abilistic model of some process, we can reason about the likelihood of various events. The use
of probabilities to describe the frequencies of repeatable events (like coin tosses) is fairly uncon-
troversial. In fact, frequentist scholars adhere to an interpretation of probability that applies only
to such repeatable events. By contrast Bayesian scholars use the language of probability more
55
2.6. Probability and Statistics 73

broadly to formalize our reasoning under uncertainty. Bayesian probability is characterized by
two unique features: (i) assigning degrees of belief to non-repeatable events, e.g., what is the
probability that the moon is made out of cheese?; and (ii) subjectivity—while Bayesian probability
provides unambiguous rules for how one should update their beliefs in light of new evidence, it
allows for different individuals to start off with different prior beliefs. Statistics helps us to reason
backwards, starting off with collection and organization of data and backing out to what infer-
ences we might draw about the process that generated the data. Whenever we analyze a dataset,
hunting for patterns that we hope might characterize a broader population, we are employing sta-
tistical thinking. Most courses, majors, theses, careers, departments, companies, and institutions
have been devoted to the study of probability and statistics. While this section only scratches the
surface, we will provide the foundation that you need to begin building models.
2.6.1 A Simple Example: Tossing Coins
Imagine that we plan to toss a coin and want to quantify how likely we are to see heads (vs. tails).
If the coin is fair, then both outcomes (heads and tails), are equally likely. Moreover if we plan
to toss the coin n times then the fraction of heads that we expect to see should exactly match the
expected fraction of tails. One intuitive way to see this is by symmetry: for every possible outcome
with nh heads and nt = (n − nh ) tails, there is an equally likely outcome with nt heads and nh
tails. Note that this is only possible if on average we expect to see 1/2 of tosses come up heads and
1/2 come up tails. Of course, if you conduct this experiment many times with n = 1000000 tosses
each, you might never see a trial where nh = nt exactly.
Formally, the quantity 1/2 is called a probability and here it captures the certainty with which
any given toss will come up heads. Probabilities assign scores between 0 and 1 to outcomes of
interest, called events. Here the event of interest is heads and we denote the corresponding prob-
ability P (heads). A probability of 1 indicates absolute certainty (imagine a trick coin where both
sides were heads) and a probability of 0 indicates impossibility (e.g., if both sides were tails). The
frequencies nh /n and nt /n are not probabilities but rather statistics. Probabilities are theoretical
quantities that underly the data generating process. Here, the probability 1/2 is a property of
the coin itself. By contrast, statistics are empirical quantities that are computed as functions of
the observed data. Our interests in probabilistic and statistical quantities are inextricably inter-
twined. We often design special statistics called estimators that, given a dataset, produce estimates
of model parameters like probabilities. Moreover, when those estimators satisfy a nice property
called consistency, our estimates will converge to the corresponding probability. In turn, these
inferred probabilities tell about the likely statistical properties of data from the same population
that we might encounter in the future.
Suppose that we stumbled upon a real coin for which we did not know the true P (heads). To
investigate this quantity with statistical methods, we need to (i) collect some data; and (ii) design
an estimator. Data acquisition here is easy; we can toss the coin many times and record all of
the outcomes. Formally, drawing realizations from some underlying random process is called
sampling. As you might have guessed, one natural estimator is the fraction between the number
of observed heads by the total number of tosses.
%matplotlib inline
import random
import torch
from torch.distributions.multinomial import Multinomial
Now, suppose that the coin was in fact fair, i.e., P (heads) = 0.5. To simulate tosses of a fair coin,
we can invoke any random number generator. Some easy ways to draw samples of an event with
probability 0.5. For example Pythonʼs random.random yields numbers in the interval [0, 1] where
the probability of lying in any sub-interval [a, b] ⊂ [0, 1] is equal to b − a. Thus we can get out 0 and
1 with probability 0.5 each by testing whether the returned float is greater than 0.5
num_tosses = 100
heads = sum([random.random() > 0.5 for _ in range(100)])
tails = num_tosses - heads
print("heads, tails: ", [heads, tails])
heads, tails: [55, 45]
More generally, we can simulate multiple draws from any variable with a finite number of possible
outcomes (like the toss of a coin or roll of a die) by calling the multinomial function, setting the
first argument to the number of draws and the second as a list of probabilities associated with
each of the possible outcomes. To simulate ten tosses of a fair coin, we assign probability vector
[0.5, 0.5], interpreting index 0 as heads and index 1 as tails. The function returns a vector with
length equal to the number of possible outcomes (here, 2), where the first component tells us the
number of occurrences of heads and the second component tells us the number of occurrences
of tails.
fair_probs = torch.tensor([0.5, 0.5])

Multinomial(100, fair_probs).sample()
tensor([44., 56.])
Each time you run this sampling process, you will receive a new random value that may differ from
the previous outcome. Dividing by the number of tosses gives us the frequency of each outcome
in our data. Note that these frequencies, like the probabilities that they are intended to estimate,
sum to 1.
Multinomial(100, fair_probs).sample() / 100
tensor([0.5000, 0.5000])
Here, even though our simulated coin is fair (we set the probabilities [0.5, 0.5] ourselves), the
counts of heads and tails may not be identical. Thatʼs because we only drew a finite number of
samples. If we didnʼt implement the simulation ourselves, and only saw the outcome, how would
we know if the coin were slightly unfair or if the possible deviation from 1/2 was just an artifact
of the small sample size? Letʼs see what happens when we simulate 10000 tosses.
counts = Multinomial(10000, fair_probs).sample()

counts / 10000
tensor([0.4986, 0.5014])
In general, for averages of repeated events (like coin tosses), as the number of repetitions grows,
our estimates are guaranteed to converge to the true underlying probabilities. The mathematical
proof of this phenomenon is called the law of large numbers and the central limit theorem tells us that

√
in many situations, as the sample size n grows, these errors should go down at a rate of (1/ n).
Letʼs get some more intuition by studying how our estimate evolves as we grow the number of
tosses from 1 to 10000.
counts = Multinomial(1, fair_probs).sample((10000,))

cum_counts = counts.cumsum(dim=0)
estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True)
estimates = estimates.numpy()
d2l.set_figsize((4.5, 3.5))
d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Samples')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();
Each solid curve corresponds to one of the two values of the coin and gives our estimated prob-
ability that the coin turns up that value after each group of experiments. The dashed black line
gives the true underlying probability. As we get more data by conducting more experiments, the
curves converge towards the true probability. You might already begin to see the shape of some
of the more advanced questions that preoccupy statisticians: How quickly does this convergence
happen? If we had already tested many coins manufactured at the same plant, how might we
incorporate this information?
2.6.2 A More Formal Treatment
Weʼve already gotten pretty far: posing a probabilistic model, generating synthetic data, running
a statistical estimator, empirically assessing convergence, and reporting error metrics (checking
the deviation). However, to go much further, we will need to be more precise.
When dealing with randomness, we denote the set of possible outcomes S and call it the sample
space or outcome space. Here, each element is a distinct possible outcome. In the case of rolling
a single coin, S = {heads, tails}. For a single die, S = {1, 2, 3, 4, 5, 6}. When flipping two coins,
we have four possible outcomes: {(heads, heads), (heads, tails), (tails, heads), (tails, tails)}. Events
are subsets of the sample space. For instance, the event “the first coin toss comes up heads” cor-
responds to the set {(heads, heads), (heads, tails)}. Whenever the outcome z of a random exper-
iment satisfies z ∈ A, then event A has occurred. For a single roll of a die, we could define the
events “seeing a 5” (A = {5}) and “seeing an odd number” (B = {1, 3, 5}). In this case, if the die
came up 5, we would say that both A and B occurred. On the other hand, if z = 3, then A did not
occur but B did.
A probability function maps events onto real values P : A ⊆ S → [0, 1]. The probability of an event
A in the given sample space S, denoted P (A), satisfies the following properties:
• The probability of any event A is a non-negative real number, i.e., P (A) ≥ 0;
• The probability of the entire sample space is 1, i.e., P (S) = 1;
• For any countable sequence of events A1 , A2 , . . . that are mutually exclusive (Ai ∩ Aj = ∅ for
all i ̸= j), the probability
∪ that any
∑∞of them happens is equal to the sum of their individual
probabilities, i.e., P ( ∞ A
i=1 i ) = i=1 P (Ai ).
These axioms of probability theory, proposed by (Kolmogorov, 1933), can be applied to rapidly
derive a number of important consequences. For instance, it follows immediately that the proba-
bility of any event A or its complement A′ occurring is 1 (because A ∪ A′ = S). We can also prove
that P (∅) = 0 because 1 = P (S ∪ S ′ ) = P (S ∪ ∅) = P (S) + P (∅) = 1 + P (∅). Consequently,
the probability of any event A and its complement A′ occurring simultaneously is P (A ∩ A′ ) = 0.
Informally, this tells us that impossible events have zero probability of occurring.
2.6.3 Random Variables
When we spoke about events like the roll of a die coming up odds or the first coin toss coming up
heads, we were invoking the idea of a random variable. Formally, random variables are mappings
from an underlying sample space to a set of (possibly many) values. You might wonder how a
random variable is different from the sample space, since both are collections of outcomes. Im-
portantly, random variables can be much coarser than the raw sample space. We can define a
binary random variable like “greater than 0.5” even when the underlying sample space is infinite,
e.g., the line segment between 0 and 1. Additionally, multiple random variables can share the
same underlying sample space. For example “whether my home alarm goes off” and “whether my
house was burglarized” are both binary random variables that share an underlying sample space.
Consequently, knowing the value taken by one random variable can tell us something about the
likely value of another random variable. Knowing that the alarm went off, we might suspect that
the house was likely burglarized.
Every value taken by a random variable corresponds to a subset of the underlying sample space.
Thus the occurrence where the random variable X takes value v, denoted by X = v, is an event
and P (X = v) denotes its probability. Sometimes this notation can get clunky, and we can abuse

notation when the context is clear. For example, we might use P (X) to refer broadly to the dis-
tribution of X, i.e., the function that tells us the probability that X takes any given value. Other
times we write expressions like P (X, Y ) = P (X)P (Y ), as a shorthand to express a statement that
is true for all of the values that the random variables X and Y can take, i.e., for all i, j it holds
that P (X = i and Y = j) = P (X = i)P (Y = j). Other times, we abuse notation by writing P (v)
when the random variable is clear from the context. Since an event in probability theory is a set of
outcomes from the sample space, we can specify a range of values for a random variable to take.
For example, P (1 ≤ X ≤ 3) denotes the probability of the event {1 ≤ X ≤ 3}.
Note that there is a subtle difference between discrete random variables, like flips of a coin or
tosses of a die, and continuous ones, like the weight and the height of a person sampled at random
from the population. In this case we seldom really care about someoneʼs exact height. Moreover,
if we took precise enough measurements, we would find that no two people on the planet have
the exact same height. In fact, with fine enough measurements, you would never have the same
height when you wake up and when you go to sleep. Thereʼs little point in asking about the exact
probability that someone is 1.801392782910287192 meters tall. Instead, we typically care more
about being able to say whether someoneʼs height falls into a given interval, say between 1.79 and
1.81 meters. In these cases we work with probability densities. The height of exactly 1.80 meters
has no probability, but nonzero density. To get out the probability assigned to an interval, we must
take an integral of the density over that interval.
2.6.4 Multiple Random Variables
You might have noticed that we couldnʼt even make it past the last section without making state-
ments involving interactions among multiple random variables (recall P (X, Y ) = P (X)P (Y )).
Most of machine learning is concerned with such relationships. Here, the sample space would be
the population of interest, say customers who transact with a business, photographs on the inter-
net, or proteins known to biologists. Each random variable would represent the (unknown) value
of a different attribute. Whenever we sample an individual from the population, we observe a
realization of each of the random variables. Because the values taken by random variables corre-
spond to subsets of the sample space that could be overlapping, partially overlapping, or entirely
disjoint, knowing the value taken by one random variable can cause us to update our beliefs about
what values of another random variable are likely. If a patient walks into a hospital and we ob-
serve that they are having trouble breathing and have lost their sense of smell, then we believe
that they are more likely to have COVID-19 than we might if they had no trouble breathing and a
perfectly ordinary sense of smell.
When working with multiple random variables, we can construct events corresponding to every
combination of values that the variables can jointly take. The probability function that assigns
probabilities to each of these combinations (e.g. A = a and B = b) is called the joint probabil-
ity function and simply returns the probability assigned to the intersection of the corresponding
subsets of the sample space. The joint probability assigned to the event where random variables
A and B take values a and b, respectively, is denoted P (A = a, B = b), where the comma indi-
cates “and”. Note that for any values a and b, it holds that P (A = a, B = b) ≤ P (A = a) and
P (A = a, B = b) ≤ P (B = b), since for A = a and B = b to happen, A = a has to happen
and B = b also has to happen. Interestingly, the joint probability tells us all that we can know
about these random variables in a probabilistic sense, and can be used to derive many other
useful quantities, including recovering the individual distributions P (A) and P (B). To recover
$P(A=a) $we simply ∑ sum up P (A = a, B = v) over all values v that the random variable B can
take: P (A = a) = v P (A = a, B = v).
P (A=a,B=b)
The ratio P (A=a) ≤ 1 turns out to be extremely important. It is called the conditional probabil-
ity, and is denoted via the “|” symbol, P (B = b|A = a) = P (A = a, B = b)/P (A = a). It tells us
the new probability associated with the event B = b, once we condition on the fact A = a took
place. We can think of this conditional probability as restricting attention only to the subset of
the sample space associated with A = a and then renormalizing so that all probabilities sum to 1.
Conditional probabilities are in fact probabilities and thus respect all of the axioms, so long as we
condition all terms on the same event and thus restrict attention to the same sample space. For
instance, for disjoint events B and B ′ , we have that P (B ∪ B ′ |A = a) = P (B|A = a) + P (B ′ |A = a).
Using the definition of conditional probabilities, we can derive the famous result called Bayes’
theorem. By construction, we have that P (A, B) = P (B|A)P (A) and P (A, B) = P (A|B)P (B).
Combining both equations yields P (B|A)P (A) = P (A|B)P (B) and hence
P (B|A)P (A)
P (A|B) = . (2.6.1)
P (B)
This simple equation has profound implications because it allows us to reverse the order of condi-
tioning. If we know how to estimate P (B|A), P (A), and P (B), then we can estimate P (A|B). We
often find it easier to estimate one term directly but not the other and Bayesʼ theorem can come
to the rescue here. For instance, if we know the prevalence of symptoms for a given disease, and
the overall prevalences of the disease and symptoms, respectively, we can determine how likely
someone is to have the disease based on their symptoms. In some cases we might not have direct
access to P (B), such as the prevalence of symptoms. In this case a simplified version of Bayesʼ
theorem comes in handy:
P (A|B) ∝ P (B|A)P (A). (2.6.2)

∑
Since we know that P (A|B) must be normalized to 1, i.e., a P (A = a|B) = 1, we can use it to
compute
P (B|A)P (A)
P (A|B) = ∑ . (2.6.3)
b (B = b|A)P (A)
P
In Bayesian statistics, we think of an observer as possessing some (subjective) prior beliefs about
the plausibility of the available hypotheses encoded in the prior P (H), and a likelihood function that
says how likely one is to observe any value of the collected evidence for each of the hypotheses
in the class P (E|H). Bayesʼ theorem is then interpreted as telling us how to update the initial
prior P (H) in light of the available evidence E to produce posterior beliefs P (H|E) = P (E|H)P
P (E)
(H)
.
Informally, this can be stated as “posterior equals prior times likelihood, divided by the evidence”.
Now, because the evidence P (E) is the same for all hypotheses, we can get away with simply
normalizing over the hypotheses.
∑
Note that a P (A = a|B) = 1 also allows us to marginalize over random variables. That is, we can
drop variables from a joint distribution such as P (A, B). After all, we have that
∑ ∑
P (A = a, B) = P (B) P (A = a|B) = P (B). (2.6.4)
a a
Independence is another fundamentally important concept that forms the backbone of many im-
portant ideas in statistics. In short, two variables are independent if conditioning on the value of A
does not cause any change to the probability distribution associated with B and vice versa. More
formally, independence, denoted A ⊥ B, requires that P (A|B) = P (A) and, consequently, that
P (A, B) = P (A|B)P (B) = P (A)P (B). Independence is often an appropriate assumption. For
example, if the random variable A represents the outcome from tossing one fair coin and the ran-
dom variable B represents the outcome from tossing another, then knowing whether A came up
heads should not influence the probability of B coming up heads.

Independence is especially useful when it holds among the successive draws of our data from
some underlying distribution (allowing us to make strong statistical conclusions) or when it holds
among various variables in our data, allowing us to work with simpler models that encode this in-
dependence structure. On the other hand, estimating the dependencies among random variables
is often the very aim of learning. We care to estimate the probability of disease given symptoms
specifically because we believe that diseases and symptoms are not independent.
Note that because conditional probabilities are proper probabilities, the concepts of indepen-
dence and dependence also apply to them. Two random variables A and B are conditionally in-
dependent given a third variable C if and only if P (A, B|C) = P (A|C)P (B|C). Interestingly, two
variables can be independent in general but become dependent when conditioning on a third.
This often occurs when the two random variables A and B correspond to causes of some third
variable C. For example, broken bones and lung cancer might be independent in the general
population but if we condition on being in the hospital then we might find that broken bones are
negatively correlated with lung cancer. Thatʼs because the broken bone explains away why some
person is in the hospital and thus lowers the probability that they have lung cancer.
And conversely, two dependent random variables can become independent upon conditioning on
a third. This often happens when two otherwise unrelated events have a common cause. Shoe size
and reading level are highly correlated among elementary school students, but this correlation
disappears if we condition on age.
2.6.5 An Example
Letʼs put our skills to the test. Assume that a doctor administers an HIV test to a patient. This test
is fairly accurate and it fails only with 1% probability if the patient is healthy but reporting him as
diseased. Moreover, it never fails to detect HIV if the patient actually has it. We use D1 ∈ {0, 1} to
indicate the diagnosis (0 if negative and 1 if positive) and H ∈ {0, 1} to denote the HIV status.
Conditional probability H=1 H=0

$P(D_1 = 1 H)$ 1
$P(D_1 = 0 H)$ 0
Note that the column sums are all 1 (but the row sums donʼt), since they are conditional probabil-
ities. Letʼs compute the probability of the patient having HIV if the test comes back positive, i.e.,
P (H = 1|D1 = 1). Intuitively this is going to depend on how common the disease is, since it affects
the number of false alarms. Assume that the population is fairly healthy, e.g., P (H = 1) = 0.0015.
To apply Bayesʼ theorem, we need to apply marginalization to determine
P (D1 = 1) =P (D1 = 1, H = 0) + P (D1 = 1, H = 1)

=P (D1 = 1|H = 0)P (H = 0) + P (D1 = 1|H = 1)P (H = 1) (2.6.5)
=0.011485.
This leads us to
P (D1 = 1|H = 1)P (H = 1)
P (H = 1|D1 = 1) = = 0.1306. (2.6.6)
P (D1 = 1)
In other words, there is only a 13.06% chance that the patient actually has HIV, despite using a
very accurate test. As we can see, probability can be counterintuitive. What should a patient do
upon receiving such terrifying news? Likely, the patient would ask the physician to administer
another test to get clarity. The second test has different characteristics and it is not as good as the
first one.
Conditional probability H=1 H=0

$P(D_2 = 1 H)$ 0.98
$P(D_2 = 0 H)$ 0.02
Unfortunately, the second test comes back positive, too. Letʼs calculate the requisite probabilities
to invoke Bayesʼ theorem by assuming conditional independence:
P (D1 = 1, D2 = 1|H = 0) = P (D1 = 1|H = 0)P (D2 = 1|H = 0) = 0.0003,

(2.6.7)
P (D1 = 1, D2 = 1|H = 1) = P (D1 = 1|H = 1)P (D2 = 1|H = 1) = 0.98.
Now we can apply marginalization to obtain the probability that both tests come back positive:
P (D1 = 1, D2 = 1) =P (D1 = 1, D2 = 1, H = 0) + P (D1 = 1, D2 = 1, H = 1)

=P (D1 = 1, D2 = 1|H = 0)P (H = 0) + P (D1 = 1, D2 = 1|H = 1)P (H = 1)
=0.00176955.
(2.6.8)
Finally, the probability of the patient having HIV given both tests being positive is
P (D1 = 1, D2 = 1|H = 1)P (H = 1)

P (H = 1|D1 = 1, D2 = 1) = = 0.8307. (2.6.9)
P (D1 = 1, D2 = 1)
That is, the second test allowed us to gain much higher confidence that not all is well. Despite the
second test being considerably less accurate than the first one, it still significantly improved our
estimate. The assumption of both tests being conditional independent of each other was crucial
for our ability to generate a more accurate estimate. Take the extreme case where we run the same
test twice. In this situation we would expect the same outcome in both times, hence no additional
insight is gained from running the same test again. The astute reader might have noticed that
the diagnosis behaved like a classifier hiding in plain sight where our ability to decide whether a
patient is healthy increases as we obtain more features (test outcomes).
2.6.6 Expectations
Often, making decisions requires not just looking at the probabilities assigned to individual events
but composing them together into useful aggregates that can provide us with guidance. For ex-
ample, when random variables take continuous scalar values, we often care about knowing what
value to expect on average. This quantity is formally called an expectation. If we are making in-
vestments, the first quantity of interest might be the return we can expect, averaging over all the
possible outcomes (and weighting by the appropriate probabilities). For instance, say that with
50% probability, an investment might fail altogether, with 40% probability it might provide a 2×
return, and with 10% probability it might provide a 10× return 10×. To calculate the expected re-
turn, we sum over all returns, multiplying each by the probability that they will occur. This yields
the expectation 0.5 · 0 + 0.4 · 2 + 0.1 · 10 = 1.8. Hence the expected return is 1.8×.
In general, the expectation (or average) of the random variable X is defined as
∑
E[X] = Ex∼P [x] = xP (X = x). (2.6.10)
x

∫
Likewise, for densities we obtain E[X] = x dp(x). Sometimes we are interested in the expected
value of some function of x. We can calculate these expectations as
∑ ∫
Ex∼P [f (x)] = f (x)P (x) and Ex∼P [f (x)] = f (x)p(x) dx (2.6.11)
x
for discrete probabilities and densities, respectively. Returning to the investment example from
above, f might be the utility (happiness) associated with the return. Behavior economists have
long noted that people associate greater disutility with losing money than the utility gained from
earning one dollar relative to their baseline. Moreover, the value of money tends to be sub-linear.
Possessing 100k dollars versus zero dollars can make the difference between paying the rent, eat-
ing well, and enjoying quality healthcare versus suffering through homelessness. On the other
hand, the gains due to possessing 200k versus 100k are less dramatic. Reasoning like this moti-
vates the cliché that “the utility of money is logarithmic”.
If the utility associated with a total loss were -1, and the utilities associated with returns of 1, 2,
and 10 were 1, 2 and 4, respectively, then the expected happiness of investing would be 0.5 · (−1) +
0.4 · 2 + 0.1 · 4 = 0.7 (an expected loss of utility of 30%). If indeed this were your utility function,
you might be best off keeping the money in the bank.
For financial decisions, we might also want to measure how risky an investment is. Here, we care
not just about the expected value but how much the actual values tend to vary relative to this value.
Note that we canʼt just take the expectation of the difference between the actual and expected
values. Thatʼs because the expectation of a difference is the difference of the expectations, and
thus E[X − E[X]] = E[X] − E[E[X]] = 0. However, we can look at the expectation of any non-
negative function of this difference. The variance of a random variable is calculated by looking at
the expected value of the squared deviations:
[ ]
Var[X] = E (X − E[X])2 = E[X 2 ] − E[X]2 . (2.6.12)
Here the equality follows by expanding (X − E[X])2 = X 2 − 2XE[X] + E[X]2 and taking expecta-
tions for each term. The square root of the variance is another useful quantity called the standard
deviation. While the variance and standard deviation convey the same information (either can be
calculated from the other), the standard deviation has the nice property that it is expressed in the
same units as the original quantity represented by the random variable.
Lastly, the variance of a function of a random variable is defined analogously as
Varx∼P [f (x)] = Ex∼P [f 2 (x)] − Ex∼P [f (x)]2 . (2.6.13)
Returning to our investment example, we can now compute the variance of the investment. It
is given by 0.5 · 0 + 0.4 · 22 + 0.1 · 102 − 1.82 = 8.36. For all intents and purposes this is a risky
investment. Note that by mathematical convention mean and variance are often referenced as µ
and σ 2 . This is particularly common whenever we use it to parametrize a Gaussian distribution.
In the same way as we introduced expectations and variance for scalar random variables, we can
do so for vector-valued ones. Expectations are easy, since we can apply them elementwise. For
def
instance, µ = Ex∼P [x] has coordinates µi = Ex∼P [xi ]. Covariances are more complicated. We
resolve the problem by taking expectations of the outer product of the difference between random
variables and their mean.
[ ]
Σ = Covx∼P [x] = Ex∼P (x − µ)(x − µ)⊤ .
def
(2.6.14)
This matrix Σ is referred to as the covariance matrix. An easy way to see its effect is to consider
some vector v of the same size as x. It follows that
[ ]
v⊤ Σv = Ex∼P v⊤ (x − µ)(x − µ)⊤ v = Varx∼P [v⊤ x]. (2.6.15)
As such, Σ allows us to compute the variance for any linear function of x by a simple matrix
multiplication. The off-diagonal elements tell us how correlated coordinates are: a value of 0
means no correlation, where a larger positive value means that they are more strongly correlated.
2.6.7 Discussion
In machine learning, there are many things to be uncertain about! We can be uncertain about
the value of a label given an input. We can be uncertain about the estimated value of a parame-
ter. We can even be uncertain about whether data arriving at deployment is even from the same
distribution as the training data.
By aleatoric uncertainty, we denote that uncertainty that is intrinsic to the problem, and due to
genuine randomness unaccounted for by the observed variables. By epistemic uncertainty, we de-
note uncertainty over a modelʼs parameters, the sort of uncertainty that we can hope to reduce
by collecting more data. We might have epistemic uncertainty concerning the probability that a
coin turns up heads, but even once we know this probability, we are left with aleatoric uncertainty
about the outcome of any future toss. No matter how long we watch someone tossing a fair coin,
we will never be more or less than 50% certain that the next toss will come up heads. These terms
owe to literature in mechanical modeling, (see e.g., (Der Kiureghian and Ditlevsen, 2009) for a re-
view on this aspect of uncertainty quantification56 ). Itʼs worth noting that these terms constitute
a slight abuse of language. The term epistemic refers to anything concerning knowledge and thus
in the philosophical sense, all uncertainty is epistemic.
We saw that sampling data from some unknown probability distribution can provide us with in-
formation that can be used to estimate the parameters of the data generating distribution. That
said, the rate at which this is possible can be quite slow. In our coin tossing example (and many
√
others) we can do no better than to design estimators that converge at a rate of 1/ n, where n is
the sample size (e.g., the number of tosses). This means that by going from 10 to 1000 observa-
tions (usually a very achievable task) we see a tenfold reduction of uncertainty, whereas the next
1000 observations help comparatively little, offering only a 1.41 times reduction. This is a persis-
tent feature of machine learning: while there are often easy gains, it takes a very large amount of
data, and often with it an enormous amount of computation to make even further gains. For an
empirical review of this fact for large scale language models see (Revels et al., 2016).
We also sharpened our language and tools for statistical modeling. In the process of that
we learned about conditional probabilities and about one of the most important equations in
statistics—Bayesʼ theorem. It is an effective tool for decoupling information conveyed by data
through a likelihood term P (B|A) that addresses how well observations B match a choice of pa-
rameters A, and a prior probability P (A) which governs how plausible a particular choice of A
was in the first place. In particular, we saw how this rule can be applied to assign probabilities to
diagnoses, based on the efficacy of the test and the prevalence of the disease itself (i.e., our prior).
Lastly, we introduced a first set of nontrivial questions about the effect of a specific probability
distribution, namely expectations and variances. While there are many more than just linear and
quadratic expectations for a probability distribution, these two already provide a good deal of
56
https://en.wikipedia.org/wiki/Uncertainty_quantification

knowledge about the possible behavior of the distribution. For instance, Chebyshevʼs inequality57
states that P (|X − µ| ≥ kσ) ≤ 1/k 2 , where µ is the expectation, σ 2 is the variance of the distribu-
tion, and k > 1 is a confidence parameter of√our choosing.
√ It tells us that draws from a distribution
lie with at least 50% probability within a [− 2σ, 2σ] interval centered on the expectation.
Exercises
1. Give an example where observing more data can reduce the amount of uncertainty about
the outcome to an arbitrarily low level.
2. Give an example where observing more data will only reduce the amount of uncertainty up
to a point and then no further. Explain why this is the case and where you expect this point
to occur.
3. We empirically demonstrated convergence to the mean for the toss of a coin. Calculate the
variance of the estimate of the probability that we see a head after drawing n samples.
1. How does the variance scale with the number of observations?
2. Use Chebyshevʼs inequality to bound the deviation from the expectation.
3. How does it relate to the central limit theorem?
4. Assume that we draw n samples xi from a probability distribution with zero mean and unit
def ∑
variance. Compute the averages zm = m−1 m i=1 xi . Can we apply Chebyshevʼs inequality
for every zm independently? Why not?
5. Given two events with probability P (A) and P (B), compute upper and lower bounds on
P (A ∪ B) and P (A ∩ B). Hint: graph the situation using a Venn diagram58 .
6. Assume that we have a sequence of random variables, say A, B, and C, where B only de-
pends on A, and C only depends on B, can you simplify the joint probability P (A, B, C)?
Hint: this is a Markov chain59 .
7. In Section 2.6.5, assume that the outcomes of the two tests are not independent. In particular
assume that either test on its own has a false positive rate of 10% and a false negative rate of
1%. That is, assume that P (D = 1|H = 0) = 0.1 and that P (D = 0|H = 1) = 0.01. Moreover,
assume that for H = 1 (infected) the test outcomes are conditionally independent, i.e., that
P (D1 , D2 |H = 1) = P (D1 |H = 1)P (D2 |H = 1) but that for healthy patients the outcomes
are coupled via P (D1 = D2 = 1|H = 0) = 0.02.
1. Work out the joint probability table for D1 and D2 , given H = 0 based on the informa-
tion you have so far.
2. Derive the probability of the patient being positive (H = 1) after one test returns posi-
tive. You can assume the same baseline probability P (H = 1) = 0.0015 as before.
3. Derive the probability of the patient being positive (H = 1) after both tests return pos-
itive.
8. Assume that you are an asset manager for an investment bank and you have a choice of
stocks si to invest in. Your portfolio needs to add up to 1 with weights αi for each stock. The
stocks have an average return µ = Es∼P [s] and covariance Σ = Covs∼P [s].
57
https://en.wikipedia.org/wiki/Chebyshev%27s_inequality
58
https://en.wikipedia.org/wiki/Venn_diagram
59
https://en.wikipedia.org/wiki/Markov_chain
1. Compute the expected return for a given portfolio α.
2. If you wanted to maximize the return of the portfolio, how should you choose your
investment?
3. Compute the variance of the portfolio.
4. Formulate an optimization problem of maximizing the return while keeping the vari-
ance constrained to an upper bound. This is the Nobel-Prize winning Markovitz port-
folio60 (Mangram, 2013). To solve it you will need a quadratic programming solver,
something way beyond the scope of this book.
Discussions61
2.7 Documentation
While we cannot possibly introduce every single PyTorch function and class (and the information
might become outdated quickly), the API documentation62 and additional tutorials63 and examples
provide such documentation. This section provides some guidance for how to explore the PyTorch
API.
2.7.1 Functions and Classes in a Module
In order to know which functions and classes can be called in a module, we invoke the dir func-
tion. For instance, we can query all properties in the module for generating random numbers:
import torch
print(dir(torch.distributions))
['AbsTransform', 'AffineTransform', 'Bernoulli', 'Beta', 'Binomial', 'CatTransform',

,→'Categorical', 'Cauchy', 'Chi2', 'ComposeTransform', 'ContinuousBernoulli',
,→'CorrCholeskyTransform', 'CumulativeDistributionTransform', 'Dirichlet', 'Distribution',
,→'ExpTransform', 'Exponential', 'ExponentialFamily', 'FisherSnedecor', 'Gamma', 'Geometric',

,→ 'Gumbel', 'HalfCauchy', 'HalfNormal', 'Independent', 'IndependentTransform', 'Kumaraswamy
,→', 'LKJCholesky', 'Laplace', 'LogNormal', 'LogisticNormal', 'LowRankMultivariateNormal',
,→'LowerCholeskyTransform', 'MixtureSameFamily', 'Multinomial', 'MultivariateNormal',
,→'NegativeBinomial', 'Normal', 'OneHotCategorical', 'OneHotCategoricalStraightThrough',
,→'Pareto', 'Poisson', 'PowerTransform', 'RelaxedBernoulli', 'RelaxedOneHotCategorical',
,→'ReshapeTransform', 'SigmoidTransform', 'SoftmaxTransform', 'SoftplusTransform',

,→'StackTransform', 'StickBreakingTransform', 'StudentT', 'TanhTransform', 'Transform',
,→'TransformedDistribution', 'Uniform', 'VonMises', 'Weibull', 'Wishart', '__all__', '__
,→builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__',
,→'__path__', '__spec__', 'bernoulli', 'beta', 'biject_to', 'binomial', 'categorical',
,→'cauchy', 'chi2', 'constraint_registry', 'constraints', 'continuous_bernoulli', 'dirichlet
,→', 'distribution', 'exp_family', 'exponential', 'fishersnedecor', 'gamma', 'geometric',

,→'gumbel', 'half_cauchy', 'half_normal', 'identity_transform', 'independent', 'kl', 'kl_

60
https://en.wikipedia.org/wiki/Markowitz_model
61
62
https://pytorch.org/docs/stable/index.html
63
https://pytorch.org/tutorials/beginner/basics/intro.html
2.7. Documentation 85
,→divergence', 'kumaraswamy', 'laplace', 'lkj_cholesky', 'log_normal', 'logistic_normal',

,→'lowrank_multivariate_normal', 'mixture_same_family', 'multinomial', 'multivariate_normal',
,→ 'negative_binomial', 'normal', 'one_hot_categorical', 'pareto', 'poisson', 'register_kl',
,→'relaxed_bernoulli', 'relaxed_categorical', 'studentT', 'transform_to', 'transformed_
,→distribution', 'transforms', 'uniform', 'utils', 'von_mises', 'weibull', 'wishart']
Generally, we can ignore functions that start and end with __ (special objects in Python) or func-
tions that start with a single _(usually internal functions). Based on the remaining function or
attribute names, we might hazard a guess that this module offers various methods for generating
random numbers, including sampling from the uniform distribution (uniform), normal distribu-
tion (normal), and multinomial distribution (multinomial).
2.7.2 Specific Functions and Classes
For more specific instructions on how to use a given function or class, we can invoke the help
function. As an example, letʼs explore the usage instructions for tensorsʼ ones function.
help(torch.ones)
Help on built-in function ones in module torch:
ones(...)
ones(*size, *, out=None, dtype=None, layout=torch.strided, device=None, requires_
,→grad=False) -> Tensor
Returns a tensor filled with the scalar value 1, with the shape defined
by the variable argument size.
Args:
size (int...): a sequence of integers defining the shape of the output␣
,→tensor.
Can be a variable number of arguments or a collection like a list or␣

,→tuple.
Keyword arguments:
out (Tensor, optional): the output tensor.
dtype (torch.dtype, optional): the desired data type of returned tensor.
Default: if None, uses a global default (see torch.set_default_tensor_
,→type()).
layout (torch.layout, optional): the desired layout of returned Tensor.

Default: torch.strided.
device (torch.device, optional): the desired device of returned tensor.
Default: if None, uses the current device for the default tensor type
(see torch.set_default_tensor_type()). device will be the CPU
for CPU tensor types and the current CUDA device for CUDA tensor types.
requires_grad (bool, optional): If autograd should record operations on the
returned tensor. Default: False.
Example::
>>> torch.ones(2, 3)
tensor([[ 1., 1., 1.],
[ 1., 1., 1.]])
>>> torch.ones(5)
tensor([ 1., 1., 1., 1., 1.])
From the documentation, we can see that the ones function creates a new tensor with the specified
shape and sets all the elements to the value of 1. Whenever possible, you should run a quick test
to confirm your interpretation:
torch.ones(4)
tensor([1., 1., 1., 1.])
In the Jupyter notebook, we can use ? to display the document in another window. For example,
list? will create content that is almost identical to help(list), displaying it in a new browser
window. In addition, if we use two question marks, such as list??, the Python code implementing
the function will also be displayed.
The official documentation provides plenty of descriptions and examples that are beyond this
book. Our emphasis lies on covering important use cases that will allow you to get started quickly
with practical problems, rather than completeness of coverage. We also encourage you to study
the source code of the libraries to see examples of high quality implementations for production
code. By doing this you will become a better engineer in addition to becoming a better scientist.
Discussions64
64
2.7. Documentation 87
3 | Linear Neural Networks for Regres-
sion
Before we worry about making our neural networks deep, it will be helpful to implement some
shallow neural networks, for which the inputs connect directly to the outputs. This will prove
important for a few reasons. First, rather than getting distracted by complicated architectures,
we can focus on the basics of neural network training, including parameterizing the output layer,
handling data, specifying a loss function, and training the model. Second, this class of shallow
networks happens to comprise the set of linear models, which subsumes many classical methods
for statistical prediction, including linear and softmax regression. Understanding these classical
tools is pivotal because they are widely used in many contexts and we will often need to use them
as baselines when justifying the use of fancier architectures. This chapter will focus narrowly on
linear regression and the subsequent chapter will extend our modeling repertoire by developing
linear neural networks for classification.
3.1 Linear Regression
Regression problems pop up whenever we want to predict a numerical value. Common examples
include predicting prices (of homes, stocks, etc.), predicting the length of stay (for patients in
the hospital), forecasting demand (for retail sales), among countless others. Not every predic-
tion problem is a classic regression problem. Later on, we will introduce classification problems,
where the goal is to predict membership among a set of categories.
As a running example, suppose that we wish to estimate the prices of houses (in dollars) based on
their area (in square feet) and age (in years). To develop a model for predicting house prices, we
need to get our hands on data consisting of sales, including the sales price, area, and age for each
home. In the terminology of machine learning, the dataset is called a training dataset or training
set, and each row (containing the data corresponding to one sale) is called an example (or data
point, instance, sample). The thing we are trying to predict (price) is called a label (or target). The
variables (age and area) upon which the predictions are based are called features (or covariates).
89
3.1.1 Basics
Linear regression may be both the simplest and most popular among the standard tools for tackling
regression problems. Dating back to the dawn of the 19th century (Gauss, 1809, Legendre, 1805),
linear regression flows from a few simple assumptions. First, we assume that the relationship
between features x and target y is approximately linear, i.e., that the conditional mean E[Y |X = x]
can be expressed as a weighted sum of the features x. This setup allows that the target value may
still deviate from its expected value on account of observation noise. Next, we can impose the
assumption that any such noise is well-behaved, following a Gaussian distribution. Typically, we
will use n to denote the number of examples in our dataset. We use superscripts to enumerate
samples and targets, and subscripts to index coordinates. More concretely, x(i) denotes the i-th
(i)
sample and xj denotes its j-th coordinate.
Model
At the heart of every solution is a model that describes how features can be transformed into an
estimate of the target. The assumption of linearity means that the expected value of the target
(price) can be expressed as a weighted sum of the features (area and age):
price = warea · area + wage · age + b. (3.1.1)
Here warea and wage are called weights, and b is called a bias (or offset or intercept). The weights
determine the influence of each feature on our prediction. The bias determines the value of the
estimate when all features are zero. Even though we will never see any newly-built homes with
precisely zero area, we still need the bias because it allows us to express all linear functions of our
features (versus restricting us to lines that pass through the origin). Strictly speaking, (3.1.1) is an
affine transformation of input features, which is characterized by a linear transformation of features
via weighted sum, combined with a translation via the added bias. Given a dataset, our goal is to
choose the weights w and the bias b that, on average, make our modelʼs predictions fit the true
prices observed in the data as closely as possible.
In disciplines where it is common to focus on datasets with just a few features, explicitly ex-
pressing models long-form, as in (3.1.1), is common. In machine learning, we usually work with
high-dimensional datasets, where itʼs more convenient to employ compact linear algebra nota-
tion. When our inputs consist of d features, we can assign each an index (between 1 and d) and
express our prediction ŷ (in general the “hat” symbol denotes an estimate) as
ŷ = w1 x1 + ... + wd xd + b. (3.1.2)
Collecting all features into a vector x ∈ Rd and all weights into a vector w ∈ Rd , we can express
our model compactly via the dot product between w and x:
ŷ = w⊤ x + b. (3.1.3)
In (3.1.3), the vector x corresponds to the features of a single example. We will often find it con-
venient to refer to features of our entire dataset of n examples via the design matrix X ∈ Rn×d .
Here, X contains one row for every example and one column for every feature. For a collection of
features X, the predictions ŷ ∈ Rn can be expressed via the matrix-vector product:
ŷ = Xw + b, (3.1.4)
where broadcasting (Section 2.1.4) is applied during the summation. Given features of a training
dataset X and corresponding (known) labels y, the goal of linear regression is to find the weight
90 Chapter 3. Linear Neural Networks for Regression

vector w and the bias term b that given features of a new data example sampled from the same
distribution as X, the new exampleʼs label will (in expectation) be predicted with the lowest error.
Even if we believe that the best model for predicting y given x is linear, we would not expect to
find a real-world dataset of n examples where y (i) exactly equals w⊤ x(i) + b for all 1 ≤ i ≤ n. For
example, whatever instruments we use to observe the features X and labels y might suffer small
amount of measurement error. Thus, even when we are confident that the underlying relationship
is linear, we will incorporate a noise term to account for such errors.
Before we can go about searching for the best parameters (or model parameters) w and b, we will need
two more things: (i) a quality measure for some given model; and (ii) a procedure for updating the
model to improve its quality.
Loss Function
Naturally, fitting our model to the data requires that we agree on some measure of fitness (or, equiv-
alently, of unfitness). Loss functions quantify the distance between the real and predicted values of
the target. The loss will usually be a non-negative number where smaller values are better and
perfect predictions incur a loss of 0. For regression problems, the most common loss function is
squared error. When our prediction for an example i is ŷ (i) and the corresponding true label is
y (i) , the squared error is given by:
1 ( (i) )2
l(i) (w, b) = ŷ − y (i) . (3.1.5)
2
The constant 21 makes no real difference but proves to be notationally convenient, since it cancels
out when we take the derivative of the loss. Because the training dataset is given to us, and thus out
of our control, the empirical error is only a function of the model parameters. Below, we visualize
the fit of a linear regression model in a problem with one-dimensional inputs (Fig. 3.1.1).
Fig. 3.1.1: Fitting a linear regression model to one-dimensional data.
Note that large differences between estimates ŷ (i) and targets y (i) lead to even larger contribu-
tions to the loss, due to the quadratic form of the loss (this can be a double-edge sword. While it
encourages the model to avoid large errors it can also lead to excessive sensitivity to anomalous
data). To measure the quality of a model on the entire dataset of n examples, we simply average
(or equivalently, sum) the losses on the training set:
1 ∑ (i) 1 ∑ 1 ( ⊤ (i) )2
n n
L(w, b) = l (w, b) = w x + b − y (i) . (3.1.6)
n n 2
i=1 i=1
3.1. Linear Regression 91

When training the model, we want to find parameters (w∗ , b∗ ) that minimize the total loss across
all training examples:
w∗ , b∗ = argmin L(w, b). (3.1.7)

w,b
Analytic Solution
Unlike most of the models that we will cover, linear regression presents us with a surprisingly
easy optimization problem. In particular, we can find the optimal parameters (as assessed on the
training data) analytically by applying a simple formula as follows. First, we can subsume the bias
b into the parameter w by appending a column to the design matrix consisting of all ones. Then
our prediction problem is to minimize ∥y − Xw∥2 . So long as the design matrix X has full rank
(no feature is linearly dependent on the others), then there will be just one critical point on the
loss surface and it corresponds to the minimum of the loss over the entire domain. Taking the
derivative of the loss with respect to w and setting it equal to zero yields:
∂w ∥y − Xw∥2 = 2X⊤ (Xw − y) = 0 and hence X⊤ y = X⊤ Xw. (3.1.8)
Solving for w provides us with the optimal solution for the optimization problem. Note that this
solution
w∗ = (X⊤ X)−1 X⊤ y (3.1.9)
will only be unique when the matrix X⊤ X is invertible, i.e., when the columns of the design matrix
are linearly independent (Golub and Van Loan, 1996).
While simple problems like linear regression may admit analytic solutions, you should not get
used to such good fortune. Although analytic solutions allow for nice mathematical analysis, the
requirement of an analytic solution is so restrictive that it would exclude almost all exciting aspects
of deep learning.
Minibatch Stochastic Gradient Descent
Fortunately, even in cases where we cannot solve the models analytically, we can still often train
models effectively in practice. Moreover, for many tasks, those difficult-to-optimize models turn
out to be so much better that figuring out how to train them ends up being well worth the trouble.
The key technique for optimizing nearly any deep learning model, and which we will call upon
throughout this book, consists of iteratively reducing the error by updating the parameters in the
direction that incrementally lowers the loss function. This algorithm is called gradient descent.
The most naive application of gradient descent consists of taking the derivative of the loss func-
tion, which is an average of the losses computed on every single example in the dataset. In prac-
tice, this can be extremely slow: we must pass over the entire dataset before making a single
update, even if the update steps might be very powerful (Liu and Nocedal, 1989). Even worse, if
there is a lot of redundancy in the training data, the benefit of a full update is even lower.
The other extreme is to consider only a single example at a time and to take update steps based
on one observation at a time. The resulting algorithm, stochastic gradient descent (SGD) can be an
effective strategy (Bottou, 2010), even for large datasets. Unfortunately, SGD has drawbacks, both
computational and statistical. One problem arises from the fact that processors are a lot faster
multiplying and adding numbers than they are at moving data from main memory to processor

cache. It is up to an order of magnitude more efficient to perform a matrix-vector multiplication
than a corresponding number of vector-vector operations. This means that it can take a lot longer
to process one sample at a time compared to a full batch. A second problem is that some of the
layers, such as batch normalization (to be described in Section 8.5), only work well when we have
access to more than one observation at a time.
The solution to both problems is to pick an intermediate strategy: rather than taking a full batch
or only a single sample at a time, we take a minibatch of observations (Li et al., 2014). The specific
choice of the size of the said minibatch depends on many factors, such as the amount of memory,
the number of accelerators, the choice of layers, and the total dataset size. Despite all of that, a
number between 32 and 256, preferably a multiple of a large power of 2, is a good start. This leads
us to minibatch stochastic gradient descent.
In its most basic form, in each iteration t, we first randomly sample a minibatch Bt consisting of a
fixed number |B| of training examples. We then compute the derivative (gradient) of the average
loss on the minibatch with respect to the model parameters. Finally, we multiply the gradient by
a predetermined small positive value η, called the learning rate, and subtract the resulting term
from the current parameter values. We can express the update as follows:
η ∑
(w, b) ← (w, b) − ∂(w,b) l(i) (w, b). (3.1.10)
|B|
i∈Bt
In summary, minibatch SGD proceeds as follows: (i) initialize the values of the model parame-
ters, typically at random; (ii) iteratively sample random minibatches from the data, updating the
parameters in the direction of the negative gradient. For quadratic losses and affine transforma-
tions, this has a closed-form expansion:
η ∑ η ∑ (i) ( ⊤ (i) )
w←w− ∂w l(i) (w, b) = w − x w x + b − y (i)
|B| |B|
i∈Bt i∈Bt
η ∑ (i) η ∑ ( ⊤ (i) ) (3.1.11)
b←b− ∂b l (w, b) =b− w x + b − y (i) .
|B| |B|
i∈Bt i∈Bt
Since we pick a minibatch B we need to normalize by its size |B|. Frequently minibatch size and
learning rate are user-defined. Such tunable parameters that are not updated in the training loop
are called hyperparameters. They can be tuned automatically by a number of techniques, such as
Bayesian optimization (Frazier, 2018). In the end, the quality of the solution is typically assessed
on a separate validation dataset (or validation set).
After training for some predetermined number of iterations (or until some other stopping cri-
terion is met), we record the estimated model parameters, denoted ŵ, b̂. Note that even if our
function is truly linear and noiseless, these parameters will not be the exact minimizers of the
loss, or even deterministic. Although the algorithm converges slowly towards the minimizers it
typically cannot achieve it exactly in a finite number of steps. Moreover, the minibatches B used
to update the parameters are chosen at random. This breaks determinism.
Linear regression happens to be a learning problem with a global minimum (whenever X is full
rank, or equivalently, whenever X⊤ X is invertible). However, the lost surfaces for deep networks
contain many saddle points and minima. Fortunately, we typically donʼt care about finding an ex-
act set of parameters but merely any set of parameters that leads to accurate predictions (and thus
low loss). In practice, deep learning practitioners seldom struggle to find parameters that mini-
mize the loss on training sets (Frankle and Carbin, 2018, Izmailov et al., 2018). The more formidable
task is to find parameters that lead to accurate predictions on previously unseen data, a challenge
called generalization. We return to these topics throughout the book.

Predictions
Given the model ŵ⊤ x + b̂, we can now make predictions for a new example, e.g., to predict the
sales price of a previously unseen house given its area x1 and age x2 . Deep learning practition-
ers have taken to calling the prediction phase inference but this is a bit of a misnomer—inference
refers broadly to any conclusion reached on the basis of evidence, including both the values of
the parameters and the likely label for an unseen instance. If anything, in the statistics literature
inference more often denotes parameter inference and this overloading of terminology creates
unnecessary confusion when deep learning practitioners talk to statisticians. In the following we
will stick to prediction whenever possible.
3.1.2 Vectorization for Speed
When training our models, we typically want to process whole minibatches of examples simulta-
neously. Doing this efficiently requires that we vectorize the calculations and leverage fast linear
algebra libraries rather than writing costly for-loops in Python.
%matplotlib inline
import math
import time
import numpy as np
import torch
To illustrate why this matters so much, we can consider two methods for adding vectors. To start,
we instantiate two 10,000-dimensional vectors containing all ones. In one method, we loop over
the vectors with a Python for-loop. In the other method, we rely on a single call to +.
n = 10000
a = torch.ones(n)
b = torch.ones(n)
Now we can benchmark the workloads. First, we add them, one coordinate at a time, using a
for-loop.
c = torch.zeros(n)
t = time.time()
for i in range(n):
c[i] = a[i] + b[i]
f'{time.time() - t:.5f} sec'
'0.10193 sec'
Alternatively, we rely on the reloaded + operator to compute the elementwise sum.
t = time.time()
d = a + b
f'{time.time() - t:.5f} sec'

'0.00016 sec'
The second method is dramatically faster than the first. Vectorizing code often yields order-of-
magnitude speedups. Moreover, we push more of the mathematics to the library and need not
write as many calculations ourselves, reducing the potential for errors and increasing portability
of the code.
3.1.3 The Normal Distribution and Squared Loss
So far weʼve given a fairly functional motivation of the squared loss objective: the optimal param-
eters return the conditional expectation E[Y |X] whenever the underlying pattern is truly linear,
and the loss assigns outsize penalties for outliers. We can also provide a more formal motivation
for the squared loss objective by making probabilistic assumptions about the distribution of noise.
Linear regression was invented at the turn of the 19th century. While it has long been debated
whether Gauss or Legendre first thought up the idea, it was Gauss who also discovered the nor-
mal distribution (also called the Gaussian). It turns out that the normal distribution and linear
regression with squared loss share a deeper connection than common parentage.
To begin, recall that a normal distribution with mean µ and variance σ 2 (standard deviation σ) is
given as
( )
1 1
p(x) = √ exp − 2 (x − µ)2 . (3.1.12)
2πσ 2 2σ
Below we define a function to compute the normal distribution.
def normal(x, mu, sigma):

p = 1 / math.sqrt(2 * math.pi * sigma**2)
return p * np.exp(-0.5 * (x - mu)**2 / sigma**2)
We can now visualize the normal distributions.
# Use numpy again for visualization

x = np.arange(-7, 7, 0.01)
# Mean and standard deviation pairs

params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
ylabel='p(x)', figsize=(4.5, 2.5),
legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])

Note that changing the mean corresponds to a shift along the x-axis, and increasing the variance
spreads the distribution out, lowering its peak.
One way to motivate linear regression with squared loss is to assume that observations arise from
noisy measurements, where the noise is normally distributed as follows:
y = w⊤ x + b + ϵ where ϵ ∼ N (0, σ 2 ). (3.1.13)
Thus, we can now write out the likelihood of seeing a particular y for a given x via
( )
1 1 ⊤
P (y|x) = √ exp − 2 (y − w x − b) . 2
(3.1.14)
2πσ 2 2σ
As such, the likelihood factorizes. According to the principle of maximum likelihood, the best values
of parameters w and b are those that maximize the likelihood of the entire dataset:
∏
n
P (y|X) = p(y (i) |x(i) ). (3.1.15)
i=1
The equality follows since all pairs (x(i) , y (i) ) were drawn independently of each other. Estima-
tors chosen according to the principle of maximum likelihood are called maximum likelihood es-
timators. While, maximizing the product of many exponential functions, might look difficult, we
can simplify things significantly, without changing the objective, by maximizing the logarithm of
the likelihood instead. For historical reasons, optimizations are more often expressed as mini-
mization rather than maximization. So, without changing anything, we can minimize the negative
log-likelihood, which we can express as follows:
∑ 1 ( (i) )2
n
1 ⊤ (i)
− log P (y|X) = log(2πσ 2 ) + y − w x − b . (3.1.16)
2 2σ 2
i=1
If we assume that σ is fixed, we can ignore the first term, because it does not depend on w or b. The
second term is identical to the squared error loss introduced earlier, except for the multiplicative
constant σ12 . Fortunately, the solution does not depend on σ either. It follows that minimizing the
mean squared error is equivalent to maximum likelihood estimation of a linear model under the
assumption of additive Gaussian noise.

3.1.4 Linear Regression as a Neural Network
While linear models are not sufficiently rich to express the many complicated neural networks
that we will introduce in this book, neural networks are rich enough to subsume linear models
as neural networks in which every feature is represented by an input neuron, all of which are
connected directly to the output.
Fig. 3.1.2 depicts linear regression as a neural network. The diagram highlights the connectivity
pattern such as how each input is connected to the output, but not the specific values taken by the
weights or biases.
Fig. 3.1.2: Linear regression is a single-layer neural network.
The inputs are x1 , . . . , xd . We refer to d as the number of inputs or feature dimensionality in the input
layer. The output of the network is o1 . Because we are just trying to predict a single numerical
value, we have only one output neuron. Note that the input values are all given. There is just
a single computed neuron. In summary, we can think of linear regression as a single-layer fully
connected neural network. We will encounter networks with far more layers in future chapters.
Biology
Because linear regression predates computational neuroscience, it might seem anachronistic to

describe linear regression in terms of neural networks. Nonetheless, they were a natural place to
start when the cyberneticists and neurophysiologists Warren McCulloch and Walter Pitts began
to develop models of artificial neurons. Consider the cartoonish picture of a biological neuron in
Fig. 3.1.3, consisting of dendrites (input terminals), the nucleus (CPU), the axon (output wire), and
the axon terminals (output terminals), enabling connections to other neurons via synapses.
Dendrite
Axon Terminal
Node of
Ranvier
Cell body
Axon Schwann cell

Myelin sheath
Nucleus
Fig. 3.1.3: The real neuron.

Information xi arriving from other neurons (or environmental sensors) is received in the den-
drites. In particular, that information is weighted by synaptic weights wi , determining the effect of
the inputs, e.g., activation or inhibition via the product xi wi . The weighted
∑ inputs arriving from
multiple sources are aggregated in the nucleus as a weighted sum y = i xi wi + b, possibly sub-
ject to some nonlinear postprocessing via σ(y). This information is then sent via the axon to the
axon terminals, where it reaches its destination (e.g., an actuator such as a muscle) or it is fed into
another neuron via its dendrites.
Certainly, the high-level idea that many such units could be combined with the right connectivity
and right learning algorithm, to produce far more interesting and complex behavior than any one
neuron alone could express owes to our study of real biological neural systems. At the same time,
most research in deep learning today draws inspiration from a much wider source. We invoke
Stuart Russell and Peter Norvig (Russell and Norvig, 2016) who pointed out that although airplanes
might have been inspired by birds, ornithology has not been the primary driver of aeronautics
innovation for some centuries. Likewise, inspiration in deep learning these days comes in equal
or greater measure from mathematics, linguistics, psychology, statistics, computer science, and
many other fields.
3.1.5 Summary
In this section, we introduced traditional linear regression, where the parameters of a linear func-
tion are chosen to minimize squared loss on the training set. We also motivated this choice of ob-
jective both via some practical considerations and through an interpretation of linear regression
as maximimum likelihood estimation under an assumption of linearity and Gaussian noise. After
discussing both computational considerations and connections to statistics, we showed how such
linear models could be expressed as simple neural networks where the inputs are directly wired
to the output(s). While we will soon move past linear models altogether, they are sufficient to in-
troduce most of the components that all of our models require: parametric forms, differentiable
objectives, optimization via minibatch stochastic gradient descent, and ultimately, evaluation on
previously unseen data.
Exercises
1. Assume
∑ that we have some data x1 , . . . , xn ∈ R. Our goal is to find a constant b such that
(x
i i − b)2 is minimized.
1. Find an analytic solution for the optimal value of b.
2. How does this problem and its solution relate to the normal distribution?
∑ ∑
3. What if we change the loss from i (xi − b)2 to i |xi − b|? Can you find the optimal
solution for b?
2. Prove that the affine functions that can be expressed by x⊤ w + b are equivalent to linear
functions on (x, 1).
∑ ∑
3. Assume that you want to find quadratic functions of x, i.e., f (x) = b+ i wi xi + j≤i wij xi xj .
How would you formulate this in a deep network?
4. Recall that one of the conditions for the linear regression problem to be solvable was that
the design matrix X⊤ X has full rank.
1. What happens if this is not the case?

2. How could you fix it? What happens if you add a small amount of coordinate-wise in-
dependent Gaussian noise to all entries of X?
3. What is the expected value of the design matrix X⊤ X in this case?
4. What happens with stochastic gradient descent when X⊤ X doesnʼt have full rank?
5. Assume that the noise model governing the additive noise ϵ is the exponential distribution.
That is, p(ϵ) = 21 exp(−|ϵ|).
1. Write out the negative log-likelihood of the data under the model − log P (y | X).
2. Can you find a closed form solution?
3. Suggest a minibatch stochastic gradient descent algorithm to solve this problem. What
could possibly go wrong (hint: what happens near the stationary point as we keep on
updating the parameters)? Can you fix this?
6. Assume that we want to design a neural network with two layers by composing two linear
layers. That is, the output of the first layer becomes the input of the second layer. Why would
such a naive composition not work?
7. What happens if you want to use regression for realistic price estimation of houses or stock
prices?
1. Show that the additive Gaussian noise assumption is not appropriate. Hint: can we have
negative prices? What about fluctuations?
2. Why would regression to the logarithm of the price be much better, i.e., y = log price?
3. What do you need to worry about when dealing with pennystock, i.e., stock with very
low prices? Hint: can you trade at all possible prices? Why is this a bigger problem for
cheap stock?
4. For more information review the celebrated Black-Scholes model for option pricing
(Black and Scholes, 1973).
8. Suppose we want to use regression to estimate the number of apples sold in a grocery store.
1. What are the problems with a Gaussian additive noise model? Hint: you are selling
apples, not oil.
2. The Poisson distribution65 captures distributions over counts. It is given by p(k|λ) =
λk e−λ /k!. Here λ is the rate function and k is the number of events you see. Prove that
λ is the expected value of counts k.
3. Design a loss function associated with the Poisson distribution.
4. Design a loss function for estimating log λ instead.
Discussions66
65
https://en.wikipedia.org/wiki/Poisson_distribution
66

3.2 Object-Oriented Design for Implementation
In our introduction to linear regression, we walked through various components including the
data, the model, the loss function, and the optimization algorithm. Indeed, linear regression is
one of the simplest machine learning models. Training it, however, uses many of the same com-
ponents as other models in this book require. Therefore, before diving into the implementation
details it is worth designing some of the APIs used throughout this book. Treating components in
deep learning as objects, we can start by defining classes for these objects and their interactions.
This object-oriented design for implementation will greatly streamline the presentation and you
might even want to use it in your projects.
Inspired by open-source libraries such as PyTorch Lightning67 , on a high level we wish to have
three classes: (i) Module contains models, losses, and optimization methods; (ii) DataModule pro-
vides data loaders for training and validation; (iii) both classes are combined using the Trainer
class, which allows us to train models on a variety of hardware platforms. Most code in this book
adapts Module and DataModule. We will touch upon the Trainer class only when we discuss GPUs,
CPUs, parallel training, and optimization algorithms.
import time
import numpy as np
import torch
3.2.1 Utilities
We need a few utilities to simplify object-oriented programming in Jupyter notebooks. One of

the challenges is that class definitions tend to be fairly long blocks of code. Notebook readability
demands short code fragments, interspersed with explanations, a requirement incompatible with
the style of programming common for Python libraries. The first utility function allows us to
register functions as methods in a class after the class has been created. In fact, we can do so even
after weʼve created instances of the class! It allows us to split the implementation of a class into
multiple code blocks.
def add_to_class(Class): #@save

def wrapper(obj):
setattr(Class, obj.__name__, obj)
return wrapper
Letʼs have a quick look at how to use it. We plan to implement a class A with a method do. Instead
of having code for both A and do in the same code block, we can first declare the class A and create
an instance a.
class A:
def __init__(self):
self.b = 1
a = A()
67
https://www.pytorchlightning.ai/

Next we define the method do as we normally would, but not in class Aʼs scope. Instead, we deco-
rate this method by add_to_class with class A as its argument. In doing so, the method is able to
access the member variables of A as we would expect if it had been defined as part of Aʼs definition.
Letʼs see what happens when we invoke it for the instance a.
@add_to_class(A)
def do(self):
print('Class attribute "b" is', self.b)
a.do()
Class attribute "b" is 1
The second one is a utility class that saves all arguments in a classʼs __init__ method as class
attributes. This allows us to extend constructor call signatures implicitly without additional code.
class HyperParameters: #@save

def save_hyperparameters(self, ignore=[]):
raise NotImplemented
We defer its implementation into Section 20.7. To use it, we define our class that inherits from
HyperParameters and calls save_hyperparameters in the __init__ method.
# Call the fully implemented HyperParameters class saved in d2l

class B(d2l.HyperParameters):
def __init__(self, a, b, c):
self.save_hyperparameters(ignore=['c'])
print('self.a =', self.a, 'self.b =', self.b)
print('There is no self.c =', not hasattr(self, 'c'))
b = B(a=1, b=2, c=3)
self.a = 1 self.b = 2
There is no self.c = True
The last utility allows us to plot experiment progress interactively while it is going on. In defer-
ence to the much more powerful (and complex) TensorBoard68 we name it ProgressBoard. The
implementation is deferred to Section 20.7. For now, letʼs simply see it in action.
The draw function plots a point (x, y) in the figure, with label specified in the legend. The op-
tional every_n smooths the line by only showing 1/n points in the figure. Their values are averaged
from the n neighbor points in the original figure.
class ProgressBoard(d2l.HyperParameters): #@save

"""Plot data points in animation."""
def __init__(self, xlabel=None, ylabel=None, xlim=None,
ls=['-', '--', '-.', ':'], colors=['C0', 'C1', 'C2', 'C3'],
fig=None, axes=None, figsize=(3.5, 2.5), display=True):
self.save_hyperparameters()
68
https://www.tensorflow.org/tensorboard
3.2. Object-Oriented Design for Implementation 101

def draw(self, x, y, label, every_n=1):

raise NotImplemented
In the following example, we draw sin and cos with a different smoothness. If you run this code
block, you will see the lines grow in animation.
board = d2l.ProgressBoard('x')
for x in np.arange(0, 10, 0.1):
board.draw(x, np.sin(x), 'sin', every_n=2)
board.draw(x, np.cos(x), 'cos', every_n=10)
3.2.2 Models
The Module class is the base class of all models we will implement. At a minimum we need to
define three methods. The __init__ method stores the learnable parameters, the training_step
method accepts a data batch to return the loss value, the configure_optimizers method returns
the optimization method, or a list of them, that is used to update the learnable parameters. Op-
tionally we can define validation_step to report the evaluation measures. Sometimes we put the
code to compute the output into a separate forward method to make it more reusable.
class Module(nn.Module, d2l.HyperParameters): #@save

def __init__(self, plot_train_per_epoch=2, plot_valid_per_epoch=1):
super().__init__()
self.board = ProgressBoard()
def loss(self, y_hat, y):
raise NotImplementedError
def forward(self, X):

assert hasattr(self, 'net'), 'Neural network is defined'
return self.net(X)
def plot(self, key, value, train):

"""Plot a point in animation."""

assert hasattr(self, 'trainer'), 'Trainer is not inited'

self.board.xlabel = 'epoch'
if train:
x = self.trainer.train_batch_idx / \
self.trainer.num_train_batches
n = self.trainer.num_train_batches / \
self.plot_train_per_epoch
else:
x = self.trainer.epoch + 1
n = self.trainer.num_val_batches / \
self.plot_valid_per_epoch
self.board.draw(x, value.to(d2l.cpu()).detach().numpy(),
('train_' if train else 'val_') + key,
every_n=int(n))
def training_step(self, batch):

l = self.loss(self(*batch[:-1]), batch[-1])
self.plot('loss', l, train=True)
return l
def validation_step(self, batch):

self.plot('loss', l, train=False)
def configure_optimizers(self):
You may notice that Module is a subclass of nn.Module, the base class of neural networks in Py-
Torch. It provides convenient features to handle neural networks. For example, if we define a
forward method, such as forward(self, X), then for an instance a we can invoke this function by
a(X). This works since it calls the forward method in the built-in __call__ method. You can find
more details and examples about nn.Module in Section 6.1.
3.2.3 Data
The DataModule class is the base class for data. Quite frequently the __init__ method is used to
prepare the data. This includes downloading and preprocessing if needed. The train_dataloader
returns the data loader for the training dataset. A data loader is a (Python) generator that yields
a data batch each time it is used. This batch is then fed into the training_step method of Module
to compute the loss. There is an optional val_dataloader to return the validation dataset loader.
It behaves in the same manner, except that it yields data batches for the validation_step method
in Module.
class DataModule(d2l.HyperParameters): #@save

def __init__(self, root='../data', num_workers=4):
def get_dataloader(self, train):

def train_dataloader(self):
return self.get_dataloader(train=True)
3.2. Object-Oriented Design for Implementation 103

def val_dataloader(self):
return self.get_dataloader(train=False)
3.2.4 Training
The Trainer class trains the learnable parameters in the Module class with data specified in Data-
Module. The key method is fit, which accepts two arguments: model, an instance of Module, and
data, an instance of DataModule. It then iterates over the entire dataset max_epochs times to train
the model. As before, we will defer the implementation of this function to later chapters.
class Trainer(d2l.HyperParameters): #@save

def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
assert num_gpus == 0, 'No GPU support yet'
def prepare_data(self, data):

self.train_dataloader = data.train_dataloader()
self.val_dataloader = data.val_dataloader()
self.num_train_batches = len(self.train_dataloader)
self.num_val_batches = (len(self.val_dataloader)
if self.val_dataloader is not None else 0)
def prepare_model(self, model):

model.trainer = self
model.board.xlim = [0, self.max_epochs]
self.model = model
def fit(self, model, data):

self.prepare_data(data)
self.prepare_model(model)
self.optim = model.configure_optimizers()
self.epoch = 0
self.train_batch_idx = 0
self.val_batch_idx = 0
for self.epoch in range(self.max_epochs):
self.fit_epoch()
def fit_epoch(self):
3.2.5 Summary
To highlight the object-oriented design for our future deep learning implementation, the above
classes just show how their objects store data and interact with each other. We will keep enrich-
ing implementations of these classes, such as via @add_to_class, in the rest of the book. More-
over, these fully implemented classes are saved in the d2l library69 , a lightweight toolkit that makes
structured modeling for deep learning easy. In particular, it facilitates reusing many components
between projects without changing much at all. For instance, we can replace just the optimizer,
69
https://github.com/d2l-ai/d2l-en/tree/master/d2l

just the model, just the dataset, etc.; this degree of modularity pays dividends throughout the book
in terms of conciseness and simplicity (this is why we added it) and it can do the same for your
own projects.
Exercises
1. Locate full implementations of the above classes that are saved in the d2l library70 . We
strongly recommend that you look at the implementation in detail once you have gained
some more familiarity with deep learning modeling.
2. Remove the save_hyperparameters statement in the B class. Can you still print self.a and
self.b? Optional: if you have dived into the full implementation of the HyperParameters
class, can you explain why?
Discussions71
3.3 Synthetic Regression Data
Machine learning is all about extracting information from data. So you might wonder, what could
we possibly learn from synthetic data? While we might not care intrinsically about the patterns
that we ourselves baked into an artificial data generating model, such datasets are nevertheless
useful for didactic purposes, helping us to evaluate the properties of our learning algorithms and
to confirm that our implementations work as expected. For example, if we create data for which
the correct parameters are known a priori, then we can verify that our model can in fact recover
them.
%matplotlib inline
import random
import torch
3.3.1 Generating the Dataset
For this example, we will work low-dimensional for succinctness. The following code snippet
generates 1000 examples with 2-dimensional features drawn from a standard normal distribution.
The resulting design matrix X belongs to R1000×2 . We generate each label by applying a ground
truth linear function, corrupted them via additive noise ϵ, drawn independently and identically
for each example:
y = Xw + b + ϵ. (3.3.1)
For convenience we assume that ϵ is drawn from a normal distribution with mean µ = 0 and
standard deviation σ = 0.01. Note that for object-oriented design we add the code to the __init__
method of a subclass of d2l.DataModule (introduced in Section 3.2.3). Itʼs good practice to allow
setting any additional hyperparameters. We accomplish this with save_hyperparameters(). The
batch_size will be determined later on.
70
71
3.3. Synthetic Regression Data 105

class SyntheticRegressionData(d2l.DataModule): #@save
def __init__(self, w, b, noise=0.01, num_train=1000, num_val=1000,
batch_size=32):
super().__init__()
n = num_train + num_val
self.X = torch.randn(n, len(w))
noise = torch.randn(n, 1) * noise
self.y = torch.matmul(self.X, w.reshape((-1, 1))) + b + noise
Below, we set the true parameters to w = [2, −3.4]⊤ and b = 4.2. Later, we can check our estimated
parameters against these ground truth values.
data = SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)
Each row in features consists of a vector in R2 and each row in labels is a scalar. Letʼs have a look
at the first entry.
print('features:', data.X[0],'\nlabel:', data.y[0])
features: tensor([0.0643, 1.0951])

label: tensor([0.6094])
Training machine learning models often requires multiple passes over a dataset, grabbing one
minibatch of examples at a time. This data is then used to update the model. To illustrate how this
works, we implement the get_dataloader function, registering it as a method in the SyntheticRe-
gressionData class via add_to_class (introduced in Section 3.2.1). It takes a batch size, a matrix
of features, and a vector of labels, and generates minibatches of size batch_size. As such, each
minibatch consists of a tuple of features and labels. Note that we need to be mindful of whether
weʼre in training or validation mode: in the former, we will want to read the data in random or-
der, whereas for the latter, being able to read data in a pre-defined order may be important for
debugging purposes.
@d2l.add_to_class(SyntheticRegressionData)
if train:
indices = list(range(0, self.num_train))
# The examples are read in random order
random.shuffle(indices)
else:
indices = list(range(self.num_train, self.num_train+self.num_val))
for i in range(0, len(indices), self.batch_size):
batch_indices = torch.tensor(indices[i: i+self.batch_size])
yield self.X[batch_indices], self.y[batch_indices]
To build some intuition, letʼs inspect the first minibatch of data. Each minibatch of features pro-
vides us with both its size and the dimensionality of input features. Likewise, our minibatch of
labels will have a matching shape given by batch_size.

X, y = next(iter(data.train_dataloader()))
print('X shape:', X.shape, '\ny shape:', y.shape)
X shape: torch.Size([32, 2])

y shape: torch.Size([32, 1])
While seemingly innocuous, the invocation of iter(data.train_dataloader()) illustrates the

power of Pythonʼs object-oriented design. Note that we added a method to the SyntheticRegres-
sionData class after creating the data object. Nonetheless, the object benefits from the ex post facto
addition of functionality to the class.
Throughout the iteration we obtain distinct minibatches until the entire dataset has been ex-
hausted (try this). While the iteration implemented above is good for didactic purposes, it is
inefficient in ways that might get us in trouble on real problems. For example, it requires that
we load all the data in memory and that we perform lots of random memory access. The built-in
iterators implemented in a deep learning framework are considerably more efficient and they can
deal with sources such as data stored in files, data received via a stream, and data generated or
processed on the fly. Next letʼs try to implement the same function using built-in iterators.
3.3.3 Concise Implementation of the Data Loader
Rather than writing our own iterator, we can call the existing API in a framework to load data.
As before, we need a dataset with features X and labels y. Beyond that, we set batch_size in the
built-in data loader and let it take care of shuffling examples efficiently.
@d2l.add_to_class(d2l.DataModule) #@save
def get_tensorloader(self, tensors, train, indices=slice(0, None)):
tensors = tuple(a[indices] for a in tensors)
dataset = torch.utils.data.TensorDataset(*tensors)
return torch.utils.data.DataLoader(dataset, self.batch_size,
shuffle=train)
@d2l.add_to_class(SyntheticRegressionData) #@save
i = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader((self.X, self.y), train, i)
The new data loader behaves just as the previous one, except that it is more efficient and has some
added functionality.
print('X shape:', X.shape, '\ny shape:', y.shape)
X shape: torch.Size([32, 2])

y shape: torch.Size([32, 1])
For instance, the data loader provided by the framework API supports the built-in __len__ method,
so we can query its length, i.e., the number of batches.
len(data.train_dataloader())
3.3. Synthetic Regression Data 107

32
3.3.4 Summary
Data loaders are a convenient way of abstracting out the process of loading and manipulating data.
This way the same machine learning algorithm is capable of processing many different types and
sources of data without the need for modification. One of the nice things about data loaders is that
they can be composed. For instance, we might be loading images and then have a post-processing
filter that crops them or modifies them otherwise. As such, data loaders can be used to describe
an entire data processing pipeline.
As for the model itself, the two-dimensional linear model is about as simple a model as we might
encounter. It lets us test out the accuracy of regression models without worry about having insuf-
ficient amounts of data or an underdetermined system of equations. We will put this to good use
in the next section.
Exercises
1. What will happen if the number of examples cannot be divided by the batch size. How to
change this behavior by specifying a different argument by using frameworkʼs API?
2. What if we want to generate a huge dataset, where both the size of the parameter vector w
and the number of examples num_examples are large?
1. What happens if we cannot hold all data in memory?
2. How would you shuffle the data if data is held on disk? Your task is to design an efficient
algorithm that does not require too many random reads or writes. Hint: pseudorandom
permutation generators72 allow you to design a reshuffle without the need to store the
permutation table explicitly (Naor and Reingold, 1999).
3. Implement a data generator that produces new data on the fly, every time the iterator is
called.
4. How would you design a random data generator that generates the same data each time itʼs
called?
Discussions73
3.4 Linear Regression Implementation from Scratch
Weʼre now ready to work through a fully functioning implementation of linear regression. In this
section, we will implement the entire method from scratch, including (i) the model; (ii) the loss
function; (iii) a minibatch stochastic gradient descent optimizer; and (iv) the training function that
stitches all of these pieces together. Finally, we will run our synthetic data generator from Section
3.3 and apply our model on the resulting dataset. While modern deep learning frameworks can
automate nearly all of this work, implementing things from scratch is the only way to make sure
that you really know what you are doing. Moreover, when it comes time to customize models,
72
https://en.wikipedia.org/wiki/Pseudorandom_permutation
73

defining our own layers or loss functions, understanding how things work under the hood will
prove handy. In this section, we will rely only on tensors and automatic differentiation. Later on,
we will introduce a more concise implementation, taking advantage of bells and whistles of deep
learning frameworks while retaining the structure of what follows below.
%matplotlib inline
import torch
3.4.1 Defining the Model
Before we can begin optimizing our modelʼs parameters by minibatch SGD, we need to have some
parameters in the first place. In the following we initialize weights by drawing random numbers
from a normal distribution with mean 0 and a standard deviation of 0.01. The magic number 0.01
often works well in practice, but you can specify a different value through the argument sigma.
Moreover we set the bias to 0. Note that for object-oriented design we add the code to the __init__
method of a subclass of d2l.Module (introduced in Section 3.2.2).
class LinearRegressionScratch(d2l.Module): #@save

def __init__(self, num_inputs, lr, sigma=0.01):
super().__init__()
self.w = torch.normal(0, sigma, (num_inputs, 1), requires_grad=True)
self.b = torch.zeros(1, requires_grad=True)
Next, we must define our model, relating its input and parameters to its output. For our linear
model we simply take the matrix-vector product of the input features X and the model weights
w, and add the offset b to each example. Xw is a vector and b is a scalar. Due to the broadcast-
ing mechanism (see Section 2.1.4), when we add a vector and a scalar, the scalar is added to each
component of the vector. The resulting forward function is registered as a method in the Linear-
RegressionScratch class via add_to_class (introduced in Section 3.2.1).
@d2l.add_to_class(LinearRegressionScratch) #@save
"""The linear regression model."""
return torch.matmul(X, self.w) + self.b
3.4.2 Defining the Loss Function
Since updating our model requires taking the gradient of our loss function, we ought to define the
loss function first. Here we use the squared loss function in (3.1.5). In the implementation, we
need to transform the true value y into the predicted valueʼs shape y_hat. The result returned by
the following function will also have the same shape as y_hat. We also return the averaged loss
value among all examples in the minibatch.
l = (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
return l.mean()
3.4. Linear Regression Implementation from Scratch 109

3.4.3 Defining the Optimization Algorithm
As discussed in Section 3.1, linear regression has a closed-form solution. However, our goal here
is to illustrate how to train more general neural networks, and that requires that we teach you
how to use minibatch SGD. Hence we will take this opportunity to introduce your first working
example of SGD. At each step, using a minibatch randomly drawn from our dataset, we estimate
the gradient of the loss with respect to the parameters. Next, we update the parameters in the
direction that may reduce the loss.
The following code applies the update, given a set of parameters, a learning rate lr. Since our loss
is computed as an average over the minibatch, we donʼt need to adjust the learning rate against
the batch size. In later chapters we will investigate how learning rates should be adjusted for very
large minibatches as they arise in distributed large scale learning. For now, we can ignore this
dependency.
We define our SGD class, a subclass of d2l.HyperParameters (introduced in Section 3.2.1), to have
a similar API as the built-in SGD optimizer. We update the parameters in the step method. The
zero_grad method sets all gradients to 0, which must be run before a backpropagation step.
class SGD(d2l.HyperParameters): #@save

def __init__(self, params, lr):
"""Minibatch stochastic gradient descent."""
def step(self):
for param in self.params:
param -= self.lr * param.grad
def zero_grad(self):
for param in self.params:
if param.grad is not None:
param.grad.zero_()
We next define the configure_optimizers method, which returns an instance of the SGD class.
return SGD([self.w, self.b], self.lr)
3.4.4 Training
Now that we have all of the parts in place (parameters, loss function, model, and optimizer), we
are ready to implement the main training loop. It is crucial that you understand this code well
since you will employ similar training loops for every other deep learning model covered in this
book. In each epoch, we iterate through the entire training dataset, passing once through every
example (assuming that the number of examples is divisible by the batch size). In each iteration,
we grab a minibatch of training examples, and compute its loss through the modelʼs training_step
method. Next, we compute the gradients with respect to each parameter. Finally, we will call
the optimization algorithm to update the model parameters. In summary, we will execute the
following loop:
• Initialize parameters (w, b)

• Repeat until done
∑
– Compute gradient g ← ∂(w,b) |B|
1
i∈B l(x(i) , y (i) , w, b)
– Update parameters (w, b) ← (w, b) − ηg

Recall that the synthetic regression dataset that we generated in Section 3.3 does not provide a
validation dataset. In most cases, however, we will use a validation dataset to measure our model
quality. Here we pass the validation dataloader once in each epoch to measure the model per-
formance. Following our object-oriented design, the prepare_batch and fit_epoch functions are
registered as methods of the d2l.Trainer class (introduced in Section 3.2.4).
@d2l.add_to_class(d2l.Trainer) #@save
def prepare_batch(self, batch):
return batch
def fit_epoch(self):
self.model.train()
for batch in self.train_dataloader:
loss = self.model.training_step(self.prepare_batch(batch))
self.optim.zero_grad()
with torch.no_grad():
loss.backward()
if self.gradient_clip_val > 0:
self.clip_gradients(self.gradient_clip_val, self.model)
self.optim.step()
self.train_batch_idx += 1
if self.val_dataloader is None:
return
self.model.eval()
for batch in self.val_dataloader:
self.model.validation_step(self.prepare_batch(batch))
self.val_batch_idx += 1
We are almost ready to train the model, but first we need some data to train on. Here we use the
SyntheticRegressionData class and pass in some ground-truth parameters. Then, we train our
model with the learning rate lr=0.03 and set max_epochs=3. Note that in general, both the number
of epochs and the learning rate are hyperparameters. In general, setting hyperparameters is tricky
and we will usually want to use a 3-way split, one set for training, a second for hyperparameter
seclection, and the third reserved for the final evaluation. We elide these details for now but will
revise them later.
model = LinearRegressionScratch(2, lr=0.03)

data = d2l.SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)
trainer = d2l.Trainer(max_epochs=3)
trainer.fit(model, data)
3.4. Linear Regression Implementation from Scratch 111

Because we synthesized the dataset ourselves, we know precisely what the true parameters are.
Thus, we can evaluate our success in training by comparing the true parameters with those that
we learned through our training loop. Indeed they turn out to be very close to each other.
print(f'error in estimating w: {data.w - model.w.reshape(data.w.shape)}')

print(f'error in estimating b: {data.b - model.b}')
error in estimating w: tensor([ 0.0753, -0.1564], grad_fn=<SubBackward0>)

error in estimating b: tensor([0.2175], grad_fn=<RsubBackward1>)
We should not take the ability to exactly recover the ground-truth parameters for granted. In gen-
eral, for deep models unique solutions for the parameters do not exist, and even for linear models,
exactly recovering the parameters is only possible when no feature is linearly dependent on the
others. However, in machine learning, we are often less concerned with recovering true under-
lying parameters, and more concerned with parameters that lead to highly accurate prediction
(Vapnik, 1992). Fortunately, even on difficult optimization problems, stochastic gradient descent
can often find remarkably good solutions, owing partly to the fact that, for deep networks, there
exist many configurations of the parameters that lead to highly accurate prediction.
3.4.5 Summary
In this section, we took a significant step towards designing deep learning systems by implement-
ing a fully functional neural network model and training loop. In this process, we built a data
loader, a model, a loss function, an optimization procedure, and a visualization and monitoring
tool. We did this by composing a Python object that contains all relevant components for training
a model. While this is not yet a professional-grade implementation it is perfectly functional and
code like this could already help you to solve small problems quickly. In the next sections, we will
see how to do this both more concisely (avoiding boilerplate code) and more efficiently (use our GPUs
to their full potential).

Exercises
1. What would happen if we were to initialize the weights to zero. Would the algorithm still
work? What if we initialized the parameters with variance 1, 000 rather than 0.01?
2. Assume that you are Georg Simon Ohm74 trying to come up with a model for resistors that
relate voltage and current. Can you use automatic differentiation to learn the parameters of
your model?
3. Can you use Planckʼs Law75 to determine the temperature of an object using spectral energy
density? For reference, the spectral density B of radiation emanating from a black body
2 ( )−1
is B(λ, T ) = 2hc
λ5
· exp λkT
hc
− 1 . Here λ is the wavelength, T is the temperature, c is
the speed of light, h is Planckʼs quantum, and k is the Boltzmann constant. You measure
the energy for different wavelengths λ and you now need to fit the spectral density curve to
Planckʼs law.
4. What are the problems you might encounter if you wanted to compute the second derivatives
of the loss? How would you fix them?
5. Why is the reshape method needed in the loss function?
6. Experiment using different learning rates to find out how quickly the loss function value
drops. Can you reduce the error by increasing the number of epochs of training?
7. If the number of examples cannot be divided by the batch size, what happens to data_iter
at the end of an epoch?
8. Try implementing a different loss function, such as the absolute value loss (y_hat - d2l.
reshape(y, y_hat.shape)).abs().sum().
1. Check what happens for regular data.
2. Check whether there is a difference in behavior if you actively perturb some entries of
y, such as y5 = 10, 000.
3. Can you think of a cheap solution for combining the best aspects of squared loss and
absolute value loss? Hint: how can you avoid really large gradient values?
9. Why do we need to reshuffle the dataset? Can you design a case where a maliciously dataset
would break the optimization algorithm otherwise?
Discussions76
3.5 Concise Implementation of Linear Regression
Deep learning has witnessed a Cambrian explosion of sorts over the past decade. The sheer num-
ber of techniques, applications and algorithms by far surpasses the progress of previous decades.
This is due to a fortuitous combination of multiple factors, one of which is the powerful free tools
offered by a number of open source deep learning frameworks. Theano (Bergstra et al., 2010), Dis-
tBelief (Dean et al., 2012), and Caffe (Jia et al., 2014) arguably represent the first generation of such
models that found widespread adoption. In contrast to earlier (seminal) works like SN2 (Simula-
teur Neuristique) (Bottou and Le Cun, 1988), which provided a Lisp-like programming experience,
74
https://en.wikipedia.org/wiki/Georg_Ohm
75
https://en.wikipedia.org/wiki/Planck%27s_law
76
3.5. Concise Implementation of Linear Regression 113

modern frameworks offer automatic differentiation and the convenience of Python. These frame-
works allow us to automate and modularize the repetitive work of implementing gradient-based
learning algorithms.
In Section 3.4, we relied only on (i) tensors for data storage and linear algebra; and (ii) automatic
differentiation for calculating gradients. In practice, because data iterators, loss functions, op-
timizers, and neural network layers are so common, modern libraries implement these compo-
nents for us as well. In this section, we will show you how to implement the linear regression
model from Section 3.4 concisely by using high-level APIs of deep learning frameworks.
import numpy as np
import torch
When we implemented linear regression from scratch in Section 3.4, we defined our model pa-
rameters explicitly and coded up the calculations to produce output using basic linear algebra
operations. You should know how to do this. But once your models get more complex, and once
you have to do this nearly every day, you will be glad for the assistance. The situation is similar
to coding up your own blog from scratch. Doing it once or twice is rewarding and instructive, but
you would be a lousy web developer if you spent a month reinventing the wheel.
For standard operations, we can use a frameworkʼs predefined layers, which allow us to focus on
the layers used to construct the model rather than worrying about their implementation. Recall
the architecture of a single-layer network as described in Fig. 3.1.2. The layer is called fully con-
nected, since each of its inputs is connected to each of its outputs by means of a matrix-vector
multiplication.
In PyTorch, the fully connected layer is defined in Linear and LazyLinear (available since version
1.8.0) classes. The later allows users to only specify the output dimension, while the former ad-
ditionally asks for how many inputs go into this layer. Specifying input shapes is inconvenient,
which may require nontrivial calculations (such as in convolutional layers). Thus, for simplicity
we will use such “lazy” layers whenever we can.
class LinearRegression(d2l.Module): #@save

def __init__(self, lr):
super().__init__()
self.net = nn.LazyLinear(1)
self.net.weight.data.normal_(0, 0.01)
self.net.bias.data.fill_(0)
In the forward method, we just invoke the built-in __call__ function of the predefined layers to
compute the outputs.
@d2l.add_to_class(LinearRegression) #@save
return self.net(X)

The MSELoss class computes the mean squared error (without the 1/2 factor in (3.1.5)). By default,
MSELoss returns the average loss over examples. It is faster (and easier to use) than implementing
our own.
fn = nn.MSELoss()
return fn(y_hat, y)
3.5.3 Defining the Optimization Algorithm
Minibatch SGD is a standard tool for optimizing neural networks and thus PyTorch supports it
alongside a number of variations on this algorithm in the optim module. When we instantiate an
SGD instance, we specify the parameters to optimize over, obtainable from our model via self.
parameters(), and the learning rate (self.lr) required by our optimization algorithm.
return torch.optim.SGD(self.parameters(), self.lr)
3.5.4 Training
You might have noticed that expressing our model through high-level APIs of a deep learning
framework requires fewer lines of code. We did not have to allocate parameters individually,
define our loss function, or implement minibatch SGD. Once we start working with much more
complex models, the advantages of the high-level API will grow considerably. Now that we have
all the basic pieces in place, the training loop itself is the same as the one we implemented from
scratch. So we just call the fit method (introduced in Section 3.2.4), which relies on the imple-
mentation of the fit_epoch method in Section 3.4, to train our model.
model = LinearRegression(lr=0.03)
data = d2l.SyntheticRegressionData(w=torch.tensor([2, -3.4]), b=4.2)
3.5. Concise Implementation of Linear Regression 115

Below, we compare the model parameters learned by training on finite data and the actual pa-
rameters that generated our dataset. To access parameters, we access the weights and bias of the
layer that we need. As in our implementation from scratch, note that our estimated parameters
are close to their true counterparts.
def get_w_b(self):
return (self.net.weight.data, self.net.bias.data)
w, b = model.get_w_b()
print(f'error in estimating w: {data.w - w.reshape(data.w.shape)}')
print(f'error in estimating b: {data.b - b}')
error in estimating w: tensor([ 0.0090, -0.0130])

error in estimating b: tensor([0.0094])
3.5.5 Summary
This section contains the first implementation of a deep network (in this book) to tap into the con-
veniences afforded by modern deep learning frameworks, such as Gluon Chen.Li.Li.ea.2015,
JAX (Frostig et al., 2018), PyTorch (Paszke et al., 2019), and Tensorflow (Abadi et al., 2016). We used
framework defaults for loading data, defining a layer, a loss function, an optimizer and a train-
ing loop. Whenever the framework provides all necessary features, itʼs generally a good idea to
use them, since the library implementations of these components tend to be heavily optimized for
performance and properly tested for reliability. At the same time, try not to forget that these mod-
ules can be implemented directly. This is especially important for aspiring researchers who wish
to live on the bleeding edge of model development, where you will be inventing new components
that cannot possibly exist in any current library.
In PyTorch, the data module provides tools for data processing, the nn module defines a large
number of neural network layers and common loss functions. We can initialize the parameters
by replacing their values with methods ending with _. Note that we need to specify the input
dimensions of the network. While this is trivial for now, it can have significant knock-on effects
when we want to design complex networks with many layers. Careful considerations of how to
parametrize these networks is needed to allow portability.

Exercises
1. How would you need to change the learning rate if you replace the aggregate loss over the
minibatch with an average over the loss on the minibatch?
2. Review the framework documentation to see which loss functions are provided. In particu-
lar, replace the squared loss with Huberʼs robust loss function. That is, use the loss function
{
|y − y ′ | − σ2 if |y − y ′ | > σ
l(y, y ′ ) = 1 ′ 2
(3.5.1)
2σ (y − y ) otherwise
3. How do you access the gradient of the weights of the model?

4. How does the solution change if you change the learning rate and the number of epochs?
Does it keep on improving?
5. How does the solution change as you change the amount of data generated?
1. Plot the estimation error for ŵ − w and b̂ − b as a function of the amount of data. Hint:
increase the amount of data logarithmically rather than linearly, i.e., 5, 10, 20, 50, …,
10,000 rather than 1,000, 2,000, …, 10,000.
2. Why is the suggestion in the hint appropriate?
Discussions77
3.6 Generalization
Consider two college students diligently preparing for their final exam. Commonly, this prepara-
tion will consist of practicing and testing their abilities by taking exams administered in previous
years. Nonetheless, doing well on past exams is no guarantee that they will excel when it mat-
ters. For instance, imagine a student, Elephantine Ellie, whose preparation consisted entirely of
memorizing the answers to previous yearsʼ exam questions. Even if Ellie were endowed with an
elephantine memory, and thus could perfectly recall the answer to any previously seen question,
she might nevertheless freeze when faced with a new (previously unseen) question. By compari-
son, imagine another student, Inductive Irene, with comparably poor memorization skills, but a
knack for picking up patterns. Note that if the exam truly consisted of recycled questions from
a previous year, Ellie would handily outperform Irene. Even if Ireneʼs inferred patterns yielded
90% accurate predictions, they could never compete with Ellieʼs 100% recall. However, even if the
exam consisted entirely of fresh questions, Irene might maintain her 90% average.
As machine learning scientists, our goal is to discover patterns. But how can we be sure that we
have truly discovered a general pattern and not simply memorized our data? Most of the time,
our predictions are only useful if our model discovers such a pattern. We donʼt want to predict
yesterdayʼs stock prices, but tomorrowʼs. We donʼt need to recognize already diagnosed diseases
for previously seen patients, but rather previously undiagnosed ailments in previously unseen
patients. This problem—how to discover patterns that generalize—is the fundamental problem of
machine learning, and arguably of all of statistics. We might cast this problem as just one slice of
a far grander question that engulfs all of science: when are we ever justified in making the leap
from particular observations to more general statements (Popper, 2005)?
77
3.6. Generalization 117

In real life, we must fit out models using a finite collection of data. The typical scales of that
data vary wildly across domains. For many important medical problem, we can only access a few
thousand data points. When studying rare diseases, we might be lucky to access hundreds. By
contrast, the largest public datasets consisting of labeled photographs (e.g., ImageNet (Deng et
al., 2009)), contain millions of images. And some unlabeled image collections such as the Flickr
YFC100M dataset can be even larger, containing over 100 million images (Thomee et al., 2016).
However, even at this extreme scale, the number of available data points remains infinitesimally
small compared to the space of all possible images at 1 megapixel resolution. Whenever we work
with finite samples, we must keep in mind the risk that we might fit our training data, only to
discover that we failed to discover a generalizable pattern.
The phenomenon of fitting closer to our training data than to the underlying distribution is called
overfitting, and techniques for combatting overfitting are often called regularization methods.
While there is no substitute for a proper introduction to statistical learning theory (see (Boucheron
et al., 2005, Vapnik, 1998)), we will give you just enough intuition to get going. We will revisit gen-
eralization in many chapters throughout the book, exploring both what is known about the prin-
ciples underlying generalization in various models, and also heuristic techniques that have been
found (empirically) to yield improved generalization on tasks of practical interest.
3.6.1 Training Error and Generalization Error
In the standard supervised learning setting, we assume that the training data and the test data
are drawn independently from identical distributions. This is commonly called the IID assumption.
While this assumption is strong, itʼs worth noting that absent any such assumption we would be
dead in the water. Why should we believe that training data sampled from distribution P (X, Y )
should tell us how to make predictions on test data generated by a different distribution Q(X, Y )?
Making such leaps turns out to require strong assumptions about how P and Q are related. Later
on we will discuss some assumptions that allow for shifts in distribution but first we need to un-
derstand the IID case, where P (·) = Q(·).
To begin with, we need to differentiate between the training error Remp , which is a statistic cal-
culated on the training dataset, and the generalization error R, which is an expectation taken with
respect to the underlying distribution. You can think of the generalization error as what you would
see if you applied your model to an infinite stream of additional data examples drawn from the
same underlying data distribution. Formally the training error is expressed as a sum (with the
same notation in Section 3.1):
1 ∑ (i) (i)
n
Remp [X, y, f ] = l(x , y , f (x(i) )), (3.6.1)
n
i=1
while the generalization error is expressed as an integral:

∫ ∫
R[p, f ] = E(x,y)∼P [l(x, y, f (x))] = l(x, y, f (x))p(x, y) dxdy. (3.6.2)
Problematically, we can never calculate the generalization error R exactly. Nobody ever tells us
the precise form of the density function p(x, y). Moreover, we cannot sample an infinite stream of
data points. Thus, in practice, we must estimate the generalization error by applying our model to
an independent test set constituted of a random selection of examples X′ and labels y′ that were
withheld from our training set. This consists of applying the same formula as for calculating the
empirical training error but to a test set X′ , y′ .
Crucially, when we evaluate our classifier on the test set, we are working with a fixed classifier (it
does not depend on the sample of the test set), and thus estimating its error is simply the problem

of mean estimation. However the same cannot be said for the training set. Note that the model
we wind up with depends explicitly on the selection of the training set and thus the training error
will in general be a biased estimate of the true error on the underlying population. The central
question of generalization is then when should we expect our training error to be close to the
population error (and thus the generalization error).
Model Complexity
In classical theory, when we have simple models and abundant data, the training and general-
ization errors tend to be close. However, when we work with more complex models and/or fewer
examples, we expect the training error to go down but the generalization gap to grow. This should
not be surprising. Imagine a model class so expressive that for any dataset of n examples, we can
find a set of parameters that can perfectly fit arbitrary labels, even if randomly assigned. In this
case, even if we fit our training data perfectly, how can we conclude anything about the general-
ization error? For all we know, our generalization error might be no better than random guessing.
In general, absent any restriction on our model class, we cannot conclude based on fitting the
training data alone that our model has discovered any generalizable pattern (Vapnik et al., 1994).
On the other hand, if our model class was not capable of fitting arbitrary labels, then it must have
discovered a pattern. Learning-theoretic ideas about model complexity derived some inspiration
from the ideas of Karl Popper, an influential philosopher of science, who formalized the criterion
of falsifiability. According to Popper, a theory that can explain any and all observations is not a
scientific theory at all! After all, what has it told us about the world if it has not ruled out any
possibility? In short, what we want is a hypothesis that could not explain any observations we
might conceivably make and yet nevertheless happens to be compatible with those observations
that we in fact make.
Now what precisely constitutes an appropriate notion of model complexity is a complex matter.
Often, models with more parameters are able to fit a greater number of arbitrarily assigned labels.
However, this is not necessarily true. For instance, kernel methods operate in spaces with infinite
numbers of parameters, yet their complexity is controlled by other means (Scholkopf and Smola,
2002). One notion of complexity that often proves useful is the range of values that the parame-
ters can take. Here, a model whose parameters are permitted to take arbitrary values would be
more complex. We will revisit this idea in the next section, when we introduce weight decay, your
first practical regularization technique. Notably, it can be difficult to compare complexity among
members of substantially different model classes (say, decision trees vs. neural networks).
At this point, we must stress another important point that we will revisit when introducing deep
neural networks. When a model is capable of fitting arbitrary labels, low training error does not
necessarily imply low generalization error. However, it does not necessarily imply high generalization
error either! All we can say confidently is that low training error alone is not enough to certify low
generalization error. Deep neural networks turn out to be just such models: while they generalize
well in practice, they too powerful to allow us to conclude much on the basis of training error
alone. In these cases we must rely more heavily on our holdout data to certify generalization after
the fact. Error on the holdout data, i.e., validation set, is called the validation error.

3.6.2 Underfitting or Overfitting?
When we compare the training and validation errors, we want to be mindful of two common situ-
ations. First, we want to watch out for cases when our training error and validation error are both
substantial but there is a little gap between them. If the model is unable to reduce the training
error, that could mean that our model is too simple (i.e., insufficiently expressive) to capture the
pattern that we are trying to model. Moreover, since the generalization gap (Remp − R) between
our training and generalization errors is small, we have reason to believe that we could get away
with a more complex model. This phenomenon is known as underfitting.
On the other hand, as we discussed above, we want to watch out for the cases when our train-
ing error is significantly lower than our validation error, indicating severe overfitting. Note that
overfitting is not always a bad thing. In deep learning especially, the best predictive models often
perform far better on training data than on holdout data. Ultimately, we usually care about driv-
ing the generalization error lower, and only care about the gap insofar as it becomes an obstacle
to that end. Note that if the training error is zero, then the generalization gap is precisely equal to
the generalization error and we can make progress only by reducing the gap.
Polynomial Curve Fitting
To illustrate some classical intuition about overfitting and model complexity, consider the follow-
ing: given training data consisting of a single feature x and a corresponding real-valued label y,
we try to find the polynomial of degree d
∑
d
ŷ = x i wi (3.6.3)
i=0
to estimate the label y. This is just a linear regression problem where our features are given by the
powers of x, the modelʼs weights are given by wi , and the bias is given by w0 since x0 = 1 for all x.
Since this is just a linear regression problem, we can use the squared error as our loss function.
A higher-order polynomial function is more complex than a lower-order polynomial function,
since the higher-order polynomial has more parameters and the model functionʼs selection range
is wider. Fixing the training dataset, higher-order polynomial functions should always achieve
lower (at worst, equal) training error relative to lower degree polynomials. In fact, whenever the
data examples each have a distinct value of x, a polynomial function with degree equal to the
number of data examples can fit the training set perfectly. We visualize the relationship between
polynomial degree (model complexity) and underfitting vs. overfitting in Fig. 3.6.1.

Fig. 3.6.1: Influence of model complexity on underfitting and overfitting
Dataset Size
As the above bound already indicates, another big consideration to bear in mind is dataset size.
Fixing our model, the fewer samples we have in the training dataset, the more likely (and more
severely) we are to encounter overfitting. As we increase the amount of training data, the gen-
eralization error typically decreases. Moreover, in general, more data never hurts. For a fixed
task and data distribution, model complexity should not increase more rapidly than the amount
of data. Given more data, we might attempt to fit a more complex model. Absent sufficient data,
simpler models may be more difficult to beat. For many tasks, deep learning only outperforms lin-
ear models when many thousands of training examples are available. In part, the current success
of deep learning owes considerably to the abundance of massive datasets arising from Internet
companies, cheap storage, connected devices, and the broad digitization of the economy.
3.6.3 Model Selection
Typically, we select our final model, only after evaluating multiple models that differ in various
ways (different architectures, training objectives, selected features, data preprocessing, learning
rates, etc.). Choosing among many models is aptly called model selection.
In principle, we should not touch our test set until after we have chosen all our hyperparameters.
Were we to use the test data in the model selection process, there is a risk that we might overfit
the test data. Then we would be in serious trouble. If we overfit our training data, there is always
the evaluation on test data to keep us honest. But if we overfit the test data, how would we ever
know? See (Ong et al., 2005) for an example how this can lead to absurd results even for models
where the complexity can be tightly controlled.
Thus, we should never rely on the test data for model selection. And yet we cannot rely solely on
the training data for model selection either because we cannot estimate the generalization error
on the very data that we use to train the model.
In practical applications, the picture gets muddier. While ideally we would only touch the test
data once, to assess the very best model or to compare a small number of models with each other,
real-world test data is seldom discarded after just one use. We can seldom afford a new test set for
each round of experiments. In fact, recycling benchmark data for decades can have a significant

impact on the development of algorithms, e.g., for image classification78 and optical character
recognition79 .
The common practice to address the problem of training on the test set is to split our data three ways,
incorporating a validation set in addition to the training and test datasets. The result is a murky
practice where the boundaries between validation and test data are worryingly ambiguous. Unless
explicitly stated otherwise, in the experiments in this book we are really working with what should
rightly be called training data and validation data, with no true test sets. Therefore, the accuracy
reported in each experiment of the book is really the validation accuracy and not a true test set
accuracy.
Cross-Validation
When training data is scarce, we might not even be able to afford to hold out enough data to con-
stitute a proper validation set. One popular solution to this problem is to employ K-fold cross-
validation. Here, the original training data is split into K non-overlapping subsets. Then model
training and validation are executed K times, each time training on K − 1 subsets and validat-
ing on a different subset (the one not used for training in that round). Finally, the training and
validation errors are estimated by averaging over the results from the K experiments.
3.6.4 Summary
This section explored some of the underpinnings of generalization in machine learning. Some
of these ideas become complicated and counterintuitive when we get to deeper models, there,
models are capable of overfitting data badly, and the relevant notions of complexity can be both
implicit and counterintuitive (e.g., larger architectures with more parameters generalizing better).
We leave you with a few rules of thumb:
1. Use validation sets (or K-fold cross-validation) for model selection;
2. More complex models often require more data;
3. Relevant notions of complexity include both the number of parameters and the range of
values that they are allowed to take;
4. Keeping all else equal, more data almost always leads to better generalization;
5. This entire talk of generalization is all predicated on the IID assumption. If we relax this
assumption, allowing for distributions to shift between the train and testing periods, then
we cannot say anything about generalization absent a further (perhaps milder) assumption.
78
https://paperswithcode.com/sota/image-classification-on-imagenet
79
https://paperswithcode.com/sota/image-classification-on-mnist

Exercises
1. When can you solve the problem of polynomial regression exactly?

2. Give at least five examples where dependent random variables make treating the problem
as IID data inadvisable.
3. Can you ever expect to see zero training error? Under which circumstances would you see
zero generalization error?
4. Why is K-fold cross-validation very expensive to compute?
5. Why is the K-fold cross-validation error estimate biased?
6. The VC dimension is defined as the maximum number of points that can be classified with
arbitrary labels {±1} by a function of a class of functions. Why might this not be a good idea
to measure how complex the class of functions is? Hint: what about the magnitude of the
functions?
7. Your manager gives you a difficult dataset on which your current algorithm doesnʼt perform
so well. How would you justify to him that you need more data? Hint: you cannot increase
the data but you can decrease it.
Discussions80
3.7 Weight Decay
Now that we have characterized the problem of overfitting, we can introduce our first regulariza-
tion technique. Recall that we can always mitigate overfitting by collecting more training data.
However, that can be costly, time consuming, or entirely out of our control, making it impossible
in the short run. For now, we can assume that we already have as much high-quality data as our
resources permit and focus the tools at our disposal even when the dataset is taken as a given.
Recall that in our polynomial regression example (Section 3.6.2) we could limit our modelʼs ca-
pacity by tweaking the degree of the fitted polynomial. Indeed, limiting the number of features
is a popular technique to mitigate overfitting. However, simply tossing aside features can be too
blunt an instrument. Sticking with the polynomial regression example, consider what might hap-
pen with high-dimensional input. The natural extensions of polynomials to multivariate data are
called monomials, which are simply products of powers of variables. The degree of a monomial is
the sum of the powers. For example, x21 x2 , and x3 x25 are both monomials of degree 3.
Note that the number of terms with degree d blows up rapidly as d(grows)larger. Given k variables,
the number of monomials of degree d (i.e., k multichoose d) is k−1+dk−1 . Even small changes in
degree, say from 2 to 3, dramatically increase the complexity of our model. Thus we often need a
more fine-grained tool for adjusting function complexity.
80
3.7. Weight Decay 123

3.7.1 Norms and Weight Decay
Rather than directly manipulating the number of parameters, weight decay, operates by restricting
the values that the parameters can take. More commonly called ℓ2 regularization outside of deep
learning circles when optimized by minibatch stochastic gradient descent, weight decay might be
the most widely used technique for regularizing parametric machine learning models. The tech-
nique is motivated by the basic intuition that among all functions f , the function f = 0 (assigning
the value 0 to all inputs) is in some sense the simplest, and that we can measure the complexity of
a function by the distance of its parameters from zero. But how precisely should we measure the
distance between a function and zero? Thereʼs no single right answer. In fact, entire branches of
mathematics, including parts of functional analysis and the theory of Banach spaces, are devoted
to addressing such issues.
One simple interpretation might be to measure the complexity of a linear function f (x) = w⊤ x
by some norm of its weight vector, e.g., ∥w∥2 . Recall that we introduced the ℓ2 norm and ℓ1 norm,
which are special cases of the more general ℓp norm in Section 2.3.11. The most common method
for ensuring a small weight vector is to add its norm as a penalty term to the problem of minimizing
the loss. Thus we replace our original objective, minimizing the prediction loss on the training labels,
with new objective, minimizing the sum of the prediction loss and the penalty term. Now, if our weight
vector grows too large, our learning algorithm might focus on minimizing the weight norm ∥w∥2
vs. minimizing the training error. That is exactly what we want. To illustrate things in code, we
revive our previous example from Section 3.1 for linear regression. There, our loss was given by
1 ∑ 1 ( ⊤ (i) )2
n
L(w, b) = w x + b − y (i) . (3.7.1)
n 2
i=1
Recall that x(i) are the features, y (i) is the label for any data example i, and (w, b) are the weight
and bias parameters, respectively. To penalize the size of the weight vector, we must somehow
add ∥w∥2 to the loss function, but how should the model trade off the standard loss for this new
additive penalty? In practice, we characterize this tradeoff via the regularization constant λ, a non-
negative hyperparameter that we fit using validation data:
λ
L(w, b) + ∥w∥2 . (3.7.2)
2
For λ = 0, we recover our original loss function. For λ > 0, we restrict the size of ∥w∥. We
divide by 2 by convention: when we take the derivative of a quadratic function, the 2 and 1/2
cancel out, ensuring that the expression for the update looks nice and simple. The astute reader
might wonder why we work with the squared norm and not the standard norm (i.e., the Euclidean
distance). We do this for computational convenience. By squaring the ℓ2 norm, we remove the
square root, leaving the sum of squares of each component of the weight vector. This makes the
derivative of the penalty easy to compute: the sum of derivatives equals the derivative of the sum.
Moreover, you might ask why we work with the ℓ2 norm in the first place and not, say, the ℓ1
norm. In fact, other choices are valid and popular throughout statistics. While ℓ2 -regularized
linear models constitute the classic ridge regression algorithm, ℓ1 -regularized linear regression is
a similarly fundamental method in statistics, popularly known as lasso regression. One reason to
work with the ℓ2 norm is that it places an outsize penalty on large components of the weight vector.
This biases our learning algorithm towards models that distribute weight evenly across a larger
number of features. In practice, this might make them more robust to measurement error in a
single variable. By contrast, ℓ1 penalties lead to models that concentrate weights on a small set
of features by clearing the other weights to zero. This gives us an effective method for feature
selection, which may be desirable for other reasons. For example, if our model only relies on a few
features, then we may not need to collect, store, or transmit data for the other (dropped) features.

Using the same notation in (3.1.11), the minibatch stochastic gradient descent updates for ℓ2 -
regularized regression follow:
η ∑ (i) ( ⊤ (i) )
w ← (1 − ηλ) w − x w x + b − y (i) . (3.7.3)
|B|
i∈B
As before, we update w based on the amount by which our estimate differs from the observation.
However, we also shrink the size of w towards zero. That is why the method is sometimes called
“weight decay”: given the penalty term alone, our optimization algorithm decays the weight at each
step of training. In contrast to feature selection, weight decay offers us a continuous mechanism
for adjusting the complexity of a function. Smaller values of λ correspond to less constrained w,
whereas larger values of λ constrain w more considerably. Whether we include a correspond-
ing bias penalty b2 can vary across implementations, and may vary across layers of a neural net-
work. Often, we do not regularize the bias term. Besides, although ℓ2 regularization may not be
equivalent to weight decay for other optimization algorithms, the idea of regularization through
shrinking the size of weights still holds true.
3.7.2 High-Dimensional Linear Regression
We can illustrate the benefits of weight decay through a simple synthetic example.
%matplotlib inline
import torch
First, we generate some data as before:
∑
d
y = 0.05 + 0.01xi + ϵ where ϵ ∼ N (0, 0.012 ). (3.7.4)
i=1
In this synthetic dataset, our label is given by an underlying linear function of our inputs, cor-
rupted by Gaussian noise with zero mean and standard deviation 0.01. For illustrative purposes,
we can make the effects of overfitting pronounced, by increasing the dimensionality of our prob-
lem to d = 200 and working with a small training set with only 20 examples.
class Data(d2l.DataModule):
def __init__(self, num_train, num_val, num_inputs, batch_size):
n = num_train + num_val
self.X = torch.randn(n, num_inputs)
noise = torch.randn(n, 1) * 0.01
w, b = torch.ones((num_inputs, 1)) * 0.01, 0.05
self.y = torch.matmul(self.X, w) + b + noise

return self.get_tensorloader([self.X, self.y], train, i)

3.7.3 Implementation from Scratch
Now, letʼs try implementing weight decay from scratch. Since minibatch stochastic gradient de-
scent is our optimizer, we just need to add the squared ℓ2 penalty to the original loss function.
Defining ℓ2 Norm Penalty
Perhaps the most convenient way to implement this penalty is to square all terms in place and
sum them up.
def l2_penalty(w):
return (w ** 2).sum() / 2
Defining the Model
In the final model, the linear regression and the squared loss have not changed since Section 3.4,
so we will just define a subclass of d2l.LinearRegressionScratch. The only change here is that
our loss now includes the penalty term.
class WeightDecayScratch(d2l.LinearRegressionScratch):
def __init__(self, num_inputs, lambd, lr, sigma=0.01):
super().__init__(num_inputs, lr, sigma)

return super().loss(y_hat, y) + self.lambd * l2_penalty(self.w)
The following code fits our model on the training set with 20 examples and evaluates it on the
validation set with 100 examples.
data = Data(num_train=20, num_val=100, num_inputs=200, batch_size=5)

def train_scratch(lambd):
model = WeightDecayScratch(num_inputs=200, lambd=lambd, lr=0.01)
model.board.yscale='log'
print('L2 norm of w:', float(l2_penalty(model.w)))
Training without Regularization
We now run this code with lambd = 0, disabling weight decay. Note that we overfit badly, decreas-
ing the training error but not the validation error—a textbook case of overfitting.
train_scratch(0)
L2 norm of w: 0.01034499891102314

Using Weight Decay
Below, we run with substantial weight decay. Note that the training error increases but the vali-
dation error decreases. This is precisely the effect we expect from regularization.
train_scratch(3)
L2 norm of w: 0.0014588753692805767
3.7.4 Concise Implementation
Because weight decay is ubiquitous in neural network optimization, the deep learning framework
makes it especially convenient, integrating weight decay into the optimization algorithm itself for
easy use in combination with any loss function. Moreover, this integration serves a computational
benefit, allowing implementation tricks to add weight decay to the algorithm, without any addi-
tional computational overhead. Since the weight decay portion of the update depends only on the
current value of each parameter, the optimizer must touch each parameter once anyway.
In the following code, we specify the weight decay hyperparameter directly through weight_decay

when instantiating our optimizer. By default, PyTorch decays both weights and biases simultane-
ously. Here, we only set weight_decay for the weight, so the bias parameter b will not decay.
class WeightDecay(d2l.LinearRegression):
def __init__(self, wd, lr):
super().__init__(lr)
self.wd = wd
return torch.optim.SGD(self.net.parameters(),
lr=self.lr, weight_decay=self.wd)
The plot looks similar to that when we implemented weight decay from scratch. However, this
version runs faster and is easier to implement, benefits that will become more pronounced as you
address larger problems and this work becomes more routine.
model = WeightDecay(wd=3, lr=0.01)

print('L2 norm of w:', float(l2_penalty(model.get_w_b()[0])))
L2 norm of w: 0.013070883229374886
So far, we only touched upon one notion of what constitutes a simple linear function. Moreover,
what constitutes a simple nonlinear function can be an even more complex question. For in-
stance, reproducing kernel Hilbert space (RKHS)81 allows one to apply tools introduced for linear
functions in a nonlinear context. Unfortunately, RKHS-based algorithms tend to scale poorly to
large, high-dimensional data. In this book we will often adopt the common heuristic whereby
weight decay is applied to all layers of a deep network.
81
https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space

3.7.5 Summary
• Regularization is a common method for dealing with overfitting. Classical regularization

techniques add a penalty term to the loss function (when training) to reduce the complexity
of the learned model.
• One particular choice for keeping the model simple is using an ℓ2 penalty. This leads to
weight decay in the update steps of the minibatch stochastic gradient descent algorithm.
• The weight decay functionality is provided in optimizers from deep learning frameworks.
• Different sets of parameters can have different update behaviors within the same training
loop.
Exercises
1. Experiment with the value of λ in the estimation problem in this section. Plot training and
validation accuracy as a function of λ. What do you observe?
2. Use a validation set to find the optimal value of λ. Is it really the optimal value? Does this
matter?
∑
3. What would the update equations look like if instead of ∥w∥2 we used i |wi | as our penalty
of choice (ℓ1 regularization)?
4. We know that ∥w∥2 = w⊤ w. Can you find a similar equation for matrices (see the Frobenius
norm in Section 2.3.11)?
5. Review the relationship between training error and generalization error. In addition to
weight decay, increased training, and the use of a model of suitable complexity, what other
ways can you think of to deal with overfitting?
6. In Bayesian statistics we use the product of prior and likelihood to arrive at a posterior via
P (w | x) ∝ P (x | w)P (w). How can you identify P (w) with regularization?
Discussions82
82

4 | Linear Neural Networks for Classifi-
cation
Now that you have worked through all of the mechanics you are ready to apply these skills to
broader kinds of tasks. Even as we pivot towards classification, most of the plumbing remains
the same: loading the data, passing it through the model, generating output, calculating the loss,
taking gradients with respect to weights, and updating the model. However, the precise form of
the targets, the parameterization of the output layer, and the choice of loss function will adapt to
suit the classification setting.
4.1 Softmax Regression
In Section 3.1, we introduced linear regression, working through implementations from scratch
in Section 3.4 and again using high-level APIs of a deep learning framework in Section 3.5 to do
the heavy lifting.
Regression is the hammer we reach for when we want to answer how much? or how many? ques-
tions. If you want to predict the number of dollars (price) at which a house will be sold, or the
number of wins a baseball team might have, or the number of days that a patient will remain hos-
pitalized before being discharged, then you are probably looking for a regression model. How-
ever, even within regression models, there are important distinctions. For instance, the price of
a house will never be negative and changes might often be relative to its baseline price. As such, it
might be more effective to regress on the logarithm of the price. Likewise, the number of days a
patient spends in hospital is a discrete nonnegative random variable. As such, least mean squares
might not be an ideal approach either. This sort of time-to-event modeling comes with a host of
other complications that are dealt with in a specialized subfield called survival modeling.
The point here is not to overwhelm you but just to let you know that there is a lot more to estima-
tion than simply minimizing squared errors. And more broadly, thereʼs a lot more to supervised
learning than regression. In this section, we focus on classification problems where we put aside
how much? questions and instead focus on which category? questions.
• Does this email belong in the spam folder or the inbox?
• Is this customer more likely to sign up or not to sign up for a subscription service?
• Does this image depict a donkey, a dog, a cat, or a rooster?
• Which movie is Aston most likely to watch next?
• Which section of the book are you going to read next?
131
Colloquially, machine learning practitioners overload the word classification to describe two subtly
different problems: (i) those where we are interested only in hard assignments of examples to
categories (classes); and (ii) those where we wish to make soft assignments, i.e., to assess the
probability that each category applies. The distinction tends to get blurred, in part, because often,
even when we only care about hard assignments, we still use models that make soft assignments.
Even more, there are cases where more than one label might be true. For instance, a news article
might simultaneously cover the topics of entertainment, business, and space flight, but not the
topics of medicine or sports. Thus, categorizing it into one of the above categories on their own
would not be very useful. This problem is commonly known as multi-label classification83 . See
(Tsoumakas and Katakis, 2007) for an overview and (Huang et al., 2015) for an effective algorithm
when tagging images.
4.1.1 Classification
To get our feet wet, letʼs start with a simple image classification problem. Here, each input consists
of a 2 × 2 grayscale image. We can represent each pixel value with a single scalar, giving us four
features x1 , x2 , x3 , x4 . Further, letʼs assume that each image belongs to one among the categories
“cat”, “chicken”, and “dog”.
Next, we have to choose how to represent the labels. We have two obvious choices. Per-
haps the most natural impulse would be to choose y ∈ {1, 2, 3}, where the integers repre-
sent {dog, cat, chicken} respectively. This is a great way of storing such information on a com-
puter. If the categories had some natural ordering among them, say if we were trying to predict
{baby, toddler, adolescent, young adult, adult, geriatric}, then it might even make sense to cast
this as an ordinal regression84 problem and keep the labels in this format. See (Moon et al., 2010)
for an overview of different types of ranking loss functions and (Beutel et al., 2014) for a Bayesian
approach that addresses responses with more than one mode.
In general, classification problems do not come with natural orderings among the classes. For-
tunately, statisticians long ago invented a simple way to represent categorical data: the one-hot
encoding. A one-hot encoding is a vector with as many components as we have categories. The
component corresponding to a particular instanceʼs category is set to 1 and all other components
are set to 0. In our case, a label y would be a three-dimensional vector, with (1, 0, 0) corresponding
to “cat”, (0, 1, 0) to “chicken”, and (0, 0, 1) to “dog”:
y ∈ {(1, 0, 0), (0, 1, 0), (0, 0, 1)}. (4.1.1)
Linear Model
In order to estimate the conditional probabilities associated with all the possible classes, we need
a model with multiple outputs, one per class. To address classification with linear models, we will
need as many affine functions as we have outputs. Strictly speaking, we only need one fewer, since
the last category has to be the difference between 1 and the sum of the other categories but for
reasons of symmetry we use a slightly redundant parametrization. Each output corresponds to
its own affine function. In our case, since we have 4 features and 3 possible output categories, we
need 12 scalars to represent the weights (w with subscripts), and 3 scalars to represent the biases
83
https://en.wikipedia.org/wiki/Multi-label_classification
84
https://en.wikipedia.org/wiki/Ordinal_regression
132 Chapter 4. Linear Neural Networks for Classification

(b with subscripts). This yields:
o1 = x1 w11 + x2 w12 + x3 w13 + x4 w14 + b1 ,

o2 = x1 w21 + x2 w22 + x3 w23 + x4 w24 + b2 , (4.1.2)
o3 = x1 w31 + x2 w32 + x3 w33 + x4 w34 + b3 .
The corresponding neural network diagram is shown in Fig. 4.1.1. Just as in linear regression, we
use a single-layer neural network. And since the calculation of each output, o1 , o2 , and o3 , depends
on all inputs, x1 , x2 , x3 , and x4 , the output layer can also be described as a fully connected layer.
Fig. 4.1.1: Softmax regression is a single-layer neural network.
For a more concise notation we use vectors and matrices: o = Wx + b is much better suited for
mathematics and code. Note that we have gathered all of our weights into a 3 × 4 matrix and all
biases b ∈ R3 in a vector.
The Softmax
Assuming a suitable loss function, we could try, directly, to minimize the difference between o and
the labels y. While it turns out that treating classification as a vector-valued regression problem
works surprisingly well, it is nonetheless lacking in the following ways:
• There is no guarantee that the outputs oi sum up to 1 in the way we expect probabilities to
behave.
• There is no guarantee that the outputs oi are even nonnegative, even if their outputs sum up
to 1, or that they do not exceed 1.
Both aspects render the estimation problem difficult to solve and the solution very brittle to out-
liers. For instance, if we assume that there is a positive linear dependency between the number of
bedrooms and the likelihood that someone will buy a house, the probability might exceed 1 when
it comes to buying a mansion! As such, we need a mechanism to “squish” the outputs.
There are many ways we might to accomplish this goal. For instance, we could assume that the
outputs o are corrupted versions of y, where the corruption occurs by means of adding noise ϵ
drawn from a normal distribution. In other words, y = o + ϵ, where ϵi ∼ N (0, σ 2 ). This is the so-
called probit model85 , first introduced by Fechner (1860). While appealing, it doesnʼt work quite
as well or lead to a particularly nice optimization problem, when compared to the softmax.
Another way to accomplish this goal (and to ensure nonnegativity) is to use an exponential func-
tion P (y = i) ∝ exp oi . This does indeed satisfy the requirement that the conditional class proba-
bility increases with increasing oi , it is monotonic, and all probabilities are nonnegative. We can
85
https://en.wikipedia.org/wiki/Probit_model
4.1. Softmax Regression 133

then transform these values so that they add up to 1 by dividing each by their sum. This process
is called normalization. Putting these two pieces together gives us the softmax function:
exp(oi )
ŷ = softmax(o) where ŷi = ∑ . (4.1.3)
j exp(oj )
Note that the largest coordinate of o corresponds to the most likely class according to ŷ. Moreover,
because the softmax operation preserves the ordering among its arguments, we do not need to
compute the softmax to determine which class has been assigned the highest probability.
argmax ŷj = argmax oj . (4.1.4)

j j
The idea of a softmax dates back to Gibbs, who adapted ideas from physics (Gibbs, 1902). Dating
even further back, Boltzmann, the father of modern thermodynamics, used this trick to model a
distribution over energy states in gas molecules. In particular, he discovered that the prevalence
of a state of energy in a thermodynamic ensemble, such as the molecules in a gas, is proportional
to exp(−E/kT ). Here, E is the energy of a state, T is the temperature, and k is the Boltzmann
constant. When statisticians talk about increasing or decreasing the “temperature” of a statistical
system, they refer to changing T in order to favor lower or higher energy states. Following Gibbsʼ
idea, energy equates to error. Energy-based models (Ranzato et al., 2007) use this point of view
when describing problems in deep learning.
Vectorization
To improve computational efficiency, we vectorize calculations in minibatches of data. Assume

that we are given a minibatch X ∈ Rn×d of n features with dimensionality (number of inputs) d.
Moreover, assume that we have q categories in the output. Then the weights satisfy W ∈ Rd×q and
the bias satisfies b ∈ R1×q .
O = XW + b,
(4.1.5)
Ŷ = softmax(O).
This accelerates the dominant operation into a matrix-matrix product XW. Moreover, since each
row in X represents a data example, the softmax operation itself can be computed rowwise: for
each row of O, exponentiate all entries and then normalize them by the sum. Note, though, that
care must be taken to avoid exponentiating and taking logarithms of large numbers, since this can
cause numerical overflow or underflow. Deep learning frameworks take care of this automatically.
4.1.2 Loss Function
Now that we have a mapping from features x to probabilities ^ y, we need a way to optimize the
accuracy of this mapping. We will rely on maximum likelihood estimation, the very same concept
that we encountered when providing a probabilistic justification for the mean squared error loss
in Section 3.1.3.

Log-Likelihood
The softmax function gives us a vector ŷ, which we can interpret as (estimated) conditional prob-
abilities of each class, given any input x, such as ŷ1 = P (y = cat|x). In the following we assume
that for a dataset with features X the labels Y are represented using a one-hot encoding label vec-
tor. We can compare the estimates with reality by checking how probable the actual classes are
according to our model, given the features:
∏
n
P (Y|X) = P (y(i) |x(i) ). (4.1.6)
i=1
We are allowed to use the factorization since we assume that each label is drawn independently
from its respective distribution P (y|x(i) ). Since maximizing the product of terms is awkward,
we take the negative logarithm to obtain the equivalent problem of minimizing the negative log-
likelihood:
∑n ∑
n
− log P (Y|X) = − log P (y |x ) =
(i) (i)
l(y(i) , ŷ(i) ), (4.1.7)
i=1 i=1
where for any pair of label y and model prediction ŷ over q classes, the loss function l is
∑
q
l(y, ŷ) = − yj log ŷj . (4.1.8)
j=1
For reasons explained later on, the loss function in (4.1.8) is commonly called the cross-entropy
loss. Since y is a one-hot vector of length q, the sum over all its coordinates j vanishes for all but
one term. Note that the loss l(y, ŷ) is bounded from below by 0 whenever ŷ is a probability vector:
no single entry is larger than 1, hence their negative logarithm cannot be lower than 0; l(y, ŷ) = 0
only if we predict the actual label with certainty. This can never happen for any finite setting of
the weights because taking a softmax output towards 1 requires taking the corresponding input oi
to infinity (or all other outputs oj for j ̸= i to negative infinity). Even if our model could assign an
output probability of 0, any error made when assigning such high confidence would incur infinite
loss (− log 0 = ∞).
Softmax and Cross-Entropy Loss
Since the softmax function and the corresponding cross-entropy loss are so common, it is worth
understanding a bit better how they are computed. Plugging (4.1.3) into the definition of the loss
in (4.1.8) and using the definition of the softmax we obtain:
∑
q
exp(oj )
l(y, ŷ) = − yj log ∑q
j=1 k=1 exp(ok )
∑
q ∑
q ∑
q
= yj log exp(ok ) − yj oj (4.1.9)
j=1 k=1 j=1
∑
q ∑
q
= log exp(ok ) − yj o j .
k=1 j=1
To understand a bit better what is going on, consider the derivative with respect to any logit oj .
We get
exp(oj )
∂oj l(y, ŷ) = ∑q − yj = softmax(o)j − yj . (4.1.10)
k=1 exp(ok )

In other words, the derivative is the difference between the probability assigned by our model,
as expressed by the softmax operation, and what actually happened, as expressed by elements in
the one-hot label vector. In this sense, it is very similar to what we saw in regression, where the
gradient was the difference between the observation y and estimate ŷ. This is not coincidence. In
any exponential family model, the gradients of the log-likelihood are given by precisely this term.
This fact makes computing gradients easy in practice.
Now consider the case where we observe not just a single outcome but an entire distribution over
outcomes. We can use the same representation as before for the label y. The only difference is that
rather than a vector containing only binary entries, say (0, 0, 1), we now have a generic probability
vector, say (0.1, 0.2, 0.7). The math that we used previously to define the loss l in (4.1.8) still works
out fine, just that the interpretation is slightly more general. It is the expected value of the loss
for a distribution over labels. This loss is called the cross-entropy loss and it is one of the most
commonly used losses for classification problems. We can demystify the name by introducing
just the basics of information theory. In a nutshell, it measures the number of bits to encode what
we see y relative to what we predict that should happen ŷ. We provide a very basic explanation in
the following. For further details on information theory see (Cover and Thomas, 1999) or (MacKay
and Mac Kay, 2003).
4.1.3 Information Theory Basics
Many deep learning papers use intuition and terms from information theory. To make sense of
them, we need some common language. This is a survival guide. Information theory deals with
the problem of encoding, decoding, transmitting, and manipulating information (also known as
data).
Entropy
The central idea in information theory is to quantify the amount of information contained in data.
This places a limit on our ability to compress data. For a distribution P its entropy is defined as:
∑
H[P ] = −P (j) log P (j). (4.1.11)
j
One of the fundamental theorems of information theory states that in order to encode data drawn
randomly from the distribution P , we need at least H[P ] “nats” to encode it (Shannon, 1948). If
you wonder what a “nat” is, it is the equivalent of bit but when using a code with base e rather than
1
one with base 2. Thus, one nat is log(2) ≈ 1.44 bit.
Surprisal
You might be wondering what compression has to do with prediction. Imagine that we have a
stream of data that we want to compress. If it is always easy for us to predict the next token, then
this data is easy to compress. Take the extreme example where every token in the stream always
takes the same value. That is a very boring data stream! And not only it is boring, but it is also
easy to predict. Because they are always the same, we do not have to transmit any information to
communicate the contents of the stream. Easy to predict, easy to compress.
However if we cannot perfectly predict every event, then we might sometimes be surprised. Our
surprise is greater when we assigned an event lower probability. Claude Shannon settled on

log P 1(j) = − log P (j) to quantify oneʼs surprisal at observing an event j having assigned it a (sub-
jective) probability P (j). The entropy defined in (4.1.11) is then the expected surprisal when one
assigned the correct probabilities that truly match the data-generating process.
Cross-Entropy Revisited
So if entropy is the level of surprise experienced by someone who knows the true probability, then
you might be wondering, what is cross-entropy? The cross-entropy from P to Q, denoted H(P, Q),
is the expected surprisal of an observer with subjective probabilities Q upon seeing data that was
def ∑
actually generated according to probabilities P . This is given by H(P, Q) = j −P (j) log Q(j).
The lowest possible cross-entropy is achieved when P = Q. In this case, the cross-entropy from
P to Q is H(P, P ) = H(P ).
In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing
the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of
bits) required to communicate the labels.
4.1.4 Summary and Discussion
In this section, we encountered the first nontrivial loss function, allowing us to optimize over dis-
crete output spaces. Key in its design was that we took a probabilistic approach, treating discrete
categories as instances of draws from a probability distribution. As a side effect, we encountered
the softmax, a convenient activation function that transforms outputs of an ordinary neural net-
work layer into valid discrete probability distributions. We saw that the derivative of the cross en-
tropy loss when combined with softmax behaves very similarly to the derivative of squared error,
namely by taking the difference between the expected behavior and its prediction. And, while we
were only able to scratch the very surface of it, we encountered exciting connections to statistical
physics and information theory.
While this is enough to get you on your way, and hopefully enough to whet your appetite, we
hardly dived deep here. Among other things, we skipped over computational considerations.
Specifically, for any fully connected layer with d inputs and q outputs, the parameterization and
computational cost is O(dq), which can be prohibitively high in practice. Fortunately, this cost
of transforming d inputs into q outputs can be reduced through approximation and compression.
For instance Deep Fried Convnets (Yang et al., 2015) uses a combination of permutations, Fourier
transforms, and scaling to reduce the cost from quadratic to log-linear. Similar techniques work
for more advanced structural matrix approximations (Sindhwani et al., 2015). Lastly, we can use
Quaternion-like decompositions to reduce the cost to O( dq n ), again if we are willing to trade off a
small amount of accuracy for computational and storage cost (Zhang et al., 2021) based on a com-
pression factor n. This is an active area of research. What makes it challenging is that we do not
necessarily strive for the most compact representation or the smallest number of floating point
operations but rather for the solution that can be executed most efficiently on modern GPUs.

Exercises
1. We can explore the connection between exponential families and the softmax in some more
depth.
1. Compute the second derivative of the cross-entropy loss l(y, ŷ) for the softmax.
2. Compute the variance of the distribution given by softmax(o) and show that it matches
the second derivative computed above.
2. Assume that we have three classes which occur with equal probability, i.e., the probability
vector is ( 13 , 13 , 13 ).
1. What is the problem if we try to design a binary code for it?
2. Can you design a better code? Hint: what happens if we try to encode two independent
observations? What if we encode n observations jointly?
3. When encoding signals transmitted over a physical wire, engineers donʼt always use binary
codes. For instance, PAM-386 uses three signal levels {−1, 0, 1} as opposed to two levels
{0, 1}. How many ternary units do you need to transmit an integer in the range {0, . . . , 7}?
Why might this be a better idea in terms of electronics?
4. The Bradley-Terry model87 uses a logistic model to capture preferences. For a user to choose
between apples and oranges one assumes scores oapple and oorange . Our requirements are
that larger scores should lead to a higher likelihood in choosing the associated item and that
the item with the largest score is the most likely one to be chosen (Bradley and Terry, 1952).
1. Prove that the softmax satisfies this requirement.
2. What happens if you want to allow for a default option of choosing neither apples nor
oranges? Hint: now the user has 3 choices.
5. Softmax derives its name from the following mapping: RealSoftMax(a, b) = log(exp(a) +
exp(b)).
1. Prove that RealSoftMax(a, b) > max(a, b).
2. How small can you make the difference between both functions? Hint: without loss of
generality you can set b = 0 and a ≥ b.
3. Prove that this holds for λ−1 RealSoftMax(λa, λb), provided that λ > 0.
4. Show that for λ → ∞ we have λ−1 RealSoftMax(λa, λb) → max(a, b).
5. What does the soft-min look like?
6. Extend this to more than two numbers.
def ∑
6. The function g(x) = log i exp xi is sometimes also referred to as the log-partition func-
tion88 .
1. Prove that the function is convex. Hint: to do so, use the fact that the first deriva-
tive amounts to the probabilities from the softmax function and show that the second
derivative is the variance.
2. Show that g is translation invariant, i.e., g(x + b) = g(x).
86
https://en.wikipedia.org/wiki/Ternary_signal
87
https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model
88
https://en.wikipedia.org/wiki/Partition_function_(mathematics)

3. What happens if some of the coordinates xi are very large? What happens if theyʼre all
very small?
4. Show that if we choose b = maxi xi we end up with a numerically stable implementa-
tion.
7. Assume that we have some probability distribution P . Suppose we pick another distribution
Q with Q(i) ∝ P (i)α for α > 0.
1. Which choice of α corresponds to doubling the temperature? Which choice corre-
sponds to halving it?
2. What happens if we let the temperature converge to 0?
3. What happens if we let the temperature converge to ∞?
Discussions89
4.2 The Image Classification Dataset
One of the widely used dataset for image classification is the MNIST dataset90 (LeCun et al., 1998)
of handwritten digits. At the time of its release in the 1990s it posed a formidable challenge to most
machine learning algorithms, consisting of 60,000 images of 28 × 28 pixels resolution (plus a test
dataset of 10,000 images). To put things into perspective, at the time, a Sun SPARCStation 5 with a
whopping 64MB of RAM and a blistering 5 MFLOPs was considered state of the art equipment for
machine learning at AT&T Bell Laboratories in 1995. Achieving high accuracy on digit recognition
was a key component in automating letter sorting for the USPS in the 1990s. Deep networks such
as LeNet-5 (LeCun et al., 1995), support vector machines with invariances (Schölkopf et al., 1996),
and tangent distance classifiers (Simard et al., 1998) all allowed to reach error rates below 1%.
For over a decade, MNIST served as the point of reference for comparing machine learning al-
gorithms. While it had a good run as a benchmark dataset, even simple models by todayʼs stan-
dards achieve classification accuracy over 95%, making it unsuitable for distinguishing between
stronger models and weaker ones. Even more so, the dataset allows for very high levels of accu-
racy, not typically seen in many classification problems. This skewed algorithmic development
towards specific families of algorithms that can take advantage of clean datasets, such as active
set methods and boundary-seeking active set algorithms. Today, MNIST serves as more of sanity
checks than as a benchmark. ImageNet (Deng et al., 2009) poses a much more relevant challenge.
Unfortunately, ImageNet is too large for many of the examples and illustrations in this book, as
it would take too long to train to make the examples interactive. As a substitute we will focus our
discussion in the coming sections on the qualitatively similar, but much smaller Fashion-MNIST
dataset (Xiao et al., 2017), which was released in 2017. It constains images of 10 categories of cloth-
ing at 28 × 28 pixels resolution.
%matplotlib inline
import time
import torch
import torchvision
89
90
https://en.wikipedia.org/wiki/MNIST_database
4.2. The Image Classification Dataset 139

d2l.use_svg_display()
4.2.1 Loading the Dataset
Since it is such a frequently used dataset, all major frameworks provide preprocessed versions of
it. We can download and read the Fashion-MNIST dataset into memory using built-in framework
functions.
class FashionMNIST(d2l.DataModule): #@save

def __init__(self, batch_size=64, resize=(28, 28)):
super().__init__()
trans = transforms.Compose([transforms.Resize(resize),
transforms.ToTensor()])
self.train = torchvision.datasets.FashionMNIST(
root=self.root, train=True, transform=trans, download=True)
self.val = torchvision.datasets.FashionMNIST(
root=self.root, train=False, transform=trans, download=True)
Fashion-MNIST consists of images from 10 categories, each represented by 6,000 images in the
training dataset and by 1,000 in the test dataset. A test dataset is used for evaluating model perfor-
mance (it must not be used for training). Consequently the training set and the test set contain
60,000 and 10,000 images, respectively.
data = FashionMNIST(resize=(32, 32))

len(data.train), len(data.val)
(60000, 10000)
The images are grayscale and upscaled to 32 × 32 pixels in resolution above. This is similar to the
original MNIST dataset which consisted of (binary) black and white images. Note, though, that
most modern image data which has 3 channels (red, green, blue) and hyperspectral images which
can have in excess of 100 channels (the HyMap sensor has 126 channels). By convention we store
image as a c × h × w tensor, where c is the number of color channels, h is the height and w is the
width.
data.train[0][0].shape
torch.Size([1, 32, 32])
The categories of Fashion-MNIST have human-understandable names. The following conve-

nience function converts between numeric labels and their names.
@d2l.add_to_class(FashionMNIST) #@save
def text_labels(self, indices):
"""Return text labels."""

labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',

'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
return [labels[int(i)] for i in indices]
4.2.2 Reading a Minibatch
To make our life easier when reading from the training and test sets, we use the built-in data
iterator rather than creating one from scratch. Recall that at each iteration, a data iterator reads
a minibatch of data with size batch_size. We also randomly shuffle the examples for the training
data iterator.
data = self.train if train else self.val
return torch.utils.data.DataLoader(data, self.batch_size, shuffle=train,
num_workers=self.num_workers)
To see how this works, letʼs load a minibatch of images by invoking the newly-added
train_dataloader method. It contains 64 images.
print(X.shape, X.dtype, y.shape, y.dtype)
torch.Size([64, 1, 32, 32]) torch.float32 torch.Size([64]) torch.int64
Letʼs look at the time it takes to read the images. Even though itʼs a built-in loader, it isnʼt blazingly
fast. Nonetheless, this is sufficient since processing images with a deep network takes quite a bit
longer. Hence itʼs good enough that training a network wonʼt be IO constrained.
tic = time.time()
for X, y in data.train_dataloader():
continue
f'{time.time() - tic:.2f} sec'
'3.72 sec'
4.2.3 Visualization
Weʼll be using the Fashion-MNIST dataset quite frequently. A convenience function show_images
can be used to visualize the images and the associated labels. Details of its implementation are
deferred to the appendix.
def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5): #@save

"""Plot a list of images."""
4.2. The Image Classification Dataset 141

Letʼs put it to good use. In general, it is a good idea to visualize and inspect data that youʼre training
on. Humans are very good at spotting unusual aspects and as such, visualization serves as an ad-
ditional safeguard against mistakes and errors in the design of experiments. Here are the images
and their corresponding labels (in text) for the first few examples in the training dataset.
def visualize(self, batch, nrows=1, ncols=8, labels=[]):
X, y = batch
if not labels:
labels = self.text_labels(y)
d2l.show_images(X.squeeze(1), nrows, ncols, titles=labels)
batch = next(iter(data.val_dataloader()))
data.visualize(batch)
We are now ready to work with the Fashion-MNIST dataset in the sections that follow.
4.2.4 Summary
We now have a slightly more realistic dataset to use for classification. Fashion-MNIST is an apparel
classification dataset consisting of images representing 10 categories. We will use this dataset in
subsequent sections and chapters to evaluate various network designs, from a simple linear model
to advanced residual networks. As we commonly do with images, we read them as a tensor of
shape (batch size, number of channels, height, width). For now, we only have one channel as the
images are grayscale (the visualization above use a false color palette for improved visibility).
Lastly, data iterators are a key component for efficient performance. For instance, we might use
GPUs for efficient image decompression, video transcoding, or other preprocessing. Whenever
possible, you should rely on well-implemented data iterators that exploit high-performance com-
puting to avoid slowing down your training loop.
Exercises
1. Does reducing the batch_size (for instance, to 1) affect the reading performance?
2. The data iterator performance is important. Do you think the current implementation is fast
enough? Explore various options to improve it. Use a system profiler to find out where the
bottlenecks are.
3. Check out the frameworkʼs online API documentation. Which other datasets are available?
Discussions91
91

4.3 The Base Classification Model
You may have noticed that the implementations from scratch and the concise implementation
using framework functionality were quite similar in the case of regression. The same is true for
classification. Since a great many models in this book deal with classification, it is worth adding
some functionality to support this setting specifically. This section provides a base class for clas-
sification models to simplify future code.
import torch
4.3.1 The Classifier Class
We define the Classifier class below. In the validation_step we report both the loss value and
the classification accuracy on a validation batch. We draw an update for every num_val_batches
batches. This has the benefit of generating the averaged loss and accuracy on the whole validation
data. These average numbers are not exactly correct if the last batch contains fewer examples, but
we ignore this minor difference to keep the code simple.
class Classifier(d2l.Module): #@save

Y_hat = self(*batch[:-1])
self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
self.plot('acc', self.accuracy(Y_hat, batch[-1]), train=False)
By default we use a stochastic gradient descent optimizer, operating on minibatches, just as we

did in the context of linear regression.
@d2l.add_to_class(d2l.Module) #@save
return torch.optim.SGD(self.parameters(), lr=self.lr)
4.3.2 Accuracy
Given the predicted probability distribution y_hat, we typically choose the class with the high-
est predicted probability whenever we must output a hard prediction. Indeed, many applications
require that we make a choice. For instance, Gmail must categorize an email into “Primary”, “So-
cial”, “Updates”, “Forums”, or “Spam”. It might estimate probabilities internally, but at the end of
the day it has to choose one among the classes.
When predictions are consistent with the label class y, they are correct. The classification ac-
curacy is the fraction of all predictions that are correct. Although it can be difficult to optimize
accuracy directly (it is not differentiable), it is often the performance measure that we care about
the most. It is often the relevant quantity in benchmarks. As such, we will nearly always report it
when training classifiers.
Accuracy is computed as follows. First, if y_hat is a matrix, we assume that the second dimension
stores prediction scores for each class. We use argmax to obtain the predicted class by the index
for the largest entry in each row. Then we compare the predicted class with the ground-truth y
elementwise. Since the equality operator == is sensitive to data types, we convert y_hatʼs data type
4.3. The Base Classification Model 143

to match that of y. The result is a tensor containing entries of 0 (false) and 1 (true). Taking the
sum yields the number of correct predictions.
@d2l.add_to_class(Classifier) #@save
def accuracy(self, Y_hat, Y, averaged=True):
"""Compute the number of correct predictions."""
Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
preds = Y_hat.argmax(axis=1).type(Y.dtype)
compare = (preds == Y.reshape(-1)).type(torch.float32)
return compare.mean() if averaged else compare
4.3.3 Summary
Classification is a sufficiently common problem that it warrants its own convenience functions.
Of central importance in classification is the accuracy of the classifier. Note that while we often
care primarily about accuracy, we train classifiers to optimize a variety of other objectives for sta-
tistical and computational reasons. However, regardless of which loss function was minimized
during training, itʼs useful to have a convenience method for assessing the accuracy of our classi-
fier empirically.
Exercises
1. Denote by Lv the validation loss, and let Lqv be its quick and dirty estimate computed by the
loss function averaging in this section. Lastly, denote by lvb the loss on the last minibatch.
Express Lv in terms of Lqv , lvb , and the sample and minibatch sizes.
2. Show that the quick and dirty estimate Lqv is unbiased. That is, show that E[Lv ] = E[Lqv ].
Why would you still want to use Lv instead?
3. Given a multiclass classification loss, denoting by l(y, y ′ ) the penalty of estimating y ′ when
we see y and given a probabilty p(y|x), formulate the rule for an optimal selection of y ′ . Hint:
express the expected loss, using l and p(y|x).
Discussions92
4.4 Softmax Regression Implementation from Scratch
Because softmax regression is so fundamental, we believe that you ought to know how to imple-
ment it yourself. Here, we limit ourselves to defining the softmax-specific aspects of the model
and reuse the other components from our linear regression section, including the training loop.
import torch
92

4.4.1 The Softmax
Letʼs begin with the most important part: the mapping from scalars to probabilities. For a re-
fresher, recall the operation of the sum operator along specific dimensions in a tensor, as dis-
cussed in Section 2.3.6 and Section 2.3.7. Given a matrix X we can sum over all elements (by de-
fault) or only over elements in the same axis. The axis variable lets us compute row and column
sums:
X = torch.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])

X.sum(0, keepdims=True), X.sum(1, keepdims=True)
(tensor([[5., 7., 9.]]),

tensor([[ 6.],
[15.]]))
Computing the softmax requires three steps: (i) exponentiation of each term; (ii) a sum over each
row to compute the normalization constant for each example; (iii) division of each row by its nor-
malization constant, ensuring that the result sums to 1.
exp(Xij )
softmax(X)ij = ∑ . (4.4.1)
k exp(Xik )
The (logarithm of the) denominator is called the (log) partition function. It was introduced in statis-
tical physics93 to sum over all possible states in a thermodynamic ensemble. The implementation
is straightforward:
def softmax(X):
X_exp = torch.exp(X)
partition = X_exp.sum(1, keepdims=True)
return X_exp / partition # The broadcasting mechanism is applied here
For any input X, we turn each element into a non-negative number. Each row sums up to 1, as is
required for a probability. Caution: the code above is not robust against very large or very small
arguments. While this is sufficient to illustrate what is happening, you should not use this code
verbatim for any serious purpose. Deep learning frameworks have such protections built-in and
we will be using the built-in softmax going forward.
X = torch.rand((2, 5))
X_prob = softmax(X)
X_prob, X_prob.sum(1)
(tensor([[0.2018, 0.1706, 0.2203, 0.1469, 0.2604],

[0.2243, 0.1840, 0.2254, 0.2254, 0.1409]]),
tensor([1.0000, 1.0000]))
93
https://en.wikipedia.org/wiki/Partition_function_(statistical_mechanics)
4.4. Softmax Regression Implementation from Scratch 145

4.4.2 The Model
We now have everything that we need to implement the softmax regression model. As in our
linear regression example, each instance will be represented by a fixed-length vector. Since the
raw data here consists of 28 × 28 pixel images, we flatten each image, treating them as vectors of
length 784. In later chapters, we will introduce convolutional neural networks, which exploit the
spatial structure in a more satisfying way.
In softmax regression, the number of outputs from our network should be equal to the number
of classes. Since our dataset has 10 classes, our network has an output dimension of 10. Conse-
quently, our weights constitute a 784 × 10 matrix plus a 1 × 10 dimensional row vector for the
biases. As with linear regression, we initialize the weights W with Gaussian noise. The biases are
initialized as zeros.
class SoftmaxRegressionScratch(d2l.Classifier):
def __init__(self, num_inputs, num_outputs, lr, sigma=0.01):
super().__init__()
self.W = torch.normal(0, sigma, size=(num_inputs, num_outputs),
requires_grad=True)
self.b = torch.zeros(num_outputs, requires_grad=True)
def parameters(self):
return [self.W, self.b]
The code below defines how the network maps each input to an output. Note that we flatten each
28 × 28 pixel image in the batch into a vector using reshape before passing the data through our
model.
@d2l.add_to_class(SoftmaxRegressionScratch)
return softmax(torch.matmul(X.reshape((-1, self.W.shape[0])), self.W) + self.b)
4.4.3 The Cross-Entropy Loss
Next we need to implement the cross-entropy loss function (introduced in Section 4.1.2). This
may be the most common loss function in all of deep learning. At the moment, applications of
deep learning easily cast classification problems far outnumber those better treated as regression
problems.
Recall that cross-entropy takes the negative log-likelihood of the predicted probability assigned to
the true label. For efficiency we avoid Python for-loops and use indexing instead. In particular,
the one-hot encoding in y allows us to select the matching terms in ŷ.
To see this in action we create sample data y_hat with 2 examples of predicted probabilities over
3 classes and their corresponding labels y. The correct labels are 1 and 2 respectively. Using y as
the indices of the probabilities in y_hat, we can pick out terms efficiently.
y = torch.tensor([0, 2])
y_hat = torch.tensor([[0.1, 0.3, 0.6], [0.3, 0.2, 0.5]])
y_hat[[0, 1], y]

tensor([0.1000, 0.5000])
Now we can implement the cross-entropy loss function by averaging over the logarithms of the
selected probabilities.
def cross_entropy(y_hat, y):

return - torch.log(y_hat[range(len(y_hat)), y]).mean()
cross_entropy(y_hat, y)
tensor(1.4979)
@d2l.add_to_class(SoftmaxRegressionScratch)
return cross_entropy(y_hat, y)
4.4.4 Training
We reuse the fit method defined in Section 3.4 to train the model with 10 epochs. Note that
both the number of epochs (max_epochs), the minibatch size (batch_size), and learning rate (lr)
are adjustable hyperparameters. That means that while these values are not learned during our
primary training loop, they still influence the performance of our model, bot vis-a-vis training
and generalization performance. In practice you will want to choose these values based on the
validation split of the data and then to ultimately evaluate your final model on the test split. As
discussed in Section 3.6.3, we will treat the test data of Fashion-MNIST as the validation set, thus
reporting validation loss and validation accuracy on this split.
data = d2l.FashionMNIST(batch_size=256)
model = SoftmaxRegressionScratch(num_inputs=784, num_outputs=10, lr=0.1)
4.4. Softmax Regression Implementation from Scratch 147

4.4.5 Prediction
Now that training is complete, our model is ready to classify some images.
X, y = next(iter(data.val_dataloader()))
preds = model(X).argmax(axis=1)
preds.shape
torch.Size([256])
We are more interested in the images we label incorrectly. We visualize them by comparing their
actual labels (first line of text output) with the predictions from the model (second line of text
output).
wrong = preds.type(y.dtype) != y
X, y, preds = X[wrong], y[wrong], preds[wrong]
labels = [a+'\n'+b for a, b in zip(
data.text_labels(y), data.text_labels(preds))]
data.visualize([X, y], labels=labels)
4.4.6 Summary
By now we are starting to get some experience with solving linear regression and classification
problems. With it, we have reached what would arguably be the state of the art of 1960-1970s of
statistical modeling. In the next section, weʼll show you how to leverage deep learning frameworks
to implement this model much more efficiently.
Exercises
1. In this section, we directly implemented the softmax function based on the mathematical
definition of the softmax operation. As discussed in Section 4.1 this can cause numerical
instabilities.
1. Test whether softmax still works correctly if an input has a value of 100?
2. Test whether softmax still works correctly if the largest of all inputs is smaller than
−100?
3. Implement a fix by looking at the value relative to the largest entry in the argument.
∑ a cross_entropy function that follows the definition of the cross-entropy loss

2. Implement
function i yi log ŷi .
1. Try it out in the code example above.

2. Why do you think it runs more slowly?
3. Should you use it? In which cases would it make sense?
4. What do you need to be careful of? Hint: consider the domain of the logarithm.
3. Is it always a good idea to return the most likely label? For example, would you do this for
medical diagnosis? How would you try to address this?
4. Assume that we want to use softmax regression to predict the next word based on some
features. What are some problems that might arise from a large vocabulary?
5. Experiment with the hyperparameters of the code above. In particular:
1. Plot how the validation loss changes as you change the learning rate.
2. Do the validation and training loss change as you change the minibatch size? How large
or small do you need to go before you see an effect?
Discussions94
4.5 Concise Implementation of Softmax Regression
Just as high-level deep learning frameworks made it easier to implement linear regression (see
Section 3.5), they are similarly convenient here.
import torch
As in Section 3.5, we construct our fully connected layer using the built-in layer. The built-in
__call__ method then invokes forward whenever we need to apply the network to some input.
We use a Flatten layer to convert the 4th order tensor X to 2nd order by keeping the dimensionality
along the first axis unchanged.
class SoftmaxRegression(d2l.Classifier):
def __init__(self, num_outputs, lr):
super().__init__()
self.net = nn.Sequential(nn.Flatten(),
nn.LazyLinear(num_outputs))
return self.net(X)
94
4.5. Concise Implementation of Softmax Regression 149

4.5.2 Softmax Revisited
In Section 4.4 we calculated our modelʼs output and applied the cross-entropy loss. While this is
perfectly reasonable mathematically, it is risky computationally, due to numerical underflow and
overflow in the exponentiation.
exp(o )
Recall that the softmax function computes probabilities via ŷj = ∑ exp(o j
k)
. If some of the ok are
k
very large, i.e., very positive, then exp(ok ) might be larger than the largest number we can have for
certain data types. This is called overflow. Likewise, if all arguments are very negative, we will get
underflow. For instance, single precision floating point numbers approximately cover the range
of 10−38 to 1038 . As such, if the largest term in o lies outside the interval [−90, 90], the result will
def
not be stable. A solution to this problem is to subtract ō = maxk ok from all entries:
exp oj exp(oj − ō) exp ō exp(oj − ō)

ŷj = ∑ =∑ =∑ . (4.5.1)
k exp ok k exp(ok − ō) exp ō k exp(ok − ō)
By construction we know that oj − ō ≤ 0 for all j. As such, for a q-class classification problem,
the denominator is contained in the interval [1, q]. Moreover, the numerator never exceeds 1, thus
preventing numerical overflow. Numerical underflow only occurs when exp(oj − ō) numerically
evaluates as 0. Nonetheless, a few steps down the road we might find ourselves in trouble when we
want to compute log ŷj as log 0. In particular, in backpropagation, we might find ourselves faced
with a screenful of the dreaded NaN (Not a Number) results.
Fortunately, we are saved by the fact that even though we are computing exponential functions,
we ultimately intend to take their log (when calculating the cross-entropy loss). By combining
softmax and cross-entropy, we can escape the numerical stability issues altogether. We have:
exp(oj − ō) ∑
log ŷj = log ∑ = oj − ō − log exp(ok − ō). (4.5.2)
k exp(ok − ō) k
This avoids both overflow and underflow. We will want to keep the conventional softmax func-
tion handy in case we ever want to evaluate the output probabilities by our model. But instead of
passing softmax probabilities into our new loss function, we just pass the logits and compute the
softmax and its log all at once inside the cross-entropy loss function, which does smart things like
the “LogSumExp trick”95 .
@d2l.add_to_class(d2l.Classifier) #@save
def loss(self, Y_hat, Y, averaged=True):
Y_hat = Y_hat.reshape((-1, Y_hat.shape[-1]))
Y = Y.reshape((-1,))
return F.cross_entropy(
Y_hat, Y, reduction='mean' if averaged else 'none')
95
https://en.wikipedia.org/wiki/LogSumExp

4.5.3 Training
Next we train our model. As before, we use Fashion-MNIST images, flattened to 784-dimensional
feature vectors.
model = SoftmaxRegression(num_outputs=10, lr=0.1)
As before, this algorithm converges to a solution that achieves a decent accuracy, albeit this time
with fewer lines of code than before.
4.5.4 Summary
High-level APIs are very convenient at hiding potentially dangerous aspects from their user, such
as numerical stability. Moreover, they allow users to design models concisely with very few lines
of code. This is both a blessing and a curse. The obvious benefit is that it makes things highly
accessible, even to engineers who never took a single class of statistics in their life (in fact, this
is one of the target audiences of the book). But hiding the sharp edges also comes with a price: a
disincentive to add new and different components on your own, since thereʼs little muscle memory
for doing it. Moreover, it makes it more difficult to fix things whenever the protective padding of
a framework fails to cover all the corner cases entirely. Again, this is due to lack of familiarity.
As such, we strongly urge you to review both the bare bones and the elegant versions of many of
the implementations that follow. While we emphasize ease of understanding, the implementa-
tions are nonetheless usually quite performant (convolutions are the big exception here). It is our
intention to allow you to build on these when you invent something new that no framework can
give you.
4.5. Concise Implementation of Softmax Regression 151

Exercises
1. Deep learning uses many different number formats, including FP64 double precision (used
extremely rarely), FP32 single precision, BFLOAT16 (good for compressed representations),
FP16 (very unstable), TF32 (a new format from NVIDIA), and INT8. Compute the smallest
and largest argument of the exponential function for which the result does not lead to a
numerical underflow or overflow.
2. INT8 is a very limited format with nonzero numbers from 1 to 255. How could you extend its
dynamic range without using more bits? Do standard multiplication and addition still work?
3. Increase the number of epochs for training. Why might the validation accuracy decrease
after a while? How could we fix this?
4. What happens as you increase the learning rate? Compare the loss curves for several learn-
ing rates. Which one works better? When?
Discussions96
4.6 Generalization in Classification
So far, we have focused on how to tackle multiclass classification problems by training (linear)
neural networks with multiple outputs and softmax functions. Interpreting our modelʼs outputs
as probabilistic predictions, we motivated and derived the cross-entropy loss function, which cal-
culates the negative log likelihood that our model (for a fixed set of parameters) assigns to the
actual labels. And finally, we put these tools into practice by fitting our model to the training set.
However, as always, our goal is to learn general patterns, as assessed empirically on previously un-
seen data (the test set). High accuracy on the training set means nothing. Whenever each of our
inputs is unique (and indeed this is true for most high-dimensional datasets), we can attain per-
fect accuracy on the training set by just memorizing the dataset on the first training epoch, and
subsequently looking up the label whenever we see a new image. And yet, memorizing the exact
labels associated with the exact training examples does not tell us how to classify new examples.
Absent further guidance, we might have to fall back on random guessing whenever we encounter
new examples.
A number of burning questions demand immediate attention: 1. How many test examples do we
need to precisely estimate the accuracy of our classifiers on the underlying population? 1. What
happens if we keep evaluating models on the same test repeatedly? 1. Why should we expect that
fitting our linear models to the training set should fare any better than our naive memorization
scheme?
While Section 3.6 introduced the basics of overfitting and generalization in the context of linear
regression, this chapter will go a little deeper, introducing some of the foundational ideas of sta-
tistical learning theory. It turns out that we often can guarantee generalization a priori: for many
models, and for any desired upper bound on the generalization gap ϵ, we can often determine
some required number of samples n such that if our training set contains at least n samples, then
our empirical error will lie within ϵ of the true error, for any data generating distribution. Unfortu-
nately, it also turns out that while these sorts of guarantees provide a profound set of intellectual
building blocks, they are of limited practical utility to the deep learning practitioner. In short,
these guarantees suggest that ensuring generalization of deep neural networks a priori requires
an absurd number of examples (perhaps trillions or more), even when we find that on the tasks
96

we care about that deep neural networks typically to generalize remarkably well with far fewer ex-
amples (thousands). Thus deep learning practitioners often forgo a priori guarantees altogether,
instead employing methods on the basis that they have generalized well on similar problems in the
past, and certifying generalization post hoc through empirical evaluations. When we get to Chap-
ter 5, we will revisit generalization and provide a light introduction to the vast scientific literature
that has sprung in attempts to explain why deep neural networks generalize in practice.
4.6.1 The Test Set
Since we have already begun to rely on test sets as the gold standard method for assessing gener-
alization error, letʼs get started by discussing the properties of such error estimates. Letʼs focus on
a fixed classifier f , without worrying about how it was obtained. Moreover suppose that we pos-
n
sess a fresh dataset of examples D = (x(i) , y (i) )i=1 that were not used to train the classifier f . The
empirical error of our classifier f on D is simply the fraction of instances for which the prediction
f (x(i) ) disagrees with the true label y (i) , and is given by the following expression:
1∑
n
ϵD (f ) = 1(f (x(i) ) ̸= y (i) ). (4.6.1)
n
i=1
By contrast, the population error is the expected fraction of examples in the underlying population
(some distribution P (X, Y ) characterized by probability density function p(x, y) for which our
classifier disagrees with the true label:
∫ ∫
ϵ(f ) = E(x,y)∼P 1(f (x) ̸= y) = 1(f (x) ̸= y)p(x, y) dxdy. (4.6.2)
While ϵ(f ) is the quantity that we actually care about, we cannot observe it directly, just as we
cannot directly observe the average height in a large population without measuring every single
person. We can only estimate this quantity based on samples. Because our test set D is statistically
representative of the underlying population, we can view ϵD (f ) as a statistical estimator of the
population error ϵ(f ). Moreover, because our quantity of interest ϵ(f ) is an expectation (of the
random variable 1(f (X) ̸= Y )) and the corresponding estimator ϵD (f ) is the sample average,
estimating the popullation error is simply the classic problem of mean estimation, which you may
recall from Section 2.6.
An important classical result from probability theory called the central limit theorem guarantees
that whenever we possess n random samples a1 , ..., an drawn from any distribution with mean µ
and standard deviation σ, as the number of samples n approaches infinity, the sample average µ̂
approximately tends towards a normal distribution centered at the true mean and with standard
√
deviation σ/ n. Already, this tells us something important: as the number of examples grows
√
large, our test error ϵD (f ) should approach the true error ϵ(f ) at a rate of O(1/ n). Thus, to
estimate our test error twice as precisely, we must collect four times as large a test set. To reduce
our test error by a factor of one hundred, we must collect ten thousand times as large a test set.
√
In general, such a rate of O(1/ n) is often the best we can hope for in statistics.
Now that we know something about the asymptotic rate at which our test error ϵD (f ) converges
to the true error ϵ(f ), we can zoom in on some important details. Recall that the random variable
of interest 1(f (X) ̸= Y ) can only take values 0 and 1 and thus is a Bernoulli random variable,
characterized by a parameter indicating the probability that it takes value 1. Here, 1 means that
our classifier made an error, so the parameter of our random variable is actually the true error
rate ϵ(f ). The variance σ 2 of a Bernoulli depends on its parameter (here, ϵ(f )) according to the
expression ϵ(f )(1 − ϵ(f )). While ϵ(f ) is initially unknown, we know that it cannot be greater than
4.6. Generalization in Classification 153

1. A little investigation of this function reveals that our variance is highest when the true error
rate is close to 0.5 and can be far lower when it is close to 0 or close to 1. This tells us that the
asymptotic standard deviation of our √ estimate ϵD (f ) of the error ϵ(f ) (over the choice of the n test
samples) cannot be any greater than 0.25/n.
If we ignore the fact that this rate characterizes behavior as the test set size approaches infinity
rather than when we possess finite samples, this tells us that if we want our test error ϵD (f ) to
approximate the population error ϵ(f ) such that one standard deviation corresponds to an interval
of ±0.01, then we should collect roughly 2500 samples. If we want to fit two standard deviations
in that range and thus be 95% that ϵD (f ) ∈ ϵ(f ) ± 0.01, then we will need 10000 samples!
This turns out to be the size of the test sets for many popular benchmarks in machine learning.
You might be surprised to find out that thousands of applied deep learning papers get published
every year making a big deal out of error rate improvements of 0.01 or less. Of course, when the
error rates are much closer to 0, then an improvement of 0.01 can indeed be a big deal.
One pesky feature of our analysis thus far is that it really only tells us about asymptotics, i.e.,
how the relationship between ϵD and ϵ evolves as our sample size goes to infinity. Fortunately,
because our random variable is bounded, we can obtain valid finite sample bounds by applying
an inequality due to Hoeffding (1963):
( )
P (ϵD (f ) − ϵ(f ) ≥ t) < exp −2nt2 . (4.6.3)
Solving for the smallest dataset size that would allow us to conclude with 95% confidence that the
distance t between our estimate ϵD (f ) and the true error rate ϵ(f ) does not exceed 0.01, you will
find that roughly 15000 examples are required as compared to the 10000 examples suggested by
the asymptotic analysis above. If you go deeper into statistics you will find that this trend holds
generally. Guarantees that hold even in finite samples are typically slightly more conservative.
Note that in the scheme of things, these numbers are not so far apart, reflecting the general use-
fulness of asymptotic analysis for giving us ballpark figures even if not guarantees we can take to
court.
4.6.2 Test Set Reuse
In some sense, you are now set up to succeed at conducting empirical machine learning research.
Nearly all practical models are developed and validated based on test set performance and you are
now a master of the test set. For any fixed classifier f , you know to evaluate its test error ϵD (f ),
and know precisely what can (and canʼt) be said about its population error ϵ(f ).
So letʼs say that you take this knowledge and prepare to train your first model f1 . Knowing just how
confident you need to be in the performance of your classifierʼs error rate you apply our analysis
above to determine an appropriate number of examples to set aside for the test set. Moreover, letʼs
assume that you took the lessons from Section 3.6 to heart and made sure to preserve the sanctity
of the test set by conducting all of your preliminary analysis, hyperparameter tuning, and even
selection among multiple competing model architectures on a validation set. Finally you evaluate
your model f1 on the test set and report an unbiased estimate of the population error with an
associated confidence interval.
So far everything seems to be going well. However, that night you wake up at 3am with a brilliant
idea for a new modeling approach. The next day, you code up your new model, tune its hyper-
parameters on the validation set and not only are you getting your new model f2 to work but itʼs
error rate appears to be much lower than f1 ʼs. However, the thrill of discovery suddenly fades as
you prepare for the final evaluation. You donʼt have a test set!

Even though the original test set D is still sitting on your server, you now face two formidable
problems. First, when you collected your test set, you determined the required level of precision
under the assumption that you were evaluating a single classifier f . However, if you get into the
business of evaluating multiple classifiers f1 , ..., fk on the same test set, you must consider the
problem of false discovery. Before, you might have been 95% sure that ϵD (f ) ∈ ϵ(f ) ± 0.01 for a
single classifier f and thus the probability of a misleading result was a mere 5%. With k classifiers
in the mix, it can be hard to guarantee that there is not even one among them whose test set
performance is misleading. With 20 classifiers under consideration, you might have no power
at all to rule out the possibility that at least one among them received a misleading score. This
problem relates to multiple hypothesis testing, which despite a vast literature in statistics, remains
a persistent problem plaguing scientific research.
If thatʼs not enough to worry you, thereʼs a special reason to distrust the results that you get on
subsequent evaluations. Recall that our analysis of test set performance rested on the assumption
that the classifier was chosen absent any contact with the test set and thus we could view the test
set as drawn randomly from the underlying population. Here, not only are you testing multiple
functions, the subsequent function f2 was chosen after you observed the test set performance of
f1 . Once information from the test set has leaked to the modeler, it can never be a true test set
again in the strictest sense. This problem is called adaptive overfitting and has recently emerged as
a topic of intense interest to learning theorists and statisticians (Dwork et al., 2015). Fortunately,
while it is possible to leak all information out of a holdout set, and the theoretical worst case
scenarios are bleak, these analyses may be too conservative. In practice, take care to create real
test sets, to consult them as infrequently as possible, to account for multiple hypothesis testing
when reporting confidence intervals, and to dial up your vigilance more aggressively when the
stakes are high and your dataset size is small. When running a series of benchmark challenges,
itʼs often good practice to maintain several test sets so that after each round, the old test set can be
demoted to a validation set.
4.6.3 Statistical Learning Theory
At once, test sets are all that we really have, and yet this fact seems strangely unsatisfying. First,
we seldom possess a true test set—unless we are the ones creating the dataset, someone else has
probably already evaluated their own classifier on our ostensible “test set”. And even when we get
first dibs, we soon find ourselves frustrated, wishing we could evaluate our subsequent modeling
attempts without the gnawing feeling that we cannot trust our numbers. Moreover, even a true
test set can only tell us post hoc whether a classifier has in fact generalized to the population, not
whether we have any reason to expect a priori that it should generalize.
With these misgivings in mind, you might now be sufficiently primed to see the appeal of statistical
learning theory, the mathematical subfield of machine learning whose practitioners aim to eluci-
date the fundamental principles that explain why/when models trained on empirical data can/will
generalize to unseen data. One of the primary aims for several decades of statistical learning re-
searchers has been to bound the generalization gap, relating the properties of the model class, the
number of samples in the dataset.
Learning theorists aim to bound the difference between the empirical error ϵS (fS ) of a learned
classifier fS , both trained and evaluated on the training set S, and the true error ϵ(fS ) of that
same classifier on the underlying population. This might look similar to the evaluation problem
that we just addressed but thereʼs a major difference. Before, the classifier f was fixed and we
only needed a dataset for evaluative purposes. And indeed, any fixed classifier does generalize:
its error on a (previously unseen) dataset is an unbiased estimate of the population error. But

what can we say when a classifier is trained and evaluated on the same dataset? Can we ever be
confident that the training error will be close to the testing error?
Suppose that our learned classifier fS must be chosen among some pre-specified set of functions
F. Recall from our discussion of test sets that while itʼs easy to estimate the error of a single clas-
sifier, things get hairy when we begin to consider collections of classifiers. Even if the empirical
error of any one (fixed) classifier will be close to its true error with high probability, once we con-
sider a collection of classifiers, we need to worry about the possibility that just one classifier in the
set will receive a badly misestimated error. The worry is that if just one classifier in our collection
receives a misleadingly low error then we might pick it and thereby grossly underestimate the
population error. Moreover, even for linear models, because their parameters are continuously
valued, we are typically choosing among an infinite class of functions (|F| = ∞).
One ambitious solution to the problem is to develop analytic tools for proving uniform conver-
gence, i.e., that with high probability, the empirical error rate for every classifier in the class
f ∈ F will simultaneously converge to its true error rate. In other words, we seek a theoretical
principle that would allow us to state that with probability at least 1 − δ (for some small δ) no
classifierʼs error rate ϵ(f ) (among all classifiers in the class F) will be misestimated by more than
some small amount α. Clearly, we cannot make such statements for all model classes F. Recall
the class of memorization machines that always achieve empirical error 0 but never outperform
random guessing on the underlying population.
In a sense the class of memorizers is too flexible. No such a uniform convergence result could
possibly hold. On the other hand, a fixed classifier is useless—it generalizes perfectly, but fits
neither the training data nor the test data. The central question of learning has thus historically
been framed as a tradeoff between more flexible (higher variance) model classes that better fit the
training data but risk overfitting, versus more rigid (higher bias) model classes that generalize well
but risk underfitting. A central question in learning theory has been to develop the appropriate
mathematical analysis to quantify where a model sits along this spectrum, and to provide the
associated guarantees.
In a series of seminal papers, Vapnik and Chervonenkis extended the theory on the convergence of
relative frequencies to more general classes of functions (Vapnik and Chervonenkis, 1964, Vapnik
and Chervonenkis, 1968, Vapnik and Chervonenkis, 1971, Vapnik and Chervonenkis, 1981, Vapnik
and Chervonenkis, 1991, Vapnik and Chervonenkis, 1974). One of the key contributions of this
line of work is the Vapnik-Chervonenkis (VC) dimension, which measures (one notion of) the
complexity (flexibility) of a model class. Moreover, one of their key results bounds the difference
between the empirical error and the population error as a function of the VC dimension and the
number of samples:
( ) √
P R[p, f ] − Remp [X, Y, f ] < α ≥ 1 − δ for α ≥ c (VC − log δ)/n. (4.6.4)
Here δ > 0 is the probability that the bound is violated, α is the upper bound on the generalization
gap, and n is the dataset size. Lastly, c > 0 is a constant that depends only on the scale of the
loss that can be incurred. One use of the bound might be to plug in desired values of δ and α to
determine how many samples to collect. The VC dimension quantifies the largest number of data
points for which we can assign any arbitrary (binary) labeling and for each find some model f in
the class that agrees with that labeling. For example, linear models on d-dimensional inputs have
VC dimension d + 1. Itʼs easy to see that a line can assign any possible labeling to three points
in two dimensions, but not to four. Unfortunately, the theory tends to be overly pessimistic for
more complex models and obtaining this guarantee typically requires far more examples than are
actually required to achieve the desired error rate. Note also that fixing the model class and δ, our
√
error rate again decays with the usual O(1/ n) rate. It seems unlikely that we could do better in

terms of n. However, as we vary the model class, VC dimension can present a pessimistic picture
of the generalization gap.
4.6.4 Summary
The most straightforward way to evaluate a model is to consult a test set comprised of previously
unseen data. Test set evaluations provide an unbiased estimate of the true error and converge at
√
the desired O(1/ n) rate as the test set grows. We can provide approximate confidence intervals
based on exact asymptotic distributions or valid finite sample confidence intervals based on (more
conservative) finite sample guarantees. Indeed test set evaluation is the bedrock of modern ma-
chine learning research. However, test sets are seldom true test sets (used by multiple researchers
again and again). Once the same test set is used to evaluate multiple models, controlling for false
discovery can be difficult. This can cause huge problems in theory. In practice, the significance
of the problem depends on the size of the holdout sets in question and whether they are merely
being used to choose hyperparameters or if they are leaking information more directly. Never-
theless, itʼs good practice to curate real test sets (or multiple) and to be as conservative as possible
about how often they are used.
Hoping to provide a more satisfying solution, statistical learning theorists have developed meth-
ods for guaranteeing uniform convergence over a model class. If indeed every modelʼs empirical
error converges to its true error simultaneously, then we are free to choose the model that per-
forms best, minimizing the training error, knowing that it too will perform similarly well on the
holdout data. Crucially, any of such results must depend on some property of the model class.
Vladimir Vapnik and Alexey Chernovenkis introduced the VC dimension, presenting uniform con-
vergence results that hold for all models in a VC class. The training errors for all models in the class
are (simultaneously) guaranteed to be close to their true errors, and guaranteed to grow closer at
√
O(1/ n) rates. Following the revolutionary discovery of VC dimension, numerous alternative
complexity measures have been proposed, each facilitating an analogous generalization guaran-
tee. See (Boucheron et al., 2005) for a detailed discussion of several advanced ways of measur-
ing function complexity. Unfortunately, while these complexity measures have become broadly
useful tools in statistical theory, they turn out to be powerless (as straightforwardly applied) for
explaining why deep neural networks generalize. Deep neural networks often have millions of
parameters (or more), and can easily assign random labels to large collections of points. Never-
theless, they generalize well on practical problems and, surprisingly, they often generalize better,
when they are larger and deeper, despite incurring larger VC dimensions. In the next chapter, we
will revisit generalization in the context of deep learning.
Exercises
1. If we wish to estimate the error of a fixed model f to within 0.0001 with probability greater
than 99.9%, how many samples do we need?
2. Suppose that somebody else possesses a labeled test set D and only makes available the unla-
beled inputs (features). Now suppose that you can only access the test set labels by running
a model f (no restrictions placed on the model class) on each of the unlabeled inputs and
receiving the corresponding error ϵD (f ). How many models would you need to evaluate be-
fore you leak the entire test set and thus could appear to have error 0, regardless of your true
error?
3. What is the VC dimension of the class of 5th -order polynomials?

4. What is the VC dimension of axis-aligned rectangles on two-dimensional data?
Discussions97
4.7 Environment and Distribution Shift
In the previous sections, we worked through a number of hands-on applications of machine learn-
ing, fitting models to a variety of datasets. And yet, we never stopped to contemplate either where
data comes from in the first place or what we plan to ultimately do with the outputs from our mod-
els. Too often, machine learning developers in possession of data rush to develop models without
pausing to consider these fundamental issues.
Many failed machine learning deployments can be traced back to this pattern. Sometimes mod-
els appear to perform marvelously as measured by test set accuracy but fail catastrophically in
deployment when the distribution of data suddenly shifts. More insidiously, sometimes the very
deployment of a model can be the catalyst that perturbs the data distribution. Say, for example,
that we trained a model to predict who will repay vs. default on a loan, finding that an applicantʼs
choice of footwear was associated with the risk of default (Oxfords indicate repayment, sneakers
indicate default). We might be inclined to thereafter grant loans to all applicants wearing Oxfords
and to deny all applicants wearing sneakers.
In this case, our ill-considered leap from pattern recognition to decision-making and our failure
to critically consider the environment might have disastrous consequences. For starters, as soon
as we began making decisions based on footwear, customers would catch on and change their
behavior. Before long, all applicants would be wearing Oxfords, without any coinciding improve-
ment in credit-worthiness. Take a minute to digest this because similar issues abound in many
applications of machine learning: by introducing our model-based decisions to the environment,
we might break the model.
While we cannot possibly give these topics a complete treatment in one section, we aim here to
expose some common concerns, and to stimulate the critical thinking required to detect these
situations early, mitigate damage, and use machine learning responsibly. Some of the solutions
are simple (ask for the “right” data), some are technically difficult (implement a reinforcement
learning system), and others require that we step outside the realm of statistical prediction al-
together and grapple with difficult philosophical questions concerning the ethical application of
algorithms.
4.7.1 Types of Distribution Shift
To begin, we stick with the passive prediction setting considering the various ways that data distri-
butions might shift and what might be done to salvage model performance. In one classic setup,
we assume that our training data was sampled from some distribution pS (x, y) but that our test
data will consist of unlabeled examples drawn from some different distribution pT (x, y). Already,
we must confront a sobering reality. Absent any assumptions on how pS and pT relate to each
other, learning a robust classifier is impossible.
Consider a binary classification problem, where we wish to distinguish between dogs and cats.
If the distribution can shift in arbitrary ways, then our setup permits the pathological case in
which the distribution over inputs remains constant: pS (x) = pT (x), but the labels are all flipped:
pS (y|x) = 1 − pT (y|x). In other words, if God can suddenly decide that in the future all “cats”
97

are now dogs and what we previously called “dogs” are now cats—without any change in the dis-
tribution of inputs p(x), then we cannot possibly distinguish this setting from one in which the
distribution did not change at all.
Fortunately, under some restricted assumptions on the ways our data might change in the future,
principled algorithms can detect shift and sometimes even adapt on the fly, improving on the
accuracy of the original classifier.
Covariate Shift
Among categories of distribution shift, covariate shift may be the most widely studied. Here, we
assume that while the distribution of inputs may change over time, the labeling function, i.e., the
conditional distribution P (y | x) does not change. Statisticians call this covariate shift because
the problem arises due to a shift in the distribution of the covariates (features). While we can
sometimes reason about distribution shift without invoking causality, we note that covariate shift
is the natural assumption to invoke in settings where we believe that x causes y.
Consider the challenge of distinguishing cats and dogs. Our training data might consist of images
of the kind in Fig. 4.7.1.
Fig. 4.7.1: Training data for distinguishing cats and dogs.
At test time we are asked to classify the images in Fig. 4.7.2.
Fig. 4.7.2: Test data for distinguishing cats and dogs.
The training set consists of photos, while the test set contains only cartoons. Training on a dataset
4.7. Environment and Distribution Shift 159

with substantially different characteristics from the test set can spell trouble absent a coherent
plan for how to adapt to the new domain.
Label Shift
Label shift describes the converse problem. Here, we assume that the label marginal P (y) can
change but the class-conditional distribution P (x | y) remains fixed across domains. Label shift is
a reasonable assumption to make when we believe that y causes x. For example, we may want to
predict diagnoses given their symptoms (or other manifestations), even as the relative prevalence
of diagnoses are changing over time. Label shift is the appropriate assumption here because dis-
eases cause symptoms. In some degenerate cases the label shift and covariate shift assumptions
can hold simultaneously. For example, when the label is deterministic, the covariate shift assump-
tion will be satisfied, even when y causes x. Interestingly, in these cases, it is often advantageous
to work with methods that flow from the label shift assumption. That is because these methods
tend to involve manipulating objects that look like labels (often low-dimensional), as opposed to
objects that look like inputs, which tend to be high-dimensional in deep learning.
Concept Shift
We may also encounter the related problem of concept shift, which arises when the very definitions
of labels can change. This sounds weird—a cat is a cat, no? However, other categories are subject to
changes in usage over time. Diagnostic criteria for mental illness, what passes for fashionable, and
job titles, are all subject to considerable amounts of concept shift. It turns out that if we navigate
around the United States, shifting the source of our data by geography, we will find considerable
concept shift regarding the distribution of names for soft drinks as shown in Fig. 4.7.3.
Fig. 4.7.3: Concept shift on soft drink names in the United States.
If we were to build a machine translation system, the distribution P (y | x) might be different de-
pending on our location. This problem can be tricky to spot. We might hope to exploit knowledge
that shift only takes place gradually either in a temporal or geographic sense.

4.7.2 Examples of Distribution Shift
Before delving into formalism and algorithms, we can discuss some concrete situations where
covariate or concept shift might not be obvious.
Medical Diagnostics
Imagine that you want to design an algorithm to detect cancer. You collect data from healthy and
sick people and you train your algorithm. It works fine, giving you high accuracy and you conclude
that you are ready for a successful career in medical diagnostics. Not so fast.
The distributions that gave rise to the training data and those you will encounter in the wild might
differ considerably. This happened to an unfortunate startup that some of us (authors) worked
with years ago. They were developing a blood test for a disease that predominantly affects older
men and hoped to study it using blood samples that they had collected from patients. However, it
is considerably more difficult to obtain blood samples from healthy men than sick patients already
in the system. To compensate, the startup solicited blood donations from students on a university
campus to serve as healthy controls in developing their test. Then they asked whether we could
help them to build a classifier for detecting the disease.
As we explained to them, it would indeed be easy to distinguish between the healthy and sick
cohorts with near-perfect accuracy. However, that is because the test subjects differed in age,
hormone levels, physical activity, diet, alcohol consumption, and many more factors unrelated
to the disease. This was unlikely to be the case with real patients. Due to their sampling proce-
dure, we could expect to encounter extreme covariate shift. Moreover, this case was unlikely to
be correctable via conventional methods. In short, they wasted a significant sum of money.
Self-Driving Cars
Say a company wanted to leverage machine learning for developing self-driving cars. One key
component here is a roadside detector. Since real annotated data is expensive to get, they had the
(smart and questionable) idea to use synthetic data from a game rendering engine as additional
training data. This worked really well on “test data” drawn from the rendering engine. Alas, inside
a real car it was a disaster. As it turned out, the roadside had been rendered with a very simplis-
tic texture. More importantly, all the roadside had been rendered with the same texture and the
roadside detector learned about this “feature” very quickly.
A similar thing happened to the US Army when they first tried to detect tanks in the forest. They
took aerial photographs of the forest without tanks, then drove the tanks into the forest and took
another set of pictures. The classifier appeared to work perfectly. Unfortunately, it had merely
learned how to distinguish trees with shadows from trees without shadows—the first set of pictures
was taken in the early morning, the second set at noon.

Nonstationary Distributions
A much more subtle situation arises when the distribution changes slowly (also known as nonsta-
tionary distribution) and the model is not updated adequately. Below are some typical cases.
• We train a computational advertising model and then fail to update it frequently (e.g., we
forget to incorporate that an obscure new device called an iPad was just launched).
• We build a spam filter. It works well at detecting all spam that we have seen so far. But then
the spammers wisen up and craft new messages that look unlike anything we have seen
before.
• We build a product recommendation system. It works throughout the winter but then con-
tinues to recommend Santa hats long after Christmas.
More Anecdotes
• We build a face detector. It works well on all benchmarks. Unfortunately it fails on test
data—the offending examples are close-ups where the face fills the entire image (no such
data was in the training set).
• We build a Web search engine for the US market and want to deploy it in the UK.
• We train an image classifier by compiling a large dataset where each among a large set of
classes is equally represented in the dataset, say 1000 categories, represented by 1000 images
each. Then we deploy the system in the real world, where the actual label distribution of
photographs is decidedly non-uniform.
4.7.3 Correction of Distribution Shift
As we have discussed, there are many cases where training and test distributions P (x, y) are dif-
ferent. In some cases, we get lucky and the models work despite covariate, label, or concept shift.
In other cases, we can do better by employing principled strategies to cope with the shift. The re-
mainder of this section grows considerably more technical. The impatient reader could continue
on to the next section as this material is not prerequisite to subsequent concepts.
Empirical Risk and Risk
Letʼs first reflect about what exactly is happening during model training: we iterate over features
and associated labels of training data {(x1 , y1 ), . . . , (xn , yn )} and update the parameters of a model
f after every minibatch. For simplicity we do not consider regularization, so we largely minimize
the loss on the training:
1∑
n
minimize l(f (xi ), yi ), (4.7.1)
f n
i=1
where l is the loss function measuring “how bad” the prediction f (xi ) is given the associated label
yi . Statisticians call the term in (4.7.1) empirical risk. The empirical risk is an average loss over the
training data to approximate the risk, which is the expectation of the loss over the entire population
of data drawn from their true distribution p(x, y):
∫ ∫
Ep(x,y) [l(f (x), y)] = l(f (x), y)p(x, y) dxdy. (4.7.2)

However, in practice we typically cannot obtain the entire population of data. Thus, empirical risk
minimization, which is minimizing the empirical risk in (4.7.1), is a practical strategy for machine
learning, with the hope to approximate minimizing the risk.
Covariate Shift Correction
Assume that we want to estimate some dependency P (y | x) for which we have labeled data (xi , yi ).
Unfortunately, the observations xi are drawn from some source distribution q(x) rather than the
target distribution p(x). Fortunately, the dependency assumption means that the conditional dis-
tribution does not change: p(y | x) = q(y | x). If the source distribution q(x) is “wrong”, we can
correct for that by using the following simple identity in the risk:
∫ ∫ ∫ ∫
p(x)
l(f (x), y)p(y | x)p(x) dxdy = l(f (x), y)q(y | x)q(x) dxdy. (4.7.3)
q(x)
In other words, we need to reweigh each data example by the ratio of the probability that it would
have been drawn from the correct distribution to that from the wrong one:
def p(xi )
βi = . (4.7.4)
q(xi )
Plugging in the weight βi for each data example (xi , yi ) we can train our model using weighted
empirical risk minimization:
1∑
n
minimize βi l(f (xi ), yi ). (4.7.5)
f n
i=1
Alas, we do not know that ratio, so before we can do anything useful we need to estimate it. Many
methods are available, including some fancy operator-theoretic approaches that attempt to recali-
brate the expectation operator directly using a minimum-norm or a maximum entropy principle.
Note that for any such approach, we need samples drawn from both distributions—the “true” p,
e.g., by access to test data, and the one used for generating the training set q (the latter is trivially
available). Note however, that we only need features x ∼ p(x); we do not need to access labels
y ∼ p(y).
In this case, there exists a very effective approach that will give almost as good results as the orig-
inal: logistic regression, which is a special case of softmax regression (see Section 4.1) for binary
classification. This is all that is needed to compute estimated probability ratios. We learn a classi-
fier to distinguish between data drawn from p(x) and data drawn from q(x). If it is impossible to
distinguish between the two distributions then it means that the associated instances are equally
likely to come from either one of the two distributions. On the other hand, any instances that can
be well discriminated should be significantly overweighted or underweighted accordingly.
For simplicityʼs sake assume that we have an equal number of instances from both distributions
p(x) and q(x), respectively. Now denote by z labels that are 1 for data drawn from p and −1 for
data drawn from q. Then the probability in a mixed dataset is given by
p(x) P (z = 1 | x) p(x)
P (z = 1 | x) = and hence = . (4.7.6)
p(x) + q(x) P (z = −1 | x) q(x)
Thus, if we use a logistic regression approach, where P (z = 1 | x) = 1

1+exp(−h(x)) (h is a parame-
terized function), it follows that
1/(1 + exp(−h(xi )))
βi = = exp(h(xi )). (4.7.7)
exp(−h(xi ))/(1 + exp(−h(xi )))

As a result, we need to solve two problems: first one to distinguish between data drawn from both
distributions, and then a weighted empirical risk minimization problem in (4.7.5) where we weigh
terms by βi .
Now we are ready to describe a correction algorithm. Suppose that we have a training set
{(x1 , y1 ), . . . , (xn , yn )} and an unlabeled test set {u1 , . . . , um }. For covariate shift, we assume that
xi for all 1 ≤ i ≤ n are drawn from some source distribution and ui for all 1 ≤ i ≤ m are drawn
from the target distribution. Here is a prototypical algorithm for correcting covariate shift:
1. Generate a binary-classification training set: {(x1 , −1), . . . , (xn , −1), (u1 , 1), . . . , (um , 1)}.
2. Train a binary classifier using logistic regression to get function h.
3. Weigh training data using βi = exp(h(xi )) or better βi = min(exp(h(xi )), c) for some con-
stant c.
4. Use weights βi for training on {(x1 , y1 ), . . . , (xn , yn )} in (4.7.5).
Note that the above algorithm relies on a crucial assumption. For this scheme to work, we need
that each data example in the target (e.g., test time) distribution had nonzero probability of oc-
curring at training time. If we find a point where p(x) > 0 but q(x) = 0, then the corresponding
importance weight should be infinity.
Label Shift Correction
Assume that we are dealing with a classification task with k categories. Using the same notation in
Section 4.7.3, q and p are the source distribution (e.g., training time) and target distribution (e.g.,
test time), respectively. Assume that the distribution of labels shifts over time: q(y) ̸= p(y), but
the class-conditional distribution stays the same: q(x | y) = p(x | y). If the source distribution
q(y) is “wrong”, we can correct for that according to the following identity in the risk as defined
in (4.7.2):
∫ ∫ ∫ ∫
p(y)
l(f (x), y)p(x | y)p(y) dxdy = l(f (x), y)q(x | y)q(y) dxdy. (4.7.8)
q(y)
Here, our importance weights will correspond to the label likelihood ratios
def p(yi )
βi = . (4.7.9)
q(yi )
One nice thing about label shift is that if we have a reasonably good model on the source distribu-
tion, then we can get consistent estimates of these weights without ever having to deal with the
ambient dimension. In deep learning, the inputs tend to be high-dimensional objects like images,
while the labels are often simpler objects like categories.
To estimate the target label distribution, we first take our reasonably good off-the-shelf classifier
(typically trained on the training data) and compute its confusion matrix using the validation set
(also from the training distribution). The confusion matrix, C, is simply a k × k matrix, where
each column corresponds to the label category (ground truth) and each row corresponds to our
modelʼs predicted category. Each cellʼs value cij is the fraction of total predictions on the validation
set where the true label was j and our model predicted i.
Now, we cannot calculate the confusion matrix on the target data directly, because we do not get
to see the labels for the examples that we see in the wild, unless we invest in a complex real-time
annotation pipeline. What we can do, however, is average all of our models predictions at test

time together, yielding the mean model outputs µ(ŷ) ∈ Rk , whose ith element µ(ŷi ) is the fraction
of total predictions on the test set where our model predicted i.
It turns out that under some mild conditions—if our classifier was reasonably accurate in the first
place, and if the target data contains only categories that we have seen before, and if the label shift
assumption holds in the first place (the strongest assumption here), then we can estimate the test
set label distribution by solving a simple linear system
Cp(y) = µ(ŷ), (4.7.10)

∑
because as an estimate kj=1 cij p(yj ) = µ(ŷi ) holds for all 1 ≤ i ≤ k, where p(yj ) is the j th element
of the k-dimensional label distribution vector p(y). If our classifier is sufficiently accurate to begin
with, then the confusion matrix C will be invertible, and we get a solution p(y) = C−1 µ(ŷ).
Because we observe the labels on the source data, it is easy to estimate the distribution q(y). Then
for any training example i with label yi , we can take the ratio of our estimated p(yi )/q(yi ) to cal-
culate the weight βi , and plug this into weighted empirical risk minimization in (4.7.5).
Concept Shift Correction
Concept shift is much harder to fix in a principled manner. For instance, in a situation where
suddenly the problem changes from distinguishing cats from dogs to one of distinguishing white
from black animals, it will be unreasonable to assume that we can do much better than just col-
lecting new labels and training from scratch. Fortunately, in practice, such extreme shifts are
rare. Instead, what usually happens is that the task keeps on changing slowly. To make things
more concrete, here are some examples:
• In computational advertising, new products are launched, old products become less popular.
This means that the distribution over ads and their popularity changes gradually and any
click-through rate predictor needs to change gradually with it.
• Traffic camera lenses degrade gradually due to environmental wear, affecting image quality
progressively.
• News content changes gradually (i.e., most of the news remains unchanged but new stories
appear).
In such cases, we can use the same approach that we used for training networks to make them
adapt to the change in the data. In other words, we use the existing network weights and simply
perform a few update steps with the new data rather than training from scratch.
4.7.4 A Taxonomy of Learning Problems
Armed with knowledge about how to deal with changes in distributions, we can now consider
some other aspects of machine learning problem formulation.

Batch Learning
In batch learning, we have access to training features and labels {(x1 , y1 ), . . . , (xn , yn )}, which we
use to train a model f (x). Later on, we deploy this model to score new data (x, y) drawn from the
same distribution. This is the default assumption for any of the problems that we discuss here. For
instance, we might train a cat detector based on lots of pictures of cats and dogs. Once we trained
it, we ship it as part of a smart catdoor computer vision system that lets only cats in. This is then
installed in a customerʼs home and is never updated again (barring extreme circumstances).
Online Learning
Now imagine that the data (xi , yi ) arrives one sample at a time. More specifically, assume that
we first observe xi , then we need to come up with an estimate f (xi ) and only once we have done
this, we observe yi and with it, we receive a reward or incur a loss, given our decision. Many
real problems fall into this category. For example, we need to predict tomorrowʼs stock price, this
allows us to trade based on that estimate and at the end of the day we find out whether our estimate
allowed us to make a profit. In other words, in online learning, we have the following cycle where
we are continuously improving our model given new observations.
model ft −→ data xt −→ estimate ft (xt ) −→ observation yt −→ loss l(yt , ft (xt )) −→ model ft+1
(4.7.11)
Bandits
Bandits are a special case of the problem above. While in most learning problems we have a con-
tinuously parametrized function f where we want to learn its parameters (e.g., a deep network),
in a bandit problem we only have a finite number of arms that we can pull, i.e., a finite number
of actions that we can take. It is not very surprising that for this simpler problem stronger theo-
retical guarantees in terms of optimality can be obtained. We list it mainly since this problem is
often (confusingly) treated as if it were a distinct learning setting.
Control
In many cases the environment remembers what we did. Not necessarily in an adversarial manner
but it will just remember and the response will depend on what happened before. For instance, a
coffee boiler controller will observe different temperatures depending on whether it was heating
the boiler previously. PID (proportional-integral-derivative) controller algorithms are a popular
choice there. Likewise, a userʼs behavior on a news site will depend on what we showed him
previously (e.g., he will read most news only once). Many such algorithms form a model of the
environment in which they act such as to make their decisions appear less random. Recently,
control theory (e.g., PID variants) has also been used to automatically tune hyperparameters to
achieve better disentangling and reconstruction quality, and improve the diversity of generated
text and the reconstruction quality of generated images (Shao et al., 2020).

Reinforcement Learning
In the more general case of an environment with memory, we may encounter situations where
the environment is trying to cooperate with us (cooperative games, in particular for non-zero-sum
games), or others where the environment will try to win. Chess, Go, Backgammon, or StarCraft
are some of the cases in reinforcement learning. Likewise, we might want to build a good controller
for autonomous cars. The other cars are likely to respond to the autonomous carʼs driving style in
nontrivial ways, e.g., trying to avoid it, trying to cause an accident, and trying to cooperate with
it.
Considering the Environment
One key distinction between the different situations above is that the same strategy that might have
worked throughout in the case of a stationary environment, might not work throughout when the
environment can adapt. For instance, an arbitrage opportunity discovered by a trader is likely to
disappear once he starts exploiting it. The speed and manner at which the environment changes
determines to a large extent the type of algorithms that we can bring to bear. For instance, if we
know that things may only change slowly, we can force any estimate to change only slowly, too. If
we know that the environment might change instantaneously, but only very infrequently, we can
make allowances for that. These types of knowledge are crucial for the aspiring data scientist to
deal with concept shift, i.e., when the problem that he is trying to solve changes over time.
4.7.5 Fairness, Accountability, and Transparency in Machine Learning
Finally, it is important to remember that when you deploy machine learning systems you are not
merely optimizing a predictive model—you are typically providing a tool that will be used to (par-
tially or fully) automate decisions. These technical systems can impact the lives of individuals
subject to the resulting decisions. The leap from considering predictions to decisions raises not
only new technical questions, but also a slew of ethical questions that must be carefully consid-
ered. If we are deploying a medical diagnostic system, we need to know for which populations
it may work and which it may not. Overlooking foreseeable risks to the welfare of a subpopula-
tion could cause us to administer inferior care. Moreover, once we contemplate decision-making
systems, we must step back and reconsider how we evaluate our technology. Among other con-
sequences of this change of scope, we will find that accuracy is seldom the right measure. For
instance, when translating predictions into actions, we will often want to take into account the
potential cost sensitivity of erring in various ways. If one way of misclassifying an image could
be perceived as a racial sleight of hand, while misclassification to a different category would be
harmless, then we might want to adjust our thresholds accordingly, accounting for societal val-
ues in designing the decision-making protocol. We also want to be careful about how prediction
systems can lead to feedback loops. For example, consider predictive policing systems, which al-
locate patrol officers to areas with high forecasted crime. It is easy to see how a worrying pattern
can emerge:
1. Neighborhoods with more crime get more patrols.
2. Consequently, more crimes are discovered in these neighborhoods, entering the training
data available for future iterations.
3. Exposed to more positives, the model predicts yet more crime in these neighborhoods.

4. In the next iteration, the updated model targets the same neighborhood even more heavily
leading to yet more crimes discovered, etc.
Often, the various mechanisms by which a modelʼs predictions become coupled to its training data
are unaccounted for in the modeling process. This can lead to what researchers call runaway feed-
back loops. Additionally, we want to be careful about whether we are addressing the right problem
in the first place. Predictive algorithms now play an outsize role in mediating the dissemination of
information. Should the news that an individual encounters be determined by the set of Facebook
pages they have Liked? These are just a few among the many pressing ethical dilemmas that you
might encounter in a career in machine learning.
4.7.6 Summary
• In many cases training and test sets do not come from the same distribution. This is called
distribution shift.
• The risk is the expectation of the loss over the entire population of data drawn from their
true distribution. However, this entire population is usually unavailable. Empirical risk is
an average loss over the training data to approximate the risk. In practice, we perform em-
pirical risk minimization.
• Under the corresponding assumptions, covariate and label shift can be detected and cor-
rected for at test time. Failure to account for this bias can become problematic at test time.
• In some cases, the environment may remember automated actions and respond in surpris-
ing ways. We must account for this possibility when building models and continue to mon-
itor live systems, open to the possibility that our models and the environment will become
entangled in unanticipated ways.
Exercises
1. What could happen when we change the behavior of a search engine? What might the users
do? What about the advertisers?
2. Implement a covariate shift detector. Hint: build a classifier.
3. Implement a covariate shift corrector.
4. Besides distribution shift, what else could affect how the empirical risk approximates the
risk?
Discussions98
98

5 | Multilayer Perceptrons
In this chapter, we will introduce your first truly deep network. The simplest deep networks are
called multilayer perceptrons, and they consist of multiple layers of neurons each fully connected to
those in the layer below (from which they receive input) and those above (which they, in turn, in-
fluence). Although automatic differentiation significantly simplifies the implementation of deep
learning algorithms, we will dive deep into how these gradients are calculated in deep networks.
Then we will be ready to discuss issues relating to numerical stability and parameter initialization
that are key to successfully training deep networks. When we train such high-capacity models
we run the risk of overfitting. Thus, we will revisit regularization and generalization for deep net-
works. Throughout, we aim to give you a firm grasp not just of the concepts but also of the practice
of using deep networks. At the end of this chapter, we apply what we have introduced so far to a
real case: house price prediction. We punt matters relating to the computational performance,
scalability, and efficiency of our models to subsequent chapters.
5.1 Multilayer Perceptrons
In Chapter 4, we introduced softmax regression (Section 4.1), implementing the algorithm from
scratch (Section 4.4) and using high-level APIs (Section 4.5). This allowed us to train classifiers
capable of recognizing 10 categories of clothing from low-resolution images. Along the way, we
learned how to wrangle data, coerce our outputs into a valid probability distribution, apply an ap-
propriate loss function, and minimize it with respect to our modelʼs parameters. Now that we have
mastered these mechanics in the context of simple linear models, we can launch our exploration
of deep neural networks, the comparatively rich class of models with which this book is primarily
concerned.
5.1.1 Hidden Layers
We described affine transformations in Section 3.1.1 as linear transformations with added bias.
To begin, recall the model architecture corresponding to our softmax regression example, illus-
trated in Fig. 4.1.1. This model maps inputs directly to outputs via a single affine transformation,
followed by a softmax operation. If our labels truly were related to the input data by a simple affine
transformation, then this approach would be sufficient. However, linearity (in affine transforma-
tions) is a strong assumption.
169
Limitations of Linear Models
For example, linearity implies the weaker assumption of monotonicity, i.e., that any increase in our
feature must either always cause an increase in our modelʼs output (if the corresponding weight
is positive), or always cause a decrease in our modelʼs output (if the corresponding weight is nega-
tive). Sometimes that makes sense. For example, if we were trying to predict whether an individ-
ual will repay a loan, we might reasonably assume that all other things being equal, an applicant
with a higher income would always be more likely to repay than one with a lower income. While
monotonic, this relationship likely is not linearly associated with the probability of repayment.
An increase in income from \$0 to \$50,000 likely corresponds to a bigger increase in likelihood of
repayment than an increase from \$1 million to \$1.05 million. One way to handle this might be to
post-process our outcome such that linearity becomes more plausible, by using the logistic map
(and thus the logarithm of the probability of outcome).
Note that we can easily come up with examples that violate monotonicity. Say for example that we
want to predict health as a function of body temperature. For individuals with a body temperature
above 37°C (98.6°F), higher temperatures indicate greater risk. However, for individuals with body
temperatures below 37°C, lower temperatures indicate greater risk! Again, we might resolve the
problem with some clever preprocessing, such as using the distance from 37°C as a feature.
But what about classifying images of cats and dogs? Should increasing the intensity of the pixel
at location (13, 17) always increase (or always decrease) the likelihood that the image depicts a
dog? Reliance on a linear model corresponds to the implicit assumption that the only requirement
for differentiating cats vs. dogs is to assess the brightness of individual pixels. This approach is
doomed to fail in a world where inverting an image preserves the category.
And yet despite the apparent absurdity of linearity here, as compared with our previous examples,
it is less obvious that we could address the problem with a simple preprocessing fix. That is,
because the significance of any pixel depends in complex ways on its context (the values of the
surrounding pixels). While there might exist a representation of our data that would take into
account the relevant interactions among our features, on top of which a linear model would be
suitable, we simply do not know how to calculate it by hand. With deep neural networks, we used
observational data to jointly learn both a representation via hidden layers and a linear predictor
that acts upon that representation.
This problem of nonlinearity has been studied for at least a century (Fisher, 1928). For instance,
decision trees in their most basic form use a sequence of binary decisions to decide upon class
membership (Quinlan, 2014). Likewise, kernel methods have been used for many decades to
model nonlinear dependencies (Aronszajn, 1950). This has found its way, e.g., into nonparamet-
ric spline models (Wahba, 1990) and kernel methods (Scholkopf and Smola, 2002). It is also some-
thing that the brain solves quite naturally. After all, neurons feed into other neurons which, in
turn, feed into other neurons again (y Cajal and Azoulay, 1894). Consequently we have a sequence
of relatively simple transformations.
170 Chapter 5. Multilayer Perceptrons

Incorporating Hidden Layers
We can overcome the limitations of linear models by incorporating one or more hidden layers.
The easiest way to do this is to stack many fully connected layers on top of each other. Each layer
feeds into the layer above it, until we generate outputs. We can think of the first L − 1 layers as our
representation and the final layer as our linear predictor. This architecture is commonly called a
multilayer perceptron, often abbreviated as MLP (Fig. 5.1.1).
Fig. 5.1.1: An MLP with a hidden layer of 5 hidden units.
This MLP has 4 inputs, 3 outputs, and its hidden layer contains 5 hidden units. Since the input layer
does not involve any calculations, producing outputs with this network requires implementing the
computations for both the hidden and output layers; thus, the number of layers in this MLP is 2.
Note that both layers are fully connected. Every input influences every neuron in the hidden layer,
and each of these in turn influences every neuron in the output layer. Alas, we are not quite done
yet.
From Linear to Nonlinear
As before, we denote by the matrix X ∈ Rn×d a minibatch of n examples where each example
has d inputs (features). For a one-hidden-layer MLP whose hidden layer has h hidden units, we
denote by H ∈ Rn×h the outputs of the hidden layer, which are hidden representations. Since the
hidden and output layers are both fully connected, we have hidden-layer weights W(1) ∈ Rd×h and
biases b(1) ∈ R1×h and output-layer weights W(2) ∈ Rh×q and biases b(2) ∈ R1×q . This allows us to
calculate the outputs O ∈ Rn×q of the one-hidden-layer MLP as follows:
H = XW(1) + b(1) ,
(5.1.1)
O = HW(2) + b(2) .
Note that after adding the hidden layer, our model now requires us to track and update additional
sets of parameters. So what have we gained in exchange? You might be surprised to find out that—
in the model defined above—we gain nothing for our troubles! The reason is plain. The hidden units
above are given by an affine function of the inputs, and the outputs (pre-softmax) are just an affine
function of the hidden units. An affine function of an affine function is itself an affine function.
Moreover, our linear model was already capable of representing any affine function.
To see this formally we can just collapse out the hidden layer in the above definition, yielding an
equivalent single-layer model with parameters W = W(1) W(2) and b = b(1) W(2) + b(2) :
O = (XW(1) + b(1) )W(2) + b(2) = XW(1) W(2) + b(1) W(2) + b(2) = XW + b. (5.1.2)
5.1. Multilayer Perceptrons 171

In order to realize the potential of multilayer architectures, we need one more key ingredient: a
nonlinear activation function σ to be applied to each hidden unit following the affine transforma-
tion. For instance, a popular choice is the ReLU (Rectified Linear Unit) activation function (Nair
and Hinton, 2010) σ(x) = max(0, x) operating on its arguments element-wise. The outputs of ac-
tivation functions σ(·) are called activations. In general, with activation functions in place, it is no
longer possible to collapse our MLP into a linear model:
H = σ(XW(1) + b(1) ),
(5.1.3)
O = HW(2) + b(2) .
Since each row in X corresponds to an example in the minibatch, with some abuse of notation,
we define the nonlinearity σ to apply to its inputs in a row-wise fashion, i.e., one example at a
time. Note that we used the same notation for softmax when we denoted a row-wise operation
in Section 4.1.1. Quite frequently the activation functions we use apply not merely row-wise but
element-wise. That means that after computing the linear portion of the layer, we can calculate
each activation without looking at the values taken by the other hidden units.
To build more general MLPs, we can continue stacking such hidden layers, e.g., H(1) = σ1 (XW(1) +
b(1) ) and H(2) = σ2 (H(1) W(2) + b(2) ), one atop another, yielding ever more expressive models.
Universal Approximators
We know that the brain is capable of very sophisticated statistical analysis. As such, it is worth ask-
ing, just how powerful a deep network could be. This question has been answered multiple times,
e.g., in (Cybenko, 1989) in the context of MLPs, and in (Micchelli, 1984) in the context of repro-
ducing kernel Hilbert spaces in a way that could be seen as radial basis function (RBF) networks
with a single hidden layer. These (and related results) suggest that even with a single-hidden-
layer network, given enough nodes (possibly absurdly many), and the right set of weights, we can
model any function. Actually learning that function is the hard part, though. You might think of
your neural network as being a bit like the C programming language. The language, like any other
modern language, is capable of expressing any computable program. But actually coming up with
a program that meets your specifications is the hard part.
Moreover, just because a single-hidden-layer network can learn any function does not mean that
you should try to solve all of your problems with single-hidden-layer networks. In fact, in this
case kernel methods are way more effective, since they are capable of solving the problem exactly
even in infinite dimensional spaces (Kimeldorf and Wahba, 1971, Schölkopf et al., 2001). In fact,
we can approximate many functions much more compactly by using deeper (vs. wider) networks
(Simonyan and Zisserman, 2014). We will touch upon more rigorous arguments in subsequent
chapters.
5.1.2 Activation Functions
Activation functions decide whether a neuron should be activated or not by calculating the
weighted sum and further adding bias with it. They are differentiable operators to transform
input signals to outputs, while most of them add non-linearity. Because activation functions are
fundamental to deep learning, letʼs briefly survey some common activation functions.
%matplotlib inline
import torch

ReLU Function
The most popular choice, due to both simplicity of implementation and its good performance
on a variety of predictive tasks, is the rectified linear unit (ReLU) (Nair and Hinton, 2010). ReLU
provides a very simple nonlinear transformation. Given an element x, the function is defined as
the maximum of that element and 0:
ReLU(x) = max(x, 0). (5.1.4)
Informally, the ReLU function retains only positive elements and discards all negative elements
by setting the corresponding activations to 0. To gain some intuition, we can plot the function. As
you can see, the activation function is piecewise linear.
x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)

y = torch.relu(x)
d2l.plot(x.detach(), y.detach(), 'x', 'relu(x)', figsize=(5, 2.5))
When the input is negative, the derivative of the ReLU function is 0, and when the input is positive,
the derivative of the ReLU function is 1. Note that the ReLU function is not differentiable when the
input takes value precisely equal to 0. In these cases, we default to the left-hand-side derivative
and say that the derivative is 0 when the input is 0. We can get away with this because the input may
never actually be zero (mathematicians would say that itʼs nondifferentiable on a set of measure
zero). There is an old adage that if subtle boundary conditions matter, we are probably doing (real)
mathematics, not engineering. That conventional wisdom may apply here, or at least, the fact that
we are not performing constrained optimization (Mangasarian, 1965, Rockafellar, 1970). We plot
the derivative of the ReLU function plotted below.
y.backward(torch.ones_like(x), retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of relu', figsize=(5, 2.5))

The reason for using ReLU is that its derivatives are particularly well behaved: either they vanish
or they just let the argument through. This makes optimization better behaved and it mitigated
the well-documented problem of vanishing gradients that plagued previous versions of neural
networks (more on this later).
Note that there are many variants to the ReLU function, including the parameterized ReLU (pReLU)
function (He et al., 2015). This variation adds a linear term to ReLU, so some information still gets
through, even when the argument is negative:
pReLU(x) = max(0, x) + α min(0, x). (5.1.5)
Sigmoid Function
The sigmoid function transforms its inputs, for which values lie in the domain R, to outputs that lie
on the interval (0, 1). For that reason, the sigmoid is often called a squashing function: it squashes
any input in the range (-inf, inf) to some value in the range (0, 1):
1
sigmoid(x) = . (5.1.6)
1 + exp(−x)
In the earliest neural networks, scientists were interested in modeling biological neurons which
either fire or do not fire. Thus the pioneers of this field, going all the way back to McCulloch and
Pitts, the inventors of the artificial neuron, focused on thresholding units (McCulloch and Pitts,
1943). A thresholding activation takes value 0 when its input is below some threshold and value 1
when the input exceeds the threshold.
When attention shifted to gradient based learning, the sigmoid function was a natural choice be-
cause it is a smooth, differentiable approximation to a thresholding unit. Sigmoids are still widely
used as activation functions on the output units, when we want to interpret the outputs as prob-
abilities for binary classification problems: you can think of the sigmoid as a special case of the
softmax. However, the sigmoid has mostly been replaced by the simpler and more easily train-
able ReLU for most use in hidden layers. Much of this has to do with the fact that the sigmoid
poses challenges for optimization (LeCun et al., 1998) since its gradient vanishes for large positive
and negative arguments. This can lead to plateaus that are difficult to escape from. Nonetheless
sigmoids are important. In later chapters (e.g., Section 10.2) on recurrent neural networks, we
will describe architectures that leverage sigmoid units to control the flow of information across
time.

Below, we plot the sigmoid function. Note that when the input is close to 0, the sigmoid function
approaches a linear transformation.
y = torch.sigmoid(x)
d2l.plot(x.detach(), y.detach(), 'x', 'sigmoid(x)', figsize=(5, 2.5))
The derivative of the sigmoid function is given by the following equation:
d exp(−x)
sigmoid(x) = = sigmoid(x) (1 − sigmoid(x)) . (5.1.7)
dx (1 + exp(−x))2
The derivative of the sigmoid function is plotted below. Note that when the input is 0, the deriva-
tive of the sigmoid function reaches a maximum of 0.25. As the input diverges from 0 in either
direction, the derivative approaches 0.
# Clear out previous gradients

x.grad.data.zero_()
y.backward(torch.ones_like(x),retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of sigmoid', figsize=(5, 2.5))

Tanh Function
Like the sigmoid function, the tanh (hyperbolic tangent) function also squashes its inputs, trans-
forming them into elements on the interval between -1 and 1:
1 − exp(−2x)
tanh(x) = . (5.1.8)
1 + exp(−2x)
We plot the tanh function below. Note that as input nears 0, the tanh function approaches a linear
transformation. Although the shape of the function is similar to that of the sigmoid function, the
tanh function exhibits point symmetry about the origin of the coordinate system (Kalman and
Kwasny, 1992).
y = torch.tanh(x)
d2l.plot(x.detach(), y.detach(), 'x', 'tanh(x)', figsize=(5, 2.5))
The derivative of the tanh function is:

d
tanh(x) = 1 − tanh2 (x). (5.1.9)
dx
It is plotted below. As the input nears 0, the derivative of the tanh function approaches a maximum
of 1. And as we saw with the sigmoid function, as input moves away from 0 in either direction,
the derivative of the tanh function approaches 0.
# Clear out previous gradients.

x.grad.data.zero_()
y.backward(torch.ones_like(x),retain_graph=True)
d2l.plot(x.detach(), x.grad, 'x', 'grad of tanh', figsize=(5, 2.5))

5.1.3 Summary
We now know how to incorporate nonlinearities to build expressive multilayer neural network
architectures. As a side note, your knowledge already puts you in command of a similar toolkit
to a practitioner circa 1990. In some ways, you have an advantage over anyone working in the
1990s, because you can leverage powerful open-source deep learning frameworks to build models
rapidly, using only a few lines of code. Previously, training these networks required researchers
to code up layers and derivatives explicitly in C, Fortran, or even Lisp (in the case of LeNet).
A secondary benefit is that ReLU is significantly more amenable to optimization than the sigmoid
or the tanh function. One could argue that this was one of the key innovations that helped the
resurgence of deep learning over the past decade. Note, though, that research in activation func-
tions has not stopped. For instance, the Swish activation function σ(x) = x sigmoid(βx) as pro-
posed in (Ramachandran et al., 2017) can yield better accuracy in many cases.
Exercises
1. Show that adding layers to a linear deep network, i.e., a network without nonlinearity σ can
never increase the expressive power of the network. Give an example where it actively re-
duces it.
2. Compute the derivative of the pReLU activation function.
3. Compute the derivative of the Swish activation function x sigmoid(βx).
4. Show that an MLP using only ReLU (or pReLU) constructs a continuous piecewise linear
function.
5. Sigmoid and tanh are very similar.
1. Show that tanh(x) + 1 = 2 sigmoid(2x).
2. Prove that the function classes parametrized by both nonlinearities are identical. Hint:
affine layers have bias terms, too.
6. Assume that we have a nonlinearity that applies to one minibatch at a time, such as the batch
normalization (Ioffe and Szegedy, 2015). What kinds of problems do you expect this to cause?
7. Provide an example where the gradients vanish for the sigmoid activation function.

Discussions99
5.2 Implementation of Multilayer Perceptrons
Multilayer perceptrons (MLPs) are not much more complex to implement than simple linear mod-
els. The key conceptual difference is that we now concatenate multiple layers.
import torch
Letʼs begin again by implementing such a network from scratch.
Initializing Model Parameters
Recall that Fashion-MNIST contains 10 classes, and that each image consists of a 28 × 28 = 784
grid of grayscale pixel values. As before we will disregard the spatial structure among the pixels
for now, so we can think of this as a classification dataset with 784 input features and 10 classes. To
begin, we will implement an MLP with one hidden layer and 256 hidden units. Both the number of
layers and their width are adjustable (they are considered hyperparameters). Typically, we choose
the layer widths to be divisible by larger powers of 2. This is computationally efficient due to the
way memory is allocated and addressed in hardware.
Again, we will represent our parameters with several tensors. Note that for every layer, we must
keep track of one weight matrix and one bias vector. As always, we allocate memory for the gra-
dients of the loss with respect to these parameters.
class MLPScratch(d2l.Classifier):
def __init__(self, num_inputs, num_outputs, num_hiddens, lr, sigma=0.01):
super().__init__()
self.W1 = nn.Parameter(torch.randn(num_inputs, num_hiddens) * sigma)
self.b1 = nn.Parameter(torch.zeros(num_hiddens))
self.W2 = nn.Parameter(torch.randn(num_hiddens, num_outputs) * sigma)
self.b2 = nn.Parameter(torch.zeros(num_outputs))
99

Model
To make sure we know how everything works, we will implement the ReLU activation ourselves
rather than invoking the built-in relu function directly.
def relu(X):
a = torch.zeros_like(X)
return torch.max(X, a)
Since we are disregarding spatial structure, we reshape each two-dimensional image into a flat
vector of length num_inputs. Finally, we implement our model with just a few lines of code. Since
we use the framework built-in autograd this is all that it takes.
@d2l.add_to_class(MLPScratch)
X = X.reshape((-1, self.num_inputs))
H = relu(torch.matmul(X, self.W1) + self.b1)
return torch.matmul(H, self.W2) + self.b2
Training
Fortunately, the training loop for MLPs is exactly the same as for softmax regression. We define
the model, data, trainer and finally invoke the fit function on model and data.
model = MLPScratch(num_inputs=784, num_outputs=10, num_hiddens=256, lr=0.1)

5.2. Implementation of Multilayer Perceptrons 179

As you might expect, by relying on the high-level APIs, we can implement MLPs even more con-
cisely.
Model
As compared with our concise implementation of softmax regression implementation (Section

4.5), the only difference is that we add two fully connected layers where we previously added only
one. The first is the hidden layer, the second is the output layer.
class MLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens, lr):
super().__init__()
self.net = nn.Sequential(nn.Flatten(), nn.LazyLinear(num_hiddens),
nn.ReLU(), nn.LazyLinear(num_outputs))
Training
The training loop is exactly the same as when we implemented softmax regression. This modu-
larity enables us to separate matters concerning the model architecture from orthogonal consid-
erations.
model = MLP(num_outputs=10, num_hiddens=256, lr=0.1)


5.2.3 Summary
Now that we have more practice in designing deep networks, the step from a single to multiple
layers of deep networks doesnʼt pose such a significant challenge any longer. In particular, we can
reuse the training algorithm and data loader. Note, though, that implementing MLPs from scratch
is nonetheless messy: naming and keeping track of the model parameters makes it difficult to
extend models. For instance, imagine wanting to insert another layer between layers 42 and 43.
This might now be layer 42b, unless we are willing to perform sequential renaming. Moreover, if
we implement the network from scratch, it is much more difficult for the framework to perform
meaningful performance optimizations.
Nonetheless, you have now reached the state of the art of the late 1980s when fully connected deep
networks were the method of choice for neural network modeling. Our next conceptual step will
be to consider images. Before we do so, we need to review a number of statistical basics and details
on how to compute models efficiently.
Exercises
1. Change the number of hidden units num_hiddens and plot how its number affects the accu-
racy of the model. What is the best value of this hyperparameter?
2. Try adding a hidden layer to see how it affects the results.
3. Why is it a bad idea to insert a hidden layer with a single neuron? What could go wrong?
4. How does changing the learning rate alter your results? With all other parameters fixed,
which learning rate gives you the best results? How does this relate to the number of epochs?
5. Letʼs optimize over all hyperparameters jointly, i.e., learning rate, number of epochs, num-
ber of hidden layers, and number of hidden units per layer.
1. What is the best result you can get by optimizing over all of them?
2. Why it is much more challenging to deal with multiple hyperparameters?
3. Describe an efficient strategy for optimizing over multiple parameters jointly.
6. Compare the speed of the framework and the from-scratch implementation for a challenging
problem. How does it change with the complexity of the network?
7. Measure the speed of tensor-matrix multiplications for well-aligned and misaligned matri-
ces. For instance, test for matrices with dimension 1024, 1025, 1026, 1028, and 1032.
1. How does this change between GPUs and CPUs?
2. Determine the memory bus width of your CPU and GPU.
8. Try out different activation functions. Which one works best?
9. Is there a difference between weight initializations of the network? Does it matter?
Discussions100
100
5.2. Implementation of Multilayer Perceptrons 181

5.3 Forward Propagation, Backward Propagation, and Computational
Graphs
So far, we have trained our models with minibatch stochastic gradient descent. However, when
we implemented the algorithm, we only worried about the calculations involved in forward prop-
agation through the model. When it came time to calculate the gradients, we just invoked the
backpropagation function provided by the deep learning framework.
The automatic calculation of gradients (automatic differentiation) profoundly simplifies the im-
plementation of deep learning algorithms. Before automatic differentiation, even small changes
to complicated models required recalculating complicated derivatives by hand. Surprisingly of-
ten, academic papers had to allocate numerous pages to deriving update rules. While we must
continue to rely on automatic differentiation so we can focus on the interesting parts, you ought
to know how these gradients are calculated under the hood if you want to go beyond a shallow
understanding of deep learning.
In this section, we take a deep dive into the details of backward propagation (more commonly called
backpropagation). To convey some insight for both the techniques and their implementations, we
rely on some basic mathematics and computational graphs. To start, we focus our exposition
on a one-hidden-layer MLP with weight decay (ℓ2 regularization, to be described in subsequent
chapters).
5.3.1 Forward Propagation
Forward propagation (or forward pass) refers to the calculation and storage of intermediate variables
(including outputs) for a neural network in order from the input layer to the output layer. We now
work step-by-step through the mechanics of a neural network with one hidden layer. This may
seem tedious but in the eternal words of funk virtuoso James Brown, you must “pay the cost to be
the boss”.
For the sake of simplicity, letʼs assume that the input example is x ∈ Rd and that our hidden layer
does not include a bias term. Here the intermediate variable is:
z = W(1) x, (5.3.1)
where W(1) ∈ Rh×d is the weight parameter of the hidden layer. After running the intermediate
variable z ∈ Rh through the activation function ϕ we obtain our hidden activation vector of length
h,
h = ϕ(z). (5.3.2)
The hidden layer output h is also an intermediate variable. Assuming that the parameters of the
output layer only possess a weight of W(2) ∈ Rq×h , we can obtain an output layer variable with a
vector of length q:
o = W(2) h. (5.3.3)
Assuming that the loss function is l and the example label is y, we can then calculate the loss term
for a single data example,
L = l(o, y). (5.3.4)

According to the definition of ℓ2 regularization that we will introduce later, given the hyperparam-
eter λ, the regularization term is
λ ( (1) 2 )
s= ∥W ∥F + ∥W(2) ∥2F , (5.3.5)
2
where the Frobenius norm of the matrix is simply the ℓ2 norm applied after flattening the matrix
into a vector. Finally, the modelʼs regularized loss on a given data example is:
J = L + s. (5.3.6)
We refer to J as the objective function in the following discussion.
5.3.2 Computational Graph of Forward Propagation
Plotting computational graphs helps us visualize the dependencies of operators and variables
within the calculation. Fig. 5.3.1 contains the graph associated with the simple network described
above, where squares denote variables and circles denote operators. The lower-left corner signi-
fies the input and the upper-right corner is the output. Notice that the directions of the arrows
(which illustrate data flow) are primarily rightward and upward.
Fig. 5.3.1: Computational graph of forward propagation.
5.3.3 Backpropagation
Backpropagation refers to the method of calculating the gradient of neural network parameters.
In short, the method traverses the network in reverse order, from the output to the input layer,
according to the chain rule from calculus. The algorithm stores any intermediate variables (partial
derivatives) required while calculating the gradient with respect to some parameters. Assume that
we have functions Y = f (X) and Z = g(Y), in which the input and the output X, Y, Z are tensors of
arbitrary shapes. By using the chain rule, we can compute the derivative of Z with respect to X via
( )
∂Z ∂Z ∂Y
= prod , . (5.3.7)
∂X ∂Y ∂X
Here we use the prod operator to multiply its arguments after the necessary operations, such as
transposition and swapping input positions, have been carried out. For vectors, this is straight-
forward: it is simply matrix-matrix multiplication. For higher dimensional tensors, we use the
appropriate counterpart. The operator prod hides all the notation overhead.
Recall that the parameters of the simple network with one hidden layer, whose computational
graph is in Fig. 5.3.1, are W(1) and W(2) . The objective of backpropagation is to calculate the gra-
dients ∂J/∂W(1) and ∂J/∂W(2) . To accomplish this, we apply the chain rule and calculate, in turn,
5.3. Forward Propagation, Backward Propagation, and Computational Graphs 183

the gradient of each intermediate variable and parameter. The order of calculations are reversed
relative to those performed in forward propagation, since we need to start with the outcome of the
computational graph and work our way towards the parameters. The first step is to calculate the
gradients of the objective function J = L + s with respect to the loss term L and the regularization
term s.
∂J ∂J
= 1 and = 1. (5.3.8)
∂L ∂s
Next, we compute the gradient of the objective function with respect to variable of the output layer
o according to the chain rule:
( )
∂J ∂J ∂L ∂L
= prod , = ∈ Rq . (5.3.9)
∂o ∂L ∂o ∂o
Next, we calculate the gradients of the regularization term with respect to both parameters:
∂s ∂s
= λW(1) and = λW(2) . (5.3.10)
∂W(1) ∂W(2)
Now we are able to calculate the gradient ∂J/∂W(2) ∈ Rq×h of the model parameters closest to
the output layer. Using the chain rule yields:
( ) ( )
∂J ∂J ∂o ∂J ∂s ∂J ⊤
(2)
= prod , (2)
+ prod , (2)
= h + λW(2) . (5.3.11)
∂W ∂o ∂W ∂s ∂W ∂o
To obtain the gradient with respect to W(1) we need to continue backpropagation along the output
layer to the hidden layer. The gradient with respect to the hidden layer output ∂J/∂h ∈ Rh is given
by
( )
∂J ∂J ∂o ⊤ ∂J
= prod , = W(2) . (5.3.12)
∂h ∂o ∂h ∂o
Since the activation function ϕ applies elementwise, calculating the gradient ∂J/∂z ∈ Rh of the
intermediate variable z requires that we use the elementwise multiplication operator, which we
denote by ⊙:
( )
∂J ∂J ∂h ∂J
= prod , = ⊙ ϕ′ (z) . (5.3.13)
∂z ∂h ∂z ∂h
Finally, we can obtain the gradient ∂J/∂W(1) ∈ Rh×d of the model parameters closest to the input
layer. According to the chain rule, we get
( ) ( )
∂J ∂J ∂z ∂J ∂s ∂J ⊤
(1)
= prod , (1)
+ prod , (1)
= x + λW(1) . (5.3.14)
∂W ∂z ∂W ∂s ∂W ∂z
5.3.4 Training Neural Networks
When training neural networks, forward and backward propagation depend on each other. In
particular, for forward propagation, we traverse the computational graph in the direction of de-
pendencies and compute all the variables on its path. These are then used for backpropagation
where the compute order on the graph is reversed.
Take the aforementioned simple network as an example to illustrate. On the one hand, comput-
ing the regularization term (5.3.5) during forward propagation depends on the current values of

model parameters W(1) and W(2) . They are given by the optimization algorithm according to back-
propagation in the latest iteration. On the other hand, the gradient calculation for the parameter
(5.3.11) during backpropagation depends on the current value of the hidden layer output h, which
is given by forward propagation.
Therefore when training neural networks, after model parameters are initialized, we alternate
forward propagation with backpropagation, updating model parameters using gradients given by
backpropagation. Note that backpropagation reuses the stored intermediate values from forward
propagation to avoid duplicate calculations. One of the consequences is that we need to retain
the intermediate values until backpropagation is complete. This is also one of the reasons why
training requires significantly more memory than plain prediction. Besides, the size of such in-
termediate values is roughly proportional to the number of network layers and the batch size.
Thus, training deeper networks using larger batch sizes more easily leads to out of memory errors.
5.3.5 Summary
• Forward propagation sequentially calculates and stores intermediate variables within the
computational graph defined by the neural network. It proceeds from the input to the output
layer.
• Backpropagation sequentially calculates and stores the gradients of intermediate variables
and parameters within the neural network in the reversed order.
• When training deep learning models, forward propagation and back propagation are inter-
dependent.
• Training requires significantly more memory than prediction.
Exercises
1. Assume that the inputs X to some scalar function f are n × m matrices. What is the dimen-
sionality of the gradient of f with respect to X?
2. Add a bias to the hidden layer of the model described in this section (you do not need to
include bias in the regularization term).
1. Draw the corresponding computational graph.
2. Derive the forward and backward propagation equations.
3. Compute the memory footprint for training and prediction in the model described in this
section.
4. Assume that you want to compute second derivatives. What happens to the computational
graph? How long do you expect the calculation to take?
5. Assume that the computational graph is too large for your GPU.
1. Can you partition it over more than one GPU?
2. What are the advantages and disadvantages over training on a smaller minibatch?
Discussions101
101
5.3. Forward Propagation, Backward Propagation, and Computational Graphs 185

5.4 Numerical Stability and Initialization
Thus far, every model that we have implemented required that we initialize its parameters accord-
ing to some pre-specified distribution. Until now, we took the initialization scheme for granted,
glossing over the details of how these choices are made. You might have even gotten the impres-
sion that these choices are not especially important. To the contrary, the choice of initialization
scheme plays a significant role in neural network learning, and it can be crucial for maintaining
numerical stability. Moreover, these choices can be tied up in interesting ways with the choice of
the nonlinear activation function. Which function we choose and how we initialize parameters
can determine how quickly our optimization algorithm converges. Poor choices here can cause us
to encounter exploding or vanishing gradients while training. In this section, we delve into these
topics with greater detail and discuss some useful heuristics that you will find useful throughout
your career in deep learning.
5.4.1 Vanishing and Exploding Gradients
Consider a deep network with L layers, input x and output o. With each layer l defined by a trans-
formation fl parameterized by weights W(l) , whose hidden layer output is h(l) (let h(0) = x), our
network can be expressed as:
h(l) = fl (h(l−1) ) and thus o = fL ◦ . . . ◦ f1 (x). (5.4.1)
If all the hidden layer output and the input are vectors, we can write the gradient of o with respect
to any set of parameters W(l) as follows:
∂W(l) o = ∂h(L−1) h(L) · . . . · ∂h(l) h(l+1) ∂W(l) h(l) .

| {z } | {z } | {z } (5.4.2)
def def def
M(L) = M(l+1) = v(l) =
In other words, this gradient is the product of L − l matrices M(L) · . . . · M(l+1) and the gradient vec-
tor v(l) . Thus we are susceptible to the same problems of numerical underflow that often crop up
when multiplying together too many probabilities. When dealing with probabilities, a common
trick is to switch into log-space, i.e., shifting pressure from the mantissa to the exponent of the
numerical representation. Unfortunately, our problem above is more serious: initially the matri-
ces M(l) may have a wide variety of eigenvalues. They might be small or large, and their product
might be very large or very small.
The risks posed by unstable gradients go beyond numerical representation. Gradients of unpre-
dictable magnitude also threaten the stability of our optimization algorithms. We may be fac-
ing parameter updates that are either (i) excessively large, destroying our model (the exploding
gradient problem); or (ii) excessively small (the vanishing gradient problem), rendering learning
impossible as parameters hardly move on each update.

Vanishing Gradients
One frequent culprit causing the vanishing gradient problem is the choice of the activation func-
tion σ that is appended following each layerʼs linear operations. Historically, the sigmoid function
1/(1 + exp(−x)) (introduced in Section 5.1) was popular because it resembles a thresholding func-
tion. Since early artificial neural networks were inspired by biological neural networks, the idea
of neurons that fire either fully or not at all (like biological neurons) seemed appealing. Letʼs take
a closer look at the sigmoid to see why it can cause vanishing gradients.
%matplotlib inline
import torch
x = torch.arange(-8.0, 8.0, 0.1, requires_grad=True)

y = torch.sigmoid(x)
y.backward(torch.ones_like(x))
d2l.plot(x.detach().numpy(), [y.detach().numpy(), x.grad.numpy()],

legend=['sigmoid', 'gradient'], figsize=(4.5, 2.5))
As you can see, the sigmoidʼs gradient vanishes both when its inputs are large and when they are
small. Moreover, when backpropagating through many layers, unless we are in the Goldilocks
zone, where the inputs to many of the sigmoids are close to zero, the gradients of the overall
product may vanish. When our network boasts many layers, unless we are careful, the gradient
will likely be cut off at some layer. Indeed, this problem used to plague deep network training.
Consequently, ReLUs, which are more stable (but less neurally plausible), have emerged as the
default choice for practitioners.
5.4. Numerical Stability and Initialization 187

Exploding Gradients
The opposite problem, when gradients explode, can be similarly vexing. To illustrate this a bit
better, we draw 100 Gaussian random matrices and multiply them with some initial matrix. For
the scale that we picked (the choice of the variance σ 2 = 1), the matrix product explodes. When
this happens due to the initialization of a deep network, we have no chance of getting a gradient
descent optimizer to converge.
M = torch.normal(0, 1, size=(4, 4))

print('a single matrix \n',M)
for i in range(100):
M = M @ torch.normal(0, 1, size=(4, 4))
print('after multiplying 100 matrices\n', M)
a single matrix
tensor([[-0.6294, 0.4637, -0.7013, 0.3351],
[ 0.6494, -0.5689, -0.6872, 0.6012],
[ 0.8345, 0.4740, 0.0549, -0.4233],
[ 0.8291, 0.3444, -1.0034, -0.8176]])
after multiplying 100 matrices
tensor([[-3.6162e+25, -6.7428e+24, 8.1315e+24, 9.8083e+24],
[-6.2805e+24, -1.1711e+24, 1.4122e+24, 1.7035e+24],
[ 2.8926e+25, 5.3935e+24, -6.5043e+24, -7.8456e+24],
[ 2.7628e+25, 5.1516e+24, -6.2125e+24, -7.4936e+24]])
Breaking the Symmetry
Another problem in neural network design is the symmetry inherent in their parametrization.
Assume that we have a simple MLP with one hidden layer and two units. In this case, we could
permute the weights W(1) of the first layer and likewise permute the weights of the output layer
to obtain the same function. There is nothing special differentiating the first hidden unit vs. the
second hidden unit. In other words, we have permutation symmetry among the hidden units of
each layer.
This is more than just a theoretical nuisance. Consider the aforementioned one-hidden-layer MLP
with two hidden units. For illustration, suppose that the output layer transforms the two hidden
units into only one output unit. Imagine what would happen if we initialized all of the parameters
of the hidden layer as W(1) = c for some constant c. In this case, during forward propagation
either hidden unit takes the same inputs and parameters, producing the same activation, which
is fed to the output unit. During backpropagation, differentiating the output unit with respect to
parameters W(1) gives a gradient whose elements all take the same value. Thus, after gradient-
based iteration (e.g., minibatch stochastic gradient descent), all the elements of W(1) still take the
same value. Such iterations would never break the symmetry on its own and we might never be
able to realize the networkʼs expressive power. The hidden layer would behave as if it had only a
single unit. Note that while minibatch stochastic gradient descent would not break this symmetry,
dropout regularization (to be introduced later) would!

5.4.2 Parameter Initialization
One way of addressing—or at least mitigating—the issues raised above is through careful initial-
ization. As we will see later, additional care during optimization and suitable regularization can
further enhance stability.
Default Initialization
In the previous sections, e.g., in Section 3.5, we used a normal distribution to initialize the values
of our weights. If we do not specify the initialization method, the framework will use a default
random initialization method, which often works well in practice for moderate problem sizes.
Xavier Initialization
Letʼs look at the scale distribution of an output oi for some fully connected layer without nonlin-
earities. With nin inputs xj and their associated weights wij for this layer, an output is given by
∑
nin
oi = wij xj . (5.4.3)
j=1
The weights wij are all drawn independently from the same distribution. Furthermore, letʼs as-
sume that this distribution has zero mean and variance σ 2 . Note that this does not mean that
the distribution has to be Gaussian, just that the mean and variance need to exist. For now, letʼs
assume that the inputs to the layer xj also have zero mean and variance γ 2 and that they are in-
dependent of wij and independent of each other. In this case, we can compute the mean and
variance of oi as follows:
∑
nin
E[oi ] = E[wij xj ]
j=1
∑
nin
= E[wij ]E[xj ]
j=1
= 0,
Var[oi ] = E[o2i ] − (E[oi ])2 (5.4.4)
∑
nin
= 2 2
E[wij xj ] − 0
j=1
∑
nin
2
= E[wij ]E[x2j ]
j=1
= nin σ 2 γ 2 .
One way to keep the variance fixed is to set nin σ 2 = 1. Now consider backpropagation. There
we face a similar problem, albeit with gradients being propagated from the layers closer to the
output. Using the same reasoning as for forward propagation, we see that the gradientsʼ variance
can blow up unless nout σ 2 = 1, where nout is the number of outputs of this layer. This leaves us in
a dilemma: we cannot possibly satisfy both conditions simultaneously. Instead, we simply try to
satisfy:
√
1 2 2
(nin + nout )σ = 1 or equivalently σ = . (5.4.5)
2 nin + nout
5.4. Numerical Stability and Initialization 189

This is the reasoning underlying the now-standard and practically beneficial Xavier initialization,
named after the first author of its creators (Glorot and Bengio, 2010). Typically, the Xavier initial-
2
ization samples weights from a Gaussian distribution with zero mean and variance σ 2 = nin +n out
.
We can also adapt Xavierʼs intuition to choose the variance when sampling weights from a uni-
2 2
form distribution. Note that the uniform distribution U (−a, a) has variance a3 . Plugging a3 into
our condition on σ 2 yields the suggestion to initialize according to
( √ √ )
6 6
U − , . (5.4.6)
nin + nout nin + nout
Though the assumption for nonexistence of nonlinearities in the above mathematical reasoning
can be easily violated in neural networks, the Xavier initialization method turns out to work well
in practice.
Beyond
The reasoning above barely scratches the surface of modern approaches to parameter initializa-
tion. A deep learning framework often implements over a dozen different heuristics. Moreover,
parameter initialization continues to be a hot area of fundamental research in deep learning.
Among these are heuristics specialized for tied (shared) parameters, super-resolution, sequence
models, and other situations. For instance, Xiao et al. demonstrated the possibility of training
10000-layer neural networks without architectural tricks by using a carefully-designed initializa-
tion method (Xiao et al., 2018).
If the topic interests you we suggest a deep dive into this moduleʼs offerings, reading the papers
that proposed and analyzed each heuristic, and then exploring the latest publications on the topic.
Perhaps you will stumble across or even invent a clever idea and contribute an implementation to
deep learning frameworks.
5.4.3 Summary
• Vanishing and exploding gradients are common issues in deep networks. Great care in pa-
rameter initialization is required to ensure that gradients and parameters remain well con-
trolled.
• Initialization heuristics are needed to ensure that the initial gradients are neither too large
nor too small.
• ReLU activation functions mitigate the vanishing gradient problem. This can accelerate con-
vergence.
• Random initialization is key to ensure that symmetry is broken before optimization.
• Xavier initialization suggests that, for each layer, variance of any output is not affected by
the number of inputs, and variance of any gradient is not affected by the number of outputs.

Exercises
1. Can you design other cases where a neural network might exhibit symmetry requiring break-
ing besides the permutation symmetry in an MLPʼs layers?
2. Can we initialize all weight parameters in linear regression or in softmax regression to the
same value?
3. Look up analytic bounds on the eigenvalues of the product of two matrices. What does this
tell you about ensuring that gradients are well conditioned?
4. If we know that some terms diverge, can we fix this after the fact? Look at the paper on
layerwise adaptive rate scaling for inspiration (You et al., 2017).
Discussions102
5.5 Generalization in Deep Learning
In Chapter 3 and Chapter 4, we tackled regression and classification problems by fitting linear
models to training data. In both cases, we provided practical algorithms for finding the param-
eters that maximized the likelihood of the observed training labels. And then, towards the end
of each chapter, we recalled that fitting the training data was only an intermediate goal. Our real
quest all along was to discover general patterns on the basis of which we can make accurate pre-
dictions even on new examples drawn from the same underlying population. Machine learning
researchers are consumers of optimization algorithms. Sometimes, we must even develop new op-
timization algorithms. But at the end of the day, optimization is merely a means to an end. At its
core, machine learning is a statistical discipline and we wish to optimize training loss only insofar
as some statistical principle (known or unknown) leads the resulting models to generalize beyond
the training set.
On the bright side, it turns out that deep neural networks trained by stochastic gradient descent
generalize remarkably well across myriad prediction problems, spanning computer vision; natu-
ral language processing; time series data; recommender systems; electronic health records; pro-
tein folding; value function approximation in video games and board games; and countless other
domains. On the downside, if you were looking for a straightforward account of either the opti-
mization story (why we can fit them to training data) or the generalization story (why the resulting
models generalize to unseen examples), then you might want to pour yourself a drink. While our
procedures for optimizing linear models and the statistical properties of the solutions are both
described well by a comprehensive body of theory, our understanding of deep learning still re-
sembles the wild west on both fronts.
The theory and practice of deep learning are rapidly evolving on both fronts, with theorists adopt-
ing new strategies to explain whatʼs going on, even as practitioners continue to innovate at a blis-
tering pace, building arsenals of heuristics for training deep networks and a body of intuitions and
folk knowledge that provide guidance for deciding which techniques to apply in which situations.
The TL;DR of the present moment is that the theory of deep learning has produced promising lines
of attack and scattered fascinating results, but still appears far from a comprehensive account of
both (i) why we are able to optimize neural networks and (ii) how models learned by gradient de-
scent manage to generalize so well, even on high-dimensional tasks. However, in practice, (i) is
seldom a problem (we can always find parameters that will fit all of our training data) and thus
102
5.5. Generalization in Deep Learning 191

understanding generalization is far the bigger problem. On the other hand, even absent the com-
fort of a coherent scientific theory, practitioners have developed a large collection of techniques
that may help you to produce models that generalize well in practice. While no pithy summary
can possibly do justice to the vast topic of generalization in deep learning, and while the overall
state of research is far from resolved, we hope, in this section, to present a broad overview of the
state of research and practice.
5.5.1 Revisiting Overfitting and Regularization
Recall that our approach to training machine learning models typically consists of two phases: (i)
fit the training data; and (ii) estimate the generalization error (the true error on the underlying pop-
ulation) by evaluating the model on holdout data. The difference between our fit on the training
data and our fit on the test data is called the generalization gap and when the generalization gap is
large, we say that our models overfit to the training data. In extreme cases of overfitting, we might
exactly fit the training data, even when the test error remains significant. And in the classical
view, the interpretation is that our models are too complex, requiring that we either shrink the
number of features, the number of nonzero parameters learned, or the size of the parameters as
quantified. Recall the plot of model complexity vs loss (Fig. 3.6.1) from Section 3.6.
However deep learning complicates this picture in counterintuitive ways. First, for classification
problems, our models are typically expressive enough to perfectly fit every training example, even
in datasets consisting of millions (Zhang et al., 2021). In the classical picture, we might think that
this setting lies on the far right extreme of the model complexity axis, and that any improvements
in generalization error must come by way of regularization, either by reducing the complexity of
the model class, or by applying a penalty, severely constraining the set of values that our param-
eters might take. But thatʼs where things start to get weird.
Strangely, for many deep learning tasks (e.g., image recognition and text classification) we are
typically choosing among model architectures, all of which can achieve arbitrarily low training
loss (and zero training error). Because all models under consideration achieve zero training er-
ror, the only avenue for further gains is to reduce overfitting. Even stranger, itʼs often the case that
despite fitting the training data perfectly, we can actually reduce the generalization error further by
making the model even more expressive, e.g., adding layers, nodes, or training for a larger number
of epochs. Stranger yet, the pattern relating the generalization gap to the complexity of the model
(as captured, e.g., in the depth or width of the networks) can be non-monotonic, with greater com-
plexity hurting at first but subsequently helping in a so-called “double-descent” pattern (Nakkiran
et al., 2021). Thus the deep learning practitioner possesses a bag of tricks, some of which seem-
ingly restrict the model in some fashion and others that seemingly make it even more expressive,
and all of which, in some sense, are applied to mitigate overfitting.
Complicating things even further, while the guarantees provided by classical learning theory can
be conservative even for classical models, they appear powerless to explain why it is that deep
neural networks generalize in the first place. Because deep neural networks are capable of fitting
arbitrary labels even for large datasets, and despite the use of familiar methods like ℓ2 regulariza-
tion, traditional complexity-based generalization bounds, e.g., those based on the VC dimension
or Rademacher complexity of a hypothesis class cannot explain why neural networks generalize.

5.5.2 Inspiration from Nonparametrics
Approaching deep learning for the first time, itʼs tempting to think of them as parametric models.
After all, the models do have millions of parameters. When we update the models, we update their
parameters. When we save the models, we write their parameters to disk. However, mathematics
and computer science are riddled with counterintuitive changes of perspective, and surprising
isomorphisms seemingly different problems. While neural networks, clearly have parameters, in
some ways, it can be more fruitful to think of them as behaving like nonparametric models. So
what precisely makes a model nonparametric? While the name covers a diverse set of approaches,
one common theme is that nonparametric methods tend to have a level of complexity that grows
as the amount of available data grows.
Perhaps the simplest example of a nonparametric model is the k-nearest neighbor algorithm (we
will cover more nonparametric models later, such as in Section 11.2). Here, at training time, the
learner simply memorizes the dataset. Then, at prediction time, when confronted with a new
point x, the learner looks up the k nearest neighbors (the k points x′i that minimize some distance
d(x, x′i )). When k = 1, this is algorithm is called 1-nearest neighbors, and the algorithm will al-
ways achieve a training error of zero. That however, does not mean that the algorithm will not
generalize. In fact, it turns out that under some mild conditions, the 1-nearest neighbor algorithm
is consistent (eventually converging to the optimal predictor).
Note that 1 nearest neighbor requires that we specify some distance function d, or equivalently,
that we specify some vector-valued basis function ϕ(x) for featurizing our data. For any choice of
the distance metric, we will achieve 0 training error and eventually reach an optimal predictor, but
different distance metrics d encode different inductive biases and with a finite amount of available
data will yield different predictors. Different choices of the distance metric d represent different
assumptions about the underlying patterns and the performance of the different predictors will
depend on how compatible the assumptions are with the observed data.
In a sense, because neural networks are over-parameterized, possessing many more parameters
than are needed to fit the training data, they tend to interpolate the training data (fitting it per-
fectly) and thus behave, in some ways, more like nonparametric models. More recent theoreti-
cal research has established deep connection between large neural networks and nonparametric
methods, notably kernel methods. In particular, (Jacot et al., 2018) demonstrated that in the limit,
as multilayer perceptrons with randomly initialized weights grow infinitely wide, they become
equivalent to (nonparametric) kernel methods for a specific choice of the kernel function (es-
sentially, a distance function), which they call the neural tangent kernel. While current neural
tangent kernel models may not fully explain the behavior of modern deep networks, their success
as an analytical tool underscores the usefulness of nonparametric modeling for understanding
the behavior of over-parameterized deep networks.
5.5.3 Early Learning and Early Stopping
While deep neural networks are capable of fitting arbitrary labels, even when labels are assigned
incorrectly or randomly ((Zhang et al., 2021)), this ability only emerges over many iterations of
training. A new line of work ((Rolnick et al., 2017)) has revealed that in the setting of label noise,
neural networks tend to fit cleanly labeled data first and only subsequently to interpolate the mis-
labeled data. Moreover, itʼs been established that this phenomenon translates directly into a guar-
antee on generalization: whenever a model has fitted the cleanly labeled data but not randomly
labeled examples included in the training set, it has in fact generalized ((Garg et al., 2021)).
Together these findings help to motivate early stopping, a classic technique for regularizing deep
5.5. Generalization in Deep Learning 193

neural networks. Here, rather than directly constraining the values of the weights, one constrains
the number of epochs of training. The most common way to determine the stopping criteria is to
monitor validation error throughout training (typically by checking once after each epoch) and to
cut off training when the validation error has not decreased by more than some small amount ϵ
for some number of epochs. This is sometimes called a patience criteria. Besides the potential to
lead to better generalization, in the setting of noisy labels, another benefit of early stopping is the
time saved. Once the patience criteria is met, one can terminate training. For large models that
might require days of training simultaneously across 8 GPUs or more, well-tuned early stopping
can save researchers days of time and can save their employers many thousands of dollars.
Notably, when there is no label noise and datasets are realizable (the classes are truly separable,
e.g., distinguishing cats from dogs), early stopping tends not to lead to significant improvements
in generalization. On the other hand, when there is label noise, or intrinsic variability in the label
(e.g., predicting mortality among patients), early stopping is crucial. Training models until they
interpolate noisy data is typically a bad idea.
5.5.4 Classical Regularization Methods for Deep Networks
In Chapter 3, we described several classical regularization techniques for constraining the com-
plexity of our models. In particular, Section 3.7 introduced a method called weight decay, which
consists of adding a regularization term to the loss function to penalize large values of the weights.
Depending on which weight norm is penalized this technique is known either as ridge regulariza-
tion (for ℓ2 penalty) or lasso regularization (for an ℓ1 penalty). In the classical analysis of these
regularizers, they are considered to restrict the values that the weights can take sufficiently to
prevent the model from fitting arbitrary labels.
In deep learning implementations, weight decay remains a popular tool. However, researchers
have noted that typical strengths of ℓ2 regularization are insufficient to prevent the networks from
interpolating the data ((Zhang et al., 2021)) and thus the benefits if interpreted as regularization
might only make sense in combination with the early stopping criteria. Absent early stopping, itʼs
possible that just like the number of layers or number of nodes (in deep learning) or the distance
metric (in 1-nearest neighbor), these methods may lead to better generalization not because they
meaningfully constrain the power of the neural network but rather because they somehow encode
inductive biases that are better compatible with the patterns found in datasets of interests. Thus,
classical regularizers remain popular in deep learning implementations, even if the theoretical
rationale for their efficacy may be radically different.
Notably, deep learning researchers have also built on techniques first popularized in classical reg-
ularization contexts, such as adding noise to model inputs. In the next section we will introduce
the famous dropout technique (invented by (Srivastava et al., 2014)), which has become a mainstay
of deep learning, even as the theoretical basis for its efficacy remains similarly mysterious.

5.5.5 Summary
Unlike classical linear models, which tend to have fewer parameters than examples, deep net-
works tend to be over-parameterized, and for most tasks are capable of perfectly fitting the train-
ing set. This interpolation regime challenges many of hard fast-held intuitions. Functionally, neu-
ral networks look like parametric models. But thinking of them as nonparametric models can
sometimes be a more reliable source of intuition. Because itʼs often the case that all deep net-
works under consideration are capable of fitting all of the training labels, nearly all gains must
come by mitigating overfitting (closing the generalization gap). Paradoxically, the interventions
that reduce the generalization gap sometimes appear to increase model complexity and at other
times appear to decrease complexity. However, these methods seldom decrease complexity suffi-
ciently for classical theory to explain the generalization of deep networks, and why certain choices
lead to improved generalization remains for the most part a massive open question despite the con-
certed efforts of many brilliant researchers.
Exercises
1. In what sense do traditional complexity-based measures fail to account for generalization of

deep neural networks?
2. Why might early stopping be considered a regularization technique?
3. How do researchers typically determine the stopping criteria?
4. What important factor seems to differentiate cases when early stopping leads to big improve-
ments in generalization?
5. Beyond generalization, describe another benefit of early stopping.
Discussions103
5.6 Dropout
Letʼs think briefly about what we expect from a good predictive model. We want it to peform well
on unseen data. Classical generalization theory suggests that to close the gap between train and
test performance, we should aim for a simple model. Simplicity can come in the form of a small
number of dimensions. We explored this when discussing the monomial basis functions of linear
models in Section 3.6. Additionally, as we saw when discussing weight decay (ℓ2 regularization) in
Section 3.7, the (inverse) norm of the parameters also represents a useful measure of simplicity.
Another useful notion of simplicity is smoothness, i.e., that the function should not be sensitive to
small changes to its inputs. For instance, when we classify images, we would expect that adding
some random noise to the pixels should be mostly harmless.
In 1995, Christopher Bishop formalized this idea when he proved that training with input noise is
equivalent to Tikhonov regularization (Bishop, 1995). This work drew a clear mathematical con-
nection between the requirement that a function be smooth (and thus simple), and the require-
ment that it be resilient to perturbations in the input.
Then, in 2014, Srivastava et al. (Srivastava et al., 2014) developed a clever idea for how to apply
Bishopʼs idea to the internal layers of a network, too. Their idea, called dropout, involves injecting
103
5.6. Dropout 195

noise while computing each internal layer during forward propagation, and it has become a stan-
dard technique for training neural networks. The method is called dropout because we literally
drop out some neurons during training. Throughout training, on each iteration, standard dropout
consists of zeroing out some fraction of the nodes in each layer before calculating the subsequent
layer.
To be clear, we are imposing our own narrative with the link to Bishop. The original paper on
dropout offers intuition through a surprising analogy to sexual reproduction. The authors argue
that neural network overfitting is characterized by a state in which each layer relies on a specific
pattern of activations in the previous layer, calling this condition co-adaptation. dropout, they
claim, breaks up co-adaptation just as sexual reproduction is argued to break up co-adapted genes.
While the explanatory of this theory is certainly up for debate, the dropout technique itself has
proved enduring, and various forms of dropout are implemented in most deep learning libraries.
The key challenge is how to inject this noise. One idea is to inject the noise in an unbiased manner
so that the expected value of each layer—while fixing the others—equals to the value it would have
taken absent noise. In Bishopʼs work, he added Gaussian noise to the inputs to a linear model. At
each training iteration, he added noise sampled from a distribution with mean zero ϵ ∼ N (0, σ 2 )
to the input x, yielding a perturbed point x′ = x + ϵ. In expectation, E[x′ ] = x.
In standard dropout regularization, one zeros out some fraction of the nodes in each layer and
then debiases each layer by normalizing by the fraction of nodes that were retained (not dropped
out). In other words, with dropout probability p, each intermediate activation h is replaced by a
random variable h′ as follows:
{
′ 0 with probability p
h = h
(5.6.1)
1−p otherwise
By design, the expectation remains unchanged, i.e., E[h′ ] = h.
5.6.1 Dropout in Practice
Recall the MLP with a hidden layer and 5 hidden units in Fig. 5.1.1. When we apply dropout to
a hidden layer, zeroing out each hidden unit with probability p, the result can be viewed as a
network containing only a subset of the original neurons. In Fig. 5.6.1, h2 and h5 are removed.
Consequently, the calculation of the outputs no longer depends on h2 or h5 and their respective
gradient also vanishes when performing backpropagation. In this way, the calculation of the out-
put layer cannot be overly dependent on any one element of h1 , . . . , h5 .

Fig. 5.6.1: MLP before and after dropout.
Typically, we disable dropout at test time. Given a trained model and a new example, we do not
drop out any nodes and thus do not need to normalize. However, there are some exceptions: some
researchers use dropout at test time as a heuristic for estimating the uncertainty of neural network
predictions: if the predictions agree across many different dropout masks, then we might say that
the network is more confident.
To implement the dropout function for a single layer, we must draw as many samples from a
Bernoulli (binary) random variable as our layer has dimensions, where the random variable takes
value 1 (keep) with probability 1 − p and 0 (drop) with probability p. One easy way to implement
this is to first draw samples from the uniform distribution U [0, 1]. Then we can keep those nodes
for which the corresponding sample is greater than p, dropping the rest.
In the following code, we implement a dropout_layer function that drops out the elements in the
tensor input X with probability dropout, rescaling the remainder as described above: dividing the
survivors by 1.0-dropout.
import torch
def dropout_layer(X, dropout):

assert 0 <= dropout <= 1
if dropout == 1: return torch.zeros_like(X)
mask = (torch.rand(X.shape) > dropout).float()
return mask * X / (1.0 - dropout)
We can test out the dropout_layer function on a few examples. In the following lines of code, we
pass our input X through the dropout operation, with probabilities 0, 0.5, and 1, respectively.
X = torch.arange(16, dtype = torch.float32).reshape((2, 8))

print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))
5.6. Dropout 197

dropout_p = 0: tensor([[ 0., 1., 2., 3., 4., 5., 6., 7.],
[ 8., 9., 10., 11., 12., 13., 14., 15.]])
dropout_p = 0.5: tensor([[ 0., 2., 0., 6., 8., 0., 12., 0.],
[ 0., 18., 20., 0., 0., 0., 28., 30.]])
dropout_p = 1: tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.]])
Defining the Model
The model below applies dropout to the output of each hidden layer (following the activation func-
tion). We can set dropout probabilities for each layer separately. A common trend is to set a lower
dropout probability closer to the input layer. We ensure that dropout is only active during training.
class DropoutMLPScratch(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.lin1 = nn.LazyLinear(num_hiddens_1)
self.lin2 = nn.LazyLinear(num_hiddens_2)
self.lin3 = nn.LazyLinear(num_outputs)
self.relu = nn.ReLU()

H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
if self.training:
H1 = dropout_layer(H1, self.dropout_1)
H2 = self.relu(self.lin2(H1))
if self.training:
H2 = dropout_layer(H2, self.dropout_2)
return self.lin3(H2)
Training
The following is similar to the training of MLPs described previously.
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,

'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLPScratch(**hparams)

With high-level APIs, all we need to do is add a Dropout layer after each fully connected layer,
passing in the dropout probability as the only argument to its constructor. During training, the
Dropout layer will randomly drop out outputs of the previous layer (or equivalently, the inputs to
the subsequent layer) according to the specified dropout probability. When not in training mode,
the Dropout layer simply passes the data through during testing.
class DropoutMLP(d2l.Classifier):
def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
dropout_1, dropout_2, lr):
super().__init__()
self.net = nn.Sequential(
nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(),
nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(),
nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))
Next, we train the model.
model = DropoutMLP(**hparams)
5.6. Dropout 199

5.6.4 Summary
• Beyond controlling the number of dimensions and the size of the weight vector, dropout is
yet another tool to avoid overfitting. Often they are used jointly.
• Dropout replaces an activation h with a random variable with expected value h.
• Dropout is only used during training.
Exercises
1. What happens if you change the dropout probabilities for the first and second layers? In
particular, what happens if you switch the ones for both layers? Design an experiment to
answer these questions, describe your results quantitatively, and summarize the qualitative
takeaways.
2. Increase the number of epochs and compare the results obtained when using dropout with
those when not using it.
3. What is the variance of the activations in each hidden layer when dropout is and is not ap-
plied? Draw a plot to show how this quantity evolves over time for both models.
4. Why is dropout not typically used at test time?
5. Using the model in this section as an example, compare the effects of using dropout and
weight decay. What happens when dropout and weight decay are used at the same time?
Are the results additive? Are there diminished returns (or worse)? Do they cancel each other
out?
6. What happens if we apply dropout to the individual weights of the weight matrix rather than
the activations?
7. Invent another technique for injecting random noise at each layer that is different from the
standard dropout technique. Can you develop a method that outperforms dropout on the
Fashion-MNIST dataset (for a fixed architecture)?
Discussions104
5.7 Predicting House Prices on Kaggle
Now that we have introduced some basic tools for building and training deep networks and reg-
ularizing them with techniques including weight decay and dropout, we are ready to put all this
knowledge into practice by participating in a Kaggle competition. The house price prediction
competition is a great place to start. The data is fairly generic and do not exhibit exotic structure
that might require specialized models (as audio or video might). This dataset, collected by Bart de
Cock in 2011 (De Cock, 2011), covers house prices in Ames, IA from the period of 2006–2010. It is
considerably larger than the famous Boston housing dataset105 of Harrison and Rubinfeld (1978),
boasting both more examples and more features.
104
105
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names

In this section, we will walk you through details of data preprocessing, model design, and hyper-
parameter selection. We hope that through a hands-on approach, you will gain some intuitions
that will guide you in your career as a data scientist.
5.7.1 Downloading Data
Throughout the book, we will train and test models on various downloaded datasets. Here, we
implement two utility functions to download files and extract zip or tar files. Again, we defer their
implementations into Section 20.7.
def download(url, folder, sha1_hash=None):

"""Download a file to folder and return the local filepath."""
def extract(filename, folder):

"""Extract a zip/tar file into folder."""
5.7.2 Kaggle
Kaggle106 is a popular platform that hosts machine learning competitions. Each competition cen-
ters on a dataset and many are sponsored by stakeholders who offer prizes to the winning solu-
tions. The platform helps users to interact via forums and shared code, fostering both collabo-
ration and competition. While leaderboard chasing often spirals out of control, with researchers
focusing myopically on preprocessing steps rather than asking fundamental questions, there is
also tremendous value in the objectivity of a platform that facilitates direct quantitative compar-
isons among competing approaches as well as code sharing so that everyone can learn what did
and did not work. If you want to participate in a Kaggle competition, you will first need to register
for an account (see Fig. 5.7.1).
Fig. 5.7.1: The Kaggle website.
On the house price prediction competition page, as illustrated in Fig. 5.7.2, you can find the dataset
(under the “Data” tab), submit predictions, and see your ranking, The URL is right here:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
106
https://www.kaggle.com
5.7. Predicting House Prices on Kaggle 201

Fig. 5.7.2: The house price prediction competition page.
5.7.3 Accessing and Reading the Dataset
Note that the competition data is separated into training and test sets. Each record includes the
property value of the house and attributes such as street type, year of construction, roof type,
basement condition, etc. The features consist of various data types. For example, the year of
construction is represented by an integer, the roof type by discrete categorical assignments, and
other features by floating point numbers. And here is where reality complicates things: for some
examples, some data is altogether missing with the missing value marked simply as “na”. The
price of each house is included for the training set only (it is a competition after all). We will want
to partition the training set to create a validation set, but we only get to evaluate our models on
the official test set after uploading predictions to Kaggle. The “Data” tab on the competition tab in
Fig. 5.7.2 has links to download the data.
%matplotlib inline
import numpy as np
import pandas as pd
import torch
To get started, we will read in and process the data using pandas, which we have introduced in
Section 2.2. For convenience, we can download and cache the Kaggle housing dataset. If a file cor-
responding to this dataset already exists in the cache directory and its SHA-1 matches sha1_hash,
our code will use the cached file to avoid clogging up your internet with redundant downloads.
class KaggleHouse(d2l.DataModule):
def __init__(self, batch_size, train=None, val=None):
super().__init__()
if self.train is None:
self.raw_train = pd.read_csv(d2l.download(
d2l.DATA_URL + 'kaggle_house_pred_train.csv', self.root,
sha1_hash='585e9cc93e70b39160e7921475f9bcd7d31219ce'))
self.raw_val = pd.read_csv(d2l.download(
d2l.DATA_URL + 'kaggle_house_pred_test.csv', self.root,
sha1_hash='fa19780a7b011d9b009e8bff8e99922a8ee2eb90'))
The training dataset includes 1460 examples, 80 features, and 1 label, while the validation data
contains 1459 examples and 80 features.

data = KaggleHouse(batch_size=64)
print(data.raw_train.shape)
print(data.raw_val.shape)
(1460, 81)
(1459, 80)
5.7.4 Data Preprocessing
Letʼs take a look at the first four and last two features as well as the label (SalePrice) from the first
four examples.
print(data.raw_train.iloc[:4, [0, 1, 2, 3, -3, -2, -1]])
Id MSSubClass MSZoning LotFrontage SaleType SaleCondition SalePrice

0 1 60 RL 65.0 WD Normal 208500
1 2 20 RL 80.0 WD Normal 181500
2 3 60 RL 68.0 WD Normal 223500
3 4 70 RL 60.0 WD Abnorml 140000
We can see that in each example, the first feature is the ID. This helps the model identify each
training example. While this is convenient, it does not carry any information for prediction pur-
poses. Hence, we will remove it from the dataset before feeding the data into the model. Besides,
given a wide variety of data types, we will need to preprocess the data before we can start model-
ing.
Letʼs start with the numerical features. First, we apply a heuristic, replacing all missing values by
the corresponding featureʼs mean. Then, to put all features on a common scale, we standardize
the data by rescaling features to zero mean and unit variance:
x−µ
x← , (5.7.1)
σ
where µ and σ denote mean and standard deviation, respectively. To verify that this indeed trans-
forms our feature (variable) such that it has zero mean and unit variance, note that E[ x−µσ ] =
µ−µ
σ = 0 and that E[(x − µ) ] = (σ + µ ) − 2µ + µ = σ . Intuitively, we standardize the data
2 2 2 2 2 2
for two reasons. First, it proves convenient for optimization. Second, because we do not know
a priori which features will be relevant, we do not want to penalize coefficients assigned to one
feature more than on any other.
Next we deal with discrete values. This includes features such as “MSZoning”. We replace them by
a one-hot encoding in the same way that we previously transformed multiclass labels into vectors
(see Section 4.1.1). For instance, “MSZoning” assumes the values “RL” and “RM”. Dropping the
“MSZoning” feature, two new indicator features “MSZoning_RL” and “MSZoning_RM” are created
with values being either 0 or 1. According to one-hot encoding, if the original value of “MSZon-
ing” is “RL”, then “MSZoning_RL” is 1 and “MSZoning_RM” is 0. The pandas package does this
automatically for us.

@d2l.add_to_class(KaggleHouse)
def preprocess(self):
# Remove the ID and label columns
label = 'SalePrice'
features = pd.concat(
(self.raw_train.drop(columns=['Id', label]),
self.raw_val.drop(columns=['Id'])))
# Standardize numerical columns
numeric_features = features.dtypes[features.dtypes != 'object'].index
features[numeric_features] = features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std()))
# Replace NAN numerical features by 0
features[numeric_features] = features[numeric_features].fillna(0)
# Replace discrete features by one-hot encoding.
features = pd.get_dummies(features, dummy_na=True)
# Save preprocessed features
self.train = features[:self.raw_train.shape[0]].copy()
self.train[label] = self.raw_train[label]
self.val = features[self.raw_train.shape[0]:].copy()
You can see that this conversion increases the number of features from 79 to 331 (excluding ID
and label columns).
data.preprocess()
data.train.shape
(1460, 332)
5.7.5 Error Measure
To get started we will train a linear model with squared loss. Not surprisingly, our linear model
will not lead to a competition-winning submission but it provides a sanity check to see whether
there is meaningful information in the data. If we cannot do better than random guessing here,
then there might be a good chance that we have a data processing bug. And if things work, the
linear model will serve as a baseline giving us some intuition about how close the simple model
gets to the best reported models, giving us a sense of how much gain we should expect from fancier
models.
With house prices, as with stock prices, we care about relative quantities more than absolute quan-
tities. Thus we tend to care more about the relative error y−ŷ
y than about the absolute error y − ŷ.
For instance, if our prediction is off by USD 100,000 when estimating the price of a house in Rural
Ohio, where the value of a typical house is 125,000 USD, then we are probably doing a horrible job.
On the other hand, if we err by this amount in Los Altos Hills, California, this might represent a
stunningly accurate prediction (there, the median house price exceeds 4 million USD).
One way to address this problem is to measure the discrepancy in the logarithm of the price esti-
mates. In fact, this is also the official error measure used by the competition to evaluate the quality
of submissions. After all, a small value δ for | log y − log ŷ| ≤ δ translates into e−δ ≤ ŷy ≤ eδ . This
leads to the following root-mean-squared-error between the logarithm of the predicted price and

the logarithm of the label price:
v
u n
u1 ∑
t (log yi − log ŷi )2 . (5.7.2)
n
i=1
@d2l.add_to_class(KaggleHouse)
label = 'SalePrice'
data = self.train if train else self.val
if label not in data: return
get_tensor = lambda x: torch.tensor(x.values, dtype=torch.float32)
# Logarithm of prices
tensors = (get_tensor(data.drop(columns=[label])), # X
torch.log(get_tensor(data[label])).reshape((-1, 1))) # Y
return self.get_tensorloader(tensors, train)
5.7.6 K-Fold Cross-Validation
You might recall that we introduced cross-validation in Section 3.6.3, where we discussed how to
deal with model selection. We will put this to good use to select the model design and to adjust the
hyperparameters. We first need a function that returns the ith fold of the data in a K-fold cross-
validation procedure. It proceeds by slicing out the ith segment as validation data and returning
the rest as training data. Note that this is not the most efficient way of handling data and we
would definitely do something much smarter if our dataset was considerably larger. But this added
complexity might obfuscate our code unnecessarily so we can safely omit it here owing to the
simplicity of our problem.
def k_fold_data(data, k):

rets = []
fold_size = data.train.shape[0] // k
for j in range(k):
idx = range(j * fold_size, (j+1) * fold_size)
rets.append(KaggleHouse(data.batch_size, data.train.drop(index=idx),
data.train.loc[idx]))
return rets
The average validation error is returned when we train K times in the K-fold cross-validation.
def k_fold(trainer, data, k, lr):

val_loss, models = [], []
for i, data_fold in enumerate(k_fold_data(data, k)):
model = d2l.LinearRegression(lr)
if i != 0: model.board.display = False
trainer.fit(model, data_fold)
val_loss.append(float(model.board.data['val_loss'][-1].y))
models.append(model)
print(f'average validation log mse = {sum(val_loss)/len(val_loss)}')
return models

5.7.7 Model Selection
In this example, we pick an untuned set of hyperparameters and leave it up to the reader to im-
prove the model. Finding a good choice can take time, depending on how many variables one
optimizes over. With a large enough dataset, and the normal sorts of hyperparameters, K-fold
cross-validation tends to be reasonably resilient against multiple testing. However, if we try an
unreasonably large number of options we might just get lucky and find that our validation perfor-
mance is no longer representative of the true error.
models = k_fold(trainer, data, k=5, lr=0.01)
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.8/site-packages/torch/nn/
,→modules/lazy.py:178: UserWarning: Lazy modules are a new feature under heavy development␣
,→so changes to the API or functionality can happen at any moment.
warnings.warn('Lazy modules are a new feature under heavy development '

average validation log mse = 0.18091966241598129
Notice that sometimes the number of training errors for a set of hyperparameters can be very
low, even as the number of errors on K-fold cross-validation is considerably higher. This indi-
cates that we are overfitting. Throughout training you will want to monitor both numbers. Less
overfitting might indicate that our data can support a more powerful model. Massive overfitting
might suggest that we can gain by incorporating regularization techniques.

5.7.8 Submitting Predictions on Kaggle
Now that we know what a good choice of hyperparameters should be, we might calculate the aver-
age predictions on the test set by all the K models. Saving the predictions in a csv file will simplify
uploading the results to Kaggle. The following code will generate a file called submission.csv.
preds = [model(torch.tensor(data.val.values, dtype=torch.float32))

for model in models]
# Taking exponentiation of predictions in the logarithm scale
ensemble_preds = torch.exp(torch.cat(preds, 1)).mean(1)
submission = pd.DataFrame({'Id':data.raw_val.Id,
'SalePrice':ensemble_preds.detach().numpy()})
submission.to_csv('submission.csv', index=False)
Next, as demonstrated in Fig. 5.7.3, we can submit our predictions on Kaggle and see how they
compare with the actual house prices (labels) on the test set. The steps are quite simple:
• Log in to the Kaggle website and visit the house price prediction competition page.
• Click the “Submit Predictions” or “Late Submission” button (as of this writing, the button is
located on the right).
• Click the “Upload Submission File” button in the dashed box at the bottom of the page and
select the prediction file you wish to upload.
• Click the “Make Submission” button at the bottom of the page to view your results.
Fig. 5.7.3: Submitting data to Kaggle
5.7.9 Summary
• Real data often contains a mix of different data types and need to be preprocessed.
• Rescaling real-valued data to zero mean and unit variance is a good default. So is replacing
missing values with their mean.
• Transforming categorical features into indicator features allows us to treat them like one-hot
vectors.
• We can use K-fold cross-validation to select the model and adjust the hyperparameters.

• Logarithms are useful for relative errors.
Exercises
1. Submit your predictions for this section to Kaggle. How good are your predictions?
2. Is it always a good idea to replace missing values by their mean? Hint: can you construct a
situation where the values are not missing at random?
3. Improve the score on Kaggle by tuning the hyperparameters through K-fold cross-
validation.
4. Improve the score by improving the model (e.g., layers, weight decay, and dropout).
5. What happens if we do not standardize the continuous numerical features like what we have
done in this section?
Discussions107
107

6 | Buildersʼ Guide
Alongside giant datasets and powerful hardware, great software tools have played an indispens-
able role in the rapid progress of deep learning. Starting with the pathbreaking Theano library
released in 2007, flexible open-source tools have enabled researchers to rapidly prototype models,
avoiding repetitive work when recycling standard components while still maintaining the ability
to make low-level modifications. Over time, deep learningʼs libraries have evolved to offer in-
creasingly coarse abstractions. Just as semiconductor designers went from specifying transistors
to logical circuits to writing code, neural networks researchers have moved from thinking about
the behavior of individual artificial neurons to conceiving of networks in terms of whole layers,
and now often design architectures with far coarser blocks in mind.
So far, we have introduced some basic machine learning concepts, ramping up to fully-functional
deep learning models. In the last chapter, we implemented each component of an MLP from
scratch and even showed how to leverage high-level APIs to roll out the same models effortlessly.
To get you that far that fast, we called upon the libraries, but skipped over more advanced details
about how they work. In this chapter, we will peel back the curtain, digging deeper into the key
components of deep learning computation, namely model construction, parameter access and
initialization, designing custom layers and blocks, reading and writing models to disk, and lever-
aging GPUs to achieve dramatic speedups. These insights will move you from end user to power
user, giving you the tools needed to reap the benefits of a mature deep learning library while re-
taining the flexibility to implement more complex models, including those you invent yourself!
While this chapter does not introduce any new models or datasets, the advanced modeling chap-
ters that follow rely heavily on these techniques.
6.1 Layers and Modules
When we first introduced neural networks, we focused on linear models with a single output.
Here, the entire model consists of just a single neuron. Note that a single neuron (i) takes some
set of inputs; (ii) generates a corresponding scalar output; and (iii) has a set of associated param-
eters that can be updated to optimize some objective function of interest. Then, once we started
thinking about networks with multiple outputs, we leveraged vectorized arithmetic to characterize
an entire layer of neurons. Just like individual neurons, layers (i) take a set of inputs, (ii) generate
corresponding outputs, and (iii) are described by a set of tunable parameters. When we worked
through softmax regression, a single layer was itself the model. However, even when we subse-
quently introduced MLPs, we could still think of the model as retaining this same basic structure.
Interestingly, for MLPs, both the entire model and its constituent layers share this structure. The
entire model takes in raw inputs (the features), generates outputs (the predictions), and possesses
parameters (the combined parameters from all constituent layers). Likewise, each individual layer
ingests inputs (supplied by the previous layer) generates outputs (the inputs to the subsequent
209
layer), and possesses a set of tunable parameters that are updated according to the signal that
flows backwards from the subsequent layer.
While you might think that neurons, layers, and models give us enough abstractions to go about
our business, it turns out that we often find it convenient to speak about components that are
larger than an individual layer but smaller than the entire model. For example, the ResNet-152
architecture, which is wildly popular in computer vision, possesses hundreds of layers. These
layers consist of repeating patterns of groups of layers. Implementing such a network one layer at
a time can grow tedious. This concern is not just hypothetical—such design patterns are common
in practice. The ResNet architecture mentioned above won the 2015 ImageNet and COCO com-
puter vision competitions for both recognition and detection (He et al., 2016) and remains a go-to
architecture for many vision tasks. Similar architectures in which layers are arranged in various
repeating patterns are now ubiquitous in other domains, including natural language processing
and speech.
To implement these complex networks, we introduce the concept of a neural network module.
A module could describe a single layer, a component consisting of multiple layers, or the entire
model itself! One benefit of working with the module abstraction is that they can be combined
into larger artifacts, often recursively. This is illustrated in Fig. 6.1.1. By defining code to generate
modules of arbitrary complexity on demand, we can write surprisingly compact code and still
implement complex neural networks.
Fig. 6.1.1: Multiple layers are combined into modules, forming repeating patterns of larger mod-
els.
From a programming standpoint, a module is represented by a class. Any subclass of it must de-
fine a forward propagation method that transforms its input into output and must store any neces-
sary parameters. Note that some modules do not require any parameters at all. Finally a module
must possess a backpropagation method, for purposes of calculating gradients. Fortunately, due
to some behind-the-scenes magic supplied by the auto differentiation (introduced in Section 2.5)
when defining our own module, we only need to worry about parameters and the forward propa-
gation method.
To begin, we revisit the code that we used to implement MLPs (Section 5.1). The following code
generates a network with one fully connected hidden layer with 256 units and ReLU activation,
followed by a fully connected output layer with 10 units (no activation function).
210 Chapter 6. Builders’ Guide

import torch
net = nn.Sequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))
X = torch.rand(2, 20)
net(X).shape
torch.Size([2, 10])
In this example, we constructed our model by instantiating an nn.Sequential, with layers in the
order that they should be executed passed as arguments. In short, nn.Sequential defines a spe-
cial kind of Module, the class that presents a module in PyTorch. It maintains an ordered list of
constituent Modules. Note that each of the two fully connected layers is an instance of the Linear
class which is itself a subclass of Module. The forward propagation (forward) method is also re-
markably simple: it chains each module in the list together, passing the output of each as input to
the next. Note that until now, we have been invoking our models via the construction net(X) to
obtain their outputs. This is actually just shorthand for net.__call__(X).
6.1.1 A Custom Module
Perhaps the easiest way to develop intuition about how a module works is to implement one our-
selves. Before we implement our own custom module, we briefly summarize the basic function-
ality that each module must provide:
1. Ingest input data as arguments to its forward propagation method.
2. Generate an output by having the forward propagation method return a value. Note that the
output may have a different shape from the input. For example, the first fully connected
layer in our model above ingests an input of arbitrary dimension but returns an output of
dimension 256.
3. Calculate the gradient of its output with respect to its input, which can be accessed via its
backpropagation method. Typically this happens automatically.
4. Store and provide access to those parameters necessary to execute the forward propagation
computation.
5. Initialize model parameters as needed.
In the following snippet, we code up a module from scratch corresponding to an MLP with one
hidden layer with 256 hidden units, and a 10-dimensional output layer. Note that the MLP class be-
low inherits the class that represents a module. We will heavily rely on the parent classʼs methods,
supplying only our own constructor (the __init__ method in Python) and the forward propagation
method.
6.1. Layers and Modules 211

class MLP(nn.Module):
def __init__(self):
# Call the constructor of the parent class nn.Module to perform
# the necessary initialization
super().__init__()
self.hidden = nn.LazyLinear(256)
self.out = nn.LazyLinear(10)
# Define the forward propagation of the model, that is, how to return the
# required model output based on the input X
return self.out(F.relu(self.hidden(X)))
Letʼs first focus on the forward propagation method. Note that it takes X as input, calculates the
hidden representation with the activation function applied, and outputs its logits. In this MLP im-
plementation, both layers are instance variables. To see why this is reasonable, imagine instanti-
ating two MLPs, net1 and net2, and training them on different data. Naturally, we would expect
them to represent two different learned models.
We instantiate the MLPʼs layers in the constructor and subsequently invoke these layers on each
call to the forward propagation method. Note a few key details. First, our customized __init__
method invokes the parent classʼs __init__ method via super().__init__() sparing us the pain of
restating boilerplate code applicable to most modules. We then instantiate our two fully connected
layers, assigning them to self.hidden and self.out. Note that unless we implement a new layer,
we need not worry about the backpropagation method or parameter initialization. The system
will generate these methods automatically. Letʼs try this out.
net = MLP()
net(X).shape
torch.Size([2, 10])
A key virtue of the module abstraction is its versatility. We can subclass a module to create lay-
ers (such as the fully connected layer class), entire models (such as the MLP class above), or vari-
ous components of intermediate complexity. We exploit this versatility throughout the following
chapters, such as when addressing convolutional neural networks.
6.1.2 The Sequential Module
We can now take a closer look at how the Sequential class works. Recall that Sequential was
designed to daisy-chain other modules together. To build our own simplified MySequential, we
just need to define two key methods: 1. A method to append modules one by one to a list. 2. A
forward propagation method to pass an input through the chain of modules, in the same order as
they were appended.
The following MySequential class delivers the same functionality of the default Sequential class.
class MySequential(nn.Module):
def __init__(self, *args):
super().__init__()

for idx, module in enumerate(args):

self.add_module(str(idx), module)

for module in self.children():
X = module(X)
return X
In the __init__ method, we add every module by calling the add_modules method. These modules
can be accessed by the children method later. In this way the system knows the added modules,
and it will properly initialize each moduleʼs parameters.
When our MySequentialʼs forward propagation method is invoked, each added module is executed
in the order in which they were added. We can now reimplement an MLP using our MySequential
class.
net = MySequential(nn.LazyLinear(256), nn.ReLU(), nn.LazyLinear(10))

net(X).shape
torch.Size([2, 10])
Note that this use of MySequential is identical to the code we previously wrote for the Sequential
class (as described in Section 5.1).
6.1.3 Executing Code in the Forward Propagation Method
The Sequential class makes model construction easy, allowing us to assemble new architectures
without having to define our own class. However, not all architectures are simple daisy chains.
When greater flexibility is required, we will want to define our own blocks. For example, we
might want to execute Pythonʼs control flow within the forward propagation method. Moreover,
we might want to perform arbitrary mathematical operations, not simply relying on predefined
neural network layers.
You might have noticed that until now, all of the operations in our networks have acted upon our
networkʼs activations and its parameters. Sometimes, however, we might want to incorporate
terms that are neither the result of previous layers nor updatable parameters. We call these con-
stant parameters. Say for example that we want a layer that calculates the function f (x, w) = c·w⊤ x,
where x is the input, w is our parameter, and c is some specified constant that is not updated dur-
ing optimization. So we implement a FixedHiddenMLP class as follows.
class FixedHiddenMLP(nn.Module):
def __init__(self):
super().__init__()
# Random weight parameters that will not compute gradients and
# therefore keep constant during training
self.rand_weight = torch.rand((20, 20))
self.linear = nn.LazyLinear(20)

X = self.linear(X)
6.1. Layers and Modules 213

X = F.relu(X @ self.rand_weight + 1)
# Reuse the fully connected layer. This is equivalent to sharing
# parameters with two fully connected layers
X = self.linear(X)
# Control flow
while X.abs().sum() > 1:
X /= 2
return X.sum()
In this FixedHiddenMLP model, we implement a hidden layer whose weights (self.rand_weight)

are initialized randomly at instantiation and are thereafter constant. This weight is not a model
parameter and thus it is never updated by backpropagation. The network then passes the output
of this “fixed” layer through a fully connected layer.
Note that before returning the output, our model did something unusual. We ran a while-loop,
testing on the condition its ℓ1 norm is larger than 1, and dividing our output vector by 2 until it
satisfied the condition. Finally, we returned the sum of the entries in X. To our knowledge, no
standard neural network performs this operation. Note that this particular operation may not be
useful in any real-world task. Our point is only to show you how to integrate arbitrary code into
the flow of your neural network computations.
net = FixedHiddenMLP()
net(X)
tensor(0.0510, grad_fn=<SumBackward0>)
We can mix and match various ways of assembling modules together. In the following example,
we nest modules in some creative ways.
class NestMLP(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
nn.LazyLinear(32), nn.ReLU())
self.linear = nn.LazyLinear(16)

return self.linear(self.net(X))
chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())

chimera(X)
tensor(0.0493, grad_fn=<SumBackward0>)

6.1.4 Summary
• Layers are modules.

• Many layers can comprise a module.
• Many modules can comprise a module.
• A module can contain code.
• Modules take care of lots of housekeeping, including parameter initialization and backprop-
agation.
• Sequential concatenations of layers and modules are handled by the Sequential module.
Exercises
1. What kinds of problems will occur if you change MySequential to store modules in a Python
list?
2. Implement a module that takes two modules as an argument, say net1 and net2 and returns
the concatenated output of both networks in the forward propagation. This is also called a
parallel module.
3. Assume that you want to concatenate multiple instances of the same network. Implement
a factory function that generates multiple instances of the same module and build a larger
network from it.
Discussions108
6.2 Parameter Management
Once we have chosen an architecture and set our hyperparameters, we proceed to the training
loop, where our goal is to find parameter values that minimize our loss function. After training, we
will need these parameters in order to make future predictions. Additionally, we will sometimes
wish to extract the parameters either to reuse them in some other context, to save our model
to disk so that it may be executed in other software, or for examination in the hope of gaining
scientific understanding.
Most of the time, we will be able to ignore the nitty-gritty details of how parameters are declared
and manipulated, relying on deep learning frameworks to do the heavy lifting. However, when we
move away from stacked architectures with standard layers, we will sometimes need to get into
the weeds of declaring and manipulating parameters. In this section, we cover the following:
• Accessing parameters for debugging, diagnostics, and visualizations.
• Sharing parameters across different model components.
We start by focusing on an MLP with one hidden layer.
108
6.2. Parameter Management 215

import torch

X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])
6.2.1 Parameter Access
Letʼs start with how to access parameters from the models that you already know. When a model
is defined via the Sequential class, we can first access any layer by indexing into the model as
though it were a list. Each layerʼs parameters are conveniently located in its attribute. We can
inspect the parameters of the second fully connected layer as follows.
net[2].state_dict()
OrderedDict([('weight',
tensor([[-0.0020, -0.2192, -0.1979, 0.1038, 0.2019, -0.2588, 0.2396, -0.
,→3146]])),
('bias', tensor([0.2685]))])
We can see that this fully connected layer contains two parameters, corresponding to that layerʼs
weights and biases, respectively.
Targeted Parameters
Note that each parameter is represented as an instance of the parameter class. To do anything
useful with the parameters, we first need to access the underlying numerical values. There are
several ways to do this. Some are simpler while others are more general. The following code
extracts the bias from the second neural network layer, which returns a parameter class instance,
and further accesses that parameterʼs value.
type(net[2].bias), net[2].bias.data
(torch.nn.parameter.Parameter, tensor([0.2685]))
Parameters are complex objects, containing values, gradients, and additional information. Thatʼs
why we need to request the value explicitly.
In addition to the value, each parameter also allows us to access the gradient. Because we have
not invoked backpropagation for this network yet, it is in its initial state.

net[2].weight.grad == None
True
All Parameters at Once
When we need to perform operations on all parameters, accessing them one-by-one can grow
tedious. The situation can grow especially unwieldy when we work with more complex modules
(e.g., nested modules), since we would need to recurse through the entire tree to extract each
sub-moduleʼs parameters. Below we demonstrate accessing the parameters of all layers.
[(name, param.shape) for name, param in net.named_parameters()]
[('0.weight', torch.Size([8, 4])),

('0.bias', torch.Size([8])),
('2.weight', torch.Size([1, 8])),
('2.bias', torch.Size([1]))]
6.2.2 Tied Parameters
Often, we want to share parameters across multiple layers. Letʼs see how to do this elegantly. In
the following we allocate a fully connected layer and then use its parameters specifically to set
those of another layer. Here we need to run the forward propagation net(X) before accessing the
parameters.
# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
shared, nn.ReLU(),
shared, nn.ReLU(),
nn.LazyLinear(1))
net(X)
# Check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
# Make sure that they are actually the same object rather than just having the
# same value
print(net[2].weight.data[0] == net[4].weight.data[0])
tensor([True, True, True, True, True, True, True, True])

tensor([True, True, True, True, True, True, True, True])
This example shows that the parameters of the second and third layer are tied. They are not just
equal, they are represented by the same exact tensor. Thus, if we change one of the parameters,
the other one changes, too. You might wonder, when parameters are tied what happens to the
gradients? Since the model parameters contain gradients, the gradients of the second hidden
layer and the third hidden layer are added together during backpropagation.
6.2. Parameter Management 217

6.2.3 Summary
We have several ways to access and tie model parameters.
Exercises
1. Use the NestMLP model defined in Section 6.1 and access the parameters of the various layers.
2. Construct an MLP containing a shared parameter layer and train it. During the training
process, observe the model parameters and gradients of each layer.
3. Why is sharing parameters a good idea?
Discussions109
6.3 Parameter Initialization
Now that we know how to access the parameters, letʼs look at how to initialize them properly. We
discussed the need for proper initialization in Section 5.4. The deep learning framework provides
default random initializations to its layers. However, we often want to initialize our weights ac-
cording to various other protocols. The framework provides most commonly used protocols, and
also allows to create a custom initializer.
By default, PyTorch initializes weight and bias matrices uniformly by drawing from a range that
is computed according to the input and output dimension. PyTorchʼs nn.init module provides a
variety of preset initialization methods.
import torch

net(X).shape
torch.Size([2, 1])
109

6.3.1 Built-in Initialization
Letʼs begin by calling on built-in initializers. The code below initializes all weight parameters as
Gaussian random variables with standard deviation 0.01, while bias parameters cleared to zero.
def init_normal(module):
if type(module) == nn.LazyLinear:
nn.init.normal_(module.weight, mean=0, std=0.01)
nn.init.zeros_(module.bias)
net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]
(tensor([-0.2947, 0.4634, 0.2715, 0.3330]), tensor(0.0168))
We can also initialize all the parameters to a given constant value (say, 1).
def init_constant(module):
nn.init.constant_(module.weight, 1)
nn.init.zeros_(module.bias)
net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]
(tensor([-0.2947, 0.4634, 0.2715, 0.3330]), tensor(0.0168))
We can also apply different initializers for certain blocks. For example, below we initialize the
first layer with the Xavier initializer and initialize the second layer to a constant value of 42.
def init_xavier(module):
nn.init.xavier_uniform_(module.weight)
def init_42(module):
nn.init.constant_(module.weight, 42)
net[0].apply(init_xavier)
net[2].apply(init_42)
print(net[0].weight.data[0])
print(net[2].weight.data)
tensor([-0.2947, 0.4634, 0.2715, 0.3330])

tensor([[ 0.1070, 0.1626, -0.0530, -0.2761, 0.0784, -0.3255, -0.0116, -0.0867]])
6.3. Parameter Initialization 219

Custom Initialization
Sometimes, the initialization methods we need are not provided by the deep learning framework.
In the example below, we define an initializer for any weight parameter w using the following
strange distribution:


U (5, 10) with probability 14
w∼ 0 with probability 12 (6.3.1)


U (−10, −5) with probability 4 1
Again, we implement a my_init function to apply to net.
def my_init(module):
print("Init", *[(name, param.shape)
for name, param in module.named_parameters()][0])
nn.init.uniform_(module.weight, -10, 10)
module.weight.data *= module.weight.data.abs() >= 5
net.apply(my_init)
net[0].weight[:2]
tensor([[-0.2947, 0.4634, 0.2715, 0.3330],

[-0.3546, 0.4045, 0.4916, 0.1918]], grad_fn=<SliceBackward0>)
Note that we always have the option of setting parameters directly.
net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42
net[0].weight.data[0]
tensor([42.0000, 1.4634, 1.2715, 1.3330])
6.3.2 Summary
We can initialize parameters using built-in and custom initializers.
Exercises
Look up the online documentation for more built-in initializers.

Discussions110
110

6.4 Lazy Initialization
So far, it might seem that we got away with being sloppy in setting up our networks. Specifically,
we did the following unintuitive things, which might not seem like they should work:
• We defined the network architectures without specifying the input dimensionality.
• We added layers without specifying the output dimension of the previous layer.
• We even “initialized” these parameters before providing enough information to determine
how many parameters our models should contain.
You might be surprised that our code runs at all. After all, there is no way the deep learning
framework could tell what the input dimensionality of a network would be. The trick here is that
the framework defers initialization, waiting until the first time we pass data through the model, to
infer the sizes of each layer on the fly.
Later on, when working with convolutional neural networks, this technique will become even
more convenient since the input dimensionality (i.e., the resolution of an image) will affect the
dimensionality of each subsequent layer. Hence, the ability to set parameters without the need
to know, at the time of writing the code, what the dimensionality is can greatly simplify the task
of specifying and subsequently modifying our models. Next, we go deeper into the mechanics of
initialization.
To begin, letʼs instantiate an MLP.
import torch
At this point, the network cannot possibly know the dimensions of the input layerʼs weights be-
cause the input dimension remains unknown. Consequently the framework has not yet initialized
any parameters. We confirm by attempting to access the parameters below.
net[0].weight
<UninitializedParameter>
Next letʼs pass data through the network to make the framework finally initialize parameters.
X = torch.rand(2, 20)
net(X)
net[0].weight.shape
6.4. Lazy Initialization 221

torch.Size([256, 20])
As soon as we know the input dimensionality, 20, the framework can identify the shape of the
first layerʼs weight matrix by plugging in the value of 20. Having recognized the first layerʼs shape,
the framework proceeds to the second layer, and so on through the computational graph until
all shapes are known. Note that in this case, only the first layer requires lazy initialization, but
the framework initializes sequentially. Once all parameter shapes are known, the framework can
finally initialize the parameters.
The following method passes in dummy inputs through the network for a dry run to infer all pa-
rameter shapes and subsequently initializes the parameters. It will be used later when default
random initializations are not desired.
@d2l.add_to_class(d2l.Module) #@save
def apply_init(self, inputs, init=None):
self.forward(*inputs)
if init is not None:
self.net.apply(init)
6.4.1 Summary
• Lazy initialization can be convenient, allowing the framework to infer parameter shapes
automatically, making it easy to modify architectures and eliminating one common source
of errors.
• We can pass data through the model to make the framework finally initialize parameters.
Exercises
1. What happens if you specify the input dimensions to the first layer but not to subsequent
layers? Do you get immediate initialization?
2. What happens if you specify mismatching dimensions?
3. What would you need to do if you have input of varying dimensionality? Hint: look at the
parameter tying.
Discussions111
6.5 Custom Layers
One factor behind deep learningʼs success is the availability of a wide range of layers that can be
composed in creative ways to design architectures suitable for a wide variety of tasks. For instance,
researchers have invented layers specifically for handling images, text, looping over sequential
data, and performing dynamic programming. Sooner or later, you will encounter or invent a layer
that does not exist yet in the deep learning framework. In these cases, you must build a custom
layer. In this section, we show you how.
111

6.5.1 Layers without Parameters
To start, we construct a custom layer that does not have any parameters of its own. This should
look familiar if you recall our introduction to module in Section 6.1. The following CenteredLayer
class simply subtracts the mean from its input. To build it, we simply need to inherit from the
base layer class and implement the forward propagation function.
import torch
class CenteredLayer(nn.Module):
def __init__(self):
super().__init__()

return X - X.mean()
Letʼs verify that our layer works as intended by feeding some data through it.
layer = CenteredLayer()
layer(torch.tensor([1.0, 2, 3, 4, 5]))
tensor([-2., -1., 0., 1., 2.])
We can now incorporate our layer as a component in constructing more complex models.
net = nn.Sequential(nn.LazyLinear(128), CenteredLayer())
As an extra sanity check, we can send random data through the network and check that the mean
is in fact 0. Because we are dealing with floating point numbers, we may still see a very small
nonzero number due to quantization.
Y = net(torch.rand(4, 8))
Y.mean()
tensor(0., grad_fn=<MeanBackward0>)
6.5. Custom Layers 223

6.5.2 Layers with Parameters
Now that we know how to define simple layers, letʼs move on to defining layers with parameters
that can be adjusted through training. We can use built-in functions to create parameters, which
provide some basic housekeeping functionality. In particular, they govern access, initialization,
sharing, saving, and loading model parameters. This way, among other benefits, we will not need
to write custom serialization routines for every custom layer.
Now letʼs implement our own version of the fully connected layer. Recall that this layer requires
two parameters, one to represent the weight and the other for the bias. In this implementation,
we bake in the ReLU activation as a default. This layer requires two input arguments: in_units
and units, which denote the number of inputs and outputs, respectively.
class MyLinear(nn.Module):
def __init__(self, in_units, units):
super().__init__()
self.weight = nn.Parameter(torch.randn(in_units, units))
self.bias = nn.Parameter(torch.randn(units,))

linear = torch.matmul(X, self.weight.data) + self.bias.data
return F.relu(linear)
Next, we instantiate the MyLinear class and access its model parameters.
linear = MyLinear(5, 3)
linear.weight
Parameter containing:
tensor([[-0.1768, 0.5718, 1.3372],
[-0.0800, -0.3124, 0.5639],
[ 1.0192, 1.8009, 1.0995],
[ 1.2927, 2.0420, -0.0564],
[-1.1036, 1.4116, 1.9683]], requires_grad=True)
We can directly carry out forward propagation calculations using custom layers.
linear(torch.rand(2, 5))
tensor([[0.0000, 3.3582, 1.4402],

[0.2795, 3.7836, 0.7976]])
We can also construct models using custom layers. Once we have that we can use it just like the
built-in fully connected layer.
net = nn.Sequential(MyLinear(64, 8), MyLinear(8, 1))

net(torch.rand(2, 64))
tensor([[0.],
[0.]])

6.5.3 Summary
• We can design custom layers via the basic layer class. This allows us to define flexible new
layers that behave differently from any existing layers in the library.
• Once defined, custom layers can be invoked in arbitrary contexts and architectures.
• Layers can have local parameters, which can be created through built-in functions.
Exercises
1. ∑
Design a layer that takes an input and computes a tensor reduction, i.e., it returns yk =
i,j Wijk xi xj .
2. Design a layer that returns the leading half of the Fourier coefficients of the data.
Discussions112
6.6 File I/O
So far we discussed how to process data and how to build, train, and test deep learning models.
However, at some point, we will hopefully be happy enough with the learned models that we will
want to save the results for later use in various contexts (perhaps even to make predictions in de-
ployment). Additionally, when running a long training process, the best practice is to periodically
save intermediate results (checkpointing) to ensure that we do not lose several days worth of com-
putation if we trip over the power cord of our server. Thus it is time to learn how to load and store
both individual weight vectors and entire models. This section addresses both issues.
6.6.1 Loading and Saving Tensors
For individual tensors, we can directly invoke the load and save functions to read and write them
respectively. Both functions require that we supply a name, and save requires as input the variable
to be saved.
import torch
x = torch.arange(4)
torch.save(x, 'x-file')
We can now read the data from the stored file back into memory.
x2 = torch.load('x-file')
x2
tensor([0, 1, 2, 3])
112
6.6. File I/O 225

We can store a list of tensors and read them back into memory.
y = torch.zeros(4)
torch.save([x, y],'x-files')
x2, y2 = torch.load('x-files')
(x2, y2)
(tensor([0, 1, 2, 3]), tensor([0., 0., 0., 0.]))
We can even write and read a dictionary that maps from strings to tensors. This is convenient
when we want to read or write all the weights in a model.
mydict = {'x': x, 'y': y}

torch.save(mydict, 'mydict')
mydict2 = torch.load('mydict')
mydict2
{'x': tensor([0, 1, 2, 3]), 'y': tensor([0., 0., 0., 0.])}
6.6.2 Loading and Saving Model Parameters
Saving individual weight vectors (or other tensors) is useful, but it gets very tedious if we want
to save (and later load) an entire model. After all, we might have hundreds of parameter groups
sprinkled throughout. For this reason the deep learning framework provides built-in function-
alities to load and save entire networks. An important detail to note is that this saves model pa-
rameters and not the entire model. For example, if we have a 3-layer MLP, we need to specify the
architecture separately. The reason for this is that the models themselves can contain arbitrary
code, hence they cannot be serialized as naturally. Thus, in order to reinstate a model, we need
to generate the architecture in code and then load the parameters from disk. Letʼs start with our
familiar MLP.
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.hidden = nn.LazyLinear(256)
self.output = nn.LazyLinear(10)
def forward(self, x):

return self.output(F.relu(self.hidden(x)))
net = MLP()
X = torch.randn(size=(2, 20))
Y = net(X)
Next, we store the parameters of the model as a file with the name “mlp.params”.

torch.save(net.state_dict(), 'mlp.params')
To recover the model, we instantiate a clone of the original MLP model. Instead of randomly
initializing the model parameters, we read the parameters stored in the file directly.
clone = MLP()
clone.load_state_dict(torch.load('mlp.params'))
clone.eval()
MLP(
(hidden): LazyLinear(in_features=0, out_features=256, bias=True)
(output): LazyLinear(in_features=0, out_features=10, bias=True)
)
Since both instances have the same model parameters, the computational result of the same input
X should be the same. Letʼs verify this.
Y_clone = clone(X)
Y_clone == Y
tensor([[True, True, True, True, True, True, True, True, True, True],
[True, True, True, True, True, True, True, True, True, True]])
6.6.3 Summary
• The save and load functions can be used to perform file I/O for tensor objects.
• We can save and load the entire sets of parameters for a network via a parameter dictionary.
• Saving the architecture has to be done in code rather than in parameters.
Exercises
1. Even if there is no need to deploy trained models to a different device, what are the practical
benefits of storing model parameters?
2. Assume that we want to reuse only parts of a network to be incorporated into a network
of a different architecture. How would you go about using, say the first two layers from a
previous network in a new network?
3. How would you go about saving the network architecture and parameters? What restrictions
would you impose on the architecture?
Discussions113
113
6.6. File I/O 227

6.7 GPUs
In Chapter 1.5, we discussed the rapid growth of computation over the past two decades. In a
nutshell, GPU performance has increased by a factor of 1000 every decade since 2000. This offers
great opportunities but it also suggests a significant need to provide such performance.
In this section, we begin to discuss how to harness this computational performance for your re-
search. First by using single GPUs and at a later point, how to use multiple GPUs and multiple
servers (with multiple GPUs).
Specifically, we will discuss how to use a single NVIDIA GPU for calculations. First, make sure
you have at least one NVIDIA GPU installed. Then, download the NVIDIA driver and CUDA114
and follow the prompts to set the appropriate path. Once these preparations are complete, the
nvidia-smi command can be used to view the graphics card information.
!nvidia-smi
Fri Aug 19 17:15:08 2022

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00 Driver Version: 460.106.00 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:1B.0 Off | 0 |
| N/A 56C P0 76W / 300W | 3189MiB / 16160MiB | 22% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:1C.0 Off | 0 |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:1D.0 Off | 0 |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:1E.0 Off | 0 |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 931 C ...l-en-release-1/bin/python 1429MiB |
| 2 N/A N/A 931 C ...l-en-release-1/bin/python 1539MiB |
+-----------------------------------------------------------------------------+
In PyTorch, every array has a device, we often refer it as a context. So far, by default, all variables
and associated computation have been assigned to the CPU. Typically, other contexts might be
114
https://developer.nvidia.com/cuda-downloads

various GPUs. Things can get even hairier when we deploy jobs across multiple servers. By as-
signing arrays to contexts intelligently, we can minimize the time spent transferring data between
devices. For example, when training neural networks on a server with a GPU, we typically prefer
for the modelʼs parameters to live on the GPU.
To run the programs in this section, you need at least two GPUs. Note that this might be extravagant
for most desktop computers but it is easily available in the cloud, e.g., by using the AWS EC2 multi-
GPU instances. Almost all other sections do not require multiple GPUs. Instead, this is simply to
illustrate how data flow between different devices.
6.7.1 Computing Devices
We can specify devices, such as CPUs and GPUs, for storage and calculation. By default, tensors
are created in the main memory and then use the CPU to calculate it.
In PyTorch, the CPU and GPU can be indicated by torch.device('cpu') and torch.
device('cuda'). It should be noted that the cpu device means all physical CPUs and memory.
This means that PyTorchʼs calculations will try to use all CPU cores. However, a gpu device only
represents one card and the corresponding memory. If there are multiple GPUs, we use torch.
device(f'cuda:{i}') to represent the ith GPU (i starts from 0). Also, gpu:0 and gpu are equivalent.
import torch
def cpu(): #@save

return torch.device('cpu')
def gpu(i=0): #@save
return torch.device(f'cuda:{i}')
cpu(), gpu(), gpu(1)
(device(type='cpu'),
device(type='cuda', index=0),
device(type='cuda', index=1))
We can query the number of available GPUs.
def num_gpus(): #@save

return torch.cuda.device_count()
num_gpus()
Now we define two convenient functions that allow us to run code even if the requested GPUs do
not exist.
def try_gpu(i=0): #@save

"""Return gpu(i) if exists, otherwise return cpu()."""
if num_gpus() >= i + 1:
return gpu(i)
6.7. GPUs 229

return cpu()
def try_all_gpus(): #@save

"""Return all available GPUs, or [cpu(),] if no GPU exists."""
return [gpu(i) for i in range(num_gpus())]
try_gpu(), try_gpu(10), try_all_gpus()
(device(type='cuda', index=0),
device(type='cpu'),
[device(type='cuda', index=0), device(type='cuda', index=1)])
6.7.2 Tensors and GPUs
By default, tensors are created on the CPU. We can query the device where the tensor is located.
x = torch.tensor([1, 2, 3])
x.device
device(type='cpu')
It is important to note that whenever we want to operate on multiple terms, they need to be on the
same device. For instance, if we sum two tensors, we need to make sure that both arguments live
on the same device—otherwise the framework would not know where to store the result or even
how to decide where to perform the computation.
Storage on the GPU
There are several ways to store a tensor on the GPU. For example, we can specify a storage device
when creating a tensor. Next, we create the tensor variable X on the first gpu. The tensor created
on a GPU only consumes the memory of this GPU. We can use the nvidia-smi command to view
GPU memory usage. In general, we need to make sure that we do not create data that exceeds the
GPU memory limit.
X = torch.ones(2, 3, device=try_gpu())
X
tensor([[1., 1., 1.],

[1., 1., 1.]], device='cuda:0')
Assuming that you have at least two GPUs, the following code will create a random tensor on the
second GPU.
Y = torch.rand(2, 3, device=try_gpu(1))
Y

tensor([[0.3104, 0.9887, 0.7314],
[0.1797, 0.9548, 0.5226]], device='cuda:1')
Copying
If we want to compute X + Y, we need to decide where to perform this operation. For instance,
as shown in Fig. 6.7.1, we can transfer X to the second GPU and perform the operation there. Do
not simply add X and Y, since this will result in an exception. The runtime engine would not know
what to do: it cannot find data on the same device and it fails. Since Y lives on the second GPU, we
need to move X there before we can add the two.
Fig. 6.7.1: Copy data to perform an operation on the same device.
Z = X.cuda(1)
print(X)
print(Z)
tensor([[1., 1., 1.],

[1., 1., 1.]], device='cuda:0')
tensor([[1., 1., 1.],
[1., 1., 1.]], device='cuda:1')
Now that the data is on the same GPU (both Z and Y are), we can add them up.
Y + Z
tensor([[1.3104, 1.9887, 1.7314],

[1.1797, 1.9548, 1.5226]], device='cuda:1')
Imagine that your variable Z already lives on your second GPU. What happens if we still call Z.
cuda(1)? It will return Z instead of making a copy and allocating new memory.
Z.cuda(1) is Z
True
6.7. GPUs 231

Side Notes
People use GPUs to do machine learning because they expect them to be fast. But transferring
variables between devices is slow. So we want you to be 100% certain that you want to do some-
thing slow before we let you do it. If the deep learning framework just did the copy automatically
without crashing then you might not realize that you had written some slow code.
Also, transferring data between devices (CPU, GPUs, and other machines) is something that is
much slower than computation. It also makes parallelization a lot more difficult, since we have to
wait for data to be sent (or rather to be received) before we can proceed with more operations. This
is why copy operations should be taken with great care. As a rule of thumb, many small operations
are much worse than one big operation. Moreover, several operations at a time are much better
than many single operations interspersed in the code unless you know what you are doing. This
is the case since such operations can block if one device has to wait for the other before it can do
something else. It is a bit like ordering your coffee in a queue rather than pre-ordering it by phone
and finding out that it is ready when you are.
Last, when we print tensors or convert tensors to the NumPy format, if the data is not in the main
memory, the framework will copy it to the main memory first, resulting in additional transmis-
sion overhead. Even worse, it is now subject to the dreaded global interpreter lock that makes
everything wait for Python to complete.
6.7.3 Neural Networks and GPUs
Similarly, a neural network model can specify devices. The following code puts the model param-
eters on the GPU.
net = nn.Sequential(nn.LazyLinear(1))
net = net.to(device=try_gpu())
We will see many more examples of how to run models on GPUs in the following chapters, simply
since they will become somewhat more computationally intensive.
When the input is a tensor on the GPU, the model will calculate the result on the same GPU.
net(X)
tensor([[-0.2376],
[-0.2376]], device='cuda:0', grad_fn=<AddmmBackward0>)
Letʼs confirm that the model parameters are stored on the same GPU.
net[0].weight.data.device

device(type='cuda', index=0)
Let the trainer support GPU.
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]
def prepare_batch(self, batch):
if self.gpus:
batch = [a.to(self.gpus[0]) for a in batch]
return batch
def prepare_model(self, model):
model.trainer = self
model.board.xlim = [0, self.max_epochs]
if self.gpus:
model.to(self.gpus[0])
self.model = model
In short, as long as all data and parameters are on the same device, we can learn models efficiently.
In the following chapters we will see several such examples.
6.7.4 Summary
• We can specify devices for storage and calculation, such as the CPU or GPU. By default, data
is created in the main memory and then uses the CPU for calculations.
• The deep learning framework requires all input data for calculation to be on the same device,
be it CPU or the same GPU.
• You can lose significant performance by moving data without care. A typical mistake is as
follows: computing the loss for every minibatch on the GPU and reporting it back to the user
on the command line (or logging it in a NumPy ndarray) will trigger a global interpreter lock
which stalls all GPUs. It is much better to allocate memory for logging inside the GPU and
only move larger logs.
Exercises
1. Try a larger computation task, such as the multiplication of large matrices, and see the dif-
ference in speed between the CPU and GPU. What about a task with a small amount of cal-
culations?
2. How should we read and write model parameters on the GPU?
3. Measure the time it takes to compute 1000 matrix-matrix multiplications of 100 × 100 matri-
ces and log the Frobenius norm of the output matrix one result at a time vs. keeping a log
on the GPU and transferring only the final result.
6.7. GPUs 233

4. Measure how much time it takes to perform two matrix-matrix multiplications on two GPUs
at the same time vs. in sequence on one GPU. Hint: you should see almost linear scaling.
Discussions115
115

7 | Convolutional Neural Networks
Image data is represented as a two-dimensional grid of pixels, be it monochromatic or in color.

Accordingly each pixel corresponds to one or multiple numerical values respectively. So far we
ignored this rich structure and treated them as vectors of numbers by flattening the images, irre-
spective of the spatial relation between pixels. This deeply unsatisfying approach was necessary
in order to feed the resulting one-dimensional vectors through a fully connected MLP.
Because these networks are invariant to the order of the features, we could get similar results
regardless of whether we preserve an order corresponding to the spatial structure of the pixels or
if we permute the columns of our design matrix before fitting the MLPʼs parameters. Preferably,
we would leverage our prior knowledge that nearby pixels are typically related to each other, to
build efficient models for learning from image data.
This chapter introduces convolutional neural networks (CNNs) (LeCun et al., 1995), a powerful fam-
ily of neural networks that are designed for precisely this purpose. CNN-based architectures are
now ubiquitous in the field of computer vision. For instance, on the Imagnet collection (Deng et
al., 2009) it was only the use of convolutional neural networks, in short Convnets that provided
significant performance improvements (Krizhevsky et al., 2012).
Modern CNNs, as they are called colloquially owe their design to inspirations from biology, group
theory, and a healthy dose of experimental tinkering. In addition to their sample efficiency in
achieving accurate models, CNNs tend to be computationally efficient, both because they require
fewer parameters than fully connected architectures and because convolutions are easy to paral-
lelize across GPU cores (Chetlur et al., 2014). Consequently, practitioners often apply CNNs when-
ever possible, and increasingly they have emerged as credible competitors even on tasks with a
one-dimensional sequence structure, such as audio (Abdel-Hamid et al., 2014), text (Kalchbren-
ner et al., 2014), and time series analysis (LeCun et al., 1995), where recurrent neural networks
are conventionally used. Some clever adaptations of CNNs have also brought them to bear on
graph-structured data (Kipf and Welling, 2016) and in recommender systems.
First, we will dive more deeply into the motivation for convolutional neural networks. This is fol-
lowed by a walk through the basic operations that comprise the backbone of all convolutional net-
works. These include the convolutional layers themselves, nitty-gritty details including padding
and stride, the pooling layers used to aggregate information across adjacent spatial regions, the
use of multiple channels at each layer, and a careful discussion of the structure of modern ar-
chitectures. We will conclude the chapter with a full working example of LeNet, the first con-
volutional network successfully deployed, long before the rise of modern deep learning. In the
next chapter, we will dive into full implementations of some popular and comparatively recent
CNN architectures whose designs represent most of the techniques commonly used by modern
practitioners.
235
7.1 From Fully Connected Layers to Convolutions
To this day, the models that we have discussed so far remain appropriate options when we are
dealing with tabular data. By tabular, we mean that the data consist of rows corresponding to
examples and columns corresponding to features. With tabular data, we might anticipate that
the patterns we seek could involve interactions among the features, but we do not assume any
structure a priori concerning how the features interact.
Sometimes, we truly lack knowledge to guide the construction of craftier architectures. In these
cases, an MLP may be the best that we can do. However, for high-dimensional perceptual data,
such structure-less networks can grow unwieldy.
For instance, letʼs return to our running example of distinguishing cats from dogs. Say that we do
a thorough job in data collection, collecting an annotated dataset of one-megapixel photographs.
This means that each input to the network has one million dimensions. Even an aggressive reduc-
tion to one thousand hidden dimensions would require a fully connected layer characterized by
106 × 103 = 109 parameters. Unless we have lots of GPUs, a talent for distributed optimization,
and an extraordinary amount of patience, learning the parameters of this network may turn out
to be infeasible.
A careful reader might object to this argument on the basis that one megapixel resolution may not
be necessary. However, while we might be able to get away with one hundred thousand pixels,
our hidden layer of size 1000 grossly underestimates the number of hidden units that it takes to
learn good representations of images, so a practical system will still require billions of parame-
ters. Moreover, learning a classifier by fitting so many parameters might require collecting an
enormous dataset. And yet today both humans and computers are able to distinguish cats from
dogs quite well, seemingly contradicting these intuitions. That is because images exhibit rich
structure that can be exploited by humans and machine learning models alike. Convolutional
neural networks (CNNs) are one creative way that machine learning has embraced for exploiting
some of the known structure in natural images.
7.1.1 Invariance
Imagine that we want to detect an object in an image. It seems reasonable that whatever method
we use to recognize objects should not be overly concerned with the precise location of the ob-
ject in the image. Ideally, our system should exploit this knowledge. Pigs usually do not fly and
planes usually do not swim. Nonetheless, we should still recognize a pig were one to appear at the
top of the image. We can draw some inspiration here from the childrenʼs game “Whereʼs Waldo”
(depicted in Fig. 7.1.1). The game consists of a number of chaotic scenes bursting with activities.
Waldo shows up somewhere in each, typically lurking in some unlikely location. The readerʼs goal
is to locate him. Despite his characteristic outfit, this can be surprisingly difficult, due to the large
number of distractions. However, what Waldo looks like does not depend upon where Waldo is lo-
cated. We could sweep the image with a Waldo detector that could assign a score to each patch,
indicating the likelihood that the patch contains Waldo. In fact, many object detection and seg-
mentation algorithms are based on this approach (Long et al., 2015). CNNs systematize this idea
of spatial invariance, exploiting it to learn useful representations with fewer parameters.
236 Chapter 7. Convolutional Neural Networks

Fig. 7.1.1: An image of the “Whereʼs Waldo” game.
We can now make these intuitions more concrete by enumerating a few desiderata to guide our
design of a neural network architecture suitable for computer vision:
1. In the earliest layers, our network should respond similarly to the same patch, regardless of
where it appears in the image. This principle is called translation invariance (or translation
equivariance).
2. The earliest layers of the network should focus on local regions, without regard for the con-
tents of the image in distant regions. This is the locality principle. Eventually, these local
representations can be aggregated to make predictions at the whole image level.
3. As we proceed, deeper layers should be able to capture longer-range features of the image,
in a way similar to higher level vision in nature.
Letʼs see how this translates into mathematics.
7.1.2 Constraining the MLP
To start off, we can consider an MLP with two-dimensional images X as inputs and their immediate
hidden representations H similarly represented as matrices (they are two-dimensional tensors in
code), where both X and H have the same shape. Let that sink in. We now conceive of not only the
inputs but also the hidden representations as possessing spatial structure.
Let [X]i,j and [H]i,j denote the pixel at location (i, j) in the input image and hidden representation,
respectively. Consequently, to have each of the hidden units receive input from each of the input
pixels, we would switch from using weight matrices (as we did previously in MLPs) to represent-
ing our parameters as fourth-order weight tensors W. Suppose that U contains biases, we could
formally express the fully connected layer as
∑∑
[H]i,j = [U]i,j + [W]i,j,k,l [X]k,l
k l
∑∑ (7.1.1)
= [U]i,j + [V]i,j,a,b [X]i+a,j+b .
a b
7.1. From Fully Connected Layers to Convolutions 237

The switch from W to V is entirely cosmetic for now since there is a one-to-one correspondence
between coefficients in both fourth-order tensors. We simply re-index the subscripts (k, l) such
that k = i + a and l = j + b. In other words, we set [V]i,j,a,b = [W]i,j,i+a,j+b . The indices a and b
run over both positive and negative offsets, covering the entire image. For any given location (i,
j) in the hidden representation [H]i,j , we compute its value by summing over pixels in x, centered
around (i, j) and weighted by [V]i,j,a,b . Before we carry on, letʼs consider the total number of pa-
rameters required for a single layer in this parametrization: a 1000 × 1000 image (1 megapixel) is
mapped to a 1000 × 1000 hidden representation. This requires 1012 parameters, far beyond what
computers currently can handle.
Translation Invariance
Now letʼs invoke the first principle established above: translation invariance (Zhang and others,
1988). This implies that a shift in the input X should simply lead to a shift in the hidden repre-
sentation H. This is only possible if V and U do not actually depend on (i, j). As such, we have
[V]i,j,a,b = [V]a,b and U is a constant, say u. As a result, we can simplify the definition for H:
∑∑
[H]i,j = u + [V]a,b [X]i+a,j+b . (7.1.2)
a b
This is a convolution! We are effectively weighting pixels at (i + a, j + b) in the vicinity of location

(i, j) with coefficients [V]a,b to obtain the value [H]i,j . Note that [V]a,b needs many fewer coeffi-
cients than [V]i,j,a,b since it no longer depends on the location within the image. Consequently,
the number of parameters required is no longer 1012 but a much more reasonable 4 · 106 : we still
have the dependency on a, b ∈ (−1000, 1000). In short, we have made significant progress. Time-
delay neural networks (TDNNs) are some of the first examples to exploit this idea (Waibel et al.,
1989).
Locality
Now letʼs invoke the second principle: locality. As motivated above, we believe that we should
not have to look very far away from location (i, j) in order to glean relevant information to assess
what is going on at [H]i,j . This means that outside some range |a| > ∆ or |b| > ∆, we should set
[V]a,b = 0. Equivalently, we can rewrite [H]i,j as
∑
∆ ∑
∆
[H]i,j = u + [V]a,b [X]i+a,j+b . (7.1.3)
a=−∆ b=−∆
This reduces the number of parameters from 4 · 106 to 4∆2 , where ∆ is typically smaller than
10. As such, we reduced the number of parameters by another 4 orders of magnitude. Note that
(7.1.3), in a nutshell, is what is called a convolutional layer. Convolutional neural networks (CNNs)
are a special family of neural networks that contain convolutional layers. In the deep learning
research community, V is referred to as a convolution kernel, a filter, or simply the layerʼs weights
that are learnable parameters.
While previously, we might have required billions of parameters to represent just a single layer
in an image-processing network, we now typically need just a few hundred, without altering the
dimensionality of either the inputs or the hidden representations. The price paid for this drastic
reduction in parameters is that our features are now translation invariant and that our layer can
only incorporate local information, when determining the value of each hidden activation. All

learning depends on imposing inductive bias. When that bias agrees with reality, we get sample-
efficient models that generalize well to unseen data. But of course, if those biases do not agree
with reality, e.g., if images turned out not to be translation invariant, our models might struggle
even to fit our training data.
This dramatic reduction in parameters brings us to our last desideratum, namely that deeper lay-
ers should represent larger and more complex aspects of an image. This can be achieved by in-
terleaving nonlinearities and convolutional layers repeatedly.
7.1.3 Convolutions
Letʼs briefly review why (7.1.3) is called a convolution. In mathematics, the convolution between
two functions (Rudin, 1973), say f, g : Rd → R is defined as
∫
(f ∗ g)(x) = f (z)g(x − z)dz. (7.1.4)
That is, we measure the overlap between f and g when one function is “flipped” and shifted by x.
Whenever we have discrete objects, the integral turns into a sum. For instance, for vectors from
the set of square summable infinite dimensional vectors with index running over Z we obtain the
following definition:
∑
(f ∗ g)(i) = f (a)g(i − a). (7.1.5)
a
For two-dimensional tensors, we have a corresponding sum with indices (a, b) for f and (i−a, j−b)
for g, respectively:
∑∑
(f ∗ g)(i, j) = f (a, b)g(i − a, j − b). (7.1.6)
a b
This looks similar to (7.1.3), with one major difference. Rather than using (i+a, j +b), we are using
the difference instead. Note, though, that this distinction is mostly cosmetic since we can always
match the notation between (7.1.3) and (7.1.6). Our original definition in (7.1.3) more properly
describes a cross-correlation. We will come back to this in the following section.
7.1.4 Channels
Returning to our Waldo detector, letʼs see what this looks like. The convolutional layer picks win-
dows of a given size and weighs intensities according to the filter V, as demonstrated in Fig. 7.1.2.
We might aim to learn a model so that wherever the “waldoness” is highest, we should find a peak
in the hidden layer representations.

Fig. 7.1.2: Detect Waldo.
There is just one problem with this approach. So far, we blissfully ignored that images consist
of 3 channels: red, green, and blue. In sum, images are not two-dimensional objects but rather
third-order tensors, characterized by a height, width, and channel, e.g., with shape 1024×1024×3
pixels. While the first two of these axes concern spatial relationships, the third can be regarded
as assigning a multidimensional representation to each pixel location. We thus index X as [X]i,j,k .
The convolutional filter has to adapt accordingly. Instead of [V]a,b , we now have [V]a,b,c .
Moreover, just as our input consists of a third-order tensor, it turns out to be a good idea to similarly
formulate our hidden representations as third-order tensors H. In other words, rather than just
having a single hidden representation corresponding to each spatial location, we want an entire
vector of hidden representations corresponding to each spatial location. We could think of the
hidden representations as comprising a number of two-dimensional grids stacked on top of each
other. As in the inputs, these are sometimes called channels. They are also sometimes called feature
maps, as each provides a spatialized set of learned features to the subsequent layer. Intuitively,
you might imagine that at lower layers that are closer to inputs, some channels could become
specialized to recognize edges while others could recognize textures.
To support multiple channels in both inputs (X) and hidden representations (H), we can add a
fourth coordinate to V: [V]a,b,c,d . Putting everything together we have:
∑
∆ ∑
∆ ∑
[H]i,j,d = [V]a,b,c,d [X]i+a,j+b,c , (7.1.7)
a=−∆ b=−∆ c
where d indexes the output channels in the hidden representations H. The subsequent convolu-
tional layer will go on to take a third-order tensor, H, as input. Being more general, (7.1.7) is the
definition of a convolutional layer for multiple channels, where V is a kernel or filter of the layer.
There are still many operations that we need to address. For instance, we need to figure out how to
combine all the hidden representations to a single output, e.g., whether there is a Waldo anywhere
in the image. We also need to decide how to compute things efficiently, how to combine multi-
ple layers, appropriate activation functions, and how to make reasonable design choices to yield
networks that are effective in practice. We turn to these issues in the remainder of the chapter.

In this section we derived the structure of convolutional neural networks from first principles.
While it is unclear whether this is what led to the invention of CNNs, it is satisfying to know that
they are the right choice when applying reasonable principles to how image processing and com-
puter vision algorithms should operate, at least at lower levels. In particular, translation invari-
ance in images implies that all patches of an image will be treated in the same manner. Locality
means that only a small neighborhood of pixels will be used to compute the corresponding hid-
den representations. Some of the earliest references to CNNs are in the form of the Neocognitron
(Fukushima, 1982).
A second principle that we encountered in our reasoning is how to reduce the number of param-
eters in a function class without limiting its expressive power, at least, whenever certain assump-
tions on the model hold. We saw a dramatic reduction of complexity as a result of this restriction,
turning computationally and statistically infeasible problems into tractable models.
Adding channels allowed us to bring back some of the complexity that was lost due to the restric-
tions imposed on the convolutional kernel by locality and translation invariance. Note that chan-
nels are quite a natural addition beyond red, green, and blue. Many satellite images, in particular
for agriculture and meteorology, have tens to hundreds of channels, generating hyperspectral im-
ages instead. They report data on many different wavelengths. In the following we will see how
to use convolutions effectively to manipulate the dimensionality of the images they operate on,
how to move from location-based to channel-based representations and how to deal with large
numbers of categories efficiently.
Exercises
1. Assume that the size of the convolution kernel is ∆ = 0. Show that in this case the convo-
lution kernel implements an MLP independently for each set of channels. This leads to the
Network in Network architectures (Lin et al., 2013).
2. Audio data is often represented as a one-dimensional sequence.
1. When might you want to impose locality and translation invariance for audio?
2. Derive the convolution operations for audio.
3. Can you treat audio using the same tools as computer vision? Hint: use the spectro-
gram.
3. Why might translation invariance not be a good idea after all? Give an example.
4. Do you think that convolutional layers might also be applicable for text data? Which prob-
lems might you encounter with language?
5. What happens with convolutions when an object is at the boundary of an image.
6. Prove that the convolution is symmetric, i.e., f ∗ g = g ∗ f .
7. Prove the convolution theorem, i.e., f ∗ g = F −1 [F[f ] · F[g]]. Can you use it to accelerate
convolutions?
Discussions116
116

7.2 Convolutions for Images
Now that we understand how convolutional layers work in theory, we are ready to see how they
work in practice. Building on our motivation of convolutional neural networks as efficient archi-
tectures for exploring structure in image data, we stick with images as our running example.
7.2.1 The Cross-Correlation Operation
Recall that strictly speaking, convolutional layers are a misnomer, since the operations they ex-
press are more accurately described as cross-correlations. Based on our descriptions of convolu-
tional layers in Section 7.1, in such a layer, an input tensor and a kernel tensor are combined to
produce an output tensor through a cross-correlation operation.
Letʼs ignore channels for now and see how this works with two-dimensional data and hidden rep-
resentations. In Fig. 7.2.1, the input is a two-dimensional tensor with a height of 3 and width of
3. We mark the shape of the tensor as 3 × 3 or (3, 3). The height and width of the kernel are both
2. The shape of the kernel window (or convolution window) is given by the height and width of the
kernel (here it is 2 × 2).
Fig. 7.2.1: Two-dimensional cross-correlation operation. The shaded portions are the first output
element as well as the input and kernel tensor elements used for the output computation: 0 × 0 +
1 × 1 + 3 × 2 + 4 × 3 = 19.
In the two-dimensional cross-correlation operation, we begin with the convolution window posi-
tioned at the upper-left corner of the input tensor and slide it across the input tensor, both from
left to right and top to bottom. When the convolution window slides to a certain position, the
input subtensor contained in that window and the kernel tensor are multiplied elementwise and
the resulting tensor is summed up yielding a single scalar value. This result gives the value of the
output tensor at the corresponding location. Here, the output tensor has a height of 2 and width
of 2 and the four elements are derived from the two-dimensional cross-correlation operation:
0 × 0 + 1 × 1 + 3 × 2 + 4 × 3 = 19,
1 × 0 + 2 × 1 + 4 × 2 + 5 × 3 = 25,
(7.2.1)
3 × 0 + 4 × 1 + 6 × 2 + 7 × 3 = 37,
4 × 0 + 5 × 1 + 7 × 2 + 8 × 3 = 43.
Note that along each axis, the output size is slightly smaller than the input size. Because the kernel
has width and height greater than one, we can only properly compute the cross-correlation for
locations where the kernel fits wholly within the image, the output size is given by the input size
nh × nw minus the size of the convolution kernel kh × kw via
(nh − kh + 1) × (nw − kw + 1). (7.2.2)

This is the case since we need enough space to “shift” the convolution kernel across the image.
Later we will see how to keep the size unchanged by padding the image with zeros around its
boundary so that there is enough space to shift the kernel. Next, we implement this process in
the corr2d function, which accepts an input tensor X and a kernel tensor K and returns an output
tensor Y.
import torch
def corr2d(X, K): #@save

"""Compute 2D cross-correlation."""
h, w = K.shape
Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
return Y
We can construct the input tensor X and the kernel tensor K from Fig. 7.2.1 to validate the output
of the above implementation of the two-dimensional cross-correlation operation.
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])

K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)
tensor([[19., 25.],
[37., 43.]])
7.2.2 Convolutional Layers
A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an
output. The two parameters of a convolutional layer are the kernel and the scalar bias. When
training models based on convolutional layers, we typically initialize the kernels randomly, just
as we would with a fully connected layer.
We are now ready to implement a two-dimensional convolutional layer based on the corr2d func-
tion defined above. In the __init__ constructor method, we declare weight and bias as the two
model parameters. The forward propagation function calls the corr2d function and adds the bias.
class Conv2D(nn.Module):
def __init__(self, kernel_size):
super().__init__()
self.weight = nn.Parameter(torch.rand(kernel_size))
self.bias = nn.Parameter(torch.zeros(1))

return corr2d(x, self.weight) + self.bias
In h × w convolution or a h × w convolution kernel, the height and width of the convolution kernel
are h and w, respectively. We also refer to a convolutional layer with a h × w convolution kernel
7.2. Convolutions for Images 243

simply as a h × w convolutional layer.
7.2.3 Object Edge Detection in Images
Letʼs take a moment to parse a simple application of a convolutional layer: detecting the edge of
an object in an image by finding the location of the pixel change. First, we construct an “image”
of 6 × 8 pixels. The middle four columns are black (0) and the rest are white (1).
X = torch.ones((6, 8))
X[:, 2:6] = 0
X
tensor([[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.],
[1., 1., 0., 0., 0., 0., 1., 1.]])
Next, we construct a kernel K with a height of 1 and a width of 2. When we perform the cross-
correlation operation with the input, if the horizontally adjacent elements are the same, the output
is 0. Otherwise, the output is non-zero. Note that this kernel is special case of a finite difference
operator. At location (i, j) it computes xi,j − x(i+1),j , i.e., it computes the difference between the
values of horizontally adjacent pixels. This is a discrete approximation of the first derivative in the
horizontal direction. After all, for a function f (i, j) its derivative −∂i f (i, j) = limϵ→0 f (i,j)−fϵ (i+ϵ,j) .
Letʼs see how this works in practice.
K = torch.tensor([[1.0, -1.0]])
We are ready to perform the cross-correlation operation with arguments X (our input) and K (our
kernel). As you can see, we detect 1 for the edge from white to black and -1 for the edge from black
to white. All other outputs take value 0.
Y = corr2d(X, K)
Y
tensor([[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.],
[ 0., 1., 0., 0., 0., -1., 0.]])
We can now apply the kernel to the transposed image. As expected, it vanishes. The kernel K only
detects vertical edges.
corr2d(X.t(), K)

tensor([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
7.2.4 Learning a Kernel
Designing an edge detector by finite differences [1, -1] is neat if we know this is precisely what
we are looking for. However, as we look at larger kernels, and consider successive layers of con-
volutions, it might be impossible to specify precisely what each filter should be doing manually.
Now letʼs see whether we can learn the kernel that generated Y from X by looking at the input–
output pairs only. We first construct a convolutional layer and initialize its kernel as a random
tensor. Next, in each iteration, we will use the squared error to compare Y with the output of
the convolutional layer. We can then calculate the gradient to update the kernel. For the sake of
simplicity, in the following we use the built-in class for two-dimensional convolutional layers and
ignore the bias.
# Construct a two-dimensional convolutional layer with 1 output channel and a

# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.LazyConv2d(1, kernel_size=(1, 2), bias=False)
# The two-dimensional convolutional layer uses four-dimensional input and

# output in the format of (example, channel, height, width), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2 # Learning rate
for i in range(10):
Y_hat = conv2d(X)
l = (Y_hat - Y) ** 2
conv2d.zero_grad()
l.sum().backward()
# Update the kernel
conv2d.weight.data[:] -= lr * conv2d.weight.grad
if (i + 1) % 2 == 0:
print(f'epoch {i + 1}, loss {l.sum():.3f}')
epoch 2, loss 4.940

epoch 4, loss 0.936
epoch 6, loss 0.201
epoch 8, loss 0.052
epoch 10, loss 0.016


Note that the error has dropped to a small value after 10 iterations. Now we will take a look at the
kernel tensor we learned.
conv2d.weight.data.reshape((1, 2))
tensor([[ 0.9774, -1.0005]])
Indeed, the learned kernel tensor is remarkably close to the kernel tensor K we defined earlier.
7.2.5 Cross-Correlation and Convolution
Recall our observation from Section 7.1 of the correspondence between the cross-correlation and
convolution operations. Here letʼs continue to consider two-dimensional convolutional layers.
What if such layers perform strict convolution operations as defined in (7.1.6) instead of cross-
correlations? In order to obtain the output of the strict convolution operation, we only need to flip
the two-dimensional kernel tensor both horizontally and vertically, and then perform the cross-
correlation operation with the input tensor.
It is noteworthy that since kernels are learned from data in deep learning, the outputs of con-
volutional layers remain unaffected no matter such layers perform either the strict convolution
operations or the cross-correlation operations.
To illustrate this, suppose that a convolutional layer performs cross-correlation and learns the ker-
nel in Fig. 7.2.1, which is denoted as the matrix K here. Assuming that other conditions remain
unchanged, when this layer performs strict convolution instead, the learned kernel K′ will be the
same as K after K′ is flipped both horizontally and vertically. That is to say, when the convolu-
tional layer performs strict convolution for the input in Fig. 7.2.1 and K′ , the same output in Fig.
7.2.1 (cross-correlation of the input and K) will be obtained.
In keeping with standard terminology with deep learning literature, we will continue to refer to the
cross-correlation operation as a convolution even though, strictly-speaking, it is slightly different.
Besides, we use the term element to refer to an entry (or component) of any tensor representing a
layer representation or a convolution kernel.
7.2.6 Feature Map and Receptive Field
As described in Section 7.1.4, the convolutional layer output in Fig. 7.2.1 is sometimes called a fea-
ture map, as it can be regarded as the learned representations (features) in the spatial dimensions
(e.g., width and height) to the subsequent layer. In CNNs, for any element x of some layer, its re-
ceptive field refers to all the elements (from all the previous layers) that may affect the calculation
of x during the forward propagation. Note that the receptive field may be larger than the actual
size of the input.
Letʼs continue to use Fig. 7.2.1 to explain the receptive field. Given the 2 × 2 convolution kernel,
the receptive field of the shaded output element (of value 19) is the four elements in the shaded
portion of the input. Now letʼs denote the 2 × 2 output as Y and consider a deeper CNN with an
additional 2 × 2 convolutional layer that takes Y as its input, outputting a single element z. In this
case, the receptive field of z on Y includes all the four elements of Y, while the receptive field on
the input includes all the nine input elements. Thus, when any element in a feature map needs a
larger receptive field to detect input features over a broader area, we can build a deeper network.

Receptive fields derive their name from neurophysiology. In a series of experiments (Hubel and
Wiesel, 1959, Hubel and Wiesel, 1962, Hubel and Wiesel, 1968) on a range of animals and different
stimuli, Hubel and Wiesel explored the response of what is called the visual cortex on said stimuli.
By and large they found that lower levels respond to edges and related shapes. Later on, Field
(1987) illustrated this effect on natural images with, what can only be called, convolutional kernels.
We reprint a key figure in Fig. 7.2.2 to illustrate the striking similarities.
Fig. 7.2.2: Figure and caption taken from (Field, 1987): An example of coding with six different
channels. (Left) Examples of the six types of sensor associated with each channel. (Right) Convo-
lution of the image in (Middle) with the six sensors shown in (Left). The response of the individual
sensors is determined by sampling these filtered images at a distance proportional to the size of
the sensor (shown with dots). This diagram shows the response of only the even symmetric sen-
sors.
As it turns out, this relation even holds for the features computed by deeper layers of networks
trained on image classification tasks, as demonstrated e.g., in (Kuzovkin et al., 2018). Suffice it to
say, convolutions have proven to be an incredibly powerful tool for computer vision, both in biol-
ogy and in code. As such, it is not surprising (in hindsight) that they heralded the recent success
in deep learning.

7.2.7 Summary
The core computation required for a convolutional layer is a cross-correlation operation. We saw
that a simple nested for-loop is all that is required to compute its value. If we have multiple input
and multiple output channels, we are performing a matrix-matrix operation between channels. As
can be seen, the computation is straightforward and, most importantly, highly local. This affords
significant hardware optimization and many recent results in computer vision are only possible
due to that. After all, it means that chip designers can invest into fast computation rather than
memory, when it comes to optimizing for convolutions. While this may not lead to optimal designs
for other applications, it opens the door to ubiquitous and affordable computer vision.
In terms of convolutions themselves, they can be used for many purposes such as to detect edges
and lines, to blur images, or to sharpen them. Most importantly, it is not necessary that the statis-
tician (or engineer) invents suitable filters. Instead, we can simply learn them from data. This
replaces feature engineering heuristics by evidence-based statistics. Lastly, and quite delight-
fully, these filters are not just advantageous for building deep networks but they also correspond
to receptive fields and feature maps in the brain. This gives us confidence that we are on the right
track.
Exercises
1. Construct an image X with diagonal edges.

1. What happens if you apply the kernel K in this section to it?
2. What happens if you transpose X?
3. What happens if you transpose K?
2. Design some kernels manually.
1. Given a directional vector v = (v1 , v2 ), derive an edge-detection kernel that detects
edges orthogonal to v, i.e., edges in the direction (v2 , −v1 ).
2. Derive a finite difference operator for the second derivative. What is the minimum size
of the convolutional kernel associate with it? Which structures in images respond most
strongly to it?
3. How would you design a blur kernel? Why might you want to use such a kernel?
4. What is the minimum size of a kernel to obtain a derivative of order d?
3. When you try to automatically find the gradient for the Conv2D class we created, what kind
of error message do you see?
4. How do you represent a cross-correlation operation as a matrix multiplication by changing
the input and kernel tensors?
Discussions117
117

7.3 Padding and Stride
Recall the example of a convolution in Fig. 7.2.1. The input had both a height and width of 3 and
the convolution kernel had both a height and width of 2, yielding an output representation with
dimension 2 × 2. Assuming that the input shape is nh × nw and the convolution kernel shape is
kh × kw , the output shape will be (nh − kh + 1) × (nw − kw + 1): we can only shift the convolution
kernel so far until it runs out of pixels to apply the convolution to.
In the following we will explore a number of techniques, including padding and strided convolu-
tions, that offer more control over the size of the output. As motivation, note that since kernels
generally have width and height greater than 1, after applying many successive convolutions, we
tend to wind up with outputs that are considerably smaller than our input. If we start with a
240 × 240 pixel image, 10 layers of 5 × 5 convolutions reduce the image to 200 × 200 pixels, slicing
off 30% of the image and with it obliterating any interesting information on the boundaries of the
original image. Padding is the most popular tool for handling this issue. In other cases, we may
want to reduce the dimensionality drastically, e.g., if we find the original input resolution to be
unwieldy. Strided convolutions are a popular technique that can help in these instances.
7.3.1 Padding
As described above, one tricky issue when applying convolutional layers is that we tend to lose
pixels on the perimeter of our image. Consider Fig. 7.3.1 that depicts the pixel utilization as a
function of the convolution kernel size and the position within the image. The pixels in the corners
are hardly used at all.
Fig. 7.3.1: Pixel utilization for convolutions of size 1 × 1, 2 × 2, and 3 × 3 respectively.
Since we typically use small kernels, for any given convolution, we might only lose a few pixels, but
this can add up as we apply many successive convolutional layers. One straightforward solution to
this problem is to add extra pixels of filler around the boundary of our input image, thus increasing
the effective size of the image. Typically, we set the values of the extra pixels to zero. In Fig. 7.3.2,
we pad a 3 × 3 input, increasing its size to 5 × 5. The corresponding output then increases to a 4 × 4
matrix. The shaded portions are the first output element as well as the input and kernel tensor
elements used for the output computation: 0 × 0 + 0 × 1 + 0 × 2 + 0 × 3 = 0.
7.3. Padding and Stride 249

Fig. 7.3.2: Two-dimensional cross-correlation with padding.
In general, if we add a total of ph rows of padding (roughly half on top and half on bottom) and
a total of pw columns of padding (roughly half on the left and half on the right), the output shape
will be
(nh − kh + ph + 1) × (nw − kw + pw + 1). (7.3.1)
This means that the height and width of the output will increase by ph and pw , respectively.
In many cases, we will want to set ph = kh − 1 and pw = kw − 1 to give the input and output the
same height and width. This will make it easier to predict the output shape of each layer when
constructing the network. Assuming that kh is odd here, we will pad ph /2 rows on both sides of
the height. If kh is even, one possibility is to pad ⌈ph /2⌉ rows on the top of the input and ⌊ph /2⌋
rows on the bottom. We will pad both sides of the width in the same way.
CNNs commonly use convolution kernels with odd height and width values, such as 1, 3, 5, or 7.
Choosing odd kernel sizes has the benefit that we can preserve the dimensionality while padding
with the same number of rows on top and bottom, and the same number of columns on left and
right.
Moreover, this practice of using odd kernels and padding to precisely preserve dimensionality
offers a clerical benefit. For any two-dimensional tensor X, when the kernelʼs size is odd and the
number of padding rows and columns on all sides are the same, producing an output with the same
height and width as the input, we know that the output Y[i, j] is calculated by cross-correlation
of the input and convolution kernel with the window centered on X[i, j].
In the following example, we create a two-dimensional convolutional layer with a height and width
of 3 and apply 1 pixel of padding on all sides. Given an input with a height and width of 8, we find
that the height and width of the output is also 8.
import torch
# We define a helper function to calculate convolutions. It initializes

# the convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output.
def comp_conv2d(conv2d, X):
# (1, 1) indicates that batch size and the number of channels are both 1
X = X.reshape((1, 1) + X.shape)
Y = conv2d(X)
# Strip the first two dimensions: examples and channels

return Y.reshape(Y.shape[2:])
# 1 row and column is padded on either side, so a total of 2 rows or columns are added
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1)
comp_conv2d(conv2d, X).shape
torch.Size([8, 8])
When the height and width of the convolution kernel are different, we can make the output and
input have the same height and width by setting different padding numbers for height and width.
# We use a convolution kernel with height 5 and width 3. The padding on

# either side of the height and width are 2 and 1, respectively.
conv2d = nn.LazyConv2d(1, kernel_size=(5, 3), padding=(2, 1))
torch.Size([8, 8])
7.3.2 Stride
When computing the cross-correlation, we start with the convolution window at the upper-left
corner of the input tensor, and then slide it over all locations both down and to the right. In the
previous examples, we defaulted to sliding one element at a time. However, sometimes, either
for computational efficiency or because we wish to downsample, we move our window more than
one element at a time, skipping the intermediate locations. This is particularly useful if the con-
volution kernel is large since it captures a large area of the underlying image.
We refer to the number of rows and columns traversed per slide as stride. So far, we have used
strides of 1, both for height and width. Sometimes, we may want to use a larger stride. Fig. 7.3.3
shows a two-dimensional cross-correlation operation with a stride of 3 vertically and 2 horizon-
tally. The shaded portions are the output elements as well as the input and kernel tensor elements
used for the output computation: 0 × 0 + 0 × 1 + 1 × 2 + 2 × 3 = 8, 0 × 0 + 6 × 1 + 0 × 2 + 0 × 3 = 6.
We can see that when the second element of the first column is generated, the convolution win-
dow slides down three rows. The convolution window slides two columns to the right when the
second element of the first row is generated. When the convolution window continues to slide
two columns to the right on the input, there is no output because the input element cannot fill the
window (unless we add another column of padding).
7.3. Padding and Stride 251

Fig. 7.3.3: Cross-correlation with strides of 3 and 2 for height and width, respectively.
In general, when the stride for the height is sh and the stride for the width is sw , the output shape
is
⌊(nh − kh + ph + sh )/sh ⌋ × ⌊(nw − kw + pw + sw )/sw ⌋. (7.3.2)
If we set ph = kh − 1 and pw = kw − 1, then the output shape can be simplified to ⌊(nh + sh −

1)/sh ⌋ × ⌊(nw + sw − 1)/sw ⌋. Going a step further, if the input height and width are divisible by
the strides on the height and width, then the output shape will be (nh /sh ) × (nw /sw ).
Below, we set the strides on both the height and width to 2, thus halving the input height and width.
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2)

torch.Size([4, 4])
Letʼs look at a slightly more complicated example.
conv2d = nn.LazyConv2d(1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))

torch.Size([2, 2])
Padding can increase the height and width of the output. This is often used to give the output the
same height and width as the input to avoid undesirable shrinkage of the output. Moreover, it
ensures that all pixels are used equally frequently. Typically we pick symmetric padding on both
sides of the input height and width. In this case we refer to (ph , pw ) padding. Most commonly we
set ph = pw , in which case we simply state that we choose padding p.
A similar convention applies to strides. When horizontal stride sh and vertical stride sw match,
we simply talk about stride s. The stride can reduce the resolution of the output, for example
reducing the height and width of the output to only 1/n of the height and width of the input for
n > 1. By default, the padding is 0 and the stride is 1.
So far all padding that we discussed simply extended images with zeros. This has significant com-
putational benefit since it is trivial to accomplish. Moreover, operators can be engineered to take

advantage of this padding implicitly without the need to allocate additional memory. At the same
time, it allows CNNs to encode implicit position information within an image, simply by learning
where the “whitespace” is. There are many alternatives to zero-padding. (Alsallakh et al., 2020)
provides an extensive overview of alternatives (albeit without a clear case to use nonzero paddings
unless artifacts occur).
Exercises
1. Given the last code example in this section with kernel size (3, 5), padding (0, 1), and stride
(3, 4), calculate the output shape to check if it is consistent with the experimental result.
2. For audio signals, what does a stride of 2 correspond to?
3. Implement mirror padding, i.e., padding where the border values are simply mirrored to
extend tensors.
4. What are the computational benefits of a stride larger than 1?
5. What might be statistical benefits of a stride larger than 1?
6. How would you implement a stride of 12 ? What does it correspond to? When would this be
useful?
Discussions118
7.4 Multiple Input and Multiple Output Channels
While we described the multiple channels that comprise each image (e.g., color images have the
standard RGB channels to indicate the amount of red, green and blue) and convolutional layers
for multiple channels in Section 7.1.4, until now, we simplified all of our numerical examples by
working with just a single input and a single output channel. This allowed us to think of our inputs,
convolution kernels, and outputs each as two-dimensional tensors.
When we add channels into the mix, our inputs and hidden representations both become three-
dimensional tensors. For example, each RGB input image has shape 3×h×w. We refer to this axis,
with a size of 3, as the channel dimension. The notion of channels is as old as CNNs themselves.
For instance LeNet5 (LeCun et al., 1995) uses them. In this section, we will take a deeper look at
convolution kernels with multiple input and multiple output channels.
7.4.1 Multiple Input Channels
When the input data contains multiple channels, we need to construct a convolution kernel with
the same number of input channels as the input data, so that it can perform cross-correlation with
the input data. Assuming that the number of channels for the input data is ci , the number of input
channels of the convolution kernel also needs to be ci . If our convolution kernelʼs window shape
is kh × kw , then when ci = 1, we can think of our convolution kernel as just a two-dimensional
tensor of shape kh × kw .
However, when ci > 1, we need a kernel that contains a tensor of shape kh × kw for every input
channel. Concatenating these ci tensors together yields a convolution kernel of shape ci × kh × kw .
118
7.4. Multiple Input and Multiple Output Channels 253

Since the input and convolution kernel each have ci channels, we can perform a cross-correlation
operation on the two-dimensional tensor of the input and the two-dimensional tensor of the con-
volution kernel for each channel, adding the ci results together (summing over the channels) to
yield a two-dimensional tensor. This is the result of a two-dimensional cross-correlation between
a multi-channel input and a multi-input-channel convolution kernel.
Fig. 7.4.1 provides an example of a two-dimensional cross-correlation with two input channels.
The shaded portions are the first output element as well as the input and kernel tensor elements
used for the output computation: (1 × 1 + 2 × 2 + 4 × 3 + 5 × 4) + (0 × 0 + 1 × 1 + 3 × 2 + 4 × 3) = 56.
Fig. 7.4.1: Cross-correlation computation with 2 input channels.
To make sure we really understand what is going on here, we can implement cross-correlation
operations with multiple input channels ourselves. Notice that all we are doing is performing a
cross-correlation operation per channel and then adding up the results.
import torch
def corr2d_multi_in(X, K):

# Iterate through the 0th dimension (channel) of K first, then add them up
return sum(d2l.corr2d(x, k) for x, k in zip(X, K))
We can construct the input tensor X and the kernel tensor K corresponding to the values in Fig.
7.4.1 to validate the output of the cross-correlation operation.
X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],

[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])
corr2d_multi_in(X, K)
tensor([[ 56., 72.],

[104., 120.]])

7.4.2 Multiple Output Channels
Regardless of the number of input channels, so far we always ended up with one output channel.
However, as we discussed in Section 7.1.4, it turns out to be essential to have multiple channels
at each layer. In the most popular neural network architectures, we actually increase the channel
dimension as we go deeper in the neural network, typically downsampling to trade off spatial
resolution for greater channel depth. Intuitively, you could think of each channel as responding to
a different set of features. The reality is a bit more complicated than this. A naive interpretation
would suggest that representations are learned independently per pixel or per channel. Instead,
channels are optimized to be jointly useful. This means that rather than mapping a single channel
to an edge detector, it may simply mean that some direction in channel space corresponds to
detecting edges.
Denote by ci and co the number of input and output channels, respectively, and let kh and kw
be the height and width of the kernel. To get an output with multiple channels, we can create a
kernel tensor of shape ci × kh × kw for every output channel. We concatenate them on the output
channel dimension, so that the shape of the convolution kernel is co × ci × kh × kw . In cross-
correlation operations, the result on each output channel is calculated from the convolution kernel
corresponding to that output channel and takes input from all channels in the input tensor.
We implement a cross-correlation function to calculate the output of multiple channels as shown
below.
def corr2d_multi_in_out(X, K):

# Iterate through the 0th dimension of `K`, and each time, perform
# cross-correlation operations with input `X`. All of the results are
# stacked together
return torch.stack([corr2d_multi_in(X, k) for k in K], 0)
We construct a trivial convolution kernel with 3 output channels by concatenating the kernel ten-
sor for K with K+1 and K+2.
K = torch.stack((K, K + 1, K + 2), 0)
K.shape
torch.Size([3, 2, 2, 2])
Below, we perform cross-correlation operations on the input tensor X with the kernel tensor K.
Now the output contains 3 channels. The result of the first channel is consistent with the result of
the previous input tensor X and the multi-input channel, single-output channel kernel.
corr2d_multi_in_out(X, K)
tensor([[[ 56., 72.],

[104., 120.]],
[[ 76., 100.],
[148., 172.]],
[[ 96., 128.],
[192., 224.]]])

7.4.3 1 × 1 Convolutional Layer
At first, a 1 × 1 convolution, i.e., kh = kw = 1, does not seem to make much sense. After all, a
convolution correlates adjacent pixels. A 1 × 1 convolution obviously does not. Nonetheless, they
are popular operations that are sometimes included in the designs of complex deep networks (Lin
et al., 2013, Szegedy et al., 2017) Letʼs see in some detail what it actually does.
Because the minimum window is used, the 1 × 1 convolution loses the ability of larger convo-
lutional layers to recognize patterns consisting of interactions among adjacent elements in the
height and width dimensions. The only computation of the 1 × 1 convolution occurs on the chan-
nel dimension.
Fig. 7.4.2 shows the cross-correlation computation using the 1 × 1 convolution kernel with 3 input
channels and 2 output channels. Note that the inputs and outputs have the same height and width.
Each element in the output is derived from a linear combination of elements at the same position
in the input image. You could think of the 1 × 1 convolutional layer as constituting a fully con-
nected layer applied at every single pixel location to transform the ci corresponding input values
into co output values. Because this is still a convolutional layer, the weights are tied across pixel
location. Thus the 1 × 1 convolutional layer requires co × ci weights (plus the bias). Also note that
convolutional layers are typically followed by nonlinearities. This ensures that 1 × 1 convolutions
cannot simply be folded into other convolutions.
Fig. 7.4.2: The cross-correlation computation uses the 1 × 1 convolution kernel with 3 input chan-
nels and 2 output channels. The input and output have the same height and width.
Letʼs check whether this works in practice: we implement a 1 × 1 convolution using a fully con-
nected layer. The only thing is that we need to make some adjustments to the data shape before
and after the matrix multiplication.
def corr2d_multi_in_out_1x1(X, K):

c_i, h, w = X.shape
c_o = K.shape[0]
X = X.reshape((c_i, h * w))
K = K.reshape((c_o, c_i))
# Matrix multiplication in the fully connected layer
Y = torch.matmul(K, X)
return Y.reshape((c_o, h, w))
When performing 1 × 1 convolutions, the above function is equivalent to the previously imple-
mented cross-correlation function corr2d_multi_in_out. Letʼs check this with some sample data.
X = torch.normal(0, 1, (3, 3, 3))

K = torch.normal(0, 1, (2, 3, 1, 1))

Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
assert float(torch.abs(Y1 - Y2).sum()) < 1e-6
7.4.4 Discussion
Channels allow us to combine the best of both worlds: MLPs that allow for significant nonlineari-
ties and convolutions that allow for localized analysis of features. In particular, channels allow the
CNN to reason with multiple features, such as edge and shape detectors at the same time. They
also offer a practical trade-off between the drastic parameter reduction arising from translation
invariance and locality, and the need for expressive and diverse models in computer vision.
Note, though, that this flexibility comes at a price. Given an image of size (h × w), the cost for
computing a k × k convolution is O(h · w · k 2 ). For ci and co input and output channels respectively
this increases to O(h·w ·k 2 ·ci ·co ). For a 256×256 pixel image with a 5×5 kernel and 128 input and
output channels respectively this amounts to over 53 billion operations (we count multiplications
and additions separately). Later on we will encounter effective strategies to cut down on the cost,
e.g., by requiring the channel-wise operations to be block-diagonal, leading to architectures such
as ResNeXt (Xie et al., 2017).
Exercises
1. Assume that we have two convolution kernels of size k1 and k2 , respectively (with no non-
linearity in-between).
1. Prove that the result of the operation can be expressed by a single convolution.
2. What is the dimensionality of the equivalent single convolution?
3. Is the converse true, i.e., can you always decompose a convolution into two smaller
ones?
2. Assume an input of shape ci × h × w and a convolution kernel of shape co × ci × kh × kw ,
padding of (ph , pw ), and stride of (sh , sw ).
1. What is the computational cost (multiplications and additions) for the forward propa-
gation?
2. What is the memory footprint?
3. What is the memory footprint for the backward computation?
4. What is the computational cost for the backpropagation?
3. By what factor does the number of calculations increase if we double the number of input
channels ci and the number of output channels co ? What happens if we double the padding?
4. Are the variables Y1 and Y2 in the last example of this section exactly the same? Why?
5. Express convolutions as a matrix multiplication, even when the convolution window is not
1 × 1?
6. Your task is to implement fast convolutions with a k × k kernel. One of the algorithm can-
didates is to scan horizontally across the source, reading a k-wide strip and computing the
1-wide output strip one value at a time. The alternative is to read a k + ∆ wide strip and

compute a ∆-wide output strip. Why is the latter preferable? Is there a limit to how large
you should choose ∆?
7. Assume that we have a c × c matrix.
1. How much faster is it to multiply with a block-diagonal matrix if the matrix is broken
up into b blocks?
2. What is the downside of having b blocks? How could you fix it, at least partly?
Discussions119
7.5 Pooling
In many cases our ultimate task asks some global question about the image, e.g., does it contain a
cat? Consequently, the units of our final layer should be sensitive to the entire input. By gradually
aggregating information, yielding coarser and coarser maps, we accomplish this goal of ultimately
learning a global representation, while keeping all of the advantages of convolutional layers at
the intermediate layers of processing. The deeper we go in the network, the larger the receptive
field (relative to the input) to which each hidden node is sensitive. Reducing spatial resolution
accelerates this process, since the convolution kernels cover a larger effective area.
Moreover, when detecting lower-level features, such as edges (as discussed in Section 7.2), we
often want our representations to be somewhat invariant to translation. For instance, if we take
the image X with a sharp delineation between black and white and shift the whole image by one
pixel to the right, i.e., Z[i, j] = X[i, j + 1], then the output for the new image Z might be vastly
different. The edge will have shifted by one pixel. In reality, objects hardly ever occur exactly at
the same place. In fact, even with a tripod and a stationary object, vibration of the camera due to
the movement of the shutter might shift everything by a pixel or so (high-end cameras are loaded
with special features to address this problem).
This section introduces pooling layers, which serve the dual purposes of mitigating the sensitivity
of convolutional layers to location and of spatially downsampling representations.
7.5.1 Maximum Pooling and Average Pooling
Like convolutional layers, pooling operators consist of a fixed-shape window that is slid over all
regions in the input according to its stride, computing a single output for each location traversed
by the fixed-shape window (sometimes known as the pooling window). However, unlike the cross-
correlation computation of the inputs and kernels in the convolutional layer, the pooling layer
contains no parameters (there is no kernel). Instead, pooling operators are deterministic, typically
calculating either the maximum or the average value of the elements in the pooling window. These
operations are called maximum pooling (max-pooling for short) and average pooling, respectively.
Average pooling is essentially as old as CNNs. The idea is akin to downsampling an image. Rather
than just taking the value of every second (or third) pixel for the lower resolution image, we can
average over adjacent pixels to obtain an image with better signal to noise ratio since we are com-
bining the information from multiple adjacent pixels. Max-pooling was introduced in (Riesen-
huber and Poggio, 1999) in the context of cognitive neuroscience to describe how information
119

aggregation might be aggregated hierarchically for the purpose of object recognition, and an ear-
lier version in speech recognition (Yamaguchi et al., 1990). In almost all cases, max-pooling, as it
is also referred to, is preferable.
In both cases, as with the cross-correlation operator, we can think of the pooling window as start-
ing from the upper-left of the input tensor and sliding across the input tensor from left to right and
top to bottom. At each location that the pooling window hits, it computes the maximum or aver-
age value of the input subtensor in the window, depending on whether max or average pooling is
employed.
Fig. 7.5.1: Max-pooling with a pooling window shape of 2×2. The shaded portions are the first out-
put element as well as the input tensor elements used for the output computation: max(0, 1, 3, 4) =
4.
The output tensor in Fig. 7.5.1 has a height of 2 and a width of 2. The four elements are derived
from the maximum value in each pooling window:
max(0, 1, 3, 4) = 4,
max(1, 2, 4, 5) = 5,
(7.5.1)
max(3, 4, 6, 7) = 7,
max(4, 5, 7, 8) = 8.
More generally, we can define a p × q pooling layer by aggregating over a region of said size. Re-
turning to the problem of edge detection, we use the output of the convolutional layer as input for
2 × 2 max-pooling. Denote by X the input of the convolutional layer input and Y the pooling layer
output. Regardless of whether or not the values of X[i, j], X[i, j + 1], X[i+1, j] and X[i+1,
j + 1] are different, the pooling layer always outputs Y[i, j] = 1. That is to say, using the 2 × 2
max-pooling layer, we can still detect if the pattern recognized by the convolutional layer moves
no more than one element in height or width.
In the code below, we implement the forward propagation of the pooling layer in the pool2d func-
tion. This function is similar to the corr2d function in Section 7.2. However, no kernel is needed,
computing the output as either the maximum or the average of each region in the input.
import torch
def pool2d(X, pool_size, mode='max'):

p_h, p_w = pool_size
Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
for j in range(Y.shape[1]):
if mode == 'max':
7.5. Pooling 259

Y[i, j] = X[i: i + p_h, j: j + p_w].max()

elif mode == 'avg':
Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
return Y
We can construct the input tensor X in Fig. 7.5.1 to validate the output of the two-dimensional
max-pooling layer.
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])

pool2d(X, (2, 2))
tensor([[4., 5.],
[7., 8.]])
Also, we experiment with the average pooling layer.
pool2d(X, (2, 2), 'avg')
tensor([[2., 3.],
[5., 6.]])
7.5.2 Padding and Stride
As with convolutional layers, pooling layers change the output shape. And as before, we can adjust
the operation to achieve a desired output shape by padding the input and adjusting the stride. We
can demonstrate the use of padding and strides in pooling layers via the built-in two-dimensional
max-pooling layer from the deep learning framework. We first construct an input tensor X whose
shape has four dimensions, where the number of examples (batch size) and number of channels
are both 1.
X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))

X
tensor([[[[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]]]])
Since pooling aggregates information from an area, deep learning frameworks default to matching
pooling window sizes and stride. For instance, if we use a pooling window of shape (3, 3) we get
a stride shape of (3, 3) by default.
pool2d = nn.MaxPool2d(3)
# Pooling has no model parameters, hence it needs no initialization
pool2d(X)

tensor([[[[10.]]]])
As expected, the stride and padding can be manually specified to override framework defaults if
needed.
pool2d = nn.MaxPool2d(3, padding=1, stride=2)

pool2d(X)
tensor([[[[ 5., 7.],

[13., 15.]]]])
Of course, we can specify an arbitrary rectangular pooling window with arbitrary height and width
respectively, as the example below shows.
pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1))

pool2d(X)
tensor([[[[ 5., 7.],

[13., 15.]]]])
7.5.3 Multiple Channels
When processing multi-channel input data, the pooling layer pools each input channel separately,
rather than summing the inputs up over channels as in a convolutional layer. This means that the
number of output channels for the pooling layer is the same as the number of input channels.
Below, we will concatenate tensors X and X + 1 on the channel dimension to construct an input
with 2 channels.
X = torch.cat((X, X + 1), 1)
X
tensor([[[[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]],
[[ 1., 2., 3., 4.],

[ 5., 6., 7., 8.],
[ 9., 10., 11., 12.],
[13., 14., 15., 16.]]]])
As we can see, the number of output channels is still 2 after pooling.
pool2d = nn.MaxPool2d(3, padding=1, stride=2)

pool2d(X)
7.5. Pooling 261

tensor([[[[ 5., 7.],
[13., 15.]],
[[ 6., 8.],
[14., 16.]]]])
7.5.4 Summary
Pooling is an exceedingly simple operation. It does exactly what its name indicates, aggregate
results over a window of values. All convolution semantics, such as strides and padding apply in
the same way as they did previously. Note that pooling is indifferent to channels, i.e., it leaves
the number of channels unchanged and it applies to each channel separately. Lastly, of the two
popular pooling choices, max-pooling is preferable to average pooling, as it confers some degree
of invariance to output. A popular choice is to pick a pooling window size of 2 × 2 to quarter the
spatial resolution of output.
Note that there are many more ways of reducing resolution beyond pooling. For instance, in
stochastic pooling (Zeiler and Fergus, 2013) and fractional max-pooling (Graham, 2014) aggre-
gation is combined with randomization. This can slightly improve the accuracy in some cases.
Lastly, as we will see later with the attention mechanism, there are more refined ways of aggre-
gating over outputs, e.g., by using the alignment between a query and representation vectors.
Exercises
1. Implement average pooling through a convolution.

2. Prove that max-pooling cannot be implemented through a convolution alone.
3. Max-pooling can be accomplished using ReLU operations, i.e., ReLU(x) = max(0, x).
1. Express max(a, b) by using only ReLU operations.
2. Use this to implement max-pooling by means of convolutions and ReLU layers.
3. How many channels and layers do you need for a 2 × 2 convolution? How many for a
3 × 3 convolution.
4. What is the computational cost of the pooling layer? Assume that the input to the pooling
layer is of size c × h × w, the pooling window has a shape of ph × pw with a padding of (ph , pw )
and a stride of (sh , sw ).
5. Why do you expect max-pooling and average pooling to work differently?
6. Do we need a separate minimum pooling layer? Can you replace it with another operation?
7. We could use the softmax operation for pooling. Why might it not be so popular?
Discussions120
120

7.6 Convolutional Neural Networks (LeNet)
We now have all the ingredients required to assemble a fully-functional CNN. In our earlier en-
counter with image data, we applied a linear model with softmax regression (Section 4.4) and
an MLP (Section 5.2) to pictures of clothing in the Fashion-MNIST dataset. To make such data
amenable we first flattened each image from a 28 × 28 matrix into a fixed-length 784-dimensional
vector, and thereafter processed them in fully connected layers. Now that we have a handle on
convolutional layers, we can retain the spatial structure in our images. As an additional benefit
of replacing fully connected layers with convolutional layers, we will enjoy more parsimonious
models that require far fewer parameters.
In this section, we will introduce LeNet, among the first published CNNs to capture wide attention
for its performance on computer vision tasks. The model was introduced by (and named for) Yann
LeCun, then a researcher at AT&T Bell Labs, for the purpose of recognizing handwritten digits
in images (LeCun et al., 1998). This work represented the culmination of a decade of research
developing the technology. In 1989, LeCunʼs team published the first study to successfully train
CNNs via backpropagation (LeCun et al., 1989).
At the time LeNet achieved outstanding results matching the performance of support vector ma-
chines, then a dominant approach in supervised learning, achieving an error rate of less than 1%
per digit. LeNet was eventually adapted to recognize digits for processing deposits in ATM ma-
chines. To this day, some ATMs still run the code that Yann LeCun and his colleague Leon Bottou
wrote in the 1990s!
7.6.1 LeNet
At a high level, LeNet (LeNet-5) consists of two parts: (i) a convolutional encoder consisting of
two convolutional layers; and (ii) a dense block consisting of three fully connected layers; The
architecture is summarized in Fig. 7.6.1.
Fig. 7.6.1: Data flow in LeNet. The input is a handwritten digit, the output a probability over 10
possible outcomes.
The basic units in each convolutional block are a convolutional layer, a sigmoid activation func-
tion, and a subsequent average pooling operation. Note that while ReLUs and max-pooling work
better, these discoveries had not yet been made at the time. Each convolutional layer uses a 5 × 5
7.6. Convolutional Neural Networks (LeNet) 263

kernel and a sigmoid activation function. These layers map spatially arranged inputs to a number
of two-dimensional feature maps, typically increasing the number of channels. The first convolu-
tional layer has 6 output channels, while the second has 16. Each 2 × 2 pooling operation (stride 2)
reduces dimensionality by a factor of 4 via spatial downsampling. The convolutional block emits
an output with shape given by (batch size, number of channel, height, width).
In order to pass output from the convolutional block to the dense block, we must flatten each
example in the minibatch. In other words, we take this four-dimensional input and transform
it into the two-dimensional input expected by fully connected layers: as a reminder, the two-
dimensional representation that we desire uses the first dimension to index examples in the mini-
batch and the second to give the flat vector representation of each example. LeNetʼs dense block
has three fully connected layers, with 120, 84, and 10 outputs, respectively. Because we are still
performing classification, the 10-dimensional output layer corresponds to the number of possible
output classes.
While getting to the point where you truly understand what is going on inside LeNet may have
taken a bit of work, hopefully the following code snippet will convince you that implementing
such models with modern deep learning frameworks is remarkably simple. We need only to in-
stantiate a Sequential block and chain together the appropriate layers, using Xavier initialization
as introduced in Section 5.4.2.
import torch
def init_cnn(module): #@save

"""Initialize weights for CNNs."""
if type(module) == nn.Linear or type(module) == nn.Conv2d:
class LeNet(d2l.Classifier):
def __init__(self, lr=0.1, num_classes=10):
super().__init__()
nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(),
nn.AvgPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.LazyLinear(120), nn.Sigmoid(),
nn.LazyLinear(84), nn.Sigmoid(),
nn.LazyLinear(num_classes))
We take some liberty in the reproduction of LeNet insofar as we replace the Gaussian activation
layer by a softmax layer. This greatly simplifies the implementation, not the least due to the fact
that the Gaussian decoder is rarely used nowadays. Other than that, this network matches the
original LeNet-5 architecture.
Letʼs see what happens inside the network. By passing a single-channel (black and white) 28 × 28
image through the network and printing the output shape at each layer, we can inspect the model
to make sure that its operations line up with what we expect from Fig. 7.6.2.

Fig. 7.6.2: Compressed notation for LeNet-5.
@d2l.add_to_class(d2l.Classifier) #@save
def layer_summary(self, X_shape):
X = torch.randn(*X_shape)
for layer in self.net:
X = layer(X)
print(layer.__class__.__name__, 'output shape:\t', X.shape)
model = LeNet()
model.layer_summary((1, 1, 28, 28))
Conv2d output shape: torch.Size([1, 6, 28, 28])

Sigmoid output shape: torch.Size([1, 6, 28, 28])
AvgPool2d output shape: torch.Size([1, 6, 14, 14])
Sigmoid output shape: torch.Size([1, 16, 10, 10])
AvgPool2d output shape: torch.Size([1, 16, 5, 5])
Flatten output shape: torch.Size([1, 400])
Linear output shape: torch.Size([1, 120])
Sigmoid output shape: torch.Size([1, 120])
Sigmoid output shape: torch.Size([1, 84])
Note that the height and width of the representation at each layer throughout the convolutional
block is reduced (compared with the previous layer). The first convolutional layer uses 2 pixels
of padding to compensate for the reduction in height and width that would otherwise result from
using a 5 × 5 kernel. As an aside, the image size of 28 × 28 pixels in the original MNIST OCR

dataset is a result of trimming 2 pixel rows (and columns) from the original scans that measured
32 × 32 pixels. This was done primarily to save space (a 30% reduction) at a time when Megabytes
mattered.
In contrast, the second convolutional layer forgoes padding, and thus the height and width are
both reduced by 4 pixels. As we go up the stack of layers, the number of channels increases
layer-over-layer from 1 in the input to 6 after the first convolutional layer and 16 after the second
convolutional layer. However, each pooling layer halves the height and width. Finally, each fully
connected layer reduces dimensionality, finally emitting an output whose dimension matches the
number of classes.
7.6.2 Training
Now that we have implemented the model, letʼs run an experiment to see how the LeNet-5 model
fares on Fashion-MNIST.
While CNNs have fewer parameters, they can still be more expensive to compute than similarly
deep MLPs because each parameter participates in many more multiplications. If you have ac-
cess to a GPU, this might be a good time to put it into action to speed up training. Note that the
d2l.Trainer class takes care of all details. By default, it initializes the model parameters on the
available devices. Just as with MLPs, our loss function is cross-entropy, and we minimize it via
minibatch stochastic gradient descent.
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)

model = LeNet(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], init_cnn)

7.6.3 Summary
In this chapter we made significant progress. We moved from the MLPs of the 1980s to the CNNs
of the 1990s and early 2000s. The architectures proposed, e.g., in the form of LeNet-5 remain
meaningful, even to this day. It is worth comparing the error rates on Fashion-MNIST achievable
with LeNet-5 both to the very best possible with MLPs (Section 5.2) and those with significantly
more advanced architectures such as ResNet (Section 8.6). LeNet is much more similar to the
latter than to the former. One of the primary differences, as we shall see, is that greater amounts
of computation afforded significantly more complex architectures.
A second difference is the relative ease with which we were able to implement LeNet. What used
to be an engineering challenge worth months of C++ and assembly code, engineering to improve
SN, an early Lisp based deep learning tool (Bottou and Le Cun, 1988), and finally experimentation
with models can now be accomplished in minutes. It is this incredible productivity boost that has
democratized deep learning model development tremendously. In the next chapter we will follow
down this rabbit to hole to see where it takes us.
Exercises
1. Letʼs modernize LeNet. Implement and test the following changes:

1. Replace the average pooling with max-pooling.
2. Replace the softmax layer with ReLU.
2. Try to change the size of the LeNet style network to improve its accuracy in addition to max-
pooling and ReLU.
1. Adjust the convolution window size.
2. Adjust the number of output channels.
3. Adjust the number of convolution layers.
4. Adjust the number of fully connected layers.
5. Adjust the learning rates and other training details (e.g., initialization and number of
epochs.)
3. Try out the improved network on the original MNIST dataset.
4. Display the activations of the first and second layer of LeNet for different inputs (e.g.,
sweaters and coats).
5. What happens to the activations when you feed significantly different images into the net-
work (e.g., cats, cars, or even random noise)?
Discussions121
121

8 | Modern Convolutional Neural Net-
works
Now that we understand the basics of wiring together CNNs, we will take you through a tour of
modern CNN architectures. This tour is, by necessity, incomplete, thanks to the plethora of ex-
citing new designs being added. Their importance derives from the fact that not only can they be
used directly for vision tasks, but they also serve as basic feature generators for more advanced
tasks such as tracking (Zhang et al., 2021), segmentation (Long et al., 2015), object detection (Red-
mon and Farhadi, 2018), or style transformation (Gatys et al., 2016). In this chapter, most sections
correspond to a significant CNN architecture that was at some point (or currently) the base model
upon which many research projects and deployed systems were built. Each of these networks was
briefly a dominant architecture and many were winners or runners-up in the ImageNet compe-
tition122 which has served as a barometer of progress on supervised learning in computer vision
since 2010.
These models include the AlexNet (Krizhevsky et al., 2012), the first large-scale network deployed
to beat conventional computer vision methods on a large-scale vision challenge; the VGG network
(Simonyan and Zisserman, 2014), which makes use of a number of repeating blocks of elements;
the network in network (NiN) that convolves whole neural networks patch-wise over inputs (Lin et
al., 2013); the GoogLeNet that uses networks with multi-branch convolutions (Szegedy et al., 2015);
the residual network (ResNet) (He et al., 2016), which remains some of the most popular off-the-
shelf architectures in computer vision; ResNeXt blocks (Xie et al., 2017) for sparser connections;
and the DenseNet (Huang et al., 2017) for a generalization of the residual architecture. In addition
to that, assuming standard and fixed blocks, we progressively simplify design spaces with better
models, leading to the RegNet (Radosavovic et al., 2020).
While the idea of deep neural networks is quite simple (stack together a bunch of layers), perfor-
mance can vary wildly across architectures and hyperparameter choices. The neural networks
described in this chapter are the product of intuition, a few mathematical insights, and a lot of
trial and error. We present these models in chronological order, partly to convey a sense of the
history so that you can form your own intuitions about where the field is heading and perhaps
develop your own architectures. For instance, batch normalization and residual connections de-
scribed in this chapter have offered two popular ideas for training and designing deep models,
both of which have since been applied to architectures beyond computer vision, too.
122
https://www.image-net.org/challenges/LSVRC/
269
8.1 Deep Convolutional Neural Networks (AlexNet)
Although CNNs were well known in the computer vision and machine learning communities fol-
lowing the introduction of LeNet (LeCun et al., 1995), they did not immediately dominate the field.
Although LeNet achieved good results on early small datasets, the performance and feasibility of
training CNNs on larger, more realistic datasets had yet to be established. In fact, for much of
the intervening time between the early 1990s and the watershed results of 2012 (Krizhevsky et al.,
2012), neural networks were often surpassed by other machine learning methods, such as ker-
nel methods (Scholkopf and Smola, 2002), ensemble methods (Freund et al., 1996), and structued
estimation (Taskar et al., 2004).
For computer vision, this comparison is perhaps not fair. That is although the inputs to convolu-
tional networks consist of raw or lightly-processed (e.g., by centering) pixel values, practitioners
would never feed raw pixels into traditional models. Instead, typical computer vision pipelines
consisted of manually engineering feature extraction pipelines, such as SIFT (Lowe, 2004), SURF
(Bay et al., 2006), and bags of visual words (Sivic and Zisserman, 2003). Rather than learn the fea-
tures, the features were crafted. Most of the progress came from having more clever ideas for
features and deep insight into geometry (Hartley and Zisserman, 2000). The learning algorithm
was often considered an afterthought.
Although some neural network accelerators were available in the 1990s, they were not yet suffi-
ciently powerful to make deep multichannel, multilayer CNNs with a large number of parameters.
For instance, NVIDIAʼs GeForce 256 from 1999 was able to process at most 480 million operations
per second, without any meaningful programming framework for operations beyond games. To-
dayʼs accelerators are able to perform in excess of 300 TFLOPs per device (NVIDIAʼs Ampere A100),
where FLOPs are floating-point operations in number of multiply-adds. Moreover, datasets were
still relatively small: OCR on 60,000 low-resolution images was considered a highly challenging
task. Added to these obstacles, key tricks for training neural networks including parameter initial-
ization heuristics (Glorot and Bengio, 2010), clever variants of stochastic gradient descent (Kingma
and Ba, 2014), non-squashing activation functions (Nair and Hinton, 2010), and effective regular-
ization techniques (Srivastava et al., 2014) were still missing.
Thus, rather than training end-to-end (pixel to classification) systems, classical pipelines looked
more like this:
1. Obtain an interesting dataset. In early days, these datasets required expensive sensors. For
instance, the Apple QuickTake 100123 of 1994 sported a whopping 0.3 Megapixel (VGA) reso-
lution, capable of storing up to 8 images, all for the price of $1,000.
2. Preprocess the dataset with hand-crafted features based on some knowledge of optics, ge-
ometry, other analytic tools, and occasionally on the serendipitous discoveries of lucky grad-
uate students.
3. Feed the data through a standard set of feature extractors such as the SIFT (scale-invariant
feature transform) (Lowe, 2004), the SURF (speeded up robust features) (Bay et al., 2006), or
any number of other hand-tuned pipelines.
4. Dump the resulting representations into your favorite classifier, likely a linear model or ker-
nel method, to train a classifier.
If you spoke to machine learning researchers, they believed that machine learning was both im-
portant and beautiful. Elegant theories proved the properties of various classifiers (Boucheron
et al., 2005) and convex optimization (Boyd and Vandenberghe, 2004) had become the mainstay
123
https://en.wikipedia.org/wiki/Apple_QuickTake
270 Chapter 8. Modern Convolutional Neural Networks

for obtaining them. The field of machine learning was thriving, rigorous, and eminently useful.
However, if you spoke to a computer vision researcher, you would hear a very different story.
The dirty truth of image recognition, they would tell you, is that features, geometry (Hartley and
Zisserman, 2000), and engineering, rather than novel learning algorithms, drove progress. Com-
puter vision researchers justifiably believed that a slightly bigger or cleaner dataset or a slightly
improved feature-extraction pipeline mattered far more to the final accuracy than any learning
algorithm.
8.1.1 Learning Representations
Another way to cast the state of affairs is that the most important part of the pipeline was the
representation. And up until 2012 the representation was calculated mostly mechanically. In fact,
engineering a new set of feature functions, improving results, and writing up the method was a
prominent genre of paper. SIFT (Lowe, 2004), SURF (Bay et al., 2006), HOG (histograms of oriented
gradient) (Dalal and Triggs, 2005), bags of visual words124 and similar feature extractors ruled the
roost.
Another group of researchers, including Yann LeCun, Geoff Hinton, Yoshua Bengio, Andrew Ng,
Shun-ichi Amari, and Juergen Schmidhuber, had different plans. They believed that features
themselves ought to be learned. Moreover, they believed that to be reasonably complex, the fea-
tures ought to be hierarchically composed with multiple jointly learned layers, each with learn-
able parameters. In the case of an image, the lowest layers might come to detect edges, colors,
and textures, in analogy to how the visual system in animals processes its input.
The first modern CNN (Krizhevsky et al., 2012), named AlexNet after one of its inventors, Alex
Krizhevsky, is largely an evolutionary improvement over LeNet. It achieved excellent perfor-
mance in the 2012 ImageNet challenge.
Fig. 8.1.1: Image filters learned by the first layer of AlexNet (reproduced from (Krizhevsky et al.,
2012)).
Interestingly in the lowest layers of the network, the model learned feature extractors that re-
sembled some traditional filters. Fig. 8.1.1 shows lower-level image descriptors. Higher layers
in the network might build upon these representations to represent larger structures, like eyes,
124
https://en.wikipedia.org/wiki/Bag-of-words_model_in_computer_vision
8.1. Deep Convolutional Neural Networks (AlexNet) 271

noses, blades of grass, and so on. Even higher layers might represent whole objects like people,
airplanes, dogs, or frisbees. Ultimately, the final hidden state learns a compact representation
of the image that summarizes its contents such that data belonging to different categories can be
easily separated.
AlexNet (2012) and its precursor LeNet (1995) share many architectural elements. This begs the
question why it took so long. A key difference is that over the past two decades both data and
computation had increased significantly. As such AlexNet was much larger: it was trained on
much more data, and on much faster GPUs, compared to the CPUs available in 1995.
Missing Ingredient: Data
Deep models with many layers require large amounts of data in order to enter the regime where
they significantly outperform traditional methods based on convex optimizations (e.g., linear and
kernel methods). However, given the limited storage capacity of computers, the relative expense
of (imaging) sensors, and the comparatively tighter research budgets in the 1990s, most research
relied on tiny datasets. Numerous papers addressed the UCI collection of datasets, many of which
contained only hundreds or (a few) thousands of images captured in low resolution and often with
an artificially clean background.
In 2009, the ImageNet dataset was released (Deng et al., 2009), challenging researchers to learn
models from 1 million examples, 1000 each from 1000 distinct categories of objects. The cate-
gories themselves were based on the most popular noun nodes in WordNet (Miller, 1995). The Im-
ageNet team used Google Image Search to prefilter large candidate sets for each category and em-
ployed the Amazon Mechanical Turk crowdsourcing pipeline to confirm for each image whether
it belonged to the associated category. This scale was unprecedented, exceeding others by over
an order of magnitude (e.g., CIFAR-100 has 60,000 images). Another aspect was that the images
were at fairly high resolution of 224 × 224 pixels, unlike the 80 million sized TinyImages dataset
(Torralba et al., 2008), consisting of 32 × 32 pixel thumbnails. This allowed for the formation
of higher-level features. The associated competition, dubbed the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC125 ) pushed computer vision and machine learning research for-
ward, challenging researchers to identify which models performed best at a greater scale than
academics had previously considered.
Missing Ingredient: Hardware
Deep learning models are voracious consumers of compute cycles. Training can take hundreds
of epochs, and each iteration requires passing data through many layers of computationally-
expensive linear algebra operations. This is one of the main reasons why in the 1990s and early
2000s, simple algorithms based on the more-efficiently optimized convex objectives were pre-
ferred.
Graphical processing units (GPUs) proved to be a game changer in making deep learning feasible.
These chips had long been developed for accelerating graphics processing to benefit computer
games. In particular, they were optimized for high throughput 4×4 matrix-vector products, which
are needed for many computer graphics tasks. Fortunately, the math is strikingly similar to that
required to calculate convolutional layers. Around that time, NVIDIA and ATI had begun optimiz-
ing GPUs for general computing operations (Fernando, 2004), going as far as to market them as
general-purpose GPUs (GPGPUs).
125
https://www.image-net.org/challenges/LSVRC/

To provide some intuition, consider the cores of a modern microprocessor (CPU). Each of the
cores is fairly powerful running at a high clock frequency and sporting large caches (up to sev-
eral megabytes of L3). Each core is well-suited to executing a wide range of instructions, with
branch predictors, a deep pipeline, specialized execution units, speculative execution, and many
other bells and whistles that enable it to run a large variety of programs with sophisticated con-
trol flow. This apparent strength, however, is also its Achilles heel: general-purpose cores are
very expensive to build. They excel at general-purpose code with lots of control flow. This re-
quires lots of chip area, not just for the actual ALU (arithmetic logical unit) where computation
happens, but also for all the aforementioned bells and whistles, plus memory interfaces, caching
logic between cores, high-speed interconnects, and so on. CPUs are comparatively bad at any
single task when compared to dedicated hardware. Modern laptops have 4–8 cores, and even
high-end servers rarely exceed 64 cores per socket, simply because it is not cost effective.
By comparison, GPUs can consist of thousands of small processing elements (NIVIDAʼs latest Am-
pere chips have up to 6912 CUDA cores), often grouped into larger groups (NVIDIA calls them
warps). The details differ somewhat between NVIDIA, AMD, ARM and other chip vendors. While
each core is relatively weak, running at about 1GHz clock frequency, it is the total number of
such cores that makes GPUs orders of magnitude faster than CPUs. For instance, NVIDIAʼs recent
Ampere generation offers over 300 TFLOPs per chip for specialized 16 bit precision (BFLOAT16)
matrix-matrix multiplications, and up to 20 TFLOPs for more general-purpose floating point op-
erations (FP32). At the same time, floating point performance of CPUs rarely exceeds 1 TFLOPs
(AWSʼ Graviton 2, for instance, reaches 2 TFLOPs peak performance for 16 bit precision opera-
tions). The reason for why this is possible is actually quite simple: First, power consumption
tends to grow quadratically with clock frequency. Hence, for the power budget of a CPU core that
runs 4 times faster (a typical number), you can use 16 GPU cores at 14 the speed, which yields
16 × 14 = 4 times the performance. Furthermore, GPU cores are much simpler (in fact, for a long
time they were not even able to execute general-purpose code), which makes them more energy
efficient. Last, many operations in deep learning require high memory bandwidth. Again, GPUs
shine here with buses that are at least 10 times as wide as many CPUs.
Back to 2012. A major breakthrough came when Alex Krizhevsky and Ilya Sutskever implemented
a deep CNN that could run on GPUs. They realized that the computational bottlenecks in CNNs,
convolutions and matrix multiplications, are all operations that could be parallelized in hardware.
Using two NVIDIA GTX 580s with 3GB of memory, either of which was capable of 1.5 TFLOPs, they
implemented fast convolutions. The code cuda-convnet126 was good enough that for several years
it was the industry standard and powered the first couple years of the deep learning boom.
8.1.2 AlexNet
AlexNet, which employed an 8-layer CNN, won the ImageNet Large Scale Visual Recognition Chal-
lenge 2012 by a large margin (Russakovsky et al., 2013). This network showed, for the first time,
that the features obtained by learning can transcend manually-designed features, breaking the
previous paradigm in computer vision.
The architectures of AlexNet and LeNet are strikingly similar, as Fig. 8.1.2 illustrates. Note that we
provide a slightly streamlined version of AlexNet removing some of the design quirks that were
needed in 2012 to make the model fit on two small GPUs.
126
https://code.google.com/archive/p/cuda-convnet/

Fig. 8.1.2: From LeNet (left) to AlexNet (right).
There are also significant differences between AlexNet and LeNet. First, AlexNet is much deeper
than the comparatively small LeNet5. AlexNet consists of eight layers: five convolutional layers,
two fully connected hidden layers, and one fully connected output layer. Second, AlexNet used
the ReLU instead of the sigmoid as its activation function. Letʼs delve into the details below.
Architecture
In AlexNetʼs first layer, the convolution window shape is 11 × 11. Since the images in ImageNet
are eight times higher and wider than the MNIST images, objects in ImageNet data tend to occupy
more pixels with more visual detail. Consequently, a larger convolution window is needed to cap-
ture the object. The convolution window shape in the second layer is reduced to 5 × 5, followed
by 3 × 3. In addition, after the first, second, and fifth convolutional layers, the network adds max-
pooling layers with a window shape of 3 × 3 and a stride of 2. Moreover, AlexNet has ten times
more convolution channels than LeNet.
After the last convolutional layer there are two fully connected layers with 4096 outputs. These
two huge fully connected layers produce model parameters of nearly 1 GB. Due to the limited
memory in early GPUs, the original AlexNet used a dual data stream design, so that each of their
two GPUs could be responsible for storing and computing only its half of the model. Fortunately,

GPU memory is comparatively abundant now, so we rarely need to break up models across GPUs
these days (our version of the AlexNet model deviates from the original paper in this aspect).
Activation Functions
Besides, AlexNet changed the sigmoid activation function to a simpler ReLU activation function.
On the one hand, the computation of the ReLU activation function is simpler. For example, it does
not have the exponentiation operation found in the sigmoid activation function. On the other
hand, the ReLU activation function makes model training easier when using different parameter
initialization methods. This is because, when the output of the sigmoid activation function is very
close to 0 or 1, the gradient of these regions is almost 0, so that backpropagation cannot continue to
update some of the model parameters. In contrast, the gradient of the ReLU activation function in
the positive interval is always 1 (Section 5.1.2). Therefore, if the model parameters are not properly
initialized, the sigmoid function may obtain a gradient of almost 0 in the positive interval, so that
the model cannot be effectively trained.
Capacity Control and Preprocessing
AlexNet controls the model complexity of the fully connected layer by dropout (Section 5.6), while
LeNet only uses weight decay. To augment the data even further, the training loop of AlexNet
added a great deal of image augmentation, such as flipping, clipping, and color changes. This
makes the model more robust and the larger sample size effectively reduces overfitting. We will
discuss data augmentation in greater detail in Section 14.1. See also (Buslaev et al., 2020) for an
in-depth review of such preprocessing steps.
import torch
class AlexNet(d2l.Classifier):
super().__init__()
nn.LazyConv2d(96, kernel_size=11, stride=4, padding=1),
nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2),
nn.LazyConv2d(256, kernel_size=5, padding=2), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),
nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(p=0.5),
nn.LazyLinear(4096), nn.ReLU(),nn.Dropout(p=0.5),
self.net.apply(d2l.init_cnn)
We construct a single-channel data example with both height and width of 224 to observe the
output shape of each layer. It matches the AlexNet architecture in Fig. 8.1.2.

AlexNet().layer_summary((1, 1, 224, 224))
ReLU output shape: torch.Size([1, 96, 54, 54])
MaxPool2d output shape: torch.Size([1, 96, 26, 26])
ReLU output shape: torch.Size([1, 4096])
Dropout output shape: torch.Size([1, 4096])
8.1.3 Training
Although AlexNet was trained on ImageNet in (Krizhevsky et al., 2012), we use Fashion-MNIST
here since training an ImageNet model to convergence could take hours or days even on a modern
GPU. One of the problems with applying AlexNet directly on Fashion-MNIST is that its images
have lower resolution (28 × 28 pixels) than ImageNet images. To make things work, we upsample
them to 224 × 224. This is generally not a smart practice, but we do it here to be faithful to the
AlexNet architecture. We perform this resizing with the resize argument in the d2l.FashionMNIST
constructor.
Now, we can start training AlexNet. Compared to LeNet in Section 7.6, the main change here is
the use of a smaller learning rate and much slower training due to the deeper and wider network,
the higher image resolution, and the more costly convolutions.
model = AlexNet(lr=0.01)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))

8.1.4 Discussion
AlexNetʼs structure bears a striking resemblance to LeNet, with a number of critical improve-
ments, both for accuracy (dropout) and for ease of training (ReLU). What is equally striking is
the amount of progress that has been made in terms of deep learning tooling. What was several
months of work in 2012 can now be accomplished in a dozen lines of code using any modern
framework.
Reviewing the architecture, we see that AlexNet has an Achilles heel when it comes to efficiency:
the last two hidden layers require matrices of size 6400 × 4096 and 4096 × 4096, respectively. This
corresponds to 164 MB of memory and 81 MFLOPs of computation, both of which are a nontriv-
ial outlay, especially on smaller devices, such as mobile phones. This is one of the reasons why
AlexNet has been surpassed by much more effective architectures that we will cover in the fol-
lowing sections. Nonetheless, it is a key step from shallow to deep networks that are used nowa-
days. Although it seems that there are only a few more lines in AlexNetʼs implementation than in
LeNetʼs, it took the academic community many years to embrace this conceptual change and take
advantage of its excellent experimental results. This was also due to the lack of efficient computa-
tional tools. At the time neither DistBelief (Dean et al., 2012) nor Caffe (Jia et al., 2014) existed, and
Theano (Bergstra et al., 2010) still lacked many distinguishing features. It is only the availability
of TensorFlow (Abadi et al., 2016) that changed this situation dramatically.
Exercises
1. Following up on the discussion above, analyze computational performance of AlexNet.

1. Compute the memory footprint for convolutions and fully connected layers, respec-
tively. Which one dominates?
2. Calculate the computational cost for the convolutions and the fully connected layers.
3. How does the memory bandwidth affect computation?
2. You are a chip designer and need to trade off computation and memory bandwidth. For ex-
ample, a faster chip requires more chip area and more power, and more memory bandwidth
requires more pins and control logic, thus also more area. How do you optimize?
3. Try increasing the number of epochs when training AlexNet. Compared with LeNet, how do
the results differ? Why?

4. AlexNet may be too complex for the Fashion-MNIST dataset, in particular due to the low
resolution of the initial images.
1. Try simplifying the model to make the training faster, while ensuring that the accuracy
does not drop significantly.
2. Design a better model that works directly on 28 × 28 images.
5. Modify the batch size, and observe the changes in throughput (images/s), accuracy, and GPU
memory.
6. Apply dropout and ReLU to LeNet-5. Does it improve? Can you improve things further by
preprocessing to take advantage of the invariances inherent in the images?
Discussions127
8.2 Networks Using Blocks (VGG)
While AlexNet offered empirical evidence that deep CNNs can achieve good results, it did not
provide a general template to guide subsequent researchers in designing new networks. In the
following sections, we will introduce several heuristic concepts commonly used to design deep
networks.
Progress in this field mirrors that of VLSI (very large scale integration) in chip design where en-
gineers moved from placing transistors to logical elements to logic blocks (Mead, 1980). Sim-
ilarly, the design of neural network architectures has grown progressively more abstract, with
researchers moving from thinking in terms of individual neurons to whole layers, and now to
blocks, repeating patterns of layers.
The idea of using blocks first emerged from the Visual Geometry Group128 (VGG) at Oxford Uni-
versity, in their eponymously-named VGG network (Simonyan and Zisserman, 2014). It is easy to
implement these repeated structures in code with any modern deep learning framework by using
loops and subroutines.
8.2.1 VGG Blocks
The basic building block of CNNs is a sequence of the following: (i) a convolutional layer with
padding to maintain the resolution, (ii) a nonlinearity such as a ReLU, (iii) a pooling layer such as
max-pooling to reduce the resolution. One of the problems with this approach is that the spatial
resolution decreases quite rapidly. In particular, this imposes a hard limit of log2 d convolutional
layers on the network before all dimensions (d) are used up. For instance, in the case of ImageNet,
it would be impossible to have more than 8 convolutional layers in this way.
The key idea by Simonyan and Zisserman was to use multiple convolutions in between downsam-
pling via max-pooling in the form of a block. They were primarily interested in whether deep or
wide networks perform better. For instance, the successive application of two 3 × 3 convolutions
touches the same pixels as a single 5 × 5 convolution does. At the same time, the latter uses ap-
proximately as many parameters (25 · c2 ) as three 3 × 3 convolutions do (3 · 9 · c2 ). In a rather
detailed analysis they showed that deep and narrow networks significantly outperform their shal-
low counterparts. This set deep learning on a quest for ever deeper networks with over 100 layers
127
128
http://www.robots.ox.ac.uk/~vgg/

for typical applications. While stacking 3 × 3 convolutions has been a gold standard in later deep
networks, implementations of such operations have also been efficient on GPUs (Lavin and Gray,
2016).
Back to VGG: a VGG block consists of a sequence of convolutions with 3 × 3 kernels with padding
of 1 (keeping height and width) followed by a 2 × 2 max-pooling layer with stride of 2 (halving
height and width after each block). In the code below, we define a function called vgg_block to
implement one VGG block.
The function below takes three arguments corresponding to the number of convolutional lay-
ers num_convs, the number of input channels in_channels and the number of output channels
out_channels.
import torch
def vgg_block(num_convs, out_channels):

layers = []
for _ in range(num_convs):
layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
layers.append(nn.ReLU())
layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
return nn.Sequential(*layers)
8.2.2 VGG Network
Like AlexNet and LeNet, the VGG Network can be partitioned into two parts: the first consisting
mostly of convolutional and pooling layers and the second consisting of fully connected layers that
are identical to those in AlexNet. The key difference is that the convolutional layers are grouped
in nonlinear transformations that leave the dimensonality unchanged, followed by a resolution-
reduction step, as depicted in Fig. 8.2.1.
8.2. Networks Using Blocks (VGG) 279

Fig. 8.2.1: From AlexNet to VGG that is designed from building blocks.
The convolutional part of the network connects several VGG blocks from Fig. 8.2.1 (also defined
in the vgg_block function) in succession. This grouping of convolutions is a pattern that has re-
mained almost unchanged over the past decade, although the specific choice of operations has
undergone considerable modifications. The variable conv_arch consists of a list of tuples (one
per block), where each contains two values: the number of convolutional layers and the number
of output channels, which are precisely the arguments required to call the vgg_block function.
As such, VGG defines a family of networks rather than just a specific manifestation. To build a
specific network we simply iterate over arch to compose the blocks.
class VGG(d2l.Classifier):
def __init__(self, arch, lr=0.1, num_classes=10):
super().__init__()
conv_blks = []
in_channels = 1
for (num_convs, out_channels) in arch:
conv_blks.append(vgg_block(num_convs, out_channels))
*conv_blks, nn.Flatten(),
nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
The original VGG network had 5 convolutional blocks, among which the first two have one convo-
lutional layer each and the latter three contain two convolutional layers each. The first block has
64 output channels and each subsequent block doubles the number of output channels, until that
number reaches 512. Since this network uses 8 convolutional layers and 3 fully connected layers,
it is often called VGG-11.

VGG(arch=((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))).layer_summary(
(1, 1, 224, 224))

Sequential output shape: torch.Size([1, 64, 112, 112])
As you can see, we halve height and width at each block, finally reaching a height and width of 7
before flattening the representations for processing by the fully connected part of the network.
8.2.3 Training
Since VGG-11 is more computationally-heavy than AlexNet we construct a network with a smaller
number of channels. This is more than sufficient for training on Fashion-MNIST. The model train-
ing process is similar to that of AlexNet in Section 8.1.
model = VGG(arch=((1, 16), (1, 32), (2, 64), (2, 128), (2, 128)), lr=0.01)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
8.2. Networks Using Blocks (VGG) 281

8.2.4 Summary
One might argue that VGG is the first truly modern convolutional neural network. While AlexNet
introduced many of the components of what make deep learning effective at scale, it is VGG that
arguably introduced key properties such as blocks of multiple convolutions and a preference for
deep and narrow networks. It is also the first network that is actually an entire family of similarly
parametrized models, giving the practitioner ample trade-off between complexity and speed. This
is also the place where modern deep learning frameworks shine. It is no longer necessary to
generate XML config files to specify a network but rather, to assmple said networks through simple
Python code.
Very recently ParNet (Goyal et al., 2021) demonstrated that it is possible to achieve competitive
performance using a much more shallow architecture through a large number of parallel com-
putations. This is an exciting development and thereʼs hope that it will influence architecture de-
signs in the future. For the remainder of the chapter, though, we will follow the path of scientific
progress over the past decade.
Exercises
1. Compared with AlexNet, VGG is much slower in terms of computation, and it also needs
more GPU memory.
1. Compare the number of parameters needed for AlexNet and VGG.
2. Compare the number of floating point operations used in the convolutional layers and
in the fully connected layers.
3. How could you reduce the computational cost created by the fully connected layers?
2. When displaying the dimensions associated with the various layers of the network, we only
see the information associated with 8 blocks (plus some auxiliary transforms), even though
the network has 11 layers. Where did the remaining 3 layers go?
3. Upsampling the resolution in Fashion-MNIST by a factor of 8 × 8 from 28 to 224 dimensions
is highly wasteful. Try modifying the network architecture and resolution conversion, e.g.,
to 56 or to 84 dimensions for its input instead. Can you do so without reducing the accuracy
of the network?
4. Use Table 1 in the VGG paper (Simonyan and Zisserman, 2014) to construct other common
models, such as VGG-16 or VGG-19.
Discussions129
129

8.3 Network in Network (NiN)
LeNet, AlexNet, and VGG all share a common design pattern: extract features exploiting spatial
structure via a sequence of convolutions and pooling layers and post-process the representations
via fully connected layers. The improvements upon LeNet by AlexNet and VGG mainly lie in how
these later networks widen and deepen these two modules.
This design poses two major challenges. First, the fully connected layers at the end of the archi-
tecture consume tremendous numbers of parameters. For instance, even a simple model such
as VGG-11 requires a monstrous 25088 × 4096 matrix, occupying almost 400MB of RAM. This is a
significant impediment to speedy computation, in particular on mobile and embedded devices.
Second, it is equally impossible to add fully connected layers earlier in the network to increase
the degree of nonlinearity: doing so would destroy the spatial structure and require potentially
even more memory.
The network in network (NiN) blocks of (Lin et al., 2013) offer an alternative, capable of solving
both problems in one simple strategy. They were proposed based on a very simple insight: (i) use
1 × 1 convolutions to add local nonlinearities across the channel activations and (ii) use global
average pooling to integrate across all locations in the last representation layer. Note that global
average pooling would not be effective, were it not for the added nonlinearities. Letʼs dive into this
in detail.
8.3.1 NiN Blocks
Recall Section 7.4.3. In it we discussed that the inputs and outputs of convolutional layers consist
of four-dimensional tensors with axes corresponding to the example, channel, height, and width.
Also recall that the inputs and outputs of fully connected layers are typically two-dimensional ten-
sors corresponding to the example and feature. The idea behind NiN is to apply a fully connected
layer at each pixel location (for each height and width). The resulting 1 × 1 convolution can be
thought as a fully connected layer acting independently on each pixel location.
Fig. 8.3.1 illustrates the main structural differences between VGG and NiN, and their blocks. Note
both the difference in the NiN blocks (the initial convolution is followed by 1 × 1 convolutions,
whereas VGG retains 3 × 3 convolutions) and in the end where we no longer require a giant fully
connected layer.
8.3. Network in Network (NiN) 283

Fig. 8.3.1: Comparing architectures of VGG and NiN, and their blocks.
import torch
def nin_block(out_channels, kernel_size, strides, padding):

return nn.Sequential(
nn.LazyConv2d(out_channels, kernel_size, strides, padding),
nn.ReLU(),
nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU(),
nn.LazyConv2d(out_channels, kernel_size=1), nn.ReLU())

8.3.2 NiN Model
NiN uses the same initial convolution sizes as AlexNet (it was proposed shortly thereafter). The
kernel sizes are 11 × 11, 5 × 5, and 3 × 3, respectively, and the numbers of output channels match
those of AlexNet. Each NiN block is followed by a max-pooling layer with a stride of 2 and a window
shape of 3 × 3.
The second significant difference between NiN and both AlexNet and VGG is that NiN avoids fully
connected layers altogether. Instead, NiN uses a NiN block with a number of output channels
equal to the number of label classes, followed by a global average pooling layer, yielding a vector
of logits. This design significantly reduces the number of required model parameters, albeit at
the expense of a potential increase in training time.
class NiN(d2l.Classifier):
super().__init__()
nin_block(96, kernel_size=11, strides=4, padding=0),
nn.MaxPool2d(3, stride=2),
nn.Dropout(0.5),
nin_block(num_classes, kernel_size=3, strides=1, padding=1),
nn.AdaptiveAvgPool2d((1, 1)),
nn.Flatten())
We create a data example to see the output shape of each block.
model = NiN()
X = torch.randn(1, 1, 224, 224)
for layer in model.net:
X = layer(X)
print(layer.__class__.__name__,'output shape:\t', X.shape)

Dropout output shape: torch.Size([1, 384, 5, 5])
AdaptiveAvgPool2d output shape: torch.Size([1, 10, 1, 1])
8.3. Network in Network (NiN) 285

8.3.3 Training
As before we use Fashion-MNIST to train the model. NiNʼs training is similar to that for AlexNet
and VGG.
model = NiN(lr=0.05)
8.3.4 Summary
NiN has dramatically fewer parameters than AlexNet and VGG. This stems from the fact that it
needs no giant fully connected layers and fewer convolutions with wide kernels. Instead, it uses
local 1 × 1 convolutions and global average pooling. These design choices influenced many sub-
sequent CNN designs.
Exercises
1. Why are there two 1 × 1 convolutional layers per NiN block? What happens if you add one?
What happens if you reduce this to one?
2. What happens if you replace the global average pooling by a fully connected layer (speed,
accuracy, number of parameters)?
3. Calculate the resource usage for NiN.
1. What is the number of parameters?
2. What is the amount of computation?
3. What is the amount of memory needed during training?
4. What is the amount of memory needed during prediction?
4. What are possible problems with reducing the 384 × 5 × 5 representation to a 10 × 5 × 5
representation in one step?

5. Use the structural design decisions in VGG that led to VGG-11, VGG-16, and VGG-19 to design
a family of NiN-like networks.
Discussions130
8.4 Multi-Branch Networks (GoogLeNet)
In 2014, GoogLeNet won the ImageNet Challenge (Szegedy et al., 2015), using a structure that com-
bined the strengths of NiN (Lin et al., 2013), repeated blocks (Simonyan and Zisserman, 2014), and
a cocktail of convolution kernels. It is arguably also the first network that exhibits a clear distinc-
tion among the stem, body, and head in a CNN. This design pattern has persisted ever since in the
design of deep networks: the stem is given by the first 2-3 convolutions that operate on the image.
They extract low-level features from the underlying images. This is followed by a body of convo-
lutional blocks. Finally, the head maps the features obtained so far to the required classification,
segmentation, detection, or tracking problem at hand.
The key contribution in GoogLeNet was the design of the network body. It solved the problem
of selecting convolution kernels in an ingenious way. While other works tried to identify which
convolution, ranging from 1 × 1 to 11 × 11 would be best, it simply concatenated multi-branch con-
volutions. In what follows we introduce a slightly simplified version of GoogLeNet. The simplifi-
cations are due to the fact that tricks to stabilize training, in particular intermediate loss functions,
are no longer needed due to the availability of improved training algorithms.
8.4.1 Inception Blocks
The basic convolutional block in GoogLeNet is called an Inception block, stemming from the meme
“we need to go deeper” of the movie Inception.
Fig. 8.4.1: Structure of the Inception block.
As depicted in Fig. 8.4.1, the inception block consists of four parallel branches. The first three
branches use convolutional layers with window sizes of 1×1, 3×3, and 5×5 to extract information
from different spatial sizes. The middle two branches also add a 1 × 1 convolution of the input to
reduce the number of channels, reducing the modelʼs complexity. The fourth branch uses a 3 × 3
max-pooling layer, followed by a 1 × 1 convolutional layer to change the number of channels.
The four branches all use appropriate padding to give the input and output the same height and
width. Finally, the outputs along each branch are concatenated along the channel dimension and
comprise the blockʼs output. The commonly-tuned hyperparameters of the Inception block are
the number of output channels per layer.
130
8.4. Multi-Branch Networks (GoogLeNet) 287

import torch
class Inception(nn.Module):
# `c1`--`c4` are the number of output channels for each branch
def __init__(self, c1, c2, c3, c4, **kwargs):
super(Inception, self).__init__(**kwargs)
# Branch 1
self.b1_1 = nn.LazyConv2d(c1, kernel_size=1)
# Branch 2
self.b2_1 = nn.LazyConv2d(c2[0], kernel_size=1)
self.b2_2 = nn.LazyConv2d(c2[1], kernel_size=3, padding=1)
# Branch 3
self.b3_1 = nn.LazyConv2d(c3[0], kernel_size=1)
self.b3_2 = nn.LazyConv2d(c3[1], kernel_size=5, padding=2)
# Branch 4
self.b4_1 = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)
self.b4_2 = nn.LazyConv2d(c4, kernel_size=1)

b1 = F.relu(self.b1_1(x))
b2 = F.relu(self.b2_2(F.relu(self.b2_1(x))))
b3 = F.relu(self.b3_2(F.relu(self.b3_1(x))))
b4 = F.relu(self.b4_2(self.b4_1(x)))
return torch.cat((b1, b2, b3, b4), dim=1)
To gain some intuition for why this network works so well, consider the combination of the filters.
They explore the image in a variety of filter sizes. This means that details at different extents can
be recognized efficiently by filters of different sizes. At the same time, we can allocate different
amounts of parameters for different filters.
8.4.2 GoogLeNet Model
As shown in Fig. 8.4.2, GoogLeNet uses a stack of a total of 9 inception blocks, arranged into 3
groups with max-pooling in between, and global average pooling in its head to generate its esti-
mates. Max-pooling between inception blocks reduces the dimensionality. At its stem, the first
module is similar to AlexNet and LeNet.

Fig. 8.4.2: The GoogLeNet architecture.
We can now implement GoogLeNet piece by piece. Letʼs begin with the stem. The first module
uses a 64-channel 7 × 7 convolutional layer.
class GoogleNet(d2l.Classifier):
def b1(self):
nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
The second module uses two convolutional layers: first, a 64-channel 1 × 1 convolutional layer,
followed by a 3 × 3 convolutional layer that triples the number of channels. This corresponds to
the second branch in the Inception block and concludes the design of the body. At this point we
have 192 channels.
@d2l.add_to_class(GoogleNet)
def b2(self):
nn.LazyConv2d(64, kernel_size=1), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
The third module connects two complete Inception blocks in series. The number of output chan-
nels of the first Inception block is 64+128+32+32 = 256. This amounts to a ratio of the number of
output channels among the four branches of 2 : 4 : 1 : 1. Achieving this, we first reduce the input

dimensions by 12 and by 12
1
in the second and third branch respectively to arrive at 96 = 192/2 and
16 = 192/12 channels respectively.
The number of output channels of the second Inception block is increased to 128 + 192 + 96 + 64 =
480, yielding a ratio of 128 : 192 : 96 : 64 = 4 : 6 : 3 : 2. As before, we need to reduce the number of
intermediate dimensions in the second and third channel. A scale of 12 and 18 respectively suffices,
yielding 128 and 32 channels respectively. This is captured by the arguments of the following
Inception block constructors.
def b3(self):
return nn.Sequential(Inception(64, (96, 128), (16, 32), 32),
Inception(128, (128, 192), (32, 96), 64),
The fourth module is more complicated. It connects five Inception blocks in series, and they have
192 + 208 + 48 + 64 = 512, 160 + 224 + 64 + 64 = 512, 128 + 256 + 64 + 64 = 512, 112 + 288 + 64 + 64 =
528, and 256 + 320 + 128 + 128 = 832 output channels, respectively. The number of channels
assigned to these branches is similar to that in the third module: the second branch with the
3 × 3 convolutional layer outputs the largest number of channels, followed by the first branch
with only the 1 × 1 convolutional layer, the third branch with the 5 × 5 convolutional layer, and the
fourth branch with the 3×3 max-pooling layer. The second and third branches will first reduce the
number of channels according to the ratio. These ratios are slightly different in different Inception
blocks.
def b4(self):
Inception(160, (112, 224), (24, 64), 64),
Inception(128, (128, 256), (24, 64), 64),
Inception(112, (144, 288), (32, 64), 64),
Inception(256, (160, 320), (32, 128), 128),
The fifth module has two Inception blocks with 256 + 320 + 128 + 128 = 832 and 384 + 384 + 128 +
128 = 1024 output channels. The number of channels assigned to each branch is the same as that
in the third and fourth modules, but differs in specific values. It should be noted that the fifth
block is followed by the output layer. This block uses the global average pooling layer to change
the height and width of each channel to 1, just as in NiN. Finally, we turn the output into a two-
dimensional array followed by a fully connected layer whose number of outputs is the number of
label classes.
def b5(self):
Inception(384, (192, 384), (48, 128), 128),
nn.AdaptiveAvgPool2d((1,1)), nn.Flatten())
super(GoogleNet, self).__init__()

self.net = nn.Sequential(self.b1(), self.b2(), self.b3(), self.b4(),

self.b5(), nn.LazyLinear(num_classes))
The GoogLeNet model is computationally complex. Note the large number of relatively arbitrary
hyperparameters in terms of the number of channels chosen. This work was done before scien-
tists started using automatic tools to optimize network designs.
For now the only modification we will carry out is to reduce the input height and width from 224
to 96 to have a reasonable training time on Fashion-MNIST. This simplifies the computation. Letʼs
have a look at the changes in the shape of the output between the various modules.
model = GoogleNet().layer_summary((1, 1, 96, 96))

Sequential output shape: torch.Size([1, 1024])
8.4.3 Training
As before, we train our model using the Fashion-MNIST dataset. We transform it to 96 × 96 pixel
resolution before invoking the training procedure.
model = GoogleNet(lr=0.01)

8.4.4 Discussion
A key feature of GoogLeNet is that it is actually cheaper to compute than its predecessors while si-
multaneously providing improved accuracy. This marks the beginning of a much more deliberate
network design that trades off the cost of evaluating a network with a reduction in errors. It also
marks the beginning of experimentation at a block level with network design hyperparameters,
even though it was entirely manual at the time. This is largely due to the fact that deep learn-
ing frameworks in 2015 still lacked much of the design flexibility that we now take for granted.
Moreover, full network optimization is costly and at the time training on ImageNet still proved
computationally challenging.
Over the following sections we will encounter a number of design choices (e.g., batch normaliza-
tion, residual connections, and channel grouping) that allow us to improve networks significantly.
For now, you can be proud to have implemented what is arguably the first truly modern CNN.
Exercises
1. GoogLeNet was so successful that it went through a number of iterations. There are several
iterations of GoogLeNet that progressively improved speed and accuracy. Try to implement
and run some of them. They include the following:
2. Add a batch normalization layer (Ioffe and Szegedy, 2015), as described later in Section 8.5.
3. Make adjustments to the Inception block (width, choice and order of convolutions), as de-
scribed in (Szegedy et al., 2016).
4. Use label smoothing for model regularization, as described in (Szegedy et al., 2016).
5. Make further adjustments to the Inception block by adding residual connection (Szegedy et
al., 2017), as described later in Section 8.6.
6. What is the minimum image size for GoogLeNet to work?
7. Can you design a variant of GoogLeNet that works on Fashion-MNISTʼs native resolution of
28 × 28 pixels? How would you need to change the stem, the body, and the head of the
network, if anything at all?
8. Compare the model parameter sizes of AlexNet, VGG, NiN, and GoogLeNet. How do the
latter two network architectures significantly reduce the model parameter size?
9. Compare the amount of computation needed in GoogLeNet and AlexNet. How does this af-
fect the design of an accelerator chip, e.g., in terms of memory size, amount of computation,
and the benefit of specialized operations?
Discussions131
131

8.5 Batch Normalization
Training deep neural networks is difficult. Getting them to converge in a reasonable amount of
time can be tricky. In this section, we describe batch normalization, a popular and effective tech-
nique that consistently accelerates the convergence of deep networks (Ioffe and Szegedy, 2015).
Together with residual blocks—covered later in Section 8.6—batch normalization has made it pos-
sible for practitioners to routinely train networks with over 100 layers. A secondary (serendipi-
tous) benefit of batch normalization is its inherent regularization.
8.5.1 Training Deep Networks
When working with data, we often preprocess before training. Choices regarding data prepro-
cessing often make an enormous difference in the final results. Recall our application of MLPs to
predicting house prices (Section 5.7). Our first step when working with real data was to standardize
our input features to have zero mean µ = 0 and unit variance = 1 across multiple observations
(Friedman, 1987). At a mimimum, one frequently rescales it such that the diagonal is unity, i.e.,
Σii = 1. Yet another strategy is to rescale vectors to unit length, possibly zero mean per observation.
This can work well, e.g., for spatial sensor data. These preprocessing techniques and many more
are beneficial to keep the estimation problem well controlled. See e.g., the articles in (Guyon et
al., 2008) for a review of feature selection and extraction techniques.
Intuitively, this standardization plays nicely with our optimizers since it puts the parameters a
priori at a similar scale. As such, it is only natural to ask whether a corresponding normalization
step inside a deep network might not be beneficial. While this isnʼt quite the reasoning that led to
the invention of batch normalization (Ioffe and Szegedy, 2015), it is a useful way of understanding
it and its cousin, layer normalization (Ba et al., 2016) within a unified framework.
Second, for a typical MLP or CNN, as we train, the variables (e.g., affine transformation outputs in
MLP) in intermediate layers may take values with widely varying magnitudes: both along the lay-
ers from the input to the output, across units in the same layer, and over time due to our updates to
the model parameters. The inventors of batch normalization postulated informally that this drift
in the distribution of such variables could hamper the convergence of the network. Intuitively, we
might conjecture that if one layer has variable activations that are 100 times that of another layer,
this might necessitate compensatory adjustments in the learning rates. Adaptive solvers such as
AdaGrad (Duchi et al., 2011), Adam (Kingma and Ba, 2014), and Yogi (Zaheer et al., 2018) aim to
address this from the viewpoint of optimization. The alternative is to prevent the problem from
occurring, simply by adaptive normalization.
Third, deeper networks are complex and tend to be more easily capable of overfitting. This means
that regularization becomes more critical. A common technique for regularization is noise in-
jection. This has been known for a long time, e.g., with regard to noise injection for the inputs
(Bishop, 1995). It also forms the basis of dropout Section 5.6. As it turns out, quite serendipi-
tously, batch normalization conveys all three benefits: preprocessing, numerical stability, and
regularization.
Batch normalization is applied to individual layers, or optionally, to all of them: In each training
iteration, we first normalize the inputs (of batch normalization) by subtracting their mean and di-
viding by their standard deviation, where both are estimated based on the statistics of the current
minibatch. Next, we apply a scale coefficient and an offset to recover the lost degrees of freedom.
It is precisely due to this normalization based on batch statistics that batch normalization derives its
name.
8.5. Batch Normalization 293

Note that if we tried to apply batch normalization with minibatches of size 1, we would not be able
to learn anything. That is because after subtracting the means, each hidden unit would take value
0. As you might guess, since we are devoting a whole section to batch normalization, with large
enough minibatches, the approach proves effective and stable. One takeaway here is that when
applying batch normalization, the choice of batch size is even more significant than without batch
normalization, or at least, suitable calibration is needed as we might adjust it.
Formally, denoting by x ∈ B an input to batch normalization (BN) that is from a minibatch B,
batch normalization transforms x according to the following expression:
x − µ̂B
BN(x) = γ ⊙ + β. (8.5.1)
σ̂ B
In (8.5.1), µ̂B is the sample mean and σ̂ B is the sample standard deviation of the minibatch B. After
applying standardization, the resulting minibatch has zero mean and unit variance. The choice
of unit variance (vs. some other magic number) is an arbitrary choice. We recover this degree of
freedom by including an elementwise scale parameter γ and shift parameter β that have the same
shape as x. Both are parameters that need to be learned as part of model training.
Consequently, the variable magnitudes for intermediate layers cannot diverge during training be-
cause batch normalization actively centers and rescales them back to a given mean and size (via
µ̂B and σ̂ B ). Practical experience confirms that, as alluded when discussing feature rescaling,
batch normalization seems to allow for more aggressive learning rates. Formally, we calculate
µ̂B and σ̂ B in (8.5.1) as follows:
1 ∑ 1 ∑
µ̂B = x and σ̂ 2B = (x − µ̂B )2 + ϵ. (8.5.2)
|B| |B|
x∈B x∈B
Note that we add a small constant ϵ > 0 to the variance estimate to ensure that we never attempt
division by zero, even in cases where the empirical variance estimate might be very small or even
vanish, leading to a division by zero. The estimates µ̂B and σ̂ B counteract the scaling issue by using
noisy estimates of mean and variance. You might think that this noisiness should be a problem.
As it turns out, this is actually beneficial.
This turns out to be a recurring theme in deep learning. For reasons that are not yet well-
characterized theoretically, various sources of noise in optimization often lead to faster training
and less overfitting: this variation appears to act as a form of regularization. (Teye et al., 2018)
and (Luo et al., 2018) relate the properties of batch normalization to Bayesian priors and penalties
respectively. In particular, this sheds some light on the puzzle of why batch normalization works
best for moderate minibatches sizes in the 50 ∼ 100 range. This particular size of minibatch
seems to inject just the “right amount” of noise per layer: a larger minibatch regularizes less due
to the more stable estimates, whereas tiny minibatches destroy useful signal due to high variance.
Exploring this direction further, considering alternative types of preprocessing and filtering may
yet lead to other effective types of regularization.
Fixing a trained model, you might think that we would prefer using the entire dataset to estimate
the mean and variance. Once training is complete, why would we want the same image to be clas-
sified differently, depending on the batch in which it happens to reside? During training, such
exact calculation is infeasible because the intermediate variables for all data examples change ev-
ery time we update our model. However, once the model is trained, we can calculate the means
and variances of each layerʼs variables based on the entire dataset. Indeed this is standard practice
for models employing batch normalization and thus batch normalization layers function differ-
ently in training mode (normalizing by minibatch statistics) and in prediction mode (normalizing
by dataset statistics). In this form they closely resemble the behavior of dropout regularization of
Section 5.6, where noise is only injected during training.

8.5.2 Batch Normalization Layers
We are now ready to take a look at how batch normalization works in practice. Batch normaliza-
tion implementations for fully connected layers and convolutional layers are slightly different.
One key difference between batch normalization and other layers is that because batch normal-
ization operates on a full minibatch at a time, we cannot just ignore the batch dimension as we
did before when introducing other layers.
Fully Connected Layers
When applying batch normalization to fully connected layers, the original paper inserted batch
normalization after the affine transformation and before the nonlinear activation function. Later
applications experimented with inserting batch normalization right after activation functions
(Ioffe and Szegedy, 2015). Denoting the input to the fully connected layer by x, the affine trans-
formation by Wx + b (with the weight parameter W and the bias parameter b), and the activation
function by ϕ, we can express the computation of a batch-normalization-enabled, fully connected
layer output h as follows:
h = ϕ(BN(Wx + b)). (8.5.3)
Recall that mean and variance are computed on the same minibatch on which the transformation
is applied.
Convolutional Layers
Similarly, with convolutional layers, we can apply batch normalization after the convolution and
before the nonlinear activation function. The key difference from batch normalization in fully
connected layers is that we apply the operation on a per-channel basis across all locations. This is
compatible with our assumption of translation invariance that led to convolutions: we assumed
that the specific location of a pattern within an image was not critical for the purpose of under-
standing.
Assume that our minibatches contain m examples and that for each channel, the output of the con-
volution has height p and width q. For convolutional layers, we carry out each batch normalization
over the m · p · q elements per output channel simultaneously. Thus, we collect the values over all
spatial locations when computing the mean and variance and consequently apply the same mean
and variance within a given channel to normalize the value at each spatial location. Each channel
has its own scale and shift parameters, both of which are scalars.
Note that in the context of convolutions the batch normalization is well-defined even for mini-
batches of size 1: after all, we have all the locations across an image to average. Consequently,
mean and variance are well defined, even if itʼs just within a single observation. This considera-
tion led (Ba et al., 2016) to introduce the notion of the layer norm. It works just like a batch norm,
just that it is applied one image at a time. There are cases where layer normalization improves the
accuracy of a model. We skip further details and recommend the interested reader to consult the
original paper.

Batch Normalization During Prediction
As we mentioned earlier, batch normalization typically behaves differently in training mode and
prediction mode. First, the noise in the sample mean and the sample variance arising from es-
timating each on minibatches are no longer desirable once we have trained the model. Second,
we might not have the luxury of computing per-batch normalization statistics. For example, we
might need to apply our model to make one prediction at a time.
Typically, after training, we use the entire dataset to compute stable estimates of the variable
statistics and then fix them at prediction time. Consequently, batch normalization behaves dif-
ferently during training and at test time. Recall that dropout also exhibits this characteristic.
To see how batch normalization works in practice, we implement one from scratch below.
import torch
def batch_norm(X, gamma, beta, moving_mean, moving_var, eps, momentum):

# Use ìs_grad_enabled` to determine whether the current mode is training
# mode or prediction mode
if not torch.is_grad_enabled():
# If it is prediction mode, directly use the mean and variance
# obtained by moving average
X_hat = (X - moving_mean) / torch.sqrt(moving_var + eps)
else:
assert len(X.shape) in (2, 4)
if len(X.shape) == 2:
# When using a fully connected layer, calculate the mean and
# variance on the feature dimension
mean = X.mean(dim=0)
var = ((X - mean) ** 2).mean(dim=0)
else:
# When using a two-dimensional convolutional layer, calculate the
# mean and variance on the channel dimension (axis=1). Here we
# need to maintain the shape of `X`, so that the broadcasting
# operation can be carried out later
mean = X.mean(dim=(0, 2, 3), keepdim=True)
var = ((X - mean) ** 2).mean(dim=(0, 2, 3), keepdim=True)
# In training mode, the current mean and variance are used for the
# standardization
X_hat = (X - mean) / torch.sqrt(var + eps)
# Update the mean and variance using moving average
moving_mean = momentum * moving_mean + (1.0 - momentum) * mean
moving_var = momentum * moving_var + (1.0 - momentum) * var
Y = gamma * X_hat + beta # Scale and shift
return Y, moving_mean.data, moving_var.data
We can now create a proper BatchNorm layer. Our layer will maintain proper parameters for scale
gamma and shift beta, both of which will be updated in the course of training. Additionally, our
layer will maintain moving averages of the means and variances for subsequent use during model
prediction.

Putting aside the algorithmic details, note the design pattern underlying our implementation of
the layer. Typically, we define the mathematics in a separate function, say batch_norm. We then
integrate this functionality into a custom layer, whose code mostly addresses bookkeeping mat-
ters, such as moving data to the right device context, allocating and initializing any required vari-
ables, keeping track of moving averages (here for mean and variance), and so on. This pattern
enables a clean separation of mathematics from boilerplate code. Also note that for the sake of
convenience we did not worry about automatically inferring the input shape here, thus we need
to specify the number of features throughout. By now all modern deep learning frameworks offer
automatic detection of size and shape in the high-level batch normalization APIs (in practice we
will use this instead).
class BatchNorm(nn.Module):
# `num_features`: the number of outputs for a fully connected layer
# or the number of output channels for a convolutional layer. `num_dims`:
# 2 for a fully connected layer and 4 for a convolutional layer
def __init__(self, num_features, num_dims):
super().__init__()
if num_dims == 2:
shape = (1, num_features)
else:
shape = (1, num_features, 1, 1)
# The scale parameter and the shift parameter (model parameters) are
# initialized to 1 and 0, respectively
self.gamma = nn.Parameter(torch.ones(shape))
self.beta = nn.Parameter(torch.zeros(shape))
# The variables that are not model parameters are initialized to 0 and 1
self.moving_mean = torch.zeros(shape)
self.moving_var = torch.ones(shape)

# If `X` is not on the main memory, copy `moving_mean` and
# `moving_var` to the device where `X` is located
if self.moving_mean.device != X.device:
self.moving_mean = self.moving_mean.to(X.device)
self.moving_var = self.moving_var.to(X.device)
# Save the updated `moving_mean` and `moving_var`
Y, self.moving_mean, self.moving_var = batch_norm(
X, self.gamma, self.beta, self.moving_mean,
self.moving_var, eps=1e-5, momentum=0.9)
return Y
Note that we used the variable momentum to govern the aggregation over past mean and variance
estimates. This is somewhat of a misnomer as it has nothing whatsoever to do with the momentum
term of optimization in Section 12.6. Nonetheless, it is the commonly adopted name for this term
and in deference to API naming convention we use the same variable name in our code, too.

8.5.4 Applying Batch Normalization in LeNet
To see how to apply BatchNorm in context, below we apply it to a traditional LeNet model (Section
7.6). Recall that batch normalization is applied after the convolutional layers or fully connected
layers but before the corresponding activation functions.
class BNLeNetScratch(d2l.Classifier):
super().__init__()
nn.LazyConv2d(6, kernel_size=5), BatchNorm(6, num_dims=4),
nn.Sigmoid(), nn.AvgPool2d(kernel_size=2, stride=2),
nn.LazyConv2d(16, kernel_size=5), BatchNorm(16, num_dims=4),
nn.Flatten(), nn.LazyLinear(120),
BatchNorm(120, num_dims=2), nn.Sigmoid(), nn.LazyLinear(84),
BatchNorm(84, num_dims=2), nn.Sigmoid(),
As before, we will train our network on the Fashion-MNIST dataset. This code is virtually identical
to that when we first trained LeNet (Section 7.6).

model = BNLeNetScratch(lr=0.1)
Letʼs have a look at the scale parameter gamma and the shift parameter beta learned from the first
batch normalization layer.
model.net[1].gamma.reshape((-1,)), model.net[1].beta.reshape((-1,))
(tensor([2.6958, 1.3935, 1.7013, 2.1176, 2.0970, 1.3059], device='cuda:0',

grad_fn=<ReshapeAliasBackward0>),

tensor([ 1.4793, 0.2669, -1.5393, -0.2497, 1.0459, -0.1274], device='cuda:0',

grad_fn=<ReshapeAliasBackward0>))
Compared with the BatchNorm class, which we just defined ourselves, we can use the BatchNorm
class defined in high-level APIs from the deep learning framework directly. The code looks virtu-
ally identical to our implementation above, except that we no longer need to provide additional
arguments for it to get the dimensions right.
class BNLeNet(d2l.Classifier):
super().__init__()
nn.LazyConv2d(6, kernel_size=5), nn.LazyBatchNorm2d(),
nn.LazyConv2d(16, kernel_size=5), nn.LazyBatchNorm2d(),
nn.Flatten(), nn.LazyLinear(120), nn.LazyBatchNorm1d(),
nn.Sigmoid(), nn.LazyLinear(84), nn.LazyBatchNorm1d(),
nn.Sigmoid(), nn.LazyLinear(num_classes))
Below, we use the same hyperparameters to train our model. Note that as usual, the high-level API
variant runs much faster because its code has been compiled to C++ or CUDA while our custom
implementation must be interpreted by Python.

model = BNLeNet(lr=0.1)

8.5.6 Discussion
Intuitively, batch normalization is thought to make the optimization landscape smoother. How-
ever, we must be careful to distinguish between speculative intuitions and true explanations for
the phenomena that we observe when training deep models. Recall that we do not even know why
simpler deep neural networks (MLPs and conventional CNNs) generalize well in the first place.
Even with dropout and weight decay, they remain so flexible that their ability to generalize to
unseen data likely needs significantly more refined learning-theoretic generalization guarantees.
In the original paper proposing batch normalization, the authors, in addition to introducing a
powerful and useful tool, offered an explanation for why it works: by reducing internal covari-
ate shift. Presumably by internal covariate shift the authors meant something like the intuition
expressed above—the notion that the distribution of variable values changes over the course of
training. However, there were two problems with this explanation: i) This drift is very different
from covariate shift, rendering the name a misnomer. ii) The explanation offers an under-specified
intuition but leaves the question of why precisely this technique works an open question wanting for
a rigorous explanation. Throughout this book, we aim to convey the intuitions that practitioners
use to guide their development of deep neural networks. However, we believe that it is important
to separate these guiding intuitions from established scientific fact. Eventually, when you master
this material and start writing your own research papers you will want to be clear to delineate
between technical claims and hunches.
Following the success of batch normalization, its explanation in terms of internal covariate shift
has repeatedly surfaced in debates in the technical literature and broader discourse about how to
present machine learning research. In a memorable speech given while accepting a Test of Time
Award at the 2017 NeurIPS conference, Ali Rahimi used internal covariate shift as a focal point in an
argument likening the modern practice of deep learning to alchemy. Subsequently, the example
was revisited in detail in a position paper outlining troubling trends in machine learning (Lipton
and Steinhardt, 2018). Other authors have proposed alternative explanations for the success of
batch normalization, some claiming that batch normalizationʼs success comes despite exhibiting
behavior that is in some ways opposite to those claimed in the original paper (Santurkar et al.,
2018).
We note that the internal covariate shift is no more worthy of criticism than any of thousands of
similarly vague claims made every year in the technical machine learning literature. Likely, its
resonance as a focal point of these debates owes to its broad recognizability to the target audience.
Batch normalization has proven an indispensable method, applied in nearly all deployed image
classifiers, earning the paper that introduced the technique tens of thousands of citations. We
conjecture, though, that the guiding principles of regularization through noise injection, acceler-
ation through rescaling and lastly preprocessing may well lead to further inventions of layers and
techniques in the future.
On a more practical note, there are a number of aspects worth remembering about batch nor-
malization: * During model training, batch normalization continuously adjusts the intermediate
output of the network by utilizing the mean and standard deviation of the minibatch, so that the
values of the intermediate output in each layer throughout the neural network are more stable.
* Batch normalization for fully connected layers and convolutional layers are slightly different.
In fact, for convolutional layers, layer normalization can sometimes be used as an alternative. *
Like a dropout layer, batch normalization layers have different behaviors in training mode and
prediction mode. * Batch normalization is useful for regularization and improving convergence
in optimization. On the other hand, the original motivation of reducing internal covariate shift
seems not to be a valid explanation.

Exercises
1. Can we remove the bias parameter from the fully connected layer or the convolutional layer
before the batch normalization? Why?
2. Compare the learning rates for LeNet with and without batch normalization.
1. Plot the increase in validation accuracy.
2. How large can you make the learning rate before the optimization fails in both cases?
3. Do we need batch normalization in every layer? Experiment with it?
4. Implement a “lite” version of batch normalization that only removes the mean, or alterna-
tively one that only removes the variance. How does it behave?
5. Fix the parameters beta and gamma, and observe and analyze the results.
6. Can you replace dropout by batch normalization? How does the behavior change?
7. Review the online documentation for BatchNorm from the high-level APIs to see some other
use cases for it.
8. Research ideas: think of other normalization transforms that you can apply:
1. Can you apply the probability integral transform?
2. Can you use a full rank covariance estimate? Why not?
3. Does a sparsification compression act as a regularizer?
Discussions132
8.6 Residual Networks (ResNet) and ResNeXt
As we design increasingly deeper networks it becomes imperative to understand how adding lay-
ers can increase the complexity and expressiveness of the network. Even more important is the
ability to design networks where adding layers makes networks strictly more expressive rather
than just different. To make some progress we need a bit of mathematics.
8.6.1 Function Classes
Consider F, the class of functions that a specific network architecture (together with learning
rates and other hyperparameter settings) can reach. That is, for all f ∈ F there exists some set of
parameters (e.g., weights and biases) that can be obtained through training on a suitable dataset.
Letʼs assume that f ∗ is the “truth” function that we really would like to find. If it is in F, we are in
good shape but typically we will not be quite so lucky. Instead, we will try to find some fF∗ which
is our best bet within F. For instance, given a dataset with features X and labels y, we might try
finding it by solving the following optimization problem:
fF∗ = argmin L(X, y, f ) subject to f ∈ F .

def
(8.6.1)
f
We know that regularization (Morozov, 2012, Tikhonov and Arsenin, 1977) may control complexity
of F and achieve consistency, so a larger size of training data generally leads to better fF∗ . It is only
132
8.6. Residual Networks (ResNet) and ResNeXt 301

reasonable to assume that if we design a different and more powerful architecture F ′ we should
arrive at a better outcome. In other words, we would expect that fF∗ ′ is “better” than fF∗ . However,
if F ̸⊆ F ′ there is no guarantee that this should even happen. In fact, fF∗ ′ might well be worse. As
illustrated by Fig. 8.6.1, for non-nested function classes, a larger function class does not always
move closer to the “truth” function f ∗ . For instance, on the left of Fig. 8.6.1, though F3 is closer to
f ∗ than F1 , F6 moves away and there is no guarantee that further increasing the complexity can
reduce the distance from f ∗ . With nested function classes where F1 ⊆ . . . ⊆ F6 on the right of
Fig. 8.6.1, we can avoid the aforementioned issue from the non-nested function classes.
Fig. 8.6.1: For non-nested function classes, a larger (indicated by area) function class does not
guarantee to get closer to the “truth” function (f ∗ ). This does not happen in nested function
classes.
Thus, only if larger function classes contain the smaller ones are we guaranteed that increasing
them strictly increases the expressive power of the network. For deep neural networks, if we can
train the newly-added layer into an identity function f (x) = x, the new model will be as effective
as the original model. As the new model may get a better solution to fit the training dataset, the
added layer might make it easier to reduce training errors.
This is the question that (He et al., 2016) considered when working on very deep computer vision
models. At the heart of their proposed residual network (ResNet) is the idea that every additional
layer should more easily contain the identity function as one of its elements. These considerations
are rather profound but they led to a surprisingly simple solution, a residual block. With it, ResNet
won the ImageNet Large Scale Visual Recognition Challenge in 2015. The design had a profound
influence on how to build deep neural networks.
8.6.2 Residual Blocks
Letʼs focus on a local part of a neural network, as depicted in Fig. 8.6.2. Denote the input by x.
We assume that the desired underlying mapping we want to obtain by learning is f (x), to be used
as input to the activation function on the top. On the left, the portion within the dotted-line box
must directly learn the mapping f (x). On the right, the portion within the dotted-line box needs
to learn the residual mapping f (x) − x, which is how the residual block derives its name. If the
identity mapping f (x) = x is the desired underlying mapping, the residual mapping is easier to
learn: we only need to push the weights and biases of the upper weight layer (e.g., fully connected

layer and convolutional layer) within the dotted-line box to zero. The right figure illustrates the
residual block of ResNet, where the solid line carrying the layer input x to the addition operator
is called a residual connection (or shortcut connection). With residual blocks, inputs can forward
propagate faster through the residual connections across layers. In fact, the residual block can be
thought of as a special case of the multi-branch Inception block: it has two branches one of which
is the identity mapping.
Fig. 8.6.2: In a regular block (left), the portion within the dotted-line box must directly learn the
mapping f (x). In a residual block (right), the portion within the dotted-line box needs to learn the
residual mapping f (x) − x, where the identity mapping f (x) = x is easier to learn.
ResNet follows VGGʼs full 3 × 3 convolutional layer design. The residual block has two 3 × 3 con-
volutional layers with the same number of output channels. Each convolutional layer is followed
by a batch normalization layer and a ReLU activation function. Then, we skip these two convolu-
tion operations and add the input directly before the final ReLU activation function. This kind of
design requires that the output of the two convolutional layers has to be of the same shape as the
input, so that they can be added together. If we want to change the number of channels, we need
to introduce an additional 1 × 1 convolutional layer to transform the input into the desired shape
for the addition operation. Letʼs have a look at the code below.
import torch
class Residual(nn.Module): #@save

"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1,
stride=strides)
self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1)
if use_1x1conv:

self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,

stride=strides)
else:
self.conv3 = None
self.bn1 = nn.LazyBatchNorm2d()

Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return F.relu(Y)
This code generates two types of networks: one where we add the input to the output before ap-
plying the ReLU nonlinearity whenever use_1x1conv=False, and one where we adjust channels
and resolution by means of a 1 × 1 convolution before adding. Fig. 8.6.3 illustrates this.
Fig. 8.6.3: ResNet block with and without 1 × 1 convolution, which transforms the input into the
desired shape for the addition operation.
Now letʼs look at a situation where the input and output are of the same shape, where 1 × 1 convo-
lution is not needed.
blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

torch.Size([4, 3, 6, 6])
We also have the option to halve the output height and width while increasing the number of
output channels. Since the input shape is changed, use_1x1conv=True is specified.
blk = Residual(6, use_1x1conv=True, strides=2)

blk(X).shape
torch.Size([4, 6, 3, 3])
8.6.3 ResNet Model
The first two layers of ResNet are the same as those of the GoogLeNet we described before: the
7 × 7 convolutional layer with 64 output channels and a stride of 2 is followed by the 3 × 3 max-
pooling layer with a stride of 2. The difference is the batch normalization layer added after each
convolutional layer in ResNet.
class ResNet(d2l.Classifier):
def b1(self):
nn.LazyBatchNorm2d(), nn.ReLU(),
GoogLeNet uses four modules made up of Inception blocks. However, ResNet uses four modules
made up of residual blocks, each of which uses several residual blocks with the same number of
output channels. The number of channels in the first module is the same as the number of input
channels. Since a max-pooling layer with a stride of 2 has already been used, it is not necessary to
reduce the height and width. In the first residual block for each of the subsequent modules, the
number of channels is doubled compared with that of the previous module, and the height and
width are halved.
Now, we implement this module. Note that special processing has been performed on the first
module.
@d2l.add_to_class(ResNet)
def block(self, num_residuals, num_channels, first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(Residual(num_channels, use_1x1conv=True, strides=2))
else:
blk.append(Residual(num_channels))
return nn.Sequential(*blk)

Then, we add all the modules to ResNet. Here, two residual blocks are used for each module.
Finally, just like GoogLeNet, we add a global average pooling layer, followed by the fully connected
layer output.
@d2l.add_to_class(ResNet)
def __init__(self, arch, lr=0.1, num_classes=10):
super(ResNet, self).__init__()
self.net = nn.Sequential(self.b1())
for i, b in enumerate(arch):
self.net.add_module(f'b{i+2}', self.block(*b, first_block=(i==0)))
self.net.add_module('last', nn.Sequential(
nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
nn.LazyLinear(num_classes)))
There are 4 convolutional layers in each module (excluding the 1×1 convolutional layer). Together
with the first 7 × 7 convolutional layer and the final fully connected layer, there are 18 layers in
total. Therefore, this model is commonly known as ResNet-18. By configuring different numbers
of channels and residual blocks in the module, we can create different ResNet models, such as
the deeper 152-layer ResNet-152. Although the main architecture of ResNet is similar to that of
GoogLeNet, ResNetʼs structure is simpler and easier to modify. All these factors have resulted in
the rapid and widespread use of ResNet. Fig. 8.6.4 depicts the full ResNet-18.

Fig. 8.6.4: The ResNet-18 architecture.
Before training ResNet, letʼs observe how the input shape changes across different modules in
ResNet. As in all the previous architectures, the resolution decreases while the number of chan-
nels increases up until the point where a global average pooling layer aggregates all features.
class ResNet18(ResNet):
super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
lr, num_classes)

ResNet18().layer_summary((1, 1, 96, 96))

8.6.4 Training
We train ResNet on the Fashion-MNIST dataset, just like before.
model = ResNet18(lr=0.01)

8.6.5 ResNeXt
Recall Fig. 8.6.3 that each ResNet block simply stacks layers between residual connections. This
design can be varied by replacing stacked layers with concatenated parallel transformations, lead-
ing to ResNeXt (Xie et al., 2017). Different from a variety of transformations in multi-branch In-
ception blocks, ResNeXt adopts the same transformation in all branches, thus minimizing manual
design efforts in each branch.
Fig. 8.6.5: The ResNeXt block. It is a bottleneck (when b < c) residual block with group convolution
(g groups).
The left dotted box in Fig. 8.6.5 depicts the added concatenated parallel transformation strategy
in ResNeXt. More concretely, an input with c channels is first split into g groups via g branches
of 1 × 1 convolutions followed by 3 × 3 convolutions, all with b/g output channels. Concatenating
these g outputs results in b output channels, leading to “bottlenecked” (when b < c) network width
inside the dashed box. This output will restore the original c channels of the input via the final
1 × 1 convolution right before sum with the residual connection. Notably, the left dotted box is
equivalent to the much simplified right dotted box in Fig. 8.6.5, where we only need to specify that
the 3 × 3 convolution is a group convolution with g groups. In fact, the group convolution dates
back to the idea of distributing the AlexNet model over two GPUs due to limited GPU memory at
that time (Krizhevsky et al., 2012).
The following implementation of the ResNeXtBlock class treats groups (b/g in Fig. 8.6.5) as an ar-
gument so that given bot_channels (b in Fig. 8.6.5) bottleneck channels, the 3 × 3 group convo-
lution will have bot_channels//groups groups. Similar to the residual block implementation in
Section 8.6.2, the residual connection is generalized with a 1 × 1 convolution (conv4), where set-
ting use_1x1conv=True, strides=2 halves the input height and width.
class ResNeXtBlock(nn.Module): #@save

"""The ResNeXt block."""
def __init__(self, num_channels, groups, bot_mul, use_1x1conv=False,

strides=1):
super().__init__()
bot_channels = int(round(num_channels * bot_mul))
self.conv1 = nn.LazyConv2d(bot_channels, kernel_size=1,
stride=1)
self.conv2 = nn.LazyConv2d(bot_channels, kernel_size=3,
stride=strides, padding=1,
groups=bot_channels//groups)
stride=1)
if use_1x1conv:
stride=strides)
else:
self.conv4 = None

Y = F.relu(self.bn1(self.conv1(X)))
Y = F.relu(self.bn2(self.conv2(Y)))
Y = self.bn3(self.conv3(Y))
if self.conv4:
X = self.bn4(self.conv4(X))
return F.relu(Y + X)
In the following case (use_1x1conv=False, strides=1), the input and output are of the same shape.
blk = ResNeXtBlock(32, 16, 1)

X = torch.randn(4, 32, 96, 96)
blk(X).shape
torch.Size([4, 32, 96, 96])
Alternatively, setting use_1x1conv=True, strides=2 halves the output height and width.
blk = ResNeXtBlock(32, 16, 1, use_1x1conv=True, strides=2)

blk(X).shape
torch.Size([4, 32, 48, 48])

Nested function classes are desirable. Learning an additional layer in deep neural networks as an
identity function (though this is an extreme case) should be made easy. The residual mapping can
learn the identity function more easily, such as pushing parameters in the weight layer to zero. We
can train an effective deep neural network by having residual blocks. Inputs can forward propagate
faster through the residual connections across layers.
Before residual connections, bypassing paths with gating units were introduced to effectively train
highway networks with over 100 layers (Srivastava et al., 2015). Using identity functions as bypass-
ing paths, ResNets performed remarkably well on multiple computer vision tasks. Residual con-
nections had a major influence on the design of subsequent deep neural networks, both for convo-
lutional and sequential nature. As we will introduce later, the transformer architecture (Vaswani
et al., 2017) adopts residual connections (together with other design choices) and is pervasive in
areas as diverse as language, vision, speech, and reinforcement learning. A key advantage of the
ResNeXt design is that increasing groups leads to sparser connections (i.e., lower computational
complexity) within the block, thus enabling an increase of network width to achieve a better trade-
off between FLOPs and accuracy. ResNeXt-ification is appealing in later convolution network de-
sign, such as in the RegNet model (Radosavovic et al., 2020) and the ConvNeXt architecture (Liu et
al., 2022). We will apply the ResNeXt block later in this chapter.
Exercises
1. What are the major differences between the Inception block in Fig. 8.4.1 and the residual
block? After removing some paths in the Inception block, how are they related to each other?
2. Refer to Table 1 in the ResNet paper (He et al., 2016) to implement different variants.
3. For deeper networks, ResNet introduces a “bottleneck” architecture to reduce model com-
plexity. Try to implement it.
4. In subsequent versions of ResNet, the authors changed the “convolution, batch normal-
ization, and activation” structure to the “batch normalization, activation, and convolution”
structure. Make this improvement yourself. See Figure 1 in (He et al., 2016) for details.
5. Why canʼt we just increase the complexity of functions without bound, even if the function
classes are nested?
Discussions133
8.7 Densely Connected Networks (DenseNet)
ResNet significantly changed the view of how to parametrize the functions in deep networks.
DenseNet (dense convolutional network) is to some extent the logical extension of this (Huang
et al., 2017). As a result, DenseNet is characterized by both the connectivity pattern where each
layer connects to all the preceding layers and the concatenation operation (rather than the addi-
tion operator in ResNet) to preserve and reuse features from earlier layers. To understand how to
arrive at it, letʼs take a small detour to mathematics.
133
8.7. Densely Connected Networks (DenseNet) 311

8.7.1 From ResNet to DenseNet
Recall the Taylor expansion for functions. For the point x = 0 it can be written as
f ′′ (0) 2 f ′′′ (0) 3

f (x) = f (0) + f ′ (0)x + x + x + .... (8.7.1)
2! 3!
The key point is that it decomposes a function into increasingly higher order terms. In a similar
vein, ResNet decomposes functions into
f (x) = x + g(x). (8.7.2)
That is, ResNet decomposes f into a simple linear term and a more complex nonlinear one. What
if we want to capture (not necessarily add) information beyond two terms? One solution was
DenseNet (Huang et al., 2017).
Fig. 8.7.1: The main difference between ResNet (left) and DenseNet (right) in cross-layer connec-
tions: use of addition and use of concatenation.
As shown in Fig. 8.7.1, the key difference between ResNet and DenseNet is that in the latter case
outputs are concatenated (denoted by [, ]) rather than added. As a result, we perform a mapping
from x to its values after applying an increasingly complex sequence of functions:
x → [x, f1 (x), f2 ([x, f1 (x)]), f3 ([x, f1 (x), f2 ([x, f1 (x)])]), . . .] . (8.7.3)
In the end, all these functions are combined in MLP to reduce the number of features again. In
terms of implementation this is quite simple: rather than adding terms, we concatenate them.
The name DenseNet arises from the fact that the dependency graph between variables becomes
quite dense. The last layer of such a chain is densely connected to all previous layers. The dense
connections are shown in Fig. 8.7.2.
Fig. 8.7.2: Dense connections in DenseNet.
The main components that compose a DenseNet are dense blocks and transition layers. The for-
mer define how the inputs and outputs are concatenated, while the latter control the number of
channels so that it is not too large.

8.7.2 Dense Blocks
DenseNet uses the modified “batch normalization, activation, and convolution” structure of
ResNet (see the exercise in Section 8.6). First, we implement this convolution block structure.
import torch
def conv_block(num_channels):
nn.LazyConv2d(num_channels, kernel_size=3, padding=1))
A dense block consists of multiple convolution blocks, each using the same number of output chan-
nels. In the forward propagation, however, we concatenate the input and output of each convo-
lution block on the channel dimension.
class DenseBlock(nn.Module):
def __init__(self, num_convs, num_channels):
super(DenseBlock, self).__init__()
layer = []
for i in range(num_convs):
layer.append(conv_block(num_channels))
self.net = nn.Sequential(*layer)

for blk in self.net:
Y = blk(X)
# Concatenate the input and output of each block on the channel
# dimension
X = torch.cat((X, Y), dim=1)
return X
In the following example, we define a DenseBlock instance with 2 convolution blocks of 10 output
channels. When using an input with 3 channels, we will get an output with 3+2×10 = 23 channels.
The number of convolution block channels controls the growth in the number of output channels
relative to the number of input channels. This is also referred to as the growth rate.
blk = DenseBlock(2, 10)

X = torch.randn(4, 3, 8, 8)
Y = blk(X)
Y.shape
torch.Size([4, 23, 8, 8])

8.7.3 Transition Layers
Since each dense block will increase the number of channels, adding too many of them will lead
to an excessively complex model. A transition layer is used to control the complexity of the model.
It reduces the number of channels by using the 1 × 1 convolutional layer and halves the height
and width of the average pooling layer with a stride of 2, further reducing the complexity of the
model.
def transition_block(num_channels):
nn.LazyConv2d(num_channels, kernel_size=1),
nn.AvgPool2d(kernel_size=2, stride=2))
Apply a transition layer with 10 channels to the output of the dense block in the previous example.
This reduces the number of output channels to 10, and halves the height and width.
blk = transition_block(10)
blk(Y).shape
torch.Size([4, 10, 4, 4])
8.7.4 DenseNet Model
Next, we will construct a DenseNet model. DenseNet first uses the same single convolutional layer
and max-pooling layer as in ResNet.
class DenseNet(d2l.Classifier):
def b1(self):
Then, similar to the four modules made up of residual blocks that ResNet uses, DenseNet uses
four dense blocks. Similar to ResNet, we can set the number of convolutional layers used in each
dense block. Here, we set it to 4, consistent with the ResNet-18 model in Section 8.6. Furthermore,
we set the number of channels (i.e., growth rate) for the convolutional layers in the dense block
to 32, so 128 channels will be added to each dense block.
In ResNet, the height and width are reduced between each module by a residual block with a
stride of 2. Here, we use the transition layer to halve the height and width and halve the number
of channels. Similar to ResNet, a global pooling layer and a fully connected layer are connected
at the end to produce the output.
@d2l.add_to_class(DenseNet)
def __init__(self, num_channels=64, growth_rate=32, arch=(4, 4, 4, 4),
lr=0.1, num_classes=10):
super(DenseNet, self).__init__()

self.net = nn.Sequential(self.b1())
for i, num_convs in enumerate(arch):
self.net.add_module(f'dense_blk{i+1}', DenseBlock(num_convs,
growth_rate))
# The number of output channels in the previous dense block
num_channels += num_convs * growth_rate
# A transition layer that halves the number of channels is added
# between the dense blocks
if i != len(arch) - 1:
num_channels //= 2
self.net.add_module(f'tran_blk{i+1}', transition_block(
num_channels))
self.net.add_module('last', nn.Sequential(
8.7.5 Training
Since we are using a deeper network here, in this section, we will reduce the input height and
width from 224 to 96 to simplify the computation.
model = DenseNet(lr=0.01)

The main components that compose DenseNet are dense blocks and transition layers. For the
latter, we need to keep the dimensionality under control when composing the network by adding
transition layers that shrink the number of channels again. In terms of cross-layer connections,
unlike ResNet, where inputs and outputs are added together, DenseNet concatenates inputs and
outputs on the channel dimension. Although these concatenation operations reuse features to
achieve computational efficiency, unfortunately they lead to heavy GPU memory consumption.
As a result, applying DenseNet may require more complex memory-efficient implementations
that may increase training time (Pleiss et al., 2017).
Exercises
1. Why do we use average pooling rather than max-pooling in the transition layer?
2. One of the advantages mentioned in the DenseNet paper is that its model parameters are
smaller than those of ResNet. Why is this the case?
3. One problem for which DenseNet has been criticized is its high memory consumption.
1. Is this really the case? Try to change the input shape to 224 × 224 to see the actual GPU
memory consumption.
2. Can you think of an alternative means of reducing the memory consumption? How
would you need to change the framework?
4. Implement the various DenseNet versions presented in Table 1 of the DenseNet paper
(Huang et al., 2017).
5. Design an MLP-based model by applying the DenseNet idea. Apply it to the housing price
prediction task in Section 5.7.
Discussions134
8.8 Designing Convolution Network Architectures
The 2010s has witnessed shift from feature engineering to network engineering in computer vision.
Since AlexNet (Section 8.1) beat conventional computer vision models on ImageNet, constructing
very deep networks by stacking the same blocks, especially 3 × 3 convolutions, has been popu-
larized by VGG networks (Section 8.2). The network in network (Section 8.3) adds local nonlin-
earities via 1 × 1 convolutions and uses global average pooling to aggregate information across
all locations. GoogLeNet (Section 8.4) is a multi-branch network that combines the advantages
from the VGG network and the network in network, where its Inception block adopts the strat-
egy of concatenated parallel transformations. ResNets (Section 8.6) stack residual blocks, which
are two-branch subnetworks using identity mapping in one branch. DenseNets (Section 8.7) gen-
eralize the residual architectures. Other notable architectures include MobileNets that use net-
work learning to achieve high accuracy in resource-constrained settings (Howard et al., 2019), the
Squeeze-and-Excitation Networks (SENets) that allow for efficient information transfer between
channels (Hu et al., 2018), and EfficientNets (Tan and Le, 2019) that scale up networks via neural
architecture search.
134

Specifically, neural architecture search (NAS) (Liu et al., 2018, Zoph and Le, 2016) is the process of
automating neural network architectures. Given a fixed search space, NAS uses a search strategy
to automatically select an architecture within the search space based on the returned performance
estimation. The outcome of NAS is a single network instance.
Instead of focusing on designing such individual instances, an alternative approach is to design
network design spaces that characterize populations of networks (Radosavovic et al., 2020). This
method combines the strength of manual design and NAS. Through semi-automatic procedures
(like in NAS), designing network design spaces explores the structure aspect of network design
from the initial AnyNet design space. It then proceeds to discover design principles (like in manual
design) that lead to simple and regular networks: RegNets. Before shedding light on these design
principles, letʼs start with the initial design space.
8.8.1 The AnyNet Design Space
The initial design space is called AnyNet, a relatively unconstrained design space, where we can
focus on exploring network structure assuming standard, fixed blocks such as ResNeXt (Section
8.6.5). Specifically, the network structure includes elements such as the number of blocks and the
number of output channels in each stage, and the number of groups (group width) and bottleneck
ratio within each ResNeXt block.
Fig. 8.8.1: The AnyNet design space. Besides the number of groups and bottleneck ratio within
each block, design choices include depth di and the number of output channels wi for any stage i.
The AnyNet design space is shown in Fig. 8.8.1. This network begins with a stem, followed by a
body with n stages of transformation, and a final head. More concretely, the network stem is a
8.8. Designing Convolution Network Architectures 317

3 × 3 convolution with stride 2 that halves the height and width of an input image. The network
head is a global average pooling followed by a fully connected layer to predict the output class.
Note that the network stem and head are kept fixed and simple, so that the design focus in on the
network body that is central to performance. Specifically, the network body consists of n stages of
transformation (n is given), where stage i consists of di ResNeXt blocks with wi output channels,
and progressively halves height and width via the first block (setting use_1x1conv=True, strides=2
in d2l.ResNeXtBlock in Section 8.6.5). Letʼs further denote the bottleneck ratio and the number
of groups (group width) within each ResNeXt block for stage i as bi and gi , respectively. Overall,
despite of the straightforward network structure, varying bi , gi , wi , and di results in a vast number
of possible networks in the AnyNet design space.
To implement AnyNet, we first define its network stem.
import torch
class AnyNet(d2l.Classifier):
def stem(self, num_channels):
nn.LazyConv2d(num_channels, kernel_size=3, stride=2, padding=1),
nn.LazyBatchNorm2d(), nn.ReLU())
Each stage consists of depth ResNeXt blocks, where num_channels specifies the block width. Note
that the first block halves the height and width of input images.
@d2l.add_to_class(AnyNet)
def stage(self, depth, num_channels, groups, bot_mul):
blk = []
for i in range(depth):
if i == 0:
blk.append(d2l.ResNeXtBlock(num_channels, groups, bot_mul,
use_1x1conv=True, strides=2))
else:
blk.append(d2l.ResNeXtBlock(num_channels, groups, bot_mul))
Putting the network stem, body, and head together, we complete the implementation of AnyNet.
@d2l.add_to_class(AnyNet)
def __init__(self, arch, stem_channels, lr=0.1, num_classes=10):
super(AnyNet, self).__init__()
self.net = nn.Sequential(self.stem(stem_channels))
for i, s in enumerate(arch):
self.net.add_module(f'stage{i+1}', self.stage(*s))
self.net.add_module('head', nn.Sequential(

8.8.2 Constraining Design Spaces with Lower Error Distributions
For any stage i of AnyNet, the design choices are the bottleneck ratio bi and the number of groups
gi within each block, block width wi , and depth di . The designing network design spaces process
starts from relatively unconstrained network structure characterized by (bi , gi , wi , di ) in the initial
AnyNet design space. Then this process progressively samples models from the input design space
to evaluate the error distribution (Radosavovic et al., 2019) as a quality indicator to output a more
constrained design space with simpler models that may have better quality.
Letʼs detail this quality indicator for design spaces. Given n models sampled from some design
space, the error empirical distribution function F (e) measures the fraction of models with errors ei
lower than e:
1∑
n
F (e) = 1(ei < e). (8.8.1)
n
i=1
Starting from the initial unconstrained AnyNet design space (AnyNetXA in (Radosavovic et al.,
2020)), sharing the bottle network ratio bi = b for all stages i results in a more constrained design
space AnyNetXB . Sampling and training n = 500 models from AnyNetXA and AnyNetXB each,
left of Fig. 8.8.2 shows that both design spaces have similar quality. Since simpler is better, we
continue to search from AnyNetXB by additionally sharing the number of groups gi = g. This
leads to a further simplified design space AnyNetXC with virtually no change in error distributions
(right of Fig. 8.8.2).
Fig. 8.8.2: Comparing error empirical distribution functions of design spaces. The legends show
the min error and mean error. Sharing bottleneck ratio (from AnyNetXA to AnyNetXB ) and sharing
the number of groups (from AnyNetXB to AnyNetXC ) simplify the design space with virtually no
change in error distributions (figure taken from (Radosavovic et al., 2020)).
Investigating good and bad models from AnyNetXC suggests that it may be useful to increase width
across stages (Radosavovic et al., 2020). Empirically, simplifying AnyNetXC to AnyNetXD with wi ≤
wi+1 improves the quality of design spaces (left of Fig. 8.8.3). Similarly, adding further constraints
of di ≤ di+1 to increase network depth across stages gives an even better AnyNetXE (right of Fig.
8.8.3).

Fig. 8.8.3: Comparing error empirical distribution functions of design spaces. The legends show
the min error and mean error. Increasing network width across stages (from AnyNetXC to
AnyNetXD ) and increasing network depth across stages (from AnyNetXD to AnyNetXE ) simplify
the design space with improved error distributions (figure taken from (Radosavovic et al., 2020)).
8.8.3 RegNet
The resulting AnyNetXE design space consists of simple networks following easy-to-interpret de-
sign principles:
• Share the bottle network ratio bi = b for all stages i;
• Share the number of groups gi = g for all stages i;
• Increase network width across stages: wi ≤ wi+1 ;
• Increase network depth across stages: di ≤ di+1 .
Following these design principles, (Radosavovic et al., 2020) proposed quantized linear constraints
to wi and di increasing, leading to RegNetX using ResNeXt blocks and RegNetY that additionally
uses operators from SENets (Hu et al., 2018). As an example, we implement a 32-layer RegNetX
variant characterized by
• bi = 1;
• gi = 16;
• w1 = 32, w2 = 80;
• d1 = 4, d2 = 6.
class RegNet32(AnyNet):
stem_channels, groups, bot_mul = 32, 16, 1
depths, channels = (4, 6), (32, 80)
super().__init__(
((depths[0], channels[0], groups, bot_mul),
(depths[1], channels[1], groups, bot_mul)),
stem_channels, lr, num_classes)
We can see that each RegNet stage progressively reduces resolution and increases output channels.

RegNet32().layer_summary((1, 1, 96, 96))

8.8.4 Training
Training the 32-layer RegNet on the Fashion-MNIST dataset is just like before.
model = RegNet32(lr=0.05)
8.8.5 Discussion
With desirable properties like locality and translation invariance (Section 7.1) for vision, CNNs
have been the dominant architectures in this area. Recently, transformers (Section 11.7) (Doso-
vitskiy et al., 2021, Touvron et al., 2021) and MLPs (Tolstikhin et al., 2021) have also sparked re-
search beyond the well-established CNN architectures for vision. Specifically, although lacking
of the aforementioned inductive biases inherent to CNNs, vision transformers (Section 11.8) at-
tained state-of-the-art performance in large-scale image classification in early 2020s, showing that
scalability trumps inductive biases (Dosovitskiy et al., 2021). In other words, it is often possible to
train large transformers to outperform large CNNs on large datasets. Inspired by the superior
scaling behavior of transformers (Section 11.9) with multi-head self-attention (Section 11.5), the
process of gradually improving from a standard ResNet architecture toward the design of a vision
transformer leads to a family of CNNs called the ConvNeXt models that compete favorably with

transformers for vision (Liu et al., 2022). We refer the interested readers to CNN design discussions
in the ConvNeXt paper (Liu et al., 2022).
Exercises
1. Increase the number of stages to 4. Can you design a deeper RegNet that performs better?
2. De-ResNeXt-ify RegNets by replacing the ResNeXt block with the ResNet block. How does
your new model perform?
3. Implement multiple instances of a “VioNet” family by violating the design principles of Reg-
Net. How do they perform? Which of (di , wi , gi , bi ) is the most important factor?
Discussions135
135

9 | Recurrent Neural Networks
Up until now, we have focused primarily on fixed-length data. When introducing linear and logis-
tic regression in Chapter 3 and Chapter 4 and multilayer perceptrons in Chapter 5, we were happy
to assume that each feature vector xi consisted of a fixed number of components x1 , . . . , xd , where
each numerical feature xj corresponded to a particular attribute. These datasets are sometimes
called tabular, because they can be arranged in tables, where each example i gets its own row, and
each attribute gets its own column. Crucially, with tabular data, we seldom assume any particular
structure over the columns.
Subsequently, in Chapter 7, we moved on to image data, where inputs consist of the raw pixel
values at each coordinate in an image. Image data hardly fit the bill of a protypical tabular dataset.
There, we needed to call upon convolutional neural networks (CNNs) to handle the hierarchical
structure and invariances. However, our data were still of fixed length. Every Fashion-MNIST
image is represented as a 28 × 28 grid of pixel values. Moreover, our goal was to develop a model
that looked at just one image and then output a single prediction. But what should we do when
faced with a sequence of images, as in a video, or when tasked with producing a sequentially
structured prediction, as in the case of image captioning?
Countless learning tasks require dealing with sequential data. Image captioning, speech synthe-
sis, and music generation all require that models produce outputs consisting of sequences. In
other domains, such as time series prediction, video analysis, and musical information retrieval,
a model must learn from inputs that are sequences. These demands often arise simultaneously:
tasks such as translating passages of text from one natural language to another, engaging in dia-
logue, or controlling a robot, demand that models both ingest and output sequentially-structured
data.
Recurrent neural networks (RNNs) are deep learning models that capture the dynamics of se-
quences via recurrent connections, which can be thought of as cycles in the network of nodes.
This might seem counterintuitive at first. After all, it is the feedforward nature of neural networks
that makes the order of computation unambiguous. However, recurrent edges are defined in a
precise way that ensures that no such ambiguity can arise. Recurrent neural networks are un-
rolled across sequence steps, with the same underlying parameters applied at each step. While the
standard connections are applied synchronously to propagate each layerʼs activations to the sub-
sequent layer at the same time step, the recurrent connections are dynamic, passing information
across adjacent time steps. As the unfolded view in Fig. 9.1 reveals, RNNs can be thought of as
feedforward neural networks where each layerʼs parameters (both conventional and recurrent)
are shared across time steps.
323
Fig. 9.1: On the left recurrent connections are depicted via cyclic edges. On the right, we unfold
the RNN over sequence steps. Here, recurrent edges span adjacent sequence steps, while conven-
tional connections are computed synchronously.
Like neural networks more broadly, RNNs have a long discipline-spanning history, originating
as models of the brain popularized by cognitive scientists and subsequently adopted as practi-
cal modeling tools employed by the machine learning community. As with deep learning more
broadly, this book adopts the machine learning perspective, focusing on RNNs as practical tools
which rose to popularity in the 2010s owing to breakthrough results on such diverse tasks as hand-
writing recognition (Graves et al., 2008), machine translation (Sutskever et al., 2014), and recog-
nizing medical diagnoses (Lipton et al., 2016). We point the reader interested in more background
material to a publicly available comprehensive review (Lipton et al., 2015). We also note that se-
quentiality is not unique to RNNs. For example, the CNNs that we already introduced can be
adapted to handle data of varying length, e.g., images of varying resolution. Moreover, RNNs
have recently ceded considerable market share to transformer models, which will be covered in
Chapter 11. However, RNNs rose to prominence as the default models for handling complex se-
quential structure in deep learning, and remain staple models for sequential modeling to this day.
The stories of RNNs and of sequence modeling are inextricably linked, and this is as much a chap-
ter about the ABCs of sequence modeling problems as it is a chapter about RNNs.
One key insight paved the way for a revolution in sequence modeling. While the inputs and targets
for many fundamental tasks in machine learning cannot easily be represented as fixed length
vectors, they can often nevertheless be represented as varying-length sequences of fixed length
vectors. For example, documents can be represented as sequences of words. Medical records
can often be represented as sequences of events (encounters, medications, procedures, lab tests,
diagnoses). Videos can be represented as varying-length sequences of still images.
While sequence models have popped up in countless application areas, basic research in the area
has been driven predominantly by advances on core tasks in natural language processing (NLP).
Thus, throughout this chapter, we will focus our exposition and examples on text data. If you
get the hang of these examples, then applying these models to other data modalities should be
relatively straightforward. In the next few sections, we introduce basic notation for sequences
and some evaluation measures for assessing the quality of sequentially structured model outputs.
Next, we discuss basic concepts of a language model and use this discussion to motivate our first
RNN models. Finally, we describe the method for calculating gradients when backpropagating
through RNNs and explore some challenges that are often encountered when training such net-
works, motivating the modern RNN architectures that will follow in Chapter 10.
324 Chapter 9. Recurrent Neural Networks

9.1 Working with Sequences
Up until now, weʼve focused on models whose inputs consisted of a single feature vector x ∈ Rd .
The main change of perspective when developing models capable of processing sequences is that
we now focus on inputs that consist of an ordered list of feature vectors x1 , . . . , xT , where each
feature vector xt indexed by a sequence step t ∈ Z+ lies in Rd .
Some datasets consist of a single massive sequence. Consider, for example, the extremely long
streams of sensor readings that might be available to climate scientists. In such cases, we might
create training datasets by randomly sampling subsequences of some predetermined length.
More often, our data arrive as a collection of sequences. Consider the following examples: (i) a
collection of documents, each represented as its own sequence of words, and each having its own
length Ti ; (ii) sequence representation of patient stays in the hospital, where each stay consists of
a number of events and the sequence length depends roughly on the length of the stay.
Previously, when dealing with individual inputs, we assumed that they were sampled indepen-
dently from the same underlying distribution P (X). While we still assume that entire sequences
(e.g., entire documents or patient trajectories) are sampled independently, we cannot assume that
the data arriving at each sequence step are independent of each other. For example, what words
are likely to appear later in a document depends heavily on what words occurred earlier in the
document. What medicine a patient is likely to receive on the 10th day of a hospital visit depends
heavily on what transpired in the previous nine days.
This should come as no surprise. If we didnʼt believe that the elements in a sequence were related,
we wouldnʼt have bothered to model them as a sequence in the first place. Consider the usefulness
of the auto-fill features that are popular on search tools and modern email clients. They are use-
ful precisely because it is often possible to predict (imperfectly, but better than random guessing)
what likely continuations of a sequence might be, given some initial prefix. For most sequence
models, we donʼt require independence, or even stationarity, of our sequences. Instead, we re-
quire only that the sequences themselves are sampled from some fixed underlying distribution
over entire sequences.
This flexible approach, allows for such phenomena as (i) documents looking significantly different
at the beginning than at the end, or (ii) patient status evolving either towards recovery or towards
death over the course of a hospital stay; and (iii) customer taste evolving in predictable ways over
course of continued interaction with a recommender system.
We sometimes wish to predict a fixed target y given sequentially structured input (e.g., sentiment
classification based on a movie review). At other times, we wish to predict a sequentially struc-
tured target (y1 , · · · , yT ) given a fixed input (e.g., image captioning). Still other times, out goal is
to predict sequentially structured targets based on sequentially structured inputs (e.g., machine
translation or video captioning). Such sequence-to-sequence tasks take two forms: (a) aligned:
where the input at each sequence step aligns with a corresponding target (e.g., part of speech
taggin); (b) unaligned where the input and target do not necessarily exhibit a step-for-step corre-
spondence (e.g. machine translation).
But before we worry about handling targets of any kind, we can tackle the most straightforward
problem: unsupervised density modeling (also called sequence modeling). Here, given a collection
of sequences, our goal is to estimate the probability mass function that tells us how likely we are
to see any given sequence, i.e. p(x1 , · · · , xT ).
9.1. Working with Sequences 325

9.1.1 Basic Tools
Before introducing specialized specialized neural networks designed to handle sequentially struc-
tured data, letʼs take a look at some actual sequence data and build up some basic intuitions and
statistical tools. In particular, we will focus on stock price data from the FTSE 100 index (Fig.
9.1.1). At each time step t ∈ Z+ , we observe the price of the index at that time, denoted by xt .
Fig. 9.1.1: FTSE 100 index over about 30 years.
Now suppose that a trader would like to make short term trades, strategically getting into or out
of the index, depending on whether they believe that it will rise or decline in the subsequent time
step. Absent any other features (news, financial reporting data, etc), the only available signal for
predicting the subsequent value is the history of prices to date.
Autoregressive Models
The trader is thus interested in knowing the probability distribution
P (xt | xt−1 , . . . , x1 ) (9.1.1)
over prices that the index might take in the subsequent time step. While estimating the entire
distribution over a continuous-valued random variable can be difficult, the trader would be happy
to focus on a few key statistics of the distribution, particularly the expected value and the variance.
One simple strategy for estimating the conditional expectation
E[(xt | xt−1 , . . . , x1 )], (9.1.2)
would be to apply a linear regression model, (recall Section 3.5). Such models that regress the
value of a signal on the previous values of that same signal are naturally called autoregressive mod-
els. There is just one major problem; the number of inputs, xt−1 , . . . , x1 varies, depending on t.
Namely, the number of inputs increases with the amount of data that we encounter. Thus if we
want to treat our historical data as a training set, we are left with the problem that each example
has a different number of features. Much of what follows in this chapter will revolve around tech-
niques for overcoming these challenges when engaging in such autoregressive modeling problems
where the object of interest is P (xt | xt−1 , . . . , x1 ) or some statistic(s) of this distribution.
A few strategies recur frequently. First, we might believe that although long sequences xt−1 , . . . , x1
are available, it may not be necessary to look back so far in the history when predicting the near

future. In this case we might content ourselves to condition on some window of length τ and only
use xt−1 , . . . , xt−τ observations. The immediate benefit is that now the number of arguments is
always the same, at least for t > τ . This allows us to train any linear model or deep network that
requires fixed-length vectors as inputs. Second, we might develop models that maintain some
summary ht of the past observations (see Fig. 9.1.2) and at the same time update ht in addition
to the prediction x̂t . This leads to models that estimate xt with x̂t = P (xt | ht ) and moreover
updates of the form ht = g(ht−1 , xt−1 ). Since ht is never observed, these models are also called
latent autoregressive models.
Fig. 9.1.2: A latent autoregressive model.
To construct training data from historical data, one typically creates examples by sampling win-
dows randomly. In general, we do not expect time to stand still. However, we often assume that
while the specific values of xt might change, the dynamics according to which each subsequent
observation is generated given the previous observations do not. Statisticians call dynamics that
do not change stationary.
9.1.2 Sequence Models
Sometimes, especially when working with language, we wish to estimate the joint probability of
an entire sequence. This is a common task when working with sequences composed of discrete
tokens, such as words. Generally, these estimated functions are called sequence models and for
natural language data, they are called language models. The field of sequence modeling has been
driven so much by NLP, that we often describe sequence models as “language models”, even when
dealing with non-language data. Language models prove useful for all sorts of reasons. Sometimes
we want to evaluate the likelihood of sentences. For example, we might wish to compare the
naturalness of two candidate outputs generated by a machine translation systems or by a speech
recognition system. But language modeling gives us not only the capacity to evaluate likelihood,
but the ability to sample sequences, and even to optimize for the most likely sequences.
While language modeling might not look, at first glance, like an autoregressive problem, we can
reduce language modeling to autogregressive prediction by decomposing the joint density of a
sequence p(xt |x1 , . . . , xT ) into the product of conditional densities in a left-to-right fashion by ap-
plying the chain rule of probability:
∏
T
P (x1 , . . . , xT ) = P (x1 ) ∗ P (xt | xt−1 , . . . , x1 ). (9.1.3)
t=2
Note that if we are working with discrete signals like words, then the autoregressive model must
be a probabilistic classifier, outputting a full probability distribution over the vocabulary for what
word will come next, given the leftwards context.

Markov Models
Now suppose that we wish to employ the strategy mentioned above, where we condition only
on the τ previous sequence steps, i.e., xt−1 , . . . , xt−τ , rather than the entire sequence history
xt−1 , . . . , x1 . Whenever we can throw away the history beyond the precious τ steps without any
loss in predictive power, we say that the sequence satisfies a Markov condition, i.e., that the future
is conditionally independent of the past, given the recent history. When τ = 1, we say that the data is
characterized by a first-order Markov model, and when τ = k, we say that the data is characterized
by a k-th order Markov model. For when the first-order Markov condition holds (τ = 1) the factor-
ization of our joint probability becomes a product of probabilities of each word given the previous
word:
∏
T
P (x1 , . . . , xT ) = P (xt | xt−1 ) where P (x1 | x0 ) = P (x1 ). (9.1.4)
t=1
We often find it useful to work with models that proceed as though a Markov condition were sat-
isfied, even when we know that this is only approximately true. With real text documents we con-
tinue to gain information as we include more and more leftwards context. But these gains diminish
rapidly. Thus, sometimes we compromise, obviating computational and statistical difficulties by
training models whose validity depends on a k-th order Markov condition. Even todayʼs massive
RNNs- and Transformer-based language models seldom incorporate more than a thousand words
of context.
With discrete data, a true Markov model simply counts the number of times that each word has
occurred in each context, producing the relative frequency estimate of P (xt | xt−1 ). Whenever
the data assume only discrete values (as in language), the most likely sequence of words can be
computed efficiently using dynamic programming.
The Order of Decoding
You might be wondering, why did we have to represent the factorization of a text sequence
P (x1 , . . . , xT ) as a left-to-right chain of conditional probabilities. Why not right-to-left or some
other, seemingly random order? In principle, there is nothing wrong with unfolding P (x1 , . . . , xT )
in reverse order. The result is a valid factorization:
∏
1
P (x1 , . . . , xT ) = P (xt | xt+1 , . . . , xT ). (9.1.5)
t=T
However, there are many reasons why factorizing text in the same directions as we read it (left-
to-right for most languages, but right-to-left for Arabic and Hebrew) is preferred for the task of
language modeling. First, this is just a more natural direction for us to think about. After all
we all read text every day, and this process is guided by our ability to anticipate what words and
phrases are likely to come next. Just think of how many times you have completed someone elseʼs
sentence. Thus, even if we had no other reason to prefer such in-order decodings, they would be
useful if only because we have better intuitions for what should be likely when predicting in this
order.
Second, by factorizing in order, we can assign probabilities to arbitrarily long sequences using
the same language model. To convert a probability over steps 1 through t into one that extends
to word t + 1 we simply multiply by the conditional probability of the additional token given the
previous ones: P (xt+1 , . . . , x1 ) = P (xt , . . . , x1 ) · P (xt+1 | xt , . . . , x1 ).

Third, we have stronger predictive models for predicting adjacent words versus words at arbitrary
other locations. While all orders of factorization are valid, they do not necessarily all represent
equally easy predictive modeling problems. This is true not only for language, but for other kinds
of data as well, e.g., when the data are causally structured. For example, we believe that future
events cannot influence the past. Hence, if we change xt , we may be able to influence what hap-
pens for xt+1 going forward but not the converse. That is, if we change xt , the distribution over
past events will not change. In some contexts, this makes it easier to predict P (xt+1 | xt ) than to
predict P (xt | xt+1 ). For instance, in some cases, we can find xt+1 = f (xt ) + ϵ for some additive
noise ϵ, whereas the converse is not true (Hoyer et al., 2009). This is great news, since it is typically
the forward direction that we are interested in estimating. The book by Peters et al. has explained
more on this topic (Peters et al., 2017). We are barely scratching the surface of it.
9.1.3 Training
Before we focus our attentions on text data, letʼs first try this out with some continuous-valued
synthetic data.
%matplotlib inline
import torch
Here, our 1000 synthetic data will follow the trigonometric sin function, applied to .01 times the
time step. To make the problem a little more interesting, we corrupt each sample with additive
noise.
class Data(d2l.DataModule):
def __init__(self, batch_size=16, T=1000, num_train=600, tau=4):
self.time = torch.arange(1, T + 1, dtype=torch.float32)
self.x = torch.sin(0.01 * self.time) + torch.randn(T) * 0.2
data = Data()
d2l.plot(data.time, data.x, 'time', 'x', xlim=[1, 1000], figsize=(6, 3))

From this sequence we extract training examples, each consisting of features and a label.
To begin, we try a model that acts as though the data satisfied a τ -order Markov condition, and
thus predicts xt using only the past τ observations. Thus for each timestep we have an example
with label y = xt and features xt = [xt−τ , . . . , xt−1 ]. The astute reader might have noticed that this
results in 1000-τ examples, since we lack sufficient history for y1 , . . . , yτ . While we could pad the
first τ sequences with zeros, to keep things simple, we drop them for now. The resulting dataset
contains T − τ examples, where each input to the model has sequence length τ . We create a data
iterator on the first 600 examples, covering a period of the sin function.
@d2l.add_to_class(Data)
features = [self.x[i : self.T-self.tau+i] for i in range(self.tau)]
self.features = torch.stack(features, 1)
self.labels = self.x[self.tau:].reshape((-1, 1))
return self.get_tensorloader([self.features, self.labels], train, i)
In this example our model will be a standard linear regression.
model = d2l.LinearRegression(lr=0.01)
9.1.4 Prediction
To evaluate our model, we first check How well our model performs at one-step-ahead prediction.
onestep_preds = model(data.features).detach().numpy()
d2l.plot(data.time[data.tau:], [data.labels, onestep_preds], 'time', 'x',
legend=['labels', '1-step preds'], figsize=(6, 3))

The one-step-ahead predictions look good. even near the end t = 1000.
Now consider, what if we only observe sequence data up until time step 604 (n_train + tau) but
wished to make predictions several steps into the future. Unfortunately, we cannot directly com-
pute the one-step-ahead prediction for time step 609, because we do not know the corresponding
inputs, having seen only up to x604 . We can address this problem by plugging in our earlier pre-
dictions as inputs to our model for making subsequent predictions, projecting forward, one step
at a time, until reaching the desired sequence step:
x̂605 = f (x601 , x602 , x603 , x604 ),

x̂606 = f (x602 , x603 , x604 , x̂605 ),
x̂607 = f (x603 , x604 , x̂605 , x̂606 ),
(9.1.6)
x̂608 = f (x604 , x̂605 , x̂606 , x̂607 ),
x̂609 = f (x̂605 , x̂606 , x̂607 , x̂608 ),
...
Generally, for an observed sequence x1 , . . . , xt , its predicted output x̂t+k at time step t + k is called
the k-step-ahead prediction. Since we have observed up to x604 , its k-step-ahead prediction is x̂604+k .
In other words, we will have to keep on using our own predictions to make multistep-ahead pre-
dictions. Letʼs see how well this goes.
multistep_preds = torch.zeros(data.T)
multistep_preds[:] = data.x
for i in range(data.num_train + data.tau, data.T):
multistep_preds[i] = model(
multistep_preds[i - data.tau:i].reshape((1, -1)))
multistep_preds = multistep_preds.detach().numpy()
d2l.plot([data.time[data.tau:], data.time[data.num_train+data.tau:]],
[onestep_preds, multistep_preds[data.num_train+data.tau:]], 'time',
'x', legend=['1-step preds', 'multistep preds'], figsize=(6, 3))

Unfortunately, in this case we fail spectacularly. The predictions decay to a constant pretty quickly
after a few prediction steps. Why did the algorithm perform so much worse when predicting fur-
ther into the future? Ultimately, this owes to the fact that errors build up. Letʼs say that after step
1 we have some error ϵ1 = ϵ̄. Now the input for step 2 is perturbed by ϵ1 , hence we suffer some
error in the order of ϵ2 = ϵ̄ + cϵ1 for some constant c, and so on. The predictions can diverge
rapidly from the true observations. You may already be familiar with this common phenomenon.
For instance, weather forecasts for the next 24 hours tend to be pretty accurate but beyond that,
accuracy declines rapidly. We will discuss methods for improving this throughout this chapter
and beyond.
Letʼs take a closer look at the difficulties in k-step-ahead predictions by computing predictions on
the entire sequence for k = 1, 4, 16, 64.
def k_step_pred(k):
features = []
for i in range(data.tau):
features.append(data.x[i : i+data.T-data.tau-k+1])
# The (i+tau)-th element stores the (i+1)-step-ahead predictions
for i in range(k):
preds = model(torch.stack(features[i : i+data.tau], 1))
features.append(preds.reshape(-1))
return features[data.tau:]
steps = (1, 4, 16, 64)

preds = k_step_pred(steps[-1])
d2l.plot(data.time[data.tau+steps[-1]-1:],
[preds[k - 1].detach().numpy() for k in steps], 'time', 'x',
legend=[f'{k}-step preds' for k in steps], figsize=(6, 3))

This clearly illustrates how the quality of the prediction changes as we try to predict further into
the future. While the 4-step-ahead predictions still look good, anything beyond that is almost
useless.
9.1.5 Summary
• There is quite a difference in difficulty between interpolation and extrapolation. Conse-

quently, if you have a sequence, always respect the temporal order of the data when training,
i.e., never train on future data.
• Sequence models require specialized statistical tools for estimation. Two popular choices
are autoregressive models and latent-variable autoregressive models.
• For causal models (e.g., time going forward), estimating the forward direction is typically a
lot easier than the reverse direction.
• For an observed sequence up to time step t, its predicted output at time step t + k is the k-
step-ahead prediction. As we predict further in time by increasing k, the errors accumulate
and the quality of the prediction degrades, often dramatically.
Exercises
1. Improve the model in the experiment of this section.

1. Incorporate more than the past 4 observations? How many do you really need?
2. How many past observations would you need if there was no noise? Hint: you can write
sin and cos as a differential equation.
3. Can you incorporate older observations while keeping the total number of features con-
stant? Does this improve accuracy? Why?
4. Change the neural network architecture and evaluate the performance. You may train
the new model with more epochs. What do you observe?
2. An investor wants to find a good security to buy. He looks at past returns to decide which
one is likely to do well. What could possibly go wrong with this strategy?

3. Does causality also apply to text? To which extent?
4. Give an example for when a latent autoregressive model might be needed to capture the
dynamic of the data.
Discussions136
9.2 Converting Raw Text into Sequence Data
Throughout this book, we will often work with text data represented as sequences of words, char-
acters, or word-pieces. To get going, weʼll need some basic tools for converting raw text into se-
quences of the appropriate form. Typical preprocessing pipelines execute the following steps:
1. Load text as strings into memory.
2. Split the strings into tokens (e.g., words or characters).
3. Build a vocabulary dictionary to associate each vocabulary element with a numerical index.
4. Convert the text into sequences of numerical indices.
import collections
import random
import re
import torch
Here, weʼll work with H. G. Wellsʼ The Time Machine137 , a book containing just over 30000 words.
While real applications will typically involve significantly larger datasets, this is sufficient to
demonstrate the preprocessing pipeline.
The following function reads the raw text into a string.
class TimeMachine(d2l.DataModule): #@save

def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'

For simplicity, we ignore punctuation and capitalization when preprocessing the raw text.
136
137
http://www.gutenberg.org/ebooks/35

@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
return re.sub('[Â-Za-z]+', ' ', text).lower()
text = data._preprocess(raw_text)
text[:60]
'the time machine by h g wells i the time traveller for so it'
9.2.2 Tokenization
Tokens are the atomic (indivisible) units of text. Each sequence step corresponds to 1 token, but
what precisely constitutes a token is a design choice. For example, we could represent the sen-
tence “Baby needs a new pair of shoes” as a sequence of 7 words, where the set of all words com-
prise a large vocabulary (typically tens or hundreds of thousands of words). Or we would represent
the same sentence as a much longer sequence of 30 characters, using a much smaller vocabulary
(there are only 256 distinct ASCII characters). Below, we tokenize our preprocessed text into a
sequence of characters.
def _tokenize(self, text):
return list(text)
tokens = data._tokenize(text)
','.join(tokens[:30])
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
9.2.3 Vocabulary
These tokens are still strings. However, the inputs to our models must ultimately consist of nu-
merical inputs. Next, we introduce a class for constructing vocabularies, i.e., objects that associate
each each distinct token value with a unique index. First, we determine the set of unique tokens
in our training corpus. We then assign a numerical index to each unique token. Rare vocabulary
elements are often dropped for convenience. Whenever we encounter a token at training or test
time that had not been previously seen or was dropped from the vocabulary, we represent it by a
special “<unk>” token, signifying that this is an unknown value.
class Vocab: #@save

"""Vocabulary for text."""
def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
# Flatten a 2D list if needed
if tokens and isinstance(tokens[0], list):
tokens = [token for line in tokens for token in line]
# Count token frequencies
counter = collections.Counter(tokens)
self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
reverse=True)
9.2. Converting Raw Text into Sequence Data 335

# The list of unique tokens

self.idx_to_token = list(sorted(set(['<unk>'] + reserved_tokens + [
token for token, freq in self.token_freqs if freq >= min_freq])))
self.token_to_idx = {token: idx
for idx, token in enumerate(self.idx_to_token)}
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self, tokens):

if not isinstance(tokens, (list, tuple)):
return self.token_to_idx.get(tokens, self.unk)
return [self.__getitem__(token) for token in tokens]
def to_tokens(self, indices):

if hasattr(indices, '__len__') and len(indices) > 1:
return [self.idx_to_token[int(index)] for index in indices]
return self.idx_to_token[indices]
@property
def unk(self): # Index for the unknown token
return self.token_to_idx['<unk>']
We now construct a vocabulary for our dataset, converting the sequence of strings into a list of
numerical indices. Note that we have not lost any information and can easily convert our dataset
back to its original (string) representation.
vocab = Vocab(tokens)
indices = vocab[tokens[:10]]
print('indices:', indices)
print('words:', vocab.to_tokens(indices))
indices: [21, 9, 6, 0, 21, 10, 14, 6, 0, 14]

words: ['t', 'h', 'e', ' ', 't', 'i', 'm', 'e', ' ', 'm']
9.2.4 Putting it all Together
Using the above classes and methods, we package everything into the following build method of
the TimeMachine class, which returns corpus, a list of token indices, and vocab, the vocabulary of
The Time Machine corpus. The modifications we did here are: (i) we tokenize text into characters,
not words, to simplify the training in later sections; (ii) corpus is a single list, not a list of token
lists, since each text line in The Time Machine dataset is not necessarily a sentence or paragraph.
def build(self, raw_text, vocab=None):
tokens = self._tokenize(self._preprocess(raw_text))
if vocab is None: vocab = Vocab(tokens)
corpus = [vocab[token] for token in tokens]
return corpus, vocab

corpus, vocab = data.build(raw_text)

len(corpus), len(vocab)
(173428, 28)
9.2.5 Exploratory Language Statistics
Using the real corpus and the Vocab class defined over words, we can inspect basic statistics con-
cerning word use in our corpus. Below, we construct a vocabulary from words used in The Time
Machine and print the 10 most frequently occurring words.
words = text.split()
vocab = Vocab(words)
vocab.token_freqs[:10]
[('the', 2261),
('i', 1267),
('and', 1245),
('of', 1155),
('a', 816),
('to', 695),
('was', 552),
('in', 541),
('that', 443),
('my', 440)]
Note that the ten most frequent words arenʼt all that descriptive. You might even imagine that
we might see a very similar list if we had chosen any book at random. Articles like “the” and
“a”, pronouns like “i” and “my”, and prepositions like “of”, “to”, and “in” occur often because they
serve common syntactic roles. Such words that are at once common but particularly descriptive
are often called stop words and, in previous generations of text classifiers based on bag-of-words
representations, they were most often filtered out. However, they carry meaning and itʼs not nec-
essary to filter them out when working with modern RNN- and Transformer-based neural NLP
models. If you look further down the list, you will notice that word frequency decays quickly. The
10th most frequent word is less than 1/5 as common as the most popular. Word frequency tends to
follow a power law distribution (specifically the Zipfian) as we go down the ranks. To get a better
idea, we plot the figure of the word frequency.
freqs = [freq for token, freq in vocab.token_freqs]

d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)',
xscale='log', yscale='log')

After dealing with the first few words as exceptions, all the remaining words roughly follow a
straight line on a log-log plot. This phenomena is captured by Zipf’s law, which states that the
frequency ni of the ith most frequent word is:
1
ni ∝ , (9.2.1)
iα
which is equivalent to
log ni = −α log i + c, (9.2.2)
where α is the exponent that characterizes the distribution and c is a constant. This should already
give us pause if we want to model words by counting statistics. After all, we will significantly over-
estimate the frequency of the tail, also known as the infrequent words. But what about the other
word combinations, such as two consecutive words (bigrams), three consecutive words (trigrams),
and beyond? Letʼs see whether the bigram frequency behaves in the same manner as the single
word (unigram) frequency.
bigram_tokens = ['--'.join(pair) for pair in zip(words[:-1], words[1:])]

bigram_vocab = Vocab(bigram_tokens)
bigram_vocab.token_freqs[:10]
[('of--the', 309),
('in--the', 169),
('i--had', 130),
('i--was', 112),
('and--the', 109),
('the--time', 102),
('it--was', 99),
('to--the', 85),
('as--i', 78),
('of--a', 73)]
One thing is notable here. Out of the ten most frequent word pairs, nine are composed of both
stop words and only one is relevant to the actual book—“the time”. Furthermore, letʼs see whether
the trigram frequency behaves in the same manner.

trigram_tokens = ['--'.join(triple) for triple in zip(
words[:-2], words[1:-1], words[2:])]
trigram_vocab = d2l.Vocab(trigram_tokens)
trigram_vocab.token_freqs[:10]
[('the--time--traveller', 59),
('the--time--machine', 30),
('the--medical--man', 24),
('it--seemed--to', 16),
('it--was--a', 15),
('here--and--there', 15),
('seemed--to--me', 14),
('i--did--not', 14),
('i--saw--the', 13),
('i--began--to', 13)]
Last, letʼs visualize the token frequency among these three models: unigrams, bigrams, and tri-
grams.
bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]

trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
ylabel='frequency: n(x)', xscale='log', yscale='log',
legend=['unigram', 'bigram', 'trigram'])
This figure is quite exciting. First, beyond unigram words, sequences of words also appear to be
following Zipfʼs law, albeit with a smaller exponent α in (9.2.1), depending on the sequence length.
Second, the number of distinct n-grams is not that large. This gives us hope that there is quite a
lot of structure in language. Third, many n-grams occur very rarely. This makes certain methods
unsuitable for language modeling and motivates the use of deep learning models. We will discuss
this in the next section.

9.2.6 Summary
• Text is among the most common forms of sequence data encountered in deep learning.
• Common choices for what constitutes a “token” are characters words, and word pieces,
which will be covered in subsequent chapters.
• To preprocess text, we usually (i) split text into tokens; (ii) build a vocabulary to map to-
ken strings to numerical indices; and (iii) convert text data into token indices for models to
manipulate.
• The frequency of words tends to follow Zipfʼs law. This is true not just for individual words
(unigrams), but also for n-grams.
Exercises
1. In the experiment of this section, tokenize text into words and vary the min_freq argument
value of the Vocab instance. Qualitatively characterize how changes in min_freq impact the
size of the resulting vocabulary.
2. Estimate the exponent of Zipfian distribution for unigrams, bigrams, and trigrams in this
corpus.
3. Find some other sources of data (download a standard ML dataset, pick another public do-
main book, scrape a web site, etc). For each, tokenize the data at both the word and charac-
ter levels. How do the vocabulary sizes compare to the Time Machine corpus at equivalent
values of min_freq. Estimate the exponent of the Zipfian distribution corresponding to the
unigram and bigram distributions for these corpora. How do they compare with the values
that you observed for the Time Machine corpus?
Discussions138
9.3 Language Models
In Section 9.2, we see how to map text sequences into tokens, where these tokens can be viewed
as a sequence of discrete observations, such as words or characters. Assume that the tokens in a
text sequence of length T are in turn x1 , x2 , . . . , xT . The goal of language models is to estimate the
joint probability of the whole sequence:
P (x1 , x2 , . . . , xT ), (9.3.1)
where statistical tools in Section 9.1 can be applied.

Language models are incredibly useful. For instance, an ideal language model would be able
to generate natural text just on its own, simply by drawing one token at a time xt ∼ P (xt |
xt−1 , . . . , x1 ). Quite unlike the monkey using a typewriter, all text emerging from such a model
would pass as natural language, e.g., English text. Furthermore, it would be sufficient for gener-
ating a meaningful dialog, simply by conditioning the text on previous dialog fragments. Clearly
we are still very far from designing such a system, since it would need to understand the text rather
than just generate grammatically sensible content.
138

Nonetheless, language models are of great service even in their limited form. For instance, the
phrases “to recognize speech” and “to wreck a nice beach” sound very similar. This can cause am-
biguity in speech recognition, which is easily resolved through a language model that rejects the
second translation as outlandish. Likewise, in a document summarization algorithm it is worth-
while knowing that “dog bites man” is much more frequent than “man bites dog”, or that “I want
to eat grandma” is a rather disturbing statement, whereas “I want to eat, grandma” is much more
benign.
9.3.1 Learning Language Models
The obvious question is how we should model a document, or even a sequence of tokens. Suppose
that we tokenize text data at the word level. Letʼs start by applying basic probability rules:
∏
T
P (x1 , x2 , . . . , xT ) = P (xt | x1 , . . . , xt−1 ). (9.3.2)
t=1
For example, the probability of a text sequence containing four words would be given as:
P (deep, learning, is, fun) = P (deep)P (learning | deep)P (is | deep, learning)P (fun | deep, learning, is).
(9.3.3)
Markov Models and n-grams
Among those sequence model analysis in Section 9.1, letʼs apply Markov models to language
modeling. A distribution over sequences satisfies the Markov property of first order if P (xt+1 |
xt , . . . , x1 ) = P (xt+1 | xt ). Higher orders correspond to longer dependencies. This leads to a
number of approximations that we could apply to model a sequence:
P (x1 , x2 , x3 , x4 ) = P (x1 )P (x2 )P (x3 )P (x4 ),

P (x1 , x2 , x3 , x4 ) = P (x1 )P (x2 | x1 )P (x3 | x2 )P (x4 | x3 ), (9.3.4)
P (x1 , x2 , x3 , x4 ) = P (x1 )P (x2 | x1 )P (x3 | x1 , x2 )P (x4 | x2 , x3 ).
The probability formulae that involve one, two, and three variables are typically referred to as
unigram, bigram, and trigram models, respectively. In order to compute the language model, we
need to calculate the probability of words and the conditional probability of a word given the
previous few words. Note that such probabilities are language model parameters.
Word Frequency
Here, we assume that the training dataset is a large text corpus, such as all Wikipedia entries,
Project Gutenberg139 , and all text posted on the Web. The probability of words can be calculated
from the relative word frequency of a given word in the training dataset. For example, the estimate
P̂ (deep) can be calculated as the probability of any sentence starting with the word “deep”. A
slightly less accurate approach would be to count all occurrences of the word “deep” and divide it
by the total number of words in the corpus. This works fairly well, particularly for frequent words.
Moving on, we could attempt to estimate
n(deep, learning)
P̂ (learning | deep) = , (9.3.5)
n(deep)
139
https://en.wikipedia.org/wiki/Project_Gutenberg
9.3. Language Models 341

where n(x) and n(x, x′ ) are the number of occurrences of singletons and consecutive word pairs,
respectively. Unfortunately, estimating the probability of a word pair is somewhat more difficult,
since the occurrences of “deep learning” are a lot less frequent. In particular, for some unusual
word combinations it may be tricky to find enough occurrences to get accurate estimates. As
suggested by the empirical results in Section 9.2.5, things take a turn for the worse for three-word
combinations and beyond. There will be many plausible three-word combinations that we likely
will not see in our dataset. Unless we provide some solution to assign such word combinations
nonzero count, we will not be able to use them in a language model. If the dataset is small or if
the words are very rare, we might not find even a single one of them.
Laplace Smoothing
A common strategy is to perform some form of Laplace smoothing. The solution is to add a small
constant to all counts. Denote by n the total number of words in the training set and m the number
of unique words. This solution helps with singletons, e.g., via
n(x) + ϵ1 /m
P̂ (x) = ,
n + ϵ1
n(x, x′ ) + ϵ2 P̂ (x′ )
P̂ (x′ | x) = , (9.3.6)
n(x) + ϵ2
n(x, x′ , x′′ ) + ϵ3 P̂ (x′′ )
P̂ (x′′ | x, x′ ) = .
n(x, x′ ) + ϵ3
Here ϵ1 , ϵ2 , and ϵ3 are hyperparameters. Take ϵ1 as an example: when ϵ1 = 0, no smoothing is

applied; when ϵ1 approaches positive infinity, P̂ (x) approaches the uniform probability 1/m. The
above is a rather primitive variant of what other techniques can accomplish (Wood et al., 2011).
Unfortunately, models like this get unwieldy rather quickly for the following reasons. First, as
discussed in Section 9.2.5, many n-grams occur very rarely, making Laplace smoothing rather
unsuitable for language modeling. Second, we need to store all counts. Third, this entirely ignores
the meaning of the words. For instance, “cat” and “feline” should occur in related contexts. It is
quite difficult to adjust such models to additional contexts, whereas, deep learning based language
models are well suited to take this into account. Last, long word sequences are almost certain to
be novel, hence a model that simply counts the frequency of previously seen word sequences
is bound to perform poorly there. Therefore, we focus on using neural networks for language
modeling in the rest of the chapter.
9.3.2 Perplexity
Next, letʼs discuss about how to measure the language model quality, which will be used to evaluate
our models in the subsequent sections. One way is to check how surprising the text is. A good
language model is able to predict with high-accuracy tokens that what we will see next. Consider
the following continuations of the phrase “It is raining”, as proposed by different language models:
1. “It is raining outside”
2. “It is raining banana tree”
3. “It is raining piouw;kcj pwepoiut”
In terms of quality, example 1 is clearly the best. The words are sensible and logically coherent.
While it might not quite accurately reflect which word follows semantically (“in San Francisco” and

“in winter” would have been perfectly reasonable extensions), the model is able to capture which
kind of word follows. Example 2 is considerably worse by producing a nonsensical extension.
Nonetheless, at least the model has learned how to spell words and some degree of correlation
between words. Last, example 3 indicates a poorly trained model that does not fit data properly.
We might measure the quality of the model by computing the likelihood of the sequence. Un-
fortunately this is a number that is hard to understand and difficult to compare. After all, shorter
sequences are much more likely to occur than the longer ones, hence evaluating the model on Tol-
stoyʼs magnum opus War and Peace will inevitably produce a much smaller likelihood than, say,
on Saint-Exuperyʼs novella The Little Prince. What is missing is the equivalent of an average.
Information theory comes handy here. We have defined entropy, surprisal, and cross-entropy
when we introduced the softmax regression (Section 4.1.3). If we want to compress text, we can ask
about predicting the next token given the current set of tokens. A better language model should
allow us to predict the next token more accurately. Thus, it should allow us to spend fewer bits in
compressing the sequence. So we can measure it by the cross-entropy loss averaged over all the n
tokens of a sequence:
1∑
n
− log P (xt | xt−1 , . . . , x1 ), (9.3.7)
n
t=1
where P is given by a language model and xt is the actual token observed at time step t from the
sequence. This makes the performance on documents of different lengths comparable. For his-
torical reasons, scientists in natural language processing prefer to use a quantity called perplexity.
In a nutshell, it is the exponential of (9.3.7):
( )
1∑
n
exp − log P (xt | xt−1 , . . . , x1 ) . (9.3.8)
n
t=1
Perplexity can be best understood as the geometric mean of the number of real choices that we
have when deciding which token to pick next. Letʼs look at a number of cases:
• In the best case scenario, the model always perfectly estimates the probability of the target
token as 1. In this case the perplexity of the model is 1.
• In the worst case scenario, the model always predicts the probability of the target token as
0. In this situation, the perplexity is positive infinity.
• At the baseline, the model predicts a uniform distribution over all the available tokens of the
vocabulary. In this case, the perplexity equals the number of unique tokens of the vocabu-
lary. In fact, if we were to store the sequence without any compression, this would be the
best we could do to encode it. Hence, this provides a nontrivial upper bound that any useful
model must beat.
import torch

9.3.3 Partitioning Sequences
We will design language models using neural networks and use perplexity to evaluate how good
the model is at predicting the next token given the current set of tokens in text sequences. Before
introducing the model, letʼs assume that it processes a minibatch of sequences with predefined
length at a time. Now the question is how to read minibatches of input sequences and target
sequences at random.
Suppose that the dataset takes the form of a sequence of T token indices in corpus. We will par-
tition it into subsequences, where each subsequence has n tokens (time steps). To iterate over
(almost) all the tokens of the entire dataset for each epoch and obtain all possible length-n subse-
quences, we can introduce randomness. More concretely, at the beginning of each epoch, discard
the first d tokens, where d ∈ [0, n) is uniformly sampled at random. The rest of the sequence is
then partitioned into m = ⌊(T − d)/n⌋ subsequences. Denote by xt = [xt , . . . , xt+n−1 ] the length-n
subsequence starting from token xt at time step t. The resulting m partitioned subsequences are
xd , xd+n , . . . , xd+n(m−1) . Each subsequence will be used as an input sequence into the language
model.
For language modeling, the goal is to predict the next token based on what tokens we have seen so
far, hence the targets (labels) are the original sequence, shifted by one token. The target sequence
for any input sequence xt is xt+1 with length n.
Fig. 9.3.1: Obtaining 5 pairs of input sequences and target sequences from partitioned length-5
subsequences.
Fig. 9.3.1 shows an example of obtaining 5 pairs of input sequences and target sequences with
n = 5 and d = 2.
@d2l.add_to_class(d2l.TimeMachine) #@save
def __init__(self, batch_size, num_steps, num_train=10000, num_val=5000):
super(d2l.TimeMachine, self).__init__()
corpus, self.vocab = self.build(self._download())
array = torch.tensor([corpus[i:i+num_steps+1]
for i in range(0, len(corpus)-num_steps-1)])
self.X, self.Y = array[:,:-1], array[:,1:]
To train language models, we will randomly sample pairs of input sequences and target sequences
in minibatches. The following data loader randomly generates a minibatch from the dataset each
time. The argument batch_size specifies the number of subsequence examples (self.b) in each
minibatch and num_steps is the subsequence length in tokens (self.n).
@d2l.add_to_class(d2l.TimeMachine) #@save
idx = slice(0, self.num_train) if train else slice(
self.num_train, self.num_train + self.num_val)
return self.get_tensorloader([self.X, self.Y], train, idx)

As we can see in the following, a minibatch of target sequences can be obtained by shifting the
input sequences by one token.
data = d2l.TimeMachine(batch_size=2, num_steps=10)

for X, Y in data.train_dataloader():
print('X:', X, '\nY:', Y)
break
X: tensor([[16, 15, 0, 3, 22, 21, 0, 21, 9, 19],

[ 9, 2, 23, 6, 0, 2, 0, 19, 6, 2]])
Y: tensor([[15, 0, 3, 22, 21, 0, 21, 9, 19, 16],
[ 2, 23, 6, 0, 2, 0, 19, 6, 2, 13]])
9.3.4 Summary
• Language models estimate the joint probability of a text sequence.

• n-grams provide a convenient model for dealing with long sequences by truncating the de-
pendence.
• There is a lot of structure but not enough frequency to deal with infrequent word combina-
tions efficiently via Laplace smoothing.
• To train language models, we can randomly sample pairs of input sequences and target se-
quences in minibatches.
Exercises
1. Suppose there are 100, 000 words in the training dataset. How much word frequency and
multi-word adjacent frequency does a four-gram need to store?
2. How would you model a dialogue?
3. What other methods can you think of for reading long sequence data?
4. Consider our method for discarding a uniformly random number of the first few tokens at
the beginning of each epoch.
1. Does it really lead to a perfectly uniform distribution over the sequences on the docu-
ment?
2. What would you have to do to make things even more uniform?
5. If we want a sequence example to be a complete sentence, what kind of problem does this
introduce in minibatch sampling? How can we fix the problem?
Discussions140
140

9.4 Recurrent Neural Networks
In Section 9.3 we described Markov models and n-grams for language modeling, where the con-
ditional probability of token xt at time step t only depends on the n − 1 previous tokens. If we
want to incorporate the possible effect of tokens earlier than time step t − (n − 1) on xt , we need
to increase n. However, the number of model parameters would also increase exponentially
with it, as we need to store |V|n numbers for a vocabulary set V. Hence, rather than modeling
P (xt | xt−1 , . . . , xt−n+1 ) it is preferable to use a latent variable model:
P (xt | xt−1 , . . . , x1 ) ≈ P (xt | ht−1 ), (9.4.1)
where ht−1 is a hidden state that stores the sequence information up to time step t − 1. In general,
the hidden state at any time step t could be computed based on both the current input xt and the
previous hidden state ht−1 :
ht = f (xt , ht−1 ). (9.4.2)
For a sufficiently powerful function f in (9.4.2), the latent variable model is not an approximation.
After all, ht may simply store all the data it has observed so far. However, it could potentially make
both computation and storage expensive.
Recall that we have discussed hidden layers with hidden units in Chapter 5. It is noteworthy that
hidden layers and hidden states refer to two very different concepts. Hidden layers are, as ex-
plained, layers that are hidden from view on the path from input to output. Hidden states are
technically speaking inputs to whatever we do at a given step, and they can only be computed by
looking at data at previous time steps.
Recurrent neural networks (RNNs) are neural networks with hidden states. Before introducing the
RNN model, we first revisit the MLP model introduced in Section 5.1.
9.4.1 Neural Networks without Hidden States
Letʼs take a look at an MLP with a single hidden layer. Let the hidden layerʼs activation function
be ϕ. Given a minibatch of examples X ∈ Rn×d with batch size n and d inputs, the hidden layer
output H ∈ Rn×h is calculated as
H = ϕ(XWxh + bh ). (9.4.3)
In (9.4.3), we have the weight parameter Wxh ∈ Rd×h , the bias parameter bh ∈ R1×h , and the
number of hidden units h, for the hidden layer. Thus, broadcasting (see Section 2.1.4) is applied
during the summation. Next, the hidden layer output H is used as input of the output layer. The
output layer is given by
O = HWhq + bq , (9.4.4)
where O ∈ Rn×q is the output variable, Whq ∈ Rh×q is the weight parameter, and bq ∈ R1×q is
the bias parameter of the output layer. If it is a classification problem, we can use softmax(O) to
compute the probability distribution of the output categories.
This is entirely analogous to the regression problem we solved previously in Section 9.1, hence
we omit details. Suffice it to say that we can pick feature-label pairs at random and learn the
parameters of our network via automatic differentiation and stochastic gradient descent.

9.4.2 Recurrent Neural Networks with Hidden States
Matters are entirely different when we have hidden states. Letʼs look at the structure in some more
detail.
Assume that we have a minibatch of inputs Xt ∈ Rn×d at time step t. In other words, for a mini-
batch of n sequence examples, each row of Xt corresponds to one example at time step t from
the sequence. Next, denote by Ht ∈ Rn×h the hidden layer output of time step t. Unlike the MLP,
here we save the hidden layer output Ht−1 from the previous time step and introduce a new weight
parameter Whh ∈ Rh×h to describe how to use the hidden layer output of the previous time step in
the current time step. Specifically, the calculation of the hidden layer output of the current time
step is determined by the input of the current time step together with the hidden layer output of
the previous time step:
Ht = ϕ(Xt Wxh + Ht−1 Whh + bh ). (9.4.5)
Compared with (9.4.3), (9.4.5) adds one more term Ht−1 Whh and thus instantiates (9.4.2). From
the relationship between hidden layer outputs Ht and Ht−1 of adjacent time steps, we know that
these variables captured and retained the sequenceʼs historical information up to their current
time step, just like the state or memory of the neural networkʼs current time step. Therefore, such
a hidden layer output is called a hidden state. Since the hidden state uses the same definition of
the previous time step in the current time step, the computation of (9.4.5) is recurrent. Hence, as
we said, neural networks with hidden states based on recurrent computation are named recurrent
neural networks. Layers that perform the computation of (9.4.5) in RNNs are called recurrent layers.
There are many different ways for constructing RNNs. RNNs with a hidden state defined by (9.4.5)
are very common. For time step t, the output of the output layer is similar to the computation in
the MLP:
Ot = Ht Whq + bq . (9.4.6)
Parameters of the RNN include the weights Wxh ∈ Rd×h , Whh ∈ Rh×h , and the bias bh ∈ R1×h
of the hidden layer, together with the weights Whq ∈ Rh×q and the bias bq ∈ R1×q of the output
layer. It is worth mentioning that even at different time steps, RNNs always use these model pa-
rameters. Therefore, the parameterization cost of an RNN does not grow as the number of time
steps increases.
Fig. 9.4.1 illustrates the computational logic of an RNN at three adjacent time steps. At any time
step t, the computation of the hidden state can be treated as: (i) concatenating the input Xt at
the current time step t and the hidden state Ht−1 at the previous time step t − 1; (ii) feeding the
concatenation result into a fully connected layer with the activation function ϕ. The output of such
a fully connected layer is the hidden state Ht of the current time step t. In this case, the model
parameters are the concatenation of Wxh and Whh , and a bias of bh , all from (9.4.5). The hidden
state of the current time step t, Ht , will participate in computing the hidden state Ht+1 of the next
time step t + 1. What is more, Ht will also be fed into the fully connected output layer to compute
the output Ot of the current time step t.
9.4. Recurrent Neural Networks 347

Fig. 9.4.1: An RNN with a hidden state.
We just mentioned that the calculation of Xt Wxh + Ht−1 Whh for the hidden state is equivalent to
matrix multiplication of concatenation of Xt and Ht−1 and concatenation of Wxh and Whh . Though
this can be proven in mathematics, in the following we just use a simple code snippet to show this.
To begin with, we define matrices X, W_xh, H, and W_hh, whose shapes are (3, 1), (1, 4), (3, 4), and
(4, 4), respectively. Multiplying X by W_xh, and H by W_hh, respectively, and then adding these two
multiplications, we obtain a matrix of shape (3, 4).
import torch
X, W_xh = torch.randn(3, 1), torch.randn(1, 4)

H, W_hh = torch.randn(3, 4), torch.randn(4, 4)
torch.matmul(X, W_xh) + torch.matmul(H, W_hh)
tensor([[-0.7369, -0.3150, -1.2501, 1.3105],

[-1.1689, 0.5863, 0.7762, 0.5272],
[-0.1722, 0.6508, 1.6008, -0.3215]])
Now we concatenate the matrices X and H along columns (axis 1), and the matrices W_xh and W_hh
along rows (axis 0). These two concatenations result in matrices of shape (3, 5) and of shape (5,
4), respectively. Multiplying these two concatenated matrices, we obtain the same output matrix
of shape (3, 4) as above.
torch.matmul(torch.cat((X, H), 1), torch.cat((W_xh, W_hh), 0))
tensor([[-0.7369, -0.3150, -1.2501, 1.3105],

[-1.1689, 0.5863, 0.7762, 0.5272],
[-0.1722, 0.6508, 1.6008, -0.3215]])

9.4.3 RNN-based Character-Level Language Models
Recall that for language modeling in Section 9.3, we aim to predict the next token based on the
current and past tokens, thus we shift the original sequence by one token as the targets (labels).
Bengio et al. first proposed to use a neural network for language modeling (Bengio et al., 2003). In
the following we illustrate how RNNs can be used to build a language model. Let the minibatch size
be one, and the sequence of the text be “machine”. To simplify training in subsequent sections, we
tokenize text into characters rather than words and consider a character-level language model. Fig.
9.4.2 demonstrates how to predict the next character based on the current and previous characters
via an RNN for character-level language modeling.
Fig. 9.4.2: A character-level language model based on the RNN. The input and target sequences
are “machin” and “achine”, respectively.
During the training process, we run a softmax operation on the output from the output layer for
each time step, and then use the cross-entropy loss to compute the error between the model output
and the target. Due to the recurrent computation of the hidden state in the hidden layer, the output
of time step 3 in Fig. 9.4.2, O3 , is determined by the text sequence “m”, “a”, and “c”. Since the next
character of the sequence in the training data is “h”, the loss of time step 3 will depend on the
probability distribution of the next character generated based on the feature sequence “m”, “a”,
“c” and the target “h” of this time step.
In practice, each token is represented by a d-dimensional vector, and we use a batch size n > 1.
Therefore, the input Xt at time step t will be a n × d matrix, which is identical to what we discussed
in Section 9.4.2.
In the following sections, we will implement RNNs for character-level language models and use
perplexity to evaluate such models.
9.4.4 Summary
• A neural network that uses recurrent computation for hidden states is called a recurrent
neural network (RNN).
• The hidden state of an RNN can capture historical information of the sequence up to the
current time step.
• The number of RNN model parameters does not grow as the number of time steps increases.
• We can create character-level language models using an RNN.
9.4. Recurrent Neural Networks 349

• We can use perplexity to evaluate the quality of language models.
Exercises
1. If we use an RNN to predict the next character in a text sequence, what is the required di-
mension for any output?
2. Why can RNNs express the conditional probability of a token at some time step based on all
the previous tokens in the text sequence?
3. What happens to the gradient if you backpropagate through a long sequence?
4. What are some of the problems associated with the language model described in this sec-
tion?
Discussions141
9.5 Recurrent Neural Network Implementation from Scratch
Weʼre now ready to implement an RNN from scratch. In particular, we will train this RNN to
function as a character-level language model (see Section 9.4) and train it on a corpus consisting
of the entire text of H. G. Wellsʼ The Time Machine, following the data processing steps outlined in
Section 9.2. We start by loading the dataset.
%matplotlib inline
import math
import torch
9.5.1 RNN Model
We begin by defining a class to implement the RNN model (Section 9.4.2). Note that the number
of hidden units num_hiddens is a tunable hyperparameter.
class RNNScratch(d2l.Module): #@save

def __init__(self, num_inputs, num_hiddens, sigma=0.01):
super().__init__()
self.W_xh = nn.Parameter(
torch.randn(num_inputs, num_hiddens) * sigma)
self.W_hh = nn.Parameter(
torch.randn(num_hiddens, num_hiddens) * sigma)
self.b_h = nn.Parameter(torch.zeros(num_hiddens))
The forward method defines how to compute the output and hidden state at any sequence step,
given the current input and the state of the model at the previous sequence step. Note that the RNN
model loops through the outermost dimension of inputs updating the hidden state one sequence
141

step at a time. The model here uses a tanh activation function, which has a sigmoid-like shape
but takes value 0 for input 0 and takes values approaching −1 (for highly negative inputs) and
approaching 1 (for highly positive inputs).
@d2l.add_to_class(RNNScratch) #@save
def forward(self, inputs, state=None):
if state is not None:
state, = state
outputs = []
for X in inputs: # Shape of inputs: (num_steps, batch_size, num_inputs)
state = torch.tanh(torch.matmul(X, self.W_xh) + (
torch.matmul(state, self.W_hh) if state is not None else 0)
+ self.b_h)
outputs.append(state)
return outputs, state
We can feed a minibatch of input sequences into an RNN model as follows.
batch_size, num_inputs, num_hiddens, num_steps = 2, 16, 32, 100

rnn = RNNScratch(num_inputs, num_hiddens)
X = torch.ones((num_steps, batch_size, num_inputs))
outputs, state = rnn(X)
Letʼs check whether the RNN model produces results of the correct shapes to ensure that the di-
mensionality of the hidden state remains unchanged.
def check_len(a, n): #@save

assert len(a) == n, f'list\'s len {len(a)} != expected length {n}'
def check_shape(a, shape): #@save

assert a.shape == shape, \
f'tensor\'s shape {a.shape} != expected shape {shape}'
d2l.check_len(outputs, num_steps)
d2l.check_shape(outputs[0], (batch_size, num_hiddens))
d2l.check_shape(state, (batch_size, num_hiddens))
9.5.2 RNN-based Language Model
The following RNNLMScratch class defines an RNN-based language model, where we pass in our
RNN via the rnn argument of the __init__ method. When training language models, the inputs
and outputs are from the same vocabulary. Hence, they have the same dimension, which is equal
to the vocabulary size. Note that we use perplexity to evaluate the model. As discussed in Section
9.3.2, this ensures that sequences of different length are comparable.
class RNNLMScratch(d2l.Classifier): #@save

def __init__(self, rnn, vocab_size, lr=0.01):
super().__init__()
self.init_params()
def init_params(self):
9.5. Recurrent Neural Network Implementation from Scratch 351

self.W_hq = nn.Parameter(
torch.randn(
self.rnn.num_hiddens, self.vocab_size) * self.rnn.sigma)
self.b_q = nn.Parameter(torch.zeros(self.vocab_size))
def training_step(self, batch):
self.plot('ppl', torch.exp(l), train=True)
return l

self.plot('ppl', torch.exp(l), train=False)
One-Hot Encoding
Recall that each token is represented by a numerical index indicating the position in the vocab-
ulary of the corresponding word/character/word-piece. You might be tempted to build a neural
network with a single input node (at each sequence step), where the index could be fed in as a
scalar value. This works when weʼre dealing with numerical inputs like price or temperature,
where any two values sufficiently close together should be treated similarly. But this doesnʼt quite
make sense. The 45th and 46th words in our vocabulary happen to be “their” and “said”, whose
meanings are not remotely similar.
When dealing with such categorical data, the most common strategy is to represent each item by
a one-hot encoding (recall from Section 4.1.1). A one-encoding is a vector whose length is given by
the size of the vocabulary N , where all entries are set to 0, except for the entry corresponding to
our token, which is set to 1. For example, the the vocabulary had $5% elements, then the one-hot
vectors corresponding to indices 0 and 2 would be the following.
F.one_hot(torch.tensor([0, 2]), 5)
tensor([[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0]])
The minibatches that we sample at each iteration will take the shape (batch size, number of se-
quence steps). Once representing each input as a one-hot vector, we can think of each minibatch
as a three-dimensional tensor, where the length along the third axis is given by the vocabulary
size (len(vocab)). We often transpose the input so that we will obtain an output of shape (number
of time steps, batch size, vocabulary size). This will allow us to more conveniently loop through
the outermost dimension for updating hidden states of a minibatch, time step by time step (e.g.,
in the above forward method).
@d2l.add_to_class(RNNLMScratch) #@save
def one_hot(self, X):
# Output shape: (num_steps, batch_size, vocab_size)
return F.one_hot(X.T, self.vocab_size).type(torch.float32)

Transforming RNN Outputs
The language model uses a fully connected output layer to transform RNN outputs into token pre-
dictions at each time step.
def output_layer(self, rnn_outputs):
outputs = [torch.matmul(H, self.W_hq) + self.b_q for H in rnn_outputs]
return torch.stack(outputs, 1)
def forward(self, X, state=None):
embs = self.one_hot(X)
rnn_outputs, _ = self.rnn(embs, state)
return self.output_layer(rnn_outputs)
Letʼs check whether the forward computation produces outputs with the correct shape.
model = RNNLMScratch(rnn, num_inputs)

outputs = model(torch.ones((batch_size, num_steps), dtype=torch.int64))
d2l.check_shape(outputs, (batch_size, num_steps, num_inputs))
9.5.3 Gradient Clipping
While youʼre already used to thinking of neural networks as “deep” in the sense that many lay-
ers separate the input and output even within a single sequence step, the length of the sequence
introduces a new notion of depth. In addition to the passing through the network in the input-
to-output direction, inputs at the first time steps must pass through a chain of T layers along the
sequence steps in order to influence the output of the model at the final sequence step. Taking the
backwards view, in each iteration, we backpropagate gradients through time, resulting in a chain
of matrix-products with length O(T ). As mentioned in Section 5.4, this can result in numerical
instability, causing the gradients to either explode or vanish depending on the properties of the
weight matrices.
Dealing with vanishing and exploding gradients is a fundamental problem when designing RNNs
and has inspired some of the biggest advances in modern neural network architectures. In the
next chapter, weʼll talk about specialized architectures that were designed in hopes of mitigating
the vanishing gradient problem. However, even modern RNNs still often suffer from exploding
gradients. One inelegant but ubiquitous solution is to simply clip the gradients forcing the result-
ing “clipped” gradients to take smaller values.
Generally speaking, when optimizing some objective by gradient descent, we iteratively update
the parameter of interest, say a vector x, but pushing it in the direction of the negative gradient g
(in stochastic gradient descent, we calculate this gradient on a randomly sampled minibatch). For
example, with learning rate η > 0, each update takes the form x ← x − ηg. Letʼs further assume
that the objective function f is sufficiently smooth. Formally, we say that the objective is Lipschitz
continuous with constant L, meaning that for any x and y, we have
|f (x) − f (y)| ≤ L∥x − y∥. (9.5.1)
As you can see, when we update the parameter vector

by subtracting ηg, the change in the value of the objective depends on the learning rate, the
norm of the gradient and L as follows:
|f (x) − f (x − ηg)| ≤ Lη∥g∥. (9.5.2)
In other words, the objective cannot change by more than Lη∥g∥. Having a small value for this
upper bound might be viewed as a good thing or a bad thing. On the downside, we are limiting
the speed at which we can reduce the value of the objective. On the bright side, this limit just how
much we can go wrong in any one gradient step.
When we say that gradients explode, we mean that ∥g∥ becomes excessively large. In this worst
cases, we might do so much damage in a single gradient step that we could undo all of the progress
made over the course of thousands of training iterations. When gradients can be so large, neu-
ral network training often diverges, failing to reduce the value of the objective. At other times,
training eventually converges but is unstable owing to massive spikes in the loss.
One way to limit the size of Lη∥g∥ is to shrink the learning rate η to tiny values. One advantage
here is that we donʼt bias the updates. But what if we only rarely get large gradients? This drastic
move slows down our progress at all steps, just to deal with the rare exploding gradient events. A
popular alternative is to adopt a gradient clipping heuristic projecting the gradients g onto a ball of
some given radius θ as follows:
( )
θ
g ← min 1, g. (9.5.3)
∥g∥
This ensures that the gradient norm never exceeds θ and that the updated gradient is entirely
aligned with the original direction of g. It also has the desirable side-effect of limiting the influ-
ence any given minibatch (and within it any given sample) can exert on the parameter vector.
This bestows a certain degree of robustness to the model. To be clear, itʼs a hack. Gradient clip-
ping means that we are not always following the true gradient and itʼs hard to reason analytically
about the possible side effects. However, itʼs a very useful hack, and is widely adopted in RNN
implementations in most deep learning frameworks.
Below we define a function to clip gradients. Note that when computing the gradient norm, we
are concatenating all model parameters, treating them as a single giant parameter vector.
def clip_gradients(self, grad_clip_val, model):
params = [p for p in model.parameters() if p.requires_grad]
norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
if norm > grad_clip_val:
for param in params:
param.grad[:] *= grad_clip_val / norm
9.5.4 Training
Using The Time Machine dataset (data), we train a character-level language model (model) based
on the RNN (rnn) implemented from scratch. Note that we first calculate the gradients, then clip
them, and finally update the model parameters using the clipped gradients.

rnn = RNNScratch(num_inputs=len(data.vocab), num_hiddens=32)
model = RNNLMScratch(rnn, vocab_size=len(data.vocab), lr=1)

trainer = d2l.Trainer(max_epochs=100, gradient_clip_val=1, num_gpus=1)

9.5.5 Decoding
Once a language model has been learned, we can use it not only to predict the next token but
to continue predicting each subsequent token, treating the previously predicted token as though
it were the next token in the input. Sometimes we will just want to generate text as though we
were starting at the beginning of a document. However, itʼs often useful to condition the language
model on a user-supplied prefix. For example, if we were developing an autocomplete feature for
search engine or to assist users in writing emails, we would want to feed in what they had written
so far (the prefix), and then generate a likely continuation.
The following predict function generates a continuation, one character at a time, after ingesting a
user-provided prefix, When looping through the characters in prefix, we keep passing the hidden
state to the next time step but do not generate any output. This is called the warm-up period. After
ingesting the prefix, we are now ready to begin emitting the subsequent characters, each of which
will be fed back into the model as the input at the subsequent sequence step.
def predict(self, prefix, num_preds, vocab, device=None):
state, outputs = None, [vocab[prefix[0]]]
for i in range(len(prefix) + num_preds - 1):
X = torch.tensor([[outputs[-1]]], device=device)
embs = self.one_hot(X)
rnn_outputs, state = self.rnn(embs, state)
if i < len(prefix) - 1: # Warm-up period
outputs.append(vocab[prefix[i + 1]])
else: # Predict `num_preds` steps
Y = self.output_layer(rnn_outputs)
outputs.append(int(Y.argmax(axis=2).reshape(1)))
return ''.join([vocab.idx_to_token[i] for i in outputs])
In the following, we specify the prefix and have it generate 20 additional characters.

model.predict('it has', 20, data.vocab, d2l.try_gpu())
'it has have the the the th'
While implementing the above RNN model from scratch is instructive, it is not convenient. In
the next section, we will see how to leverage deep learning frameworks to whip up RNNs using
standard architectures, and to reap performance gains by relying on highly optimized library func-
tions.
9.5.6 Summary
• We can train RNN-based language models to generate text following the user-provided text
prefix.
• A simple RNN language model consists of input encoding, RNN modeling, and output gen-
eration.
• We implemented a simple RNN language model and trained it on sequences of text, tok-
enized at the character level.
• Gradient clipping can mitigate the problem of exploding gradients but does not address the
problem of vanishing gradients.
• By conditioning on a prefix, we can use a language model to generate likely continuations,
which proves useful in many applications, e.g., autocomplete features.
Exercises
1. Does the implemented language model predict the next token based on all the past tokens
up to the very first token in The Time Machine?
2. Which hyperparameter controls the length of history used for prediction?
3. Show that one-hot encoding is equivalent to picking a different embedding for each object.
4. Adjust the hyperparameters (e.g., number of epochs, number of hidden units, number of
time steps in a minibatch, and learning rate) to improve the perplexity. How low can you go
while sticking with this simple architecture?
5. Replace one-hot encoding with learnable embeddings. Does this lead to better perfor-
mance?
6. Conduct an experiment to determine how well this language model trained on The Time Ma-
chine works on other books by H. G. Wells, e.g., *The War of the Worlds*142 .
7. Conduct another experiment to evaluate the perplexity of this model on books written by
other authors.
8. Modify the prediction function such as to use sampling rather than picking the most likely
next character.
• What happens?
142
http://www.gutenberg.org/ebooks/36

• Bias the model towards more likely outputs, e.g., by sampling from q(xt |
xt−1 , . . . , x1 ) ∝ P (xt | xt−1 , . . . , x1 )α for α > 1.
9. Run the code in this section without clipping the gradient. What happens?
10. Replace the activation function used in this section with ReLU and repeat the experiments
in this section. Do we still need gradient clipping? Why?
Discussions143
9.6 Concise Implementation of Recurrent Neural Networks
Like most of our from-scratch implementations, Section 9.5 was designed to provide insight into
how each component works. But when youʼre using RNNs every day or writing production code,
youʼll want to rely more on libraries that cut down on both implementation time (by supplying
library code for common models and functions) and computation time (by optimizing the heck
out of these library implementations). This section will show you how to implement the same lan-
guage model more efficiently using the high-level API provided by your deep learning framework.
We begin, as before, by loading the The Time Machine dataset.
import torch
We define the following class using the RNN implemented by high-level APIs.
class RNN(d2l.Module): #@save

def __init__(self, num_inputs, num_hiddens):
super().__init__()
self.rnn = nn.RNN(num_inputs, num_hiddens)
def forward(self, inputs, H=None):

return self.rnn(inputs, H)
Inheriting from the RNNLMScratch class in Section 9.5, the following RNNLM class defines a complete
RNN-based language model. Note that we need to create a separate fully connected output layer.
class RNNLM(d2l.RNNLMScratch): #@save

def init_params(self):
self.linear = nn.LazyLinear(self.vocab_size)
def output_layer(self, hiddens):
return self.linear(hiddens).swapaxes(0, 1)
143
9.6. Concise Implementation of Recurrent Neural Networks 357

9.6.2 Training and Predicting
Before training the model, letʼs make a prediction with a model initialized with random weights.
Given that we have not trained the network, it will generate nonsensical predictions.

rnn = RNN(num_inputs=len(data.vocab), num_hiddens=32)
model = RNNLM(rnn, vocab_size=len(data.vocab), lr=1)
model.predict('it has', 20, data.vocab)
'it hasmczlxxmxmxmxmxmxmxmx'
Next, we train our model, leveraging the high-level API.

Compared with Section 9.5, this model achieves comparable perplexity, but runs faster due to the
optimized implementations. As before, we can generate predicted tokens following the specified
prefix string.
'it has and the time travel'

9.6.3 Summary
• High-level APIs in deep learning frameworks provide implementations of standard RNNs.

• These libraries help you to avoid wasting time reimplementing standard models.
• Framework implementations are often highly optimized, leading to significant (computa-
tional) performance gains as compared to implementations from scratch.
Exercises
1. Can you make the RNN model overfit using the high-level APIs?
2. Implement the autoregressive model of Section 9.1 using an RNN.
Discussions144
9.7 Backpropagation Through Time
If you completed the exercises in Section 9.5, you would have seen that gradient clipping is vital
to prevent the occasional massive gradients from destabilizing training. While we hinted that the
exploding gradients stem from backpropagating across long sequences. Before introducing a slew
of modern RNN architectures, letʼs take a closer look at how backpropagation works in sequence
models in mathematical detail. Hopefully, this discussion bring some precision to the notion of
vanishing and exploding gradients and explain why we need to detach the gradient for RNNs in Sec-
tion 9.5. If you recall our discussion of forward and backward propagation through computational
graphs when we introduced MLPs in Section 5.3, then forward propagation in RNNs should be rel-
atively straightforward. Applying backpropagation in RNNs is called Backpropagation through time
((Werbos, 1990)). This procedure requires us to expand (or unroll) the computational graph of an
RNN one sequence step at a time. The unrolled RNN is essentially a feedforward neural network
with the special property that the same parameters are repeated throughout the unrolled network,
appearing at each sequence step. Then, just as in any feedforward neural network, we can apply
the chain rule, backpropagating gradients through the unrolled net. The gradient with respect
to each parameter must be summed across all places that the parameter occurs in the unrolled
net. Handling such weight tying should be familiar from our chapters on convolutional neural
networks.
Complications arise because sequences can be rather long. Itʼ not unusual to work with text se-
quences consisting of over a thousand tokens. Note that this poses problems both from a com-
putational (too much memory) and optimization (numerical instability) standpoint. Inputs from
the first step passes through over 1000 matrix products before arriving at the output, and another
1000 matrix products are required to compute the gradient. We now analyze what can go wrong
and how to address it in practice.
144
9.7. Backpropagation Through Time 359

9.7.1 Analysis of Gradients in RNNs
We start with a simplified model of how an RNN works. This model ignores details about the
specifics of the hidden state and how it is updated. The mathematical notation here does not
explicitly distinguish scalars, vectors, and matrices. We are just trying to develop some intuition.
In this simplified model, we denote ht as the hidden state, xt as input, and ot as output at time step
t. Recall our discussions in Section 9.4.2 that the input and the hidden state can be concatenated
to before being multiplied by one weight variable in the hidden layer. Thus, we use wh and wo to
indicate the weights of the hidden layer and the output layer, respectively. As a result, the hidden
states and outputs at each time steps are
ht = f (xt , ht−1 , wh ),
(9.7.1)
ot = g(ht , wo ),
where f and g are transformations of the hidden layer and the output layer, respectively. Hence,
we have a chain of values {. . . , (xt−1 , ht−1 , ot−1 ), (xt , ht , ot ), . . .} that depend on each other via re-
current computation. The forward propagation is fairly straightforward. All we need is to loop
through the (xt , ht , ot ) triples one time step at a time. The discrepancy between output ot and the
desired target yt is then evaluated by an objective function across all the T time steps as
1∑
T
L(x1 , . . . , xT , y1 , . . . , yT , wh , wo ) = l(yt , ot ). (9.7.2)
T
t=1
For backpropagation, matters are a bit trickier, especially when we compute the gradients with
regard to the parameters wh of the objective function L. To be specific, by the chain rule,
1 ∑ ∂l(yt , ot )
T
∂L
=
∂wh T ∂wh
t=1
(9.7.3)
1 ∑T
∂l(yt , ot ) ∂g(ht , wo ) ∂ht
= .
T ∂ot ∂ht ∂wh
t=1
The first and the second factors of the product in (9.7.3) are easy to compute. The third factor
∂ht /∂wh is where things get tricky, since we need to recurrently compute the effect of the param-
eter wh on ht . According to the recurrent computation in (9.7.1), ht depends on both ht−1 and
wh , where computation of ht−1 also depends on wh . Thus, evaluating the total derivate of ht with
respect to wh using the chain rule yields
∂ht ∂f (xt , ht−1 , wh ) ∂f (xt , ht−1 , wh ) ∂ht−1
= + . (9.7.4)
∂wh ∂wh ∂ht−1 ∂wh
To derive the above gradient, assume that we have three sequences {at }, {bt }, {ct } satisfying a0 = 0
and at = bt + ct at−1 for t = 1, 2, . . .. Then for t ≥ 1, it is easy to show
 
∑
t−1 ∏ t
at = bt +  cj  bi . (9.7.5)
i=1 j=i+1
By substituting at , bt , and ct according to

∂ht
at = ,
∂wh
∂f (xt , ht−1 , wh )
bt = , (9.7.6)
∂wh
∂f (xt , ht−1 , wh )
ct = ,
∂ht−1

the gradient computation in (9.7.4) satisfies at = bt + ct at−1 . Thus, per (9.7.5), we can remove the
recurrent computation in (9.7.4) with
 
∂f (xt , ht−1 , wh ) ∑  ∏ ∂f (xj , hj−1 , wh )  ∂f (xi , hi−1 , wh )
t−1 t
∂ht
= + . (9.7.7)
∂wh ∂wh ∂hj−1 ∂wh
i=1 j=i+1
While we can use the chain rule to compute ∂ht /∂wh recursively, this chain can get very long
whenever t is large. Letʼs discuss a number of strategies for dealing with this problem.
Full Computation
One idea might be to compute the full sum in (9.7.7). However, this is very slow and gradients
can blow up, since subtle changes in the initial conditions can potentially affect the outcome a lot.
That is, we could see things similar to the butterfly effect, where minimal changes in the initial
conditions lead to disproportionate changes in the outcome. This is generally undesirable. After
all, we are looking for robust estimators that generalize well. Hence this strategy is almost never
used in practice.
Truncating Time Steps
Alternatively, we can truncate the sum in (9.7.7) after τ steps. This is what we have been discussing
so far, such as when we detached the gradients in Section 9.5. This leads to an approximation of the
true gradient, simply by terminating the sum at ∂ht−τ /∂wh . In practice this works quite well. It
is what is commonly referred to as truncated backpropgation through time (Jaeger, 2002). One of
the consequences of this is that the model focuses primarily on short-term influence rather than
long-term consequences. This is actually desirable, since it biases the estimate towards simpler
and more stable models.
Randomized Truncation
Last, we can replace ∂ht /∂wh by a random variable which is correct in expectation but truncates
the sequence. This is achieved by using a sequence of ξt with predefined 0 ≤ πt ≤ 1, where
P (ξt = 0) = 1 − πt and P (ξt = πt−1 ) = πt , thus E[ξt ] = 1. We use this to replace the gradient
∂ht /∂wh in (9.7.4) with
∂f (xt , ht−1 , wh ) ∂f (xt , ht−1 , wh ) ∂ht−1

zt = + ξt . (9.7.8)
∂wh ∂ht−1 ∂wh
It follows from the definition of ξt that E[zt ] = ∂ht /∂wh . Whenever ξt = 0 the recurrent computa-
tion terminates at that time step t. This leads to a weighted sum of sequences of varying lengths,
where long sequences are rare but appropriately overweighted. This idea was proposed by Tallec
and Ollivier (Tallec and Ollivier, 2017).

Comparing Strategies
Fig. 9.7.1: Comparing strategies for computing gradients in RNNs. From top to bottom: random-
ized truncation, regular truncation, and full computation.
Fig. 9.7.1 illustrates the three strategies when analyzing the first few characters of The Time Ma-
chine using backpropagation through time for RNNs:
• The first row is the randomized truncation that partitions the text into segments of varying
lengths.
• The second row is the regular truncation that breaks the text into subsequences of the same
length. This is what we have been doing in RNN experiments.
• The third row is the full backpropagation through time that leads to a computationally in-
feasible expression.
Unfortunately, while appealing in theory, randomized truncation does not work much better than
regular truncation, most likely due to a number of factors. First, the effect of an observation after
a number of backpropagation steps into the past is quite sufficient to capture dependencies in
practice. Second, the increased variance counteracts the fact that the gradient is more accurate
with more steps. Third, we actually want models that have only a short range of interactions.
Hence, regularly truncated backpropagation through time has a slight regularizing effect that can
be desirable.
9.7.2 Backpropagation Through Time in Detail
After discussing the general principle, letʼs discuss backpropagation through time in detail. Differ-
ent from the analysis in Section 9.7.1, in the following we will show how to compute the gradients
of the objective function with respect to all the decomposed model parameters. To keep things
simple, we consider an RNN without bias parameters, whose activation function in the hidden
layer uses the identity mapping (ϕ(x) = x). For time step t, let the single example input and the
target be xt ∈ Rd and yt , respectively. The hidden state ht ∈ Rh and the output ot ∈ Rq are
computed as
ht = Whx xt + Whh ht−1 ,

(9.7.9)
ot = Wqh ht ,
where Whx ∈ Rh×d , Whh ∈ Rh×h , and Wqh ∈ Rq×h are the weight parameters. Denote by l(ot , yt )
the loss at time step t. Our objective function, the loss over T time steps from the beginning of the
sequence is thus
1∑
T
L= l(ot , yt ). (9.7.10)
T
t=1

In order to visualize the dependencies among model variables and parameters during computa-
tion of the RNN, we can draw a computational graph for the model, as shown in Fig. 9.7.2. For
example, the computation of the hidden states of time step 3, h3 , depends on the model parame-
ters Whx and Whh , the hidden state of the last time step h2 , and the input of the current time step
x3 .
Fig. 9.7.2: Computational graph showing dependencies for an RNN model with three time steps.
Boxes represent variables (not shaded) or parameters (shaded) and circles represent operators.
As just mentioned, the model parameters in Fig. 9.7.2 are Whx , Whh , and Wqh . Generally, training
this model requires gradient computation with respect to these parameters ∂L/∂Whx , ∂L/∂Whh ,
and ∂L/∂Wqh . According to the dependencies in Fig. 9.7.2, we can traverse in the opposite direc-
tion of the arrows to calculate and store the gradients in turn. To flexibly express the multiplica-
tion of matrices, vectors, and scalars of different shapes in the chain rule, we continue to use the
prod operator as described in Section 5.3.
First of all, differentiating the objective function with respect to the model output at any time step
t is fairly straightforward:
∂L ∂l(ot , yt )
= ∈ Rq . (9.7.11)
∂ot T · ∂ot
Now, we can calculate the gradient of the objective with respect to the parameter Wqh in the output
layer: ∂L/∂Wqh ∈ Rq×h . Based on Fig. 9.7.2, the objective L depends on Wqh via o1 , . . . , oT . Using
the chain rule yields
∑ T ( ) ∑
T
∂L ∂L ∂ot ∂L ⊤
= prod , = h , (9.7.12)
∂Wqh ∂ot ∂Wqh ∂ot t
t=1 t=1
where ∂L/∂ot is given by (9.7.11).

Next, as shown in Fig. 9.7.2, at the final time step T , the objective function L depends on the
hidden state hT only via oT . Therefore, we can easily find the gradient ∂L/∂hT ∈ Rh using the
chain rule:
( )
∂L ∂L ∂oT ∂L
= prod , = W⊤qh . (9.7.13)
∂hT ∂oT ∂hT ∂oT
It gets trickier for any time step t < T , where the objective function L depends on ht via ht+1 and
ot . According to the chain rule, the gradient of the hidden state ∂L/∂ht ∈ Rh at any time step
t < T can be recurrently computed as:
( ) ( )
∂L ∂L ∂ht+1 ∂L ∂ot ∂L ∂L
= prod , + prod , = W⊤ hh + W⊤
qh . (9.7.14)
∂ht ∂ht+1 ∂ht ∂ot ∂ht ∂ht+1 ∂ot

For analysis, expanding the recurrent computation for any time step 1 ≤ t ≤ T gives
∂L ∑( T )T −i ∂L
= W⊤
hh W⊤
qh . (9.7.15)
∂ht ∂oT +t−i
i=t
We can see from (9.7.15) that this simple linear example already exhibits some key problems of
long sequence models: it involves potentially very large powers of W⊤hh . In it, eigenvalues smaller
than 1 vanish and eigenvalues larger than 1 diverge. This is numerically unstable, which manifests
itself in the form of vanishing and exploding gradients. One way to address this is to truncate the
time steps at a computationally convenient size as discussed in Section 9.7.1. In practice, this
truncation is effected by detaching the gradient after a given number of time steps. Later on, we
will see how more sophisticated sequence models such as long short-term memory can alleviate
this further.
Finally, Fig. 9.7.2 shows that the objective function L depends on model parameters Whx and Whh
in the hidden layer via hidden states h1 , . . . , hT . To compute gradients with respect to such pa-
rameters ∂L/∂Whx ∈ Rh×d and ∂L/∂Whh ∈ Rh×h , we apply the chain rule that gives
∑ T ( ) ∑
T
∂L ∂L ∂ht ∂L ⊤
= prod , = x ,
∂Whx ∂ht ∂Whx ∂ht t
t=1 t=1
( ) (9.7.16)
∂L ∑
T
∂L ∂ht ∑
T
∂L ⊤
= prod , = h ,
∂Whh ∂ht ∂Whh ∂ht t−1
t=1 t=1
where ∂L/∂ht that is recurrently computed by (9.7.13) and (9.7.14) is the key quantity that affects
the numerical stability.
Since backpropagation through time is the application of backpropagation in RNNs, as we have
explained in Section 5.3, training RNNs alternates forward propagation with backpropagation
through time. Besides, backpropagation through time computes and stores the above gradients
in turn. Specifically, stored intermediate values are reused to avoid duplicate calculations, such
as storing ∂L/∂ht to be used in computation of both ∂L/∂Whx and ∂L/∂Whh .
9.7.3 Summary
• Backpropagation through time is merely an application of backpropagation to sequence

models with a hidden state.
• Truncation is needed for computational convenience and numerical stability, such as regu-
lar truncation and randomized truncation.
• High powers of matrices can lead to divergent or vanishing eigenvalues. This manifests itself
in the form of exploding or vanishing gradients.
• For efficient computation, intermediate values are cached during backpropagation through
time.

Exercises
1. Assume that we have a symmetric matrix M ∈ Rn×n with eigenvalues λi whose correspond-
ing eigenvectors are vi (i = 1, . . . , n). Without loss of generality, assume that they are or-
dered in the order |λi | ≥ |λi+1 |.
2. Show that Mk has eigenvalues λki .
3. Prove that for a random vector x ∈ Rn , with high probability Mk x will be very much aligned
with the eigenvector v1 of M. Formalize this statement.
4. What does the above result mean for gradients in RNNs?
5. Besides gradient clipping, can you think of any other methods to cope with gradient explo-
sion in recurrent neural networks?
Discussions145
145

10 | Modern Recurrent Neural Net-
works
We have introduced the basics of RNNs, which can better handle sequence data. For demonstra-
tion, we implemented RNN-based language models on text data. However, such techniques may
not be sufficient for practitioners when they face a wide range of sequence learning problems
nowadays.
For instance, a notable issue in practice is the numerical instability of RNNs. Although we have
applied implementation tricks such as gradient clipping, this issue can be alleviated further with
more sophisticated designs of sequence models. Specifically, gated RNNs are much more com-
mon in practice. We will begin by introducing two of such widely-used networks, namely gated
recurrent units (GRUs) and long short-term memory (LSTM). Furthermore, we will expand the RNN
architecture with a single unidirectional hidden layer that has been discussed so far. We will de-
scribe deep architectures with multiple hidden layers, and discuss the bidirectional design with
both forward and backward recurrent computations. Such expansions are frequently adopted in
modern recurrent networks. When explaining these RNN variants, we continue to consider the
same language modeling problem introduced in Chapter 9.
In fact, language modeling reveals only a small fraction of what sequence learning is capable of.
In a variety of sequence learning problems, such as automatic speech recognition, text to speech,
and machine translation, both inputs and outputs are sequences of arbitrary length. To explain
how to fit this type of data, we will take machine translation as an example, and introduce the
encoder-decoder architecture based on RNNs and beam search for sequence generation.
10.1 Gated Recurrent Units (GRU)
In Section 9.7, we discussed how gradients are calculated in RNNs. In particular we found that
long products of matrices can lead to vanishing or exploding gradients. Letʼs briefly think about
what such gradient anomalies mean in practice:
• We might encounter a situation where an early observation is highly significant for predict-
ing all future observations. Consider the somewhat contrived case where the first observa-
tion contains a checksum and the goal is to discern whether the checksum is correct at the
end of the sequence. In this case, the influence of the first token is vital. We would like to
have some mechanisms for storing vital early information in a memory cell. Without such a
mechanism, we will have to assign a very large gradient to this observation, since it affects
all the subsequent observations.
• We might encounter situations where some tokens carry no pertinent observation. For in-
stance, when parsing a web page there might be auxiliary HTML code that is irrelevant for
367
the purpose of assessing the sentiment conveyed on the page. We would like to have some
mechanism for skipping such tokens in the latent state representation.
• We might encounter situations where there is a logical break between parts of a sequence.
For instance, there might be a transition between chapters in a book, or a transition between
a bear and a bull market for securities. In this case it would be nice to have a means of
resetting our internal state representation.
A number of methods have been proposed to address this. One of the earliest is long short-term
memory (Hochreiter and Schmidhuber, 1997) which we will discuss in Section 10.2. The gated re-
current unit (GRU) (Cho et al., 2014) is a slightly more streamlined variant that often offers compa-
rable performance and is significantly faster to compute (Chung et al., 2014). Due to its simplicity,
letʼs start with the GRU.
10.1.1 Gated Hidden State
The key distinction between vanilla RNNs and GRUs is that the latter support gating of the hidden
state. This means that we have dedicated mechanisms for when a hidden state should be updated
and also when it should be reset. These mechanisms are learned and they address the concerns
listed above. For instance, if the first token is of great importance we will learn not to update
the hidden state after the first observation. Likewise, we will learn to skip irrelevant temporary
observations. Last, we will learn to reset the latent state whenever needed. We discuss this in
detail below.
Reset Gate and Update Gate
The first thing we need to introduce are the reset gate and the update gate. We engineer them to be
vectors with entries in (0, 1) such that we can perform convex combinations. For instance, a reset
gate would allow us to control how much of the previous state we might still want to remember.
Likewise, an update gate would allow us to control how much of the new state is just a copy of the
old state.
We begin by engineering these gates. Fig. 10.1.1 illustrates the inputs for both the reset and update
gates in a GRU, given the input of the current time step and the hidden state of the previous time
step. The outputs of two gates are given by two fully connected layers with a sigmoid activation
function.
368 Chapter 10. Modern Recurrent Neural Networks

Fig. 10.1.1: Computing the reset gate and the update gate in a GRU model.
Mathematically, for a given time step t, suppose that the input is a minibatch Xt ∈ Rn×d (number of
examples: n, number of inputs: d) and the hidden state of the previous time step is Ht−1 ∈ Rn×h
(number of hidden units: h). Then, the reset gate Rt ∈ Rn×h and update gate Zt ∈ Rn×h are
computed as follows:
Rt = σ(Xt Wxr + Ht−1 Whr + br ),

(10.1.1)
Zt = σ(Xt Wxz + Ht−1 Whz + bz ),
where Wxr , Wxz ∈ Rd×h and Whr , Whz ∈ Rh×h are weight parameters and br , bz ∈ R1×h are
biases. Note that broadcasting (see Section 2.1.4) is triggered during the summation. We use
sigmoid functions (as introduced in Section 5.1) to transform input values to the interval (0, 1).
Candidate Hidden State
Next, letʼs integrate the reset gate Rt with the regular latent state updating mechanism in (9.4.5).
It leads to the following candidate hidden state H̃t ∈ Rn×h at time step t:
H̃t = tanh(Xt Wxh + (Rt ⊙ Ht−1 ) Whh + bh ), (10.1.2)
where Wxh ∈ Rd×h and Whh ∈ Rh×h are weight parameters, bh ∈ R1×h is the bias, and the symbol
⊙ is the Hadamard (elementwise) product operator. Here we use a nonlinearity in the form of
tanh to ensure that the values in the candidate hidden state remain in the interval (−1, 1).
The result is a candidate since we still need to incorporate the action of the update gate. Compar-
ing with (9.4.5), now the influence of the previous states can be reduced with the elementwise
multiplication of Rt and Ht−1 in (10.1.2). Whenever the entries in the reset gate Rt are close to 1,
we recover a vanilla RNN such as in (9.4.5). For all entries of the reset gate Rt that are close to 0,
the candidate hidden state is the result of an MLP with Xt as input. Any pre-existing hidden state
is thus reset to defaults.
Fig. 10.1.2 illustrates the computational flow after applying the reset gate.
10.1. Gated Recurrent Units (GRU) 369

Fig. 10.1.2: Computing the candidate hidden state in a GRU model.
Hidden State
Finally, we need to incorporate the effect of the update gate Zt . This determines the extent to which
the new hidden state Ht ∈ Rn×h is just the old state Ht−1 and by how much the new candidate state
H̃t is used. The update gate Zt can be used for this purpose, simply by taking elementwise convex
combinations between both Ht−1 and H̃t . This leads to the final update equation for the GRU:
Ht = Zt ⊙ Ht−1 + (1 − Zt ) ⊙ H̃t . (10.1.3)
Whenever the update gate Zt is close to 1, we simply retain the old state. In this case the informa-
tion from Xt is essentially ignored, effectively skipping time step t in the dependency chain. In
contrast, whenever Zt is close to 0, the new latent state Ht approaches the candidate latent state
H̃t . These designs can help us cope with the vanishing gradient problem in RNNs and better cap-
ture dependencies for sequences with large time step distances. For instance, if the update gate
has been close to 1 for all the time steps of an entire subsequence, the old hidden state at the time
step of its beginning will be easily retained and passed to its end, regardless of the length of the
subsequence.
Fig. 10.1.3 illustrates the computational flow after the update gate is in action.

Fig. 10.1.3: Computing the hidden state in a GRU model.
In summary, GRUs have the following two distinguishing features:

• Reset gates help capture short-term dependencies in sequences.
• Update gates help capture long-term dependencies in sequences.
To gain a better understanding of the GRU model, letʼs implement it from scratch.
import torch
The first step is to initialize the model parameters. We draw the weights from a Gaussian distribu-
tion with standard deviation to be sigma and set the bias to 0. The hyperparameter num_hiddens
defines the number of hidden units. We instantiate all weights and biases relating to the update
gate, the reset gate, and the candidate hidden state.
class GRUScratch(d2l.Module):
super().__init__()
init_weight = lambda *shape: nn.Parameter(torch.randn(*shape) * sigma)

triple = lambda: (init_weight(num_inputs, num_hiddens),
init_weight(num_hiddens, num_hiddens),
nn.Parameter(torch.zeros(num_hiddens)))
self.W_xz, self.W_hz, self.b_z = triple() # Update gate

self.W_xr, self.W_hr, self.b_r = triple() # Reset gate

self.W_xh, self.W_hh, self.b_h = triple() # Candidate hidden state
Defining the Model
Now we are ready to define the GRU forward computation. Its structure is the same as that of the
basic RNN cell, except that the update equations are more complex.
@d2l.add_to_class(GRUScratch)
def forward(self, inputs, H=None):
matmul_H = lambda A, B: torch.matmul(A, B) if H is not None else 0
outputs = []
for X in inputs:
Z = torch.sigmoid(torch.matmul(X, self.W_xz) + (
torch.matmul(H, self.W_hz) if H is not None else 0) + self.b_z)
if H is None: H = torch.zeros_like(Z)
R = torch.sigmoid(torch.matmul(X, self.W_xr) +
torch.matmul(H, self.W_hr) + self.b_r)
H_tilda = torch.tanh(torch.matmul(X, self.W_xh) +
torch.matmul(R * H, self.W_hh) + self.b_h)
H = Z * H + (1 - Z) * H_tilda
outputs.append(H)
return outputs, (H, )
Training
Training a language model on The Time Machine dataset works in exactly the same manner as in
Section 9.5.

gru = GRUScratch(num_inputs=len(data.vocab), num_hiddens=32)
model = d2l.RNNLMScratch(gru, vocab_size=len(data.vocab), lr=4)

In high-level APIs, we can directly instantiate a GPU model. This encapsulates all the configuration
detail that we made explicit above.
class GRU(d2l.RNN):
d2l.Module.__init__(self)
self.rnn = nn.GRU(num_inputs, num_hiddens)
The code is significantly faster in training as it uses compiled operators rather than Python for
many details that we spelled out before.
gru = GRU(num_inputs=len(data.vocab), num_hiddens=32)

model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=4)
After training, we print out the perplexity on the training set and the predicted sequence following
the provided prefix.
'it has expromenter a cour '
10.1.4 Summary
• Gated RNNs can better capture dependencies for sequences with large time step distances.
• Reset gates help capture short-term dependencies in sequences.
• Update gates help capture long-term dependencies in sequences.
• GRUs contain basic RNNs as their extreme case whenever the reset gate is switched on. They
can also skip subsequences by turning on the update gate.

Exercises
1. Assume that we only want to use the input at time step t′ to predict the output at time step
t > t′ . What are the best values for the reset and update gates for each time step?
2. Adjust the hyperparameters and analyze the their influence on running time, perplexity, and
the output sequence.
3. Compare runtime, perplexity, and the output strings for rnn.RNN and rnn.GRU implementa-
tions with each other.
4. What happens if you implement only parts of a GRU, e.g., with only a reset gate or only an
update gate?
Discussions146
10.2 Long Short-Term Memory (LSTM)
The challenge to address long-term information preservation and short-term input skipping in
latent variable models has existed for a long time. One of the earliest approaches to address this
was the long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). It shares many of
the properties of the GRU. Interestingly, LSTMs have a slightly more complex design than GRUs
but predates GRUs by almost two decades.
10.2.1 Gated Memory Cell
Arguably LSTMʼs design is inspired by logic gates of a computer. LSTM introduces a memory cell (or
cell for short) that has the same shape as the hidden state (some literatures consider the memory
cell as a special type of the hidden state), engineered to record additional information. To control
the memory cell we need a number of gates. One gate is needed to read out the entries from the
cell. We will refer to this as the output gate. A second gate is needed to decide when to read data
into the cell. We refer to this as the input gate. Last, we need a mechanism to reset the content of
the cell, governed by a forget gate. The motivation for such a design is the same as that of GRUs,
namely to be able to decide when to remember and when to ignore inputs in the hidden state via
a dedicated mechanism. Letʼs see how this works in practice.
Input Gate, Forget Gate, and Output Gate
Just like in GRUs, the data feeding into the LSTM gates are the input at the current time step and
the hidden state of the previous time step, as illustrated in Fig. 10.2.1. They are processed by
three fully connected layers with a sigmoid activation function to compute the values of the input,
forget, and output gates. As a result, values of the three gates are in the range of (0, 1).
146

Fig. 10.2.1: Computing the input gate, the forget gate, and the output gate in an LSTM model.
Mathematically, suppose that there are h hidden units, the batch size is n, and the number of
inputs is d. Thus, the input is Xt ∈ Rn×d and the hidden state of the previous time step is Ht−1 ∈
Rn×h . Correspondingly, the gates at time step t are defined as follows: the input gate is It ∈ Rn×h ,
the forget gate is Ft ∈ Rn×h , and the output gate is Ot ∈ Rn×h . They are calculated as follows:
It = σ(Xt Wxi + Ht−1 Whi + bi ),

Ft = σ(Xt Wxf + Ht−1 Whf + bf ), (10.2.1)
Ot = σ(Xt Wxo + Ht−1 Who + bo ),
where Wxi , Wxf , Wxo ∈ Rd×h and Whi , Whf , Who ∈ Rh×h are weight parameters and bi , bf , bo ∈
R1×h are bias parameters.
Candidate Memory Cell
Next we design the memory cell. Since we have not specified the action of the various gates yet, we
first introduce the candidate memory cell C̃t ∈ Rn×h . Its computation is similar to that of the three
gates described above, but using a tanh function with a value range for (−1, 1) as the activation
function. This leads to the following equation at time step t:
C̃t = tanh(Xt Wxc + Ht−1 Whc + bc ), (10.2.2)
where Wxc ∈ Rd×h and Whc ∈ Rh×h are weight parameters and bc ∈ R1×h is a bias parameter.
A quick illustration of the candidate memory cell is shown in Fig. 10.2.2.
10.2. Long Short-Term Memory (LSTM) 375

Fig. 10.2.2: Computing the candidate memory cell in an LSTM model.
Memory Cell
In GRUs, we have a mechanism to govern input and forgetting (or skipping). Similarly, in LSTMs
we have two dedicated gates for such purposes: the input gate It governs how much we take new
data into account via C̃t and the forget gate Ft addresses how much of the old memory cell content
Ct−1 ∈ Rn×h we retain. Using the same pointwise multiplication trick as before, we arrive at the
following update equation:
Ct = Ft ⊙ Ct−1 + It ⊙ C̃t . (10.2.3)
If the forget gate is always approximately 1 and the input gate is always approximately 0, the past
memory cells Ct−1 will be saved over time and passed to the current time step. This design is intro-
duced to alleviate the vanishing gradient problem and to better capture long range dependencies
within sequences.
We thus arrive at the flow diagram in Fig. 10.2.3.

Fig. 10.2.3: Computing the memory cell in an LSTM model.
Hidden State
Last, we need to define how to compute the hidden state Ht ∈ Rn×h . This is where the output gate
comes into play. In LSTM it is simply a gated version of the tanh of the memory cell. This ensures
that the values of Ht are always in the interval (−1, 1).
Ht = Ot ⊙ tanh(Ct ). (10.2.4)
Whenever the output gate approximates 1 we effectively pass all memory information through to
the predictor, whereas for the output gate close to 0 we retain all the information only within the
memory cell and perform no further processing.
Fig. 10.2.4 has a graphical illustration of the data flow.
Fig. 10.2.4: Computing the hidden state in an LSTM model.

Now letʼs implement an LSTM from scratch. As same as the experiments in Section 9.5, we first
load The Time Machine dataset.
import torch
Next we need to define and initialize the model parameters. As previously, the hyperparameter
num_hiddens defines the number of hidden units. We initialize weights following a Gaussian dis-
tribution with 0.01 standard deviation, and we set the biases to 0.
class LSTMScratch(d2l.Module): #@save

super().__init__()
init_weight = lambda *shape: nn.Parameter(torch.randn(*shape) * sigma)

triple = lambda: (init_weight(num_inputs, num_hiddens),
init_weight(num_hiddens, num_hiddens),
nn.Parameter(torch.zeros(num_hiddens)))
self.W_xi, self.W_hi, self.b_i = triple() # Input gate
self.W_xf, self.W_hf, self.b_f = triple() # Forget gate
self.W_xo, self.W_ho, self.b_o = triple() # Output gate
self.W_xc, self.W_hc, self.b_c = triple() # Candidate memory cell
The actual model is defined just like what we discussed before: providing three gates and an aux-
iliary memory cell. Note that only the hidden state is passed to the output layer. The memory cell
Ct does not directly participate in the output computation.
@d2l.add_to_class(LSTMScratch)
def forward(self, inputs, H_C=None):
H, C = None, None if H_C is None else H_C
outputs = []
for X in inputs:
I = torch.sigmoid(torch.matmul(X, self.W_xi) + (
torch.matmul(H, self.W_hi) if H is not None else 0) + self.b_i)
if H is None:
H, C = torch.zeros_like(I), torch.zeros_like(I)
F = torch.sigmoid(torch.matmul(X, self.W_xf) +
torch.matmul(H, self.W_hf) + self.b_f)
O = torch.sigmoid(torch.matmul(X, self.W_xo) +
torch.matmul(H, self.W_ho) + self.b_o)
C_tilda = torch.tanh(torch.matmul(X, self.W_xc) +
torch.matmul(H, self.W_hc) + self.b_c)
C = F * C + I * C_tilda
H = O * torch.tanh(C)
outputs.append(H)
return outputs, (H, C)

Training and Prediction
Letʼs train an LSTM as same as what we did in Section 10.1, by instantiating the RNNLMScratch class
as introduced in Section 9.5.

lstm = LSTMScratch(num_inputs=len(data.vocab), num_hiddens=32)
model = d2l.RNNLMScratch(lstm, vocab_size=len(data.vocab), lr=4)
Using high-level APIs, we can directly instantiate an LSTM model. This encapsulates all the con-
figuration details that we made explicit above. The code is significantly faster as it uses compiled
operators rather than Python for many details that we spelled out in detail before.
class LSTM(d2l.RNN):
self.rnn = nn.LSTM(num_inputs, num_hiddens)
def forward(self, inputs, H_C=None):

return self.rnn(inputs, H_C)
lstm = LSTM(num_inputs=len(data.vocab), num_hiddens=32)

model = d2l.RNNLM(lstm, vocab_size=len(data.vocab), lr=4)

'it has in the pestical and'
LSTMs are the prototypical latent variable autoregressive model with nontrivial state control.
Many variants thereof have been proposed over the years, e.g., multiple layers, residual con-
nections, different types of regularization. However, training LSTMs and other sequence models
(such as GRUs) are quite costly due to the long range dependency of the sequence. Later we will
encounter alternative models such as transformers that can be used in some cases.
10.2.4 Summary
• LSTMs have three types of gates: input gates, forget gates, and output gates that control the
flow of information.
• The hidden layer output of LSTM includes the hidden state and the memory cell. Only the
hidden state is passed into the output layer. The memory cell is entirely internal.
• LSTMs can alleviate vanishing and exploding gradients.
Exercises
1. Adjust the hyperparameters and analyze the their influence on running time, perplexity, and
the output sequence.
2. How would you need to change the model to generate proper words as opposed to sequences
of characters?
3. Compare the computational cost for GRUs, LSTMs, and regular RNNs for a given hidden
dimension. Pay special attention to the training and inference cost.
4. Since the candidate memory cell ensures that the value range is between −1 and 1 by using
the tanh function, why does the hidden state need to use the tanh function again to ensure
that the output value range is between −1 and 1?
5. Implement an LSTM model for time series prediction rather than character sequence pre-
diction.

Discussions147
10.3 Deep Recurrent Neural Networks
Up to now, we only discussed RNNs with a single unidirectional hidden layer. In it the specific
functional form of how latent variables and observations interact is rather arbitrary. This is not a
big problem as long as we have enough flexibility to model different types of interactions. With a
single layer, however, this can be quite challenging. In the case of the linear models, we fixed this
problem by adding more layers. Within RNNs this is a bit trickier, since we first need to decide
how and where to add extra nonlinearity.
In fact, we could stack multiple layers of RNNs on top of each other. This results in a flexible
mechanism, due to the combination of several simple layers. In particular, data might be relevant
at different levels of the stack. For instance, we might want to keep high-level data about financial
market conditions (bear or bull market) available, whereas at a lower level we only record shorter-
term temporal dynamics.
Beyond all the above abstract discussion it is probably easiest to understand the family of models
we are interested in by reviewing Fig. 10.3.1. It describes a deep RNN with L hidden layers. Each
hidden state is continuously passed to both the next time step of the current layer and the current
time step of the next layer.
Fig. 10.3.1: Architecture of a deep RNN.

147
10.3. Deep Recurrent Neural Networks 381

10.3.1 Functional Dependencies
We can formalize the functional dependencies within the deep architecture of L hidden layers
depicted in Fig. 10.3.1. Our following discussion focuses primarily on the vanilla RNN model, but
it applies to other sequence models, too.
Suppose that we have a minibatch input Xt ∈ Rn×d (number of examples: n, number of inputs
in each example: d) at time step t. At the same time step, let the hidden state of the lth hidden
(l)
layer (l = 1, . . . , L) be Ht ∈ Rn×h (number of hidden units: h) and the output layer variable be
(0)
Ot ∈ Rn×q (number of outputs: q). Setting Ht = Xt , the hidden state of the lth hidden layer that
uses the activation function ϕl is expressed as follows:
(l) (l−1) (l) (l) (l) (l)
Ht = ϕl (Ht Wxh + Ht−1 Whh + bh ), (10.3.1)
(l) (l) (l)

where the weights Wxh ∈ Rh×h and Whh ∈ Rh×h , together with the bias bh ∈ R1×h , are the model
parameters of the lth hidden layer.
In the end, the calculation of the output layer is only based on the hidden state of the final Lth
hidden layer:
(L)
Ot = Ht Whq + bq , (10.3.2)
where the weight Whq ∈ Rh×q and the bias bq ∈ R1×q are the model parameters of the output
layer.
Just as with MLPs, the number of hidden layers L and the number of hidden units h are hyperpa-
rameters. In other words, they can be tuned or specified by us. In addition, we can easily get a
deep gated RNN by replacing the hidden state computation in (10.3.1) with that from a GRU or an
LSTM.
import torch
To implement a multi-layer RNN from scratch, we can treat each layer as an RNNScratch instance
with its own learnable parameters.
class StackedRNNScratch(d2l.Module):
def __init__(self, num_inputs, num_hiddens, num_layers, sigma=0.01):
super().__init__()
self.rnns = nn.Sequential(*[d2l.RNNScratch(
num_inputs if i==0 else num_hiddens, num_hiddens, sigma)
for i in range(num_layers)])
The multi-layer forward computation simply performs forward computation layer by layer.
@d2l.add_to_class(StackedRNNScratch)
def forward(self, inputs, Hs=None):

outputs = inputs
if Hs is None: Hs = [None] * len(inputs)
for i in range(self.num_layers):
outputs, Hs[i] = self.rnns[i](outputs, Hs[i])
return outputs, Hs
As an example, we train a deep GRU model on The Time Machine dataset (same as in Section 9.5).
To keep things simple we set the number of layers to 2.

rnn_block = StackedRNNScratch(num_inputs=len(data.vocab),
num_hiddens=32, num_layers=2)
model = d2l.RNNLMScratch(rnn_block, vocab_size=len(data.vocab), lr=2)
Fortunately many of the logistical details required to implement multiple layers of an RNN are
readily available in high-level APIs. Our concise implementation will use such built-in function-
alities. The code generalizes the one we used previously in Section 10.1, allowing specification of
the number of layers explicitly rather than picking the default of a single layer.
class GRU(d2l.RNN): #@save

def __init__(self, num_inputs, num_hiddens, num_layers, dropout=0):
self.rnn = nn.GRU(num_inputs, num_hiddens, num_layers,
dropout=dropout)
The architectural decisions such as choosing hyperparameters are very similar to those of Section
10.1. We pick the same number of inputs and outputs as we have distinct tokens, i.e., vocab_size.
The number of hidden units is still 32. The only difference is that we now select a nontrivial num-
ber of hidden layers by specifying the value of num_layers.
10.3. Deep Recurrent Neural Networks 383

gru = GRU(num_inputs=len(data.vocab), num_hiddens=32, num_layers=2)
'it has no dimension the ti'
10.3.4 Summary
• In deep RNNs, the hidden state information is passed to the next time step of the current
layer and the current time step of the next layer.
• There exist many different flavors of deep RNNs, such as LSTMs, GRUs, or vanilla RNNs.
Conveniently these models are all available as parts of the high-level APIs of deep learning
frameworks.
• Initialization of models requires care. Overall, deep RNNs require considerable amount of
work (such as learning rate and clipping) to ensure proper convergence.
Exercises
1. Replace the GRU by an LSTM and compare the accuracy and training speed.
2. Increase the training data to include multiple books. How low can you go on the perplexity
scale?
3. Would you want to combine sources of different authors when modeling text? Why is this a
good idea? What could go wrong?
Discussions148
148

10.4 Bidirectional Recurrent Neural Networks
In sequence learning, so far we assumed that our goal is to model the next output given what we
have seen so far, e.g., in the context of a time series or in the context of a language model. While
this is a typical scenario, it is not the only one we might encounter. To illustrate the issue, consider
the following three tasks of filling in the blank in a text sequence:
• I am ___.
• I am ___ hungry.
• I am ___ hungry, and I can eat half a pig.
Depending on the amount of information available, we might fill in the blanks with very differ-
ent words such as “happy”, “not”, and “very”. Clearly the end of the phrase (if available) conveys
significant information about which word to pick. A sequence model that is incapable of taking
advantage of this will perform poorly on related tasks. For instance, to do well in named entity
recognition (e.g., to recognize whether “Green” refers to “Mr. Green” or to the color) longer-range
context is equally vital. To get some inspiration for addressing the problem letʼs take a detour to
probabilistic graphical models.
10.4.1 Dynamic Programming in Hidden Markov Models
This subsection serves to illustrate the dynamic programming problem. The specific technical
details do not matter for understanding the deep learning models but they help in motivating why
one might use deep learning and why one might pick specific architectures.
If we want to solve the problem using probabilistic graphical models we could for instance design
a latent variable model as follows. At any time step t, we assume that there exists some latent
variable ht that governs our observed emission xt via P (xt | ht ). Moreover, any transition ht →
ht+1 is given by some state transition probability P (ht+1 | ht ). This probabilistic graphical model
is then a hidden Markov model as in Fig. 10.4.1.
Fig. 10.4.1: A hidden Markov model.
Thus, for a sequence of T observations we have the following joint probability distribution over
the observed and hidden states:
∏
T
P (x1 , . . . , xT , h1 , . . . , hT ) = P (ht | ht−1 )P (xt | ht ), where P (h1 | h0 ) = P (h1 ). (10.4.1)
t=1
Now assume that we observe all xi with the exception of some xj and it is our goal to compute
P (xj | x−j ), where x−j = (x1 , . . . , xj−1 , xj+1 , . . . , xT ). Since there is no latent variable in P (xj |
x−j ), we consider summing over all the possible combinations of choices for h1 , . . . , hT . In case
any hi can take on k distinct values (a finite number of states), this means that we need to sum over
k T terms—usually mission impossible! Fortunately there is an elegant solution for this: dynamic
programming.
10.4. Bidirectional Recurrent Neural Networks 385

To see how it works, consider summing over latent variables h1 , . . . , hT in turn. According to
(10.4.1), this yields:
P (x1 , . . . , xT )
∑
= P (x1 , . . . , xT , h1 , . . . , hT )
h1 ,...,hT
∑ ∏
T
= P (ht | ht−1 )P (xt | ht )
h1 ,...,hT t=1
 
∑ ∑ ∏
T
=  P (h1 )P (x1 | h1 )P (h2 | h1 ) P (x2 | h2 ) P (ht | ht−1 )P (xt | ht )
h2 ,...,hT h1 t=3
| {z } (10.4.2)
def
π2 (h2 ) =
 
∑ ∑ ∏
T
=  
π2 (h2 )P (x2 | h2 )P (h3 | h2 ) P (x3 | h3 ) P (ht | ht−1 )P (xt | ht )
h3 ,...,hT h2 t=4
| {z }
def
π3 (h3 ) =
=...
∑
= πT (hT )P (xT | hT ).
hT
In general we have the forward recursion as

∑
πt+1 (ht+1 ) = πt (ht )P (xt | ht )P (ht+1 | ht ). (10.4.3)
ht
The recursion is initialized as π1 (h1 ) = P (h1 ). In abstract terms this can be written as πt+1 =
f (πt , xt ), where f is some learnable function. This looks very much like the update equation in
the latent variable models we discussed so far in the context of RNNs!
Entirely analogously to the forward recursion, we can also sum over the same set of latent variables
with a backward recursion. This yields:
P (x1 , . . . , xT )
∑
= P (x1 , . . . , xT , h1 , . . . , hT )
h1 ,...,hT
∑ −1
T∏
= P (ht | ht−1 )P (xt | ht ) · P (hT | hT −1 )P (xT | hT )
h1 ,...,hT t=1
 
∑ −1
T∏ ∑
= P (ht | ht−1 )P (xt | ht ) ·  P (hT | hT −1 )P (xT | hT )
h1 ,...,hT −1 t=1 hT
| {z } (10.4.4)
def
ρT −1 (hT −1 ) =
 
∑ −2
T∏ ∑
= P (ht | ht−1 )P (xt | ht ) ·  P (hT −1 | hT −2 )P (xT −1 | hT −1 )ρT −1 (hT −1 )
h1 ,...,hT −2 t=1 hT −1
| {z }
def
ρT −2 (hT −2 ) =
=...
∑
= P (h1 )P (x1 | h1 )ρ1 (h1 ).
h1

We can thus write the backward recursion as
∑
ρt−1 (ht−1 ) = P (ht | ht−1 )P (xt | ht )ρt (ht ), (10.4.5)
ht
with initialization ρT (hT ) = 1. Both the forward and backward recursions allow us to sum over
T latent variables in O(kT ) (linear) time over all values of (h1 , . . . , hT ) rather than in exponential
time. This is one of the great benefits of the probabilistic inference with graphical models. It
is also a very special instance of a general message passing algorithm (Aji and McEliece, 2000).
Combining both forward and backward recursions, we are able to compute
∑
P (xj | x−j ) ∝ πj (hj )ρj (hj )P (xj | hj ). (10.4.6)
hj
Note that in abstract terms the backward recursion can be written as ρt−1 = g(ρt , xt ), where g is a
learnable function. Again, this looks very much like an update equation, just running backwards
unlike what we have seen so far in RNNs. Indeed, hidden Markov models benefit from knowing
future data when it is available. Signal processing scientists distinguish between the two cases of
knowing and not knowing future observations as interpolation v.s. extrapolation. See the intro-
ductory chapter of the book on sequential Monte Carlo algorithms for more details (Doucet et al.,
2001).
10.4.2 Bidirectional Model
If we want to have a mechanism in RNNs that offers comparable look-ahead ability as in hidden
Markov models, we need to modify the RNN design that we have seen so far. Fortunately, this is
easy conceptually. Instead of running an RNN only in the forward mode starting from the first
token, we start another one from the last token running from back to front. Bidirectional RNNs
add a hidden layer that passes information in a backward direction to more flexibly process such
information. Fig. 10.4.2 illustrates the architecture of a bidirectional RNN with a single hidden
layer.
Fig. 10.4.2: Architecture of a bidirectional RNN.
In fact, this is not too dissimilar to the forward and backward recursions in the dynamic program-
ing of hidden Markov models. The main distinction is that in the previous case these equations
had a specific statistical meaning. Now they are devoid of such easily accessible interpretations
and we can just treat them as generic and learnable functions. This transition epitomizes many
of the principles guiding the design of modern deep networks: first, use the type of functional
dependencies of classical statistical models, and then parameterize them in a generic form.

Definition
Bidirectional RNNs were introduced by (Schuster and Paliwal, 1997). For a detailed discussion
of the various architectures see also the paper (Graves and Schmidhuber, 2005). Letʼs look at the
specifics of such a network.
For any time step t, given a minibatch input Xt ∈ Rn×d (number of examples: n, number of inputs
in each example: d) and let the hidden layer activation function be ϕ. In the bidirectional archi-
−
→
tecture, we assume that the forward and backward hidden states for this time step are H t ∈ Rn×h
←
−
and H t ∈ Rn×h , respectively, where h is the number of hidden units. The forward and backward
hidden state updates are as follows:
−
→ (f ) −
→ (f ) (f )
H t = ϕ(Xt Wxh + H t−1 Whh + bh ),
←
− (b) ←
− (b) (b)
(10.4.7)
H t = ϕ(Xt Wxh + H t+1 Whh + bh ),
(f ) (f ) (b) (b) (f )
where the weights Wxh ∈ Rd×h , Whh ∈ Rh×h , Wxh ∈ Rd×h , and Whh ∈ Rh×h , and biases bh ∈
(b)
R1×h and bh ∈ R1×h are all the model parameters.
−
→ ←
−
Next, we concatenate the forward and backward hidden states H t and H t to obtain the hidden
state Ht ∈ Rn×2h to be fed into the output layer. In deep bidirectional RNNs with multiple hidden
layers, such information is passed on as input to the next bidirectional layer. Last, the output layer
computes the output Ot ∈ Rn×q (number of outputs: q):
Ot = Ht Whq + bq . (10.4.8)
Here, the weight matrix Whq ∈ R2h×q and the bias bq ∈ R1×q are the model parameters of the
output layer. In fact, the two directions can have different numbers of hidden units.
Computational Cost and Applications
One of the key features of a bidirectional RNN is that information from both ends of the sequence
is used to estimate the output. That is, we use information from both future and past observations
to predict the current one. In the case of next token prediction this is not quite what we want.
After all, we do not have the luxury of knowing the next to next token when predicting the next
one. Hence, if we were to use a bidirectional RNN naively we would not get a very good accuracy:
during training we have past and future data to estimate the present. During test time we only
have past data and thus poor accuracy. We will illustrate this in an experiment below.
To add insult to injury, bidirectional RNNs are also exceedingly slow. The main reasons for this
are that the forward propagation requires both forward and backward recursions in bidirectional
layers and that the backpropagation is dependent on the outcomes of the forward propagation.
Hence, gradients will have a very long dependency chain.
In practice bidirectional layers are used very sparingly and only for a narrow set of applications,
such as filling in missing words, annotating tokens (e.g., for named entity recognition), and en-
coding sequences wholesale as a step in a sequence processing pipeline (e.g., for machine transla-
tion). In Section 15.8 and Section 16.2, we will introduce how to use bidirectional RNNs to encode
text sequences.
import torch

10.4.3 Training a Bidirectional RNN for a Wrong Application
If we were to ignore all advice regarding the fact that bidirectional RNNs use past and future data
and simply apply it to language models, we will get estimates with acceptable perplexity. Nonethe-
less, the ability of the model to predict future tokens is severely compromised as the experiment
below illustrates. Despite reasonable perplexity, it only generates gibberish even after many it-
erations. We include the code below as a cautionary example against using them in the wrong
context.
Implementation from Scratch
To implement a bidirectional RNN from scratch, we can include two unidirectional RNNScratch
instances with separate learnable parameters.
class BiRNNScratch(d2l.Module):
super().__init__()
self.f_rnn = d2l.RNNScratch(num_inputs, num_hiddens, sigma)
self.b_rnn = d2l.RNNScratch(num_inputs, num_hiddens, sigma)
self.num_hiddens *= 2 # The output dimension will be doubled
States of forward and backward RNNs are updated separately, while outputs of these two RNNs
are concatenated.
@d2l.add_to_class(BiRNNScratch)
def forward(self, inputs, Hs=None):
f_H, b_H = Hs if Hs is not None else (None, None)
f_outputs, f_H = self.f_rnn(inputs, f_H)
b_outputs, b_H = self.b_rnn(reversed(inputs), b_H)
outputs = [torch.cat((f, b), -1) for f, b in zip(f_outputs, b_outputs)]
return outputs, (f_H, b_H)
The training procedure is the same as in Section 9.5.

birnn = BiRNNScratch(num_inputs=len(data.vocab), num_hiddens=32)
model = d2l.RNNLMScratch(birnn, vocab_size=len(data.vocab), lr=2)

Concise Implementation
Using the high-level APIs, we can implement bidirectional RNNs more concisely. Here we take a
GRU model as an example.
class BiGRU(d2l.RNN):
self.rnn = nn.GRU(num_inputs, num_hiddens, bidirectional=True)
self.num_hiddens *= 2
gru = BiGRU(num_inputs=len(data.vocab), num_hiddens=32)

'it hasasasasasasasasasasas'

The output is clearly unsatisfactory for the reasons described above. For a discussion of more
effective uses of bidirectional RNNs, please see the sentiment analysis application in Section 16.2.
10.4.4 Summary
• In bidirectional RNNs, the hidden state for each time step is simultaneously determined by
the data prior to and after the current time step.
• Bidirectional RNNs bear a striking resemblance with the forward-backward algorithm in
probabilistic graphical models.
• Bidirectional RNNs are mostly useful for sequence encoding and the estimation of observa-
tions given bidirectional context.
• Bidirectional RNNs are very costly to train due to long gradient chains.
Exercises
1. If the different directions use a different number of hidden units, how will the shape of Ht
change?
2. Design a bidirectional RNN with multiple hidden layers.
3. Polysemy is common in natural languages. For example, the word “bank” has different
meanings in contexts “i went to the bank to deposit cash” and “i went to the bank to sit down”.
How can we design a neural network model such that given a context sequence and a word,
a vector representation of the word in the context will be returned? What type of neural
architectures is preferred for handling polysemy?
Discussions149
10.5 Machine Translation and the Dataset
We have used RNNs to design language models, which are key to natural language processing.
Another flagship benchmark is machine translation, a central problem domain for sequence trans-
duction models that transform input sequences into output sequences. Playing a crucial role in
various modern AI applications, sequence transduction models will form the focus of the remain-
der of this chapter and Chapter 11. To this end, this section introduces the machine translation
problem and its dataset that will be used later.
Machine translation refers to the automatic translation of a sequence from one language to another.
In fact, this field may date back to 1940s soon after digital computers were invented, especially
by considering the use of computers for cracking language codes in World War II. For decades,
statistical approaches had been dominant in this field (Brown et al., 1990, Brown et al., 1988) before
the rise of end-to-end learning using neural networks. The latter is often called neural machine
translation to distinguish itself from statistical machine translation that involves statistical analysis
in components such as the translation model and the language model.
Emphasizing end-to-end learning, this book will focus on neural machine translation methods.
Different from our language model problem in Section 9.3 whose corpus is in one single language,
149
10.5. Machine Translation and the Dataset 391

machine translation datasets are composed of pairs of text sequences that are in the source lan-
guage and the target language, respectively. Thus, instead of reusing the preprocessing routine
for language modeling, we need a different way to preprocess machine translation datasets. In
the following, we show how to load the preprocessed data into minibatches for training.
import os
import torch
10.5.1 Downloading and Preprocessing the Dataset
To begin with, we download an English-French dataset that consists of bilingual sentence pairs
from the Tatoeba Project150 . Each line in the dataset is a tab-delimited pair of an English text
sequence and the translated French text sequence. Note that each text sequence can be just one
sentence or a paragraph of multiple sentences. In this machine translation problem where English
is translated into French, English is the source language and French is the target language.
class MTFraEng(d2l.DataModule): #@save

def _download(self):
d2l.extract(d2l.download(
d2l.DATA_URL+'fra-eng.zip', self.root,
'94646ad1522d915e7b0f9296181140edcf86a4f5'))
with open(self.root + '/fra-eng/fra.txt', encoding='utf-8') as f:
return f.read()
data = MTFraEng()
raw_text = data._download()
print(raw_text[:75])
Go. Va !
Hi. Salut !
Run! Cours !
Run! Courez !
Who? Qui ?
Wow! Ça alors !
After downloading the dataset, we proceed with several preprocessing steps for the raw text data.
For instance, we replace non-breaking space with space, convert uppercase letters to lowercase
ones, and insert space between words and punctuation marks.
@d2l.add_to_class(MTFraEng) #@save
def _preprocess(self, text):
# Replace non-breaking space with space
text = text.replace('\u202f', ' ').replace('\xa0', ' ')
# Insert space between words and punctuation marks
no_space = lambda char, prev_char: char in ',.!?' and prev_char != ' '
out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
for i, char in enumerate(text.lower())]
return ''.join(out)

150
http://www.manythings.org/anki/

text = data._preprocess(raw_text)
print(text[:80])
go . va !
hi . salut !
run ! cours !
run ! courez !
who ? qui ?
wow ! ça alors !
10.5.2 Tokenization
Different from character-level tokenization in Section 9.3, for machine translation we prefer word-
level tokenization here (state-of-the-art models may use more advanced tokenization techniques).
The following _tokenize method tokenizes the first max_examples text sequence pairs, where each
token is either a word or a punctuation mark. We append the special “<eos>” token to the end of
every sequence to indicate the end of the sequence. When a model is predicting by generating
a sequence token after token, the generation of the “<eos>” token can suggest that the output se-
quence is complete. In the end, the method below returns two lists of token lists: src and tgt.
Specifically, src[i] is a list of tokens from the ith text sequence in the source language (English
here) and tgt[i] is that in the target language (French here).
def _tokenize(self, text, max_examples=None):
src, tgt = [], []
for i, line in enumerate(text.split('\n')):
if max_examples and i > max_examples: break
parts = line.split('\t')
if len(parts) == 2:
# Skip empty tokens
src.append([t for t in f'{parts[0]} <eos>'.split(' ') if t])
tgt.append([t for t in f'{parts[1]} <eos>'.split(' ') if t])
return src, tgt
src, tgt = data._tokenize(text)

src[:6], tgt[:6]
([['go', '.', '<eos>'],

['hi', '.', '<eos>'],
['run', '!', '<eos>'],
['run', '!', '<eos>'],
['who', '?', '<eos>'],
['wow', '!', '<eos>']],
[['va', '!', '<eos>'],
['salut', '!', '<eos>'],
['cours', '!', '<eos>'],
['courez', '!', '<eos>'],
['qui', '?', '<eos>'],
['ça', 'alors', '!', '<eos>']])

Letʼs plot the histogram of the number of tokens per text sequence. In this simple English-French
dataset, most of the text sequences have fewer than 20 tokens.
#@save
def show_list_len_pair_hist(legend, xlabel, ylabel, xlist, ylist):
"""Plot the histogram for list length pairs."""
d2l.set_figsize()
_, _, patches = d2l.plt.hist(
[[len(l) for l in xlist], [len(l) for l in ylist]])
d2l.plt.xlabel(xlabel)
d2l.plt.ylabel(ylabel)
for patch in patches[1].patches:
patch.set_hatch('/')
d2l.plt.legend(legend)
show_list_len_pair_hist(['source', 'target'], '# tokens per sequence',

'count', src, tgt);
10.5.3 Loading Sequences of Fixed Length
Recall that in language modeling each sequence example, either a segment of one sentence or a
span over multiple sentences, has a fixed length. This was specified by the num_steps (number of
time steps or tokens) argument in Section 9.3. In machine translation, each example is a pair of
source and target text sequences, where each text sequence may have different lengths.
For computational efficiency, we can still process a minibatch of text sequences at one time by
truncation and padding. Suppose that every sequence in the same minibatch should have the same
length num_steps. If a text sequence has fewer than num_steps tokens, we will keep appending the
special “<pad>” token to its end until its length reaches num_steps. Otherwise, we will truncate
the text sequence by only taking its first num_steps tokens and discarding the remaining. In this
way, every text sequence will have the same length to be loaded in minibatches of the same shape.
Besides, we also record length of the source sequence excluding padding tokens. This information
will be needed by some models that we will cover later.
Since the machine translation dataset consists of pairs of languages, we can build two vocabular-
ies for both the source language and the target language separately. With word-level tokenization,
the vocabulary size will be significantly larger than that using character-level tokenization. To al-
leviate this, here we treat infrequent tokens that appear less than 2 times as the same unknown

(“<unk>”) token. As we will explain later (Fig. 10.7.1), when training with target sequences, the
decoder output (label tokens) can be the same decoder input (target tokens), shifted by one to-
ken; and the special beginning-of-sequence “<bos>” token will be used as the first input token for
predicting the target sequence (Fig. 10.7.3).
def __init__(self, batch_size, num_steps=9, num_train=512, num_val=128):
super(MTFraEng, self).__init__()
self.arrays, self.src_vocab, self.tgt_vocab = self._build_arrays(
self._download())
def _build_arrays(self, raw_text, src_vocab=None, tgt_vocab=None):
def _build_array(sentences, vocab, is_tgt=False):
pad_or_trim = lambda seq, t: (
seq[:t] if len(seq) > t else seq + ['<pad>'] * (t - len(seq)))
sentences = [pad_or_trim(s, self.num_steps) for s in sentences]
if is_tgt:
sentences = [['<bos>'] + s for s in sentences]
if vocab is None:
vocab = d2l.Vocab(sentences, min_freq=2)
array = torch.tensor([vocab[s] for s in sentences])
valid_len = (array != vocab['<pad>']).type(torch.int32).sum(1)
return array, vocab, valid_len
src, tgt = self._tokenize(self._preprocess(raw_text),
self.num_train + self.num_val)
src_array, src_vocab, src_valid_len = _build_array(src, src_vocab)
tgt_array, tgt_vocab, _ = _build_array(tgt, tgt_vocab, True)
return ((src_array, tgt_array[:,:-1], src_valid_len, tgt_array[:,1:]),
src_vocab, tgt_vocab)
Finally, we define the get_dataloader method to return the data iterator.
idx = slice(0, self.num_train) if train else slice(self.num_train, None)
return self.get_tensorloader(self.arrays, train, idx)
Letʼs read the first minibatch from the English-French dataset.
data = MTFraEng(batch_size=3)
src, tgt, src_valid_len, label = next(iter(data.train_dataloader()))
print('source:', src.type(torch.int32))
print('decoder input:', tgt.type(torch.int32))
print('source len excluding pad:', src_valid_len.type(torch.int32))
print('label:', label.type(torch.int32))

source: tensor([[ 84, 170, 2, 3, 4, 4, 4, 4, 4],
[192, 111, 2, 3, 4, 4, 4, 4, 4],
[ 16, 140, 2, 3, 4, 4, 4, 4, 4]], dtype=torch.int32)
decoder input: tensor([[ 3, 105, 6, 2, 4, 5, 5, 5, 5],
[ 3, 6, 0, 4, 5, 5, 5, 5, 5],
[ 3, 18, 0, 4, 5, 5, 5, 5, 5]], dtype=torch.int32)
source len excluding pad: tensor([4, 4, 4], dtype=torch.int32)
label: tensor([[105, 6, 2, 4, 5, 5, 5, 5, 5],
[ 6, 0, 4, 5, 5, 5, 5, 5, 5],
[ 18, 0, 4, 5, 5, 5, 5, 5, 5]], dtype=torch.int32)
Below we show a pair of source and target sequences that are processed by the above
_build_arrays method (in the string format).
def build(self, src_sentences, tgt_sentences):
raw_text = '\n'.join([src + '\t' + tgt for src, tgt in zip(
src_sentences, tgt_sentences)])
arrays, _, _ = self._build_arrays(
raw_text, self.src_vocab, self.tgt_vocab)
return arrays
src, tgt, _, _ = data.build(['hi .'], ['salut .'])

print('source:', data.src_vocab.to_tokens(src[0].type(torch.int32)))
print('target:', data.tgt_vocab.to_tokens(tgt[0].type(torch.int32)))
source: ['hi', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
target: ['<bos>', 'salut', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']
10.5.5 Summary
• Machine translation refers to the automatic translation of a sequence from one language to
another.
• Using word-level tokenization, the vocabulary size will be significantly larger than that using
character-level tokenization. To alleviate this, we can treat infrequent tokens as the same
unknown token.
• We can truncate and pad text sequences so that all of them will have the same length to be
loaded in minibatches.

Exercises
1. Try different values of the max_examples argument in the _tokenize method. How does this
affect the vocabulary sizes of the source language and the target language?
2. Text in some languages such as Chinese and Japanese does not have word boundary indi-
cators (e.g., space). Is word-level tokenization still a good idea for such cases? Why or why
not?
Discussions151
10.6 Encoder-Decoder Architecture
As we have discussed in Section 10.5, machine translation is a major problem domain for sequence
transduction models, whose input and output are both variable-length sequences. To handle this
type of inputs and outputs, we can design an architecture with two major components. The first
component is an encoder: it takes a variable-length sequence as input and transforms it into a state
with a fixed shape. The second component is a decoder: it maps the encoded state of a fixed shape
to a variable-length sequence. This is called an encoder-decoder architecture, which is depicted in
Fig. 10.6.1.
Fig. 10.6.1: The encoder-decoder architecture.
Letʼs take machine translation from English to French as an example. Given an input sequence
in English: “They”, “are”, “watching”, “.”, this encoder-decoder architecture first encodes the
variable-length input into a state, then decodes the state to generate the translated sequence to-
ken by token as output: “Ils”, “regardent”, “.”. Since the encoder-decoder architecture forms the
basis of different sequence transduction models in subsequent sections, this section will convert
this architecture into an interface that will be implemented later.
10.6.1 Encoder
In the encoder interface, we just specify that the encoder takes variable-length sequences as input
X. The implementation will be provided by any model that inherits this base Encoder class.

#@save
class Encoder(nn.Module):
"""The base encoder interface for the encoder-decoder architecture."""
def __init__(self):
super().__init__()
151
10.6. Encoder-Decoder Architecture 397

# Later there can be additional arguments (e.g., length excluding padding)

def forward(self, X, *args):
10.6.2 Decoder
In the following decoder interface, we add an additional init_state function to convert the en-
coder output (enc_outputs) into the encoded state. Note that this step may need extra inputs such
as the valid length of the input, which was explained in Section 10.5. To generate a variable-length
sequence token by token, every time the decoder may map an input (e.g., the generated token at
the previous time step) and the encoded state into an output token at the current time step.
#@save
class Decoder(nn.Module):
"""The base decoder interface for the encoder-decoder architecture."""
def __init__(self):
super().__init__()
# Later there can be additional arguments (e.g., length excluding padding)

def init_state(self, enc_outputs, *args):
def forward(self, X, state):

10.6.3 Putting the Encoder and Decoder Together
In the end, the encoder-decoder architecture contains both an encoder and a decoder, with op-
tionally extra arguments. In the forward propagation, the output of the encoder is used to produce
the encoded state, and this state will be further used by the decoder as one of its input.
#@save
class EncoderDecoder(d2l.Classifier):
"""The base class for the encoder-decoder architecture."""
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.decoder = decoder
def forward(self, enc_X, dec_X, *args):

enc_outputs = self.encoder(enc_X, *args)
dec_state = self.decoder.init_state(enc_outputs, *args)
# Return decoder output only
return self.decoder(dec_X, dec_state)[0]
The term “state” in the encoder-decoder architecture has probably inspired you to implement this
architecture using neural networks with states. In the next section, we will see how to apply RNNs
to design sequence transduction models based on this encoder-decoder architecture.

10.6.4 Summary
• The encoder-decoder architecture can handle inputs and outputs that are both variable-
length sequences, thus is suitable for sequence transduction problems such as machine
translation.
• The encoder takes a variable-length sequence as input and transforms it into a state with a
fixed shape.
• The decoder maps the encoded state of a fixed shape to a variable-length sequence.
Exercises
1. Suppose that we use neural networks to implement the encoder-decoder architecture. Do

the encoder and the decoder have to be the same type of neural network?
2. Besides machine translation, can you think of another application where the encoder-
decoder architecture can be applied?
Discussions152
10.7 Sequence to Sequence Learning
As we have seen in Section 10.5, in machine translation both the input and output are a variable-
length sequence. To address this type of problem, we have designed a general encoder-decoder
architecture in Section 10.6. In this section, we will use two RNNs to design the encoder and the
decoder of this architecture and apply it to sequence to sequence learning for machine translation
(Cho et al., 2014, Sutskever et al., 2014).
Following the design principle of the encoder-decoder architecture, the RNN encoder can take
a variable-length sequence as input and transforms it into a fixed-shape hidden state. In other
words, information of the input (source) sequence is encoded in the hidden state of the RNN en-
coder. To generate the output sequence token by token, a separate RNN decoder can predict the
next token based on what tokens have been seen (such as in language modeling) or generated,
together with the encoded information of the input sequence. Fig. 10.7.1 illustrates how to use
two RNNs for sequence to sequence learning in machine translation.
Fig. 10.7.1: Sequence to sequence learning with an RNN encoder and an RNN decoder.
152
10.7. Sequence to Sequence Learning 399

In Fig. 10.7.1, the special “<eos>” token marks the end of the sequence. The model can stop mak-
ing predictions once this token is generated. At the initial time step of the RNN decoder, there are
two special design decisions. First, the special beginning-of-sequence “<bos>” token is an input.
Second, the final hidden state of the RNN encoder is used to initiate the hidden state of the de-
coder. In designs such as (Sutskever et al., 2014), this is exactly how the encoded input sequence
information is fed into the decoder for generating the output (target) sequence. In some other
designs such as (Cho et al., 2014), the final hidden state of the encoder is also fed into the decoder
as part of the inputs at every time step as shown in Fig. 10.7.1.
10.7.1 Teacher Forcing
While the encoder input is just tokens from the source sequence, the decoder input and output are
not so straightforward in encoder-decoder training. A common approach is teacher forcing, where
the original target sequence (token labels) is fed into the decoder as input. More concretely, the
special beginning-of-sequence token and the original target sequence excluding the final token
are concatenated as input to the decoder, while the decoder output (labels for training) is the
original target sequence, shifted by one token: “<bos>”, “Ils”, “regardent”, “.” → “Ils”, “regardent”,
“.”, “<eos>” (Fig. 10.7.1).
Our implementation in Section 10.5.3 prepared training data for teacher forcing, where shifting
tokens for self-supervised learning is similar to the training of language models in Section 9.3. An
alternative approach is to feed the predicted token from the previous time step as the current input
to the decoder.
In the following, we will explain the design of Fig. 10.7.1 in greater detail. We will train this model
for machine translation on the English-French dataset as introduced in Section 10.5.
import collections
import math
import torch
10.7.2 Encoder
Technically speaking, the encoder transforms an input sequence of variable length into a fixed-
shape context variable c, and encodes the input sequence information in this context variable. As
depicted in Fig. 10.7.1, we can use an RNN to design the encoder.
Letʼs consider a sequence example (batch size: 1). Suppose that the input sequence is x1 , . . . , xT ,
such that xt is the tth token in the input text sequence. At time step t, the RNN transforms the input
feature vector xt for xt and the hidden state ht−1 from the previous time step into the current
hidden state ht . We can use a function f to express the transformation of the RNNʼs recurrent
layer:
ht = f (xt , ht−1 ). (10.7.1)
In general, the encoder transforms the hidden states at all the time steps into the context variable
through a customized function q:
c = q(h1 , . . . , hT ). (10.7.2)

For example, when choosing q(h1 , . . . , hT ) = hT such as in Fig. 10.7.1, the context variable is just
the hidden state hT of the input sequence at the final time step.
So far we have used a unidirectional RNN to design the encoder, where a hidden state only de-
pends on the input subsequence at and before the time step of the hidden state. We can also
construct encoders using bidirectional RNNs. In this case, a hidden state depends on the subse-
quence before and after the time step (including the input at the current time step), which encodes
the information of the entire sequence.
Now letʼs implement the RNN encoder. Note that we use an embedding layer to obtain the feature
vector for each token in the input sequence. The weight of an embedding layer is a matrix whose
number of rows equals to the size of the input vocabulary (vocab_size) and number of columns
equals to the feature vectorʼs dimension (embed_size). For any input token index i, the embedding
layer fetches the ith row (starting from 0) of the weight matrix to return its feature vector. Besides,
here we choose a multilayer GRU to implement the encoder.
def init_seq2seq(module): #@save

"""Initialize weights for Seq2Seq."""
if type(module) == nn.Linear:
if type(module) == nn.GRU:
for param in module._flat_weights_names:
if "weight" in param:
nn.init.xavier_uniform_(module._parameters[param])
class Seq2SeqEncoder(d2l.Encoder): #@save

"""The RNN encoder for sequence to sequence learning."""
def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
dropout=0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = d2l.GRU(embed_size, num_hiddens, num_layers, dropout)
self.apply(init_seq2seq)
def forward(self, X, *args):

# X shape: (batch_size, num_steps)
embs = self.embedding(X.t().type(torch.int64))
# embs shape: (num_steps, batch_size, embed_size)
output, state = self.rnn(embs)
# output shape: (num_steps, batch_size, num_hiddens)
# state shape: (num_layers, batch_size, num_hiddens)
return output, state
The returned variables of recurrent layers have been explained in Section 9.6. Letʼs still use a
concrete example to illustrate the above encoder implementation. Below we instantiate a two-
layer GRU encoder whose number of hidden units is 16. Given a minibatch of sequence inputs
X (batch size: 4, number of time steps: 9), the hidden states of the last layer at all the time steps
(outputs return by the encoderʼs recurrent layers) are a tensor of shape (number of time steps,
batch size, number of hidden units).
vocab_size, embed_size, num_hiddens, num_layers = 10, 8, 16, 2

batch_size, num_steps = 4, 9
encoder = Seq2SeqEncoder(vocab_size, embed_size, num_hiddens, num_layers)


X = torch.zeros((batch_size, num_steps))
outputs, state = encoder(X)
d2l.check_shape(outputs, (num_steps, batch_size, num_hiddens))
Since a GRU is employed here, the shape of the multilayer hidden states at the final time step is
(number of hidden layers, batch size, number of hidden units).
d2l.check_shape(state, (num_layers, batch_size, num_hiddens))
10.7.3 Decoder
As we just mentioned, the context variable c of the encoderʼs output encodes the entire input
sequence x1 , . . . , xT . Given the output sequence y1 , y2 , . . . , yT ′ from the training dataset, for each
time step t′ (the symbol differs from the time step t of input sequences or encoders), the probability
of the decoder output yt′ is conditional on the previous output subsequence y1 , . . . , yt′ −1 and the
context variable c, i.e., P (yt′ | y1 , . . . , yt′ −1 , c).
To model this conditional probability on sequences, we can use another RNN as the decoder. At
any time step t′ on the output sequence, the RNN takes the output yt′ −1 from the previous time step
and the context variable c as its input, then transforms them and the previous hidden state st′ −1
into the hidden state st′ at the current time step. As a result, we can use a function g to express
the transformation of the decoderʼs hidden layer:
st′ = g(yt′ −1 , c, st′ −1 ). (10.7.3)
After obtaining the hidden state of the decoder, we can use an output layer and the softmax oper-
ation to compute the conditional probability distribution P (yt′ | y1 , . . . , yt′ −1 , c) for the output at
time step t′ .
Following Fig. 10.7.1, when implementing the decoder as follows, we directly use the hidden state
at the final time step of the encoder to initialize the hidden state of the decoder. This requires
that the RNN encoder and the RNN decoder have the same number of layers and hidden units. To
further incorporate the encoded input sequence information, the context variable is concatenated
with the decoder input at all the time steps. To predict the probability distribution of the output
token, a fully connected layer is used to transform the hidden state at the final layer of the RNN
decoder.
class Seq2SeqDecoder(d2l.Decoder):
"""The RNN decoder for sequence to sequence learning."""
dropout=0):
super().__init__()
self.rnn = d2l.GRU(embed_size+num_hiddens, num_hiddens,
num_layers, dropout)
self.dense = nn.LazyLinear(vocab_size)
self.apply(init_seq2seq)
def init_state(self, enc_outputs, *args):


return enc_outputs[1]
def forward(self, X, enc_state):

# X shape: (batch_size, num_steps)
# embs shape: (num_steps, batch_size, embed_size)
embs = self.embedding(X.t().type(torch.int32))
# context shape: (batch_size, num_hiddens)
context = enc_state[-1]
# Broadcast context to (num_steps, batch_size, num_hiddens)
context = context.repeat(embs.shape[0], 1, 1)
# Concat at the feature dimension
embs_and_context = torch.cat((embs, context), -1)
outputs, state = self.rnn(embs_and_context, enc_state)
outputs = self.dense(outputs).swapaxes(0, 1)
# outputs shape: (batch_size, num_steps, vocab_size)
# state shape: (num_layers, batch_size, num_hiddens)
return outputs, state
To illustrate the implemented decoder, below we instantiate it with the same hyperparameters
from the aforementioned encoder. As we can see, the output shape of the decoder becomes (batch
size, number of time steps, vocabulary size), where the last dimension of the tensor stores the
predicted token distribution.
decoder = Seq2SeqDecoder(vocab_size, embed_size, num_hiddens, num_layers)

state = decoder.init_state(encoder(X))
outputs, state = decoder(X, state)
d2l.check_shape(outputs, (batch_size, num_steps, vocab_size))

d2l.check_shape(state, (num_layers, batch_size, num_hiddens))

To summarize, the layers in the above RNN encoder-decoder model are illustrated in Fig. 10.7.2.
Fig. 10.7.2: Layers in an RNN encoder-decoder model.

10.7.4 Encoder-Decoder for Sequence to Sequence Learning
Based on the architecture described in Section 10.6, the RNN encoder-decoder model for sequence
to sequence learning just puts the RNN encoder and the RNN decoder together.
class Seq2Seq(d2l.EncoderDecoder): #@save

def __init__(self, encoder, decoder, tgt_pad, lr):
super().__init__(encoder, decoder)

Y_hat = self(*batch[:-1])
self.plot('loss', self.loss(Y_hat, batch[-1]), train=False)
# Adam optimizer is used here
return torch.optim.Adam(self.parameters(), lr=self.lr)
10.7.5 Loss Function with Masking
At each time step, the decoder predicts a probability distribution for the output tokens. Similar
to language modeling, we can apply softmax to obtain the distribution and calculate the cross-
entropy loss for optimization. Recall Section 10.5 that the special padding tokens are appended to
the end of sequences so sequences of varying lengths can be efficiently loaded in minibatches of
the same shape. However, prediction of padding tokens should be excluded from loss calculations.
To this end, we can mask irrelevant entries with zero values so that multiplication of any irrelevant
prediction with zero equals to zero.
@d2l.add_to_class(Seq2Seq)
def loss(self, Y_hat, Y):
l = super(Seq2Seq, self).loss(Y_hat, Y, averaged=False)
mask = (Y.reshape(-1) != self.tgt_pad).type(torch.float32)
return (l * mask).sum() / mask.sum()
10.7.6 Training
Now we can create and train an RNN encoder-decoder model for sequence to sequence learning
on the machine translation dataset.
data = d2l.MTFraEng(batch_size=128)
embed_size, num_hiddens, num_layers, dropout = 256, 256, 2, 0.2
encoder = Seq2SeqEncoder(
len(data.src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqDecoder(
len(data.tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
model = Seq2Seq(encoder, decoder, tgt_pad=data.tgt_vocab['<pad>'],
lr=0.001)

10.7.7 Prediction
To predict the output sequence token by token, at each decoder time step the predicted token from
the previous time step is fed into the decoder as an input. Similar to training, at the initial time
step the beginning-of-sequence (“<bos>”) token is fed into the decoder. This prediction process is
illustrated in Fig. 10.7.3. When the end-of-sequence (“<eos>”) token is predicted, the prediction
of the output sequence is complete.
Fig. 10.7.3: Predicting the output sequence token by token using an RNN encoder-decoder.
We will introduce different strategies for sequence generation in Section 10.8.
@d2l.add_to_class(d2l.EncoderDecoder) #@save
def predict_step(self, batch, device, num_steps,
save_attention_weights=False):
batch = [a.to(device) for a in batch]
src, tgt, src_valid_len, _ = batch
enc_outputs = self.encoder(src, src_valid_len)
dec_state = self.decoder.init_state(enc_outputs, src_valid_len)
outputs, attention_weights = [tgt[:, (0)].unsqueeze(1), ], []
for _ in range(num_steps):
Y, dec_state = self.decoder(outputs[-1], dec_state)
outputs.append(Y.argmax(2))
# Save attention weights (to be covered later)
if save_attention_weights:
attention_weights.append(self.decoder.attention_weights)
return torch.cat(outputs[1:], 1), attention_weights

10.7.8 Evaluation of Predicted Sequences
We can evaluate a predicted sequence by comparing it with the label sequence (the ground-
truth). BLEU (Bilingual Evaluation Understudy), though originally proposed for evaluating ma-
chine translation results (Papineni et al., 2002), has been extensively used in measuring the qual-
ity of output sequences for different applications. In principle, for any n-grams in the predicted
sequence, BLEU evaluates whether this n-grams appears in the label sequence.
Denote by pn the precision of n-grams, which is the ratio of the number of matched n-grams in the
predicted and label sequences to the number of n-grams in the predicted sequence. To explain,
given a label sequence A, B, C, D, E, F , and a predicted sequence A, B, B, C, D, we have p1 = 4/5,
p2 = 3/4, p3 = 1/3, and p4 = 0. Besides, let lenlabel and lenpred be the numbers of tokens in the
label sequence and the predicted sequence, respectively. Then, BLEU is defined as
( ( )) ∏
k
lenlabel n
exp min 0, 1 − p1/2
n , (10.7.4)
lenpred
n=1
where k is the longest n-grams for matching.

Based on the definition of BLEU in (10.7.4), whenever the predicted sequence is the same as the
label sequence, BLEU is 1. Moreover, since matching longer n-grams is more difficult, BLEU as-
1/2n
signs a greater weight to a longer n-gram precision. Specifically, when pn is fixed, pn increases
1/n
as n grows (the original paper uses pn ). Furthermore, since predicting shorter sequences tends
to obtain a higher pn value, the coefficient before the multiplication term in (10.7.4) penalizes
shorter predicted sequences. For example, when k = 2, given the label sequence A, B, C, D, E,
F and the predicted sequence A, B, although p1 = p2 = 1, the penalty factor exp(1 − 6/2) ≈ 0.14
lowers the BLEU.
We implement the BLEU measure as follows.
def bleu(pred_seq, label_seq, k): #@save

"""Compute the BLEU."""
pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ')
len_pred, len_label = len(pred_tokens), len(label_tokens)
score = math.exp(min(0, 1 - len_label / len_pred))
for n in range(1, min(k, len_pred) + 1):
num_matches, label_subs = 0, collections.defaultdict(int)
for i in range(len_label - n + 1):
label_subs[' '.join(label_tokens[i: i + n])] += 1
for i in range(len_pred - n + 1):
if label_subs[' '.join(pred_tokens[i: i + n])] > 0:
num_matches += 1
label_subs[' '.join(pred_tokens[i: i + n])] -= 1
score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))
return score
In the end, we use the trained RNN encoder-decoder to translate a few English sentences into
French and compute the BLEU of the results.
engs = ['go .', 'i lost .', 'he\'s calm .', 'i\'m home .']
fras = ['va !', 'j\'ai perdu .', 'il est calme .', 'je suis chez moi .']
preds, _ = model.predict_step(
data.build(engs, fras), d2l.try_gpu(), data.num_steps)

for en, fr, p in zip(engs, fras, preds):

translation = []
for token in data.tgt_vocab.to_tokens(p):
if token == '<eos>':
break
translation.append(token)
print(f'{en} => {translation}, bleu,'
f'{bleu(" ".join(translation), fr, k=2):.3f}')
go . => ['va', '!'], bleu,1.000

i lost . => ["j'ai", 'perdu', '.'], bleu,1.000
he's calm . => ['soyez', 'calme', '!'], bleu,0.000
i'm home . => ['je', 'suis', 'chez', 'moi', 'moi', 'qui', "l'ai", 'fait', '<unk>'], bleu,0.
,→522
10.7.9 Summary
• Following the design of the encoder-decoder architecture, we can use two RNNs to design a
model for sequence to sequence learning.
• In encoder-decoder training, the teacher forcing approach feeds original output sequences
(in contrast to predictions) into the decoder.
• When implementing the encoder and the decoder, we can use multilayer RNNs.
• We can use masks to filter out irrelevant computations, such as when calculating the loss.
• BLEU is a popular measure for evaluating output sequences by matching n-grams between
the predicted sequence and the label sequence.
Exercises
1. Can you adjust the hyperparameters to improve the translation results?

2. Rerun the experiment without using masks in the loss calculation. What results do you ob-
serve? Why?
3. If the encoder and the decoder differ in the number of layers or the number of hidden units,
how can we initialize the hidden state of the decoder?
4. In training, replace teacher forcing with feeding the prediction at the previous time step into
the decoder. How does this influence the performance?
5. Rerun the experiment by replacing GRU with LSTM.
6. Are there any other ways to design the output layer of the decoder?
Discussions153
153

10.8 Beam Search
In Section 10.7, we predicted the output sequence token by token until the special end-of-sequence
“<eos>” token is predicted. In this section, we will begin with formalizing this greedy search strategy
and exploring issues with it, then compare this strategy with other alternatives: exhaustive search
and beam search.
Before a formal introduction to greedy search, letʼs formalize the search problem using the same
mathematical notation from Section 10.7. At any time step t′ , the probability of the decoder out-
put yt′ is conditional on the output subsequence y1 , . . . , yt′ −1 before t′ and the context variable c
that encodes the information of the input sequence. To quantify computational cost, denote by
Y (it contains “<eos>”) the output vocabulary. So the cardinality |Y| of this vocabulary set is the
vocabulary size. Letʼs also specify the maximum number of tokens of an output sequence as T ′ . As
′
a result, our goal is to search for an ideal output from all the O(|Y|T ) possible output sequences.
Of course, for all these output sequences, portions including and after “<eos>” will be discarded
in the actual output.
10.8.1 Greedy Search
First, letʼs take a look at a simple strategy: greedy search. This strategy has been used to predict
sequences in Section 10.7. In greedy search, at any time step t′ of the output sequence, we search
for the token with the highest conditional probability from Y, i.e.,
yt′ = argmax P (y | y1 , . . . , yt′ −1 , c), (10.8.1)

y∈Y
as output. Once “<eos>” is outputted or the output sequence has reached its maximum length T ′ ,
the output sequence is completed.
So what can go wrong with greedy search? In fact, the optimal sequence should be the output se-
∏ ′
quence with the maximum Tt′ =1 P (yt′ | y1 , . . . , yt′ −1 , c), which is the conditional probability of
generating an output sequence based on the input sequence. Unfortunately, there is no guaran-
tee that the optimal sequence will be obtained by greedy search.
Fig. 10.8.1: At each time step, greedy search selects the token with the highest conditional proba-
bility.
Letʼs illustrate it with an example. Suppose that there are four tokens “A”, “B”, “C”, and “<eos>” in
the output dictionary. In Fig. 10.8.1, the four numbers under each time step represent the
conditional probabilities of generating “A”, “B”, “C”, and “<eos>” at that time step, respectively.
At each time step, greedy search selects the token with the highest conditional probability.
Therefore, the output sequence “A”, “B”, “C”, and “<eos>” will be predicted in Fig. 10.8.1. The
conditional probability of this output sequence is 0.5 × 0.4 × 0.4 × 0.6 = 0.048.

Fig. 10.8.2: The four numbers under each time step represent the conditional probabilities of
generating “A”, “B”, “C”, and “<eos>” at that time step. At time step 2, the token “C”, which has the
second highest conditional probability, is selected.
Next, letʼs look at another example in Fig. 10.8.2. Unlike in Fig. 10.8.1, at time step 2 we select
the token “C” in Fig. 10.8.2, which has the second highest conditional probability. Since the output
subsequences at time steps 1 and 2, on which time step 3 is based, have changed from “A” and
“B” in Fig. 10.8.1 to “A” and “C” in Fig. 10.8.2, the conditional probability of each token at time
step 3 has also changed in Fig. 10.8.2. Suppose that we choose the token “B” at time step 3. Now
time step 4 is conditional on the output subsequence at the first three time steps “A”, “C”, and “B”,
which is different from “A”, “B”, and “C” in Fig. 10.8.1. Therefore, the conditional probability of
generating each token at time step 4 in Fig. 10.8.2 is also different from that in Fig. 10.8.1. As a
result, the conditional probability of the output sequence “A”, “C”, “B”, and “<eos>” in Fig. 10.8.2
is 0.5 × 0.3 × 0.6 × 0.6 = 0.054, which is greater than that of greedy search in Fig. 10.8.1. In this
example, the output sequence “A”, “B”, “C”, and “<eos>” obtained by the greedy search is not an
optimal sequence.
10.8.2 Exhaustive Search
If the goal is to obtain the optimal sequence, we may consider using exhaustive search: exhaustively
enumerate all the possible output sequences with their conditional probabilities, then output the
one with the highest conditional probability.
Although we can use exhaustive search to obtain the optimal sequence, its computational cost
′
O(|Y|T ) is likely to be excessively high. For example, when |Y| = 10000 and T ′ = 10, we will
need to evaluate 1000010 = 1040 sequences. This is next to impossible! On the other hand, the
computational cost of greedy search is O(|Y| T ′ ): it is usually significantly smaller than that of
exhaustive search. For example, when |Y| = 10000 and T ′ = 10, we only need to evaluate 10000 ×
10 = 105 sequences.
10.8.3 Beam Search
Decisions about sequence searching strategies lie on a spectrum, with easy questions at either
extreme. What if only accuracy matters? Obviously, exhaustive search. What if only computa-
tional cost matters? Clearly, greedy search. A real-world application usually asks a complicated
question, somewhere in between those two extremes.
Beam search is an improved version of greedy search. It has a hyperparameter named beam size, k.
At time step 1, we select k tokens with the highest conditional probabilities. Each of them will be
the first token of k candidate output sequences, respectively. At each subsequent time step, based
on the k candidate output sequences at the previous time step, we continue to select k candidate
output sequences with the highest conditional probabilities from k |Y| possible choices.
10.8. Beam Search 409

Fig. 10.8.3: The process of beam search (beam size: 2, maximum length of an output sequence:
3). The candidate output sequences are A, C, AB, CE, ABD, and CED.
Fig. 10.8.3 demonstrates the process of beam search with an example. Suppose that the output
vocabulary contains only five elements: Y = {A, B, C, D, E}, where one of them is “<eos>”. Let
the beam size be 2 and the maximum length of an output sequence be 3. At time step 1, suppose
that the tokens with the highest conditional probabilities P (y1 | c) are A and C. At time step 2, for
all y2 ∈ Y, we compute
P (A, y2 | c) = P (A | c)P (y2 | A, c),
(10.8.2)
P (C, y2 | c) = P (C | c)P (y2 | C, c),
and pick the largest two among these ten values, say P (A, B | c) and P (C, E | c). Then at time
step 3, for all y3 ∈ Y, we compute
P (A, B, y3 | c) = P (A, B | c)P (y3 | A, B, c),
(10.8.3)
P (C, E, y3 | c) = P (C, E | c)P (y3 | C, E, c),
and pick the largest two among these ten values, say P (A, B, D | c) and P (C, E, D | c). As a result,
we get six candidates output sequences: (i) A; (ii) C; (iii) A, B; (iv) C, E; (v) A, B, D; and (vi) C,
E, D.
In the end, we obtain the set of final candidate output sequences based on these six sequences
(e.g., discard portions including and after “<eos>”). Then we choose the sequence with the highest
of the following score as the output sequence:
1 ∑
L
1
log P (y 1 , . . . , y L | c) = log P (yt′ | y1 , . . . , yt′ −1 , c), (10.8.4)
Lα Lα ′
t =1
where L is the length of the final candidate sequence and α is usually set to 0.75. Since a longer se-
quence has more logarithmic terms in the summation of (10.8.4), the term Lα in the denominator
penalizes long sequences.
The computational cost of beam search is O(k |Y| T ′ ). This result is in between that of greedy
search and that of exhaustive search. In fact, greedy search can be treated as a special type of

beam search with a beam size of 1. With a flexible choice of the beam size, beam search provides
a tradeoff between accuracy versus computational cost.
10.8.4 Summary
• Sequence searching strategies include greedy search, exhaustive search, and beam search.
• Beam search provides a tradeoff between accuracy versus computational cost via its flexible
choice of the beam size.
Exercises
1. Can we treat exhaustive search as a special type of beam search? Why or why not?
2. Apply beam search in the machine translation problem in Section 10.7. How does the beam
size affect the translation results and the prediction speed?
3. We used language modeling for generating text following user-provided prefixes in Section
9.5. Which kind of search strategy does it use? Can you improve it?
Discussions154
154
10.8. Beam Search 411

11 | Attention Mechanisms and Trans-
formers
The optic nerve of a primateʼs visual system receives massive sensory input, far exceeding what
the brain can fully process. Fortunately, not all stimuli are created equal. Focalization and con-
centration of consciousness have enabled primates to direct attention to objects of interest, such
as preys and predators, in the complex visual environment. The ability of paying attention to only
a small fraction of the information has evolutionary significance, allowing human beings to live
and succeed.
Scientists have been studying attention in the cognitive neuroscience field since the 19th century.
In this chapter, we will begin by reviewing a popular framework explaining how attention is de-
ployed in a visual scene. Inspired by the attention cues in this framework, we will design models
that leverage such attention cues. Notably, the Nadaraya-Watson kernel regression in 1964 is a
simple demonstration of machine learning with attention mechanisms. Next, we will introduce at-
tention functions that have been extensively used in the design of attention models in deep learn-
ing. Specifically, we will show how to use these functions to design the Bahdanau attention, a
groundbreaking attention model in deep learning that can align bidirectionally and is differen-
tiable.
Equipped with the more recent multi-head attention and self-attention designs, the transformer ar-
chitecture is solely based on attention mechanisms. We will go on to describe its original encoder-
decoder design for machine translation. Then we will show how its encoder can represent images,
leading to the development of vision transformers. When training very large models on very large
datasets (e.g., 300 million images), vision transformers outperform ResNets significantly in image
classification, demonstrating superior scalability of transformers. Thus, transformers have been
extensively used in large-scale pretraining, which can be adapted to perform different tasks with
model update (e.g., fine tuning) or not (e.g., few shot). In the end, we will review how to pretrain
transformers as encoder-only (e.g., BERT), encoder-decoder (e.g., T5), and decoder-only (e.g., GPT
series). Compelling success of large-scale pretraining with transformers in areas as diverse as lan-
guage, vision, speech, and reinforcement learning suggests that better performance benefits from
larger models, more training data, and more training compute.
413
11.1 Attention Cues
Thank you for your attention to this book. Attention is a scarce resource: at the moment you
are reading this book and ignoring the rest. Thus, similar to money, your attention is being paid
with an opportunity cost. To ensure that your investment of attention right now is worthwhile, we
have been highly motivated to pay our attention carefully to produce a nice book. Attention is the
keystone in the arch of life and holds the key to any workʼs exceptionalism.
Since economics studies the allocation of scarce resources, we are in the era of the attention econ-
omy, where human attention is treated as a limited, valuable, and scarce commodity that can be
exchanged. Numerous business models have been developed to capitalize on it. On music or video
streaming services, we either pay attention to their ads or pay money to hide them. For growth
in the world of online games, we either pay attention to participate in battles, which attract new
gamers, or pay money to instantly become powerful. Nothing comes for free.
All in all, information in our environment is not scarce, attention is. When inspecting a visual
scene, our optic nerve receives information at the order of 108 bits per second, far exceeding what
our brain can fully process. Fortunately, our ancestors had learned from experience (also known
as data) that not all sensory inputs are created equal. Throughout human history, the capability of
directing attention to only a fraction of information of interest has enabled our brain to allocate
resources more smartly to survive, to grow, and to socialize, such as detecting predators, preys,
and mates.
11.1.1 Attention Cues in Biology
To explain how our attention is deployed in the visual world, a two-component framework has
emerged and been pervasive. This idea dates back to William James in the 1890s, who is consid-
ered the “father of American psychology” (James, 2007). In this framework, subjects selectively
direct the spotlight of attention using both the nonvolitional cue and volitional cue.
The nonvolitional cue is based on the saliency and conspicuity of objects in the environment.
Imagine there are five objects in front of you: a newspaper, a research paper, a cup of coffee, a
notebook, and a book such as in Fig. 11.1.1. While all the paper products are printed in black and
white, the coffee cup is red. In other words, this coffee is intrinsically salient and conspicuous
in this visual environment, automatically and involuntarily drawing attention. So you bring the
fovea (the center of the macula where visual acuity is highest) onto the coffee as shown in Fig.
11.1.1.
414 Chapter 11. Attention Mechanisms and Transformers

Fig. 11.1.1: Using the nonvolitional cue based on saliency (red cup, non-paper), attention is invol-
untarily directed to the coffee.
After drinking coffee, you become caffeinated and want to read a book. So you turn your head,
refocus your eyes, and look at the book as depicted in Fig. 11.1.2. Different from the case in Fig.
11.1.1 where the coffee biases you towards selecting based on saliency, in this task-dependent
case you select the book under cognitive and volitional control. Using the volitional cue based on
variable selection criteria, this form of attention is more deliberate. It is also more powerful with
the subjectʼs voluntary effort.
Fig. 11.1.2: Using the volitional cue (want to read a book) that is task-dependent, attention is di-
rected to the book under volitional control.
11.1. Attention Cues 415

11.1.2 Queries, Keys, and Values
Inspired by the nonvolitional and volitional attention cues that explain the attentional deploy-
ment, in the following we will describe a framework for designing attention mechanisms by in-
corporating these two attention cues.
To begin with, consider the simpler case where only nonvolitional cues are available. To bias
selection over sensory inputs, we can simply use a parameterized fully connected layer or even
non-parameterized max or average pooling.
Therefore, what sets attention mechanisms apart from those fully connected layers or pooling
layers is the inclusion of the volitional cues. In the context of attention mechanisms, we refer
to volitional cues as queries. Given any query, attention mechanisms bias selection over sensory
inputs (e.g., intermediate feature representations) via attention pooling. These sensory inputs are
called values in the context of attention mechanisms. More generally, every value is paired with a
key, which can be thought of the nonvolitional cue of that sensory input. As shown in Fig. 11.1.3,
we can design attention pooling so that the given query (volitional cue) can interact with keys
(nonvolitional cues), which guides bias selection over values (sensory inputs).
Fig. 11.1.3: Attention mechanisms bias selection over values (sensory inputs) via attention pool-
ing, which incorporates queries (volitional cues) and keys (nonvolitional cues).
Note that there are many alternatives for the design of attention mechanisms. For instance, we
can design a non-differentiable attention model that can be trained using reinforcement learning
methods (Mnih et al., 2014). Given the dominance of the framework in Fig. 11.1.3, models under
this framework will be the center of our attention in this chapter.

11.1.3 Visualization of Attention
Average pooling can be treated as a weighted average of inputs, where weights are uniform. In
practice, attention pooling aggregates values using weighted average, where weights are computed
between the given query and different keys.
import torch
To visualize attention weights, we define the show_heatmaps function. Its input matrices has the
shape (number of rows for display, number of columns for display, number of queries, number of
keys).
#@save
def show_heatmaps(matrices, xlabel, ylabel, titles=None, figsize=(2.5, 2.5),
cmap='Reds'):
"""Show heatmaps of matrices."""
num_rows, num_cols = len(matrices), len(matrices[0])
fig, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize,
sharex=True, sharey=True, squeeze=False)
for i, (row_axes, row_matrices) in enumerate(zip(axes, matrices)):
for j, (ax, matrix) in enumerate(zip(row_axes, row_matrices)):
pcm = ax.imshow(matrix.detach().numpy(), cmap=cmap)
if i == num_rows - 1:
ax.set_xlabel(xlabel)
if j == 0:
ax.set_ylabel(ylabel)
if titles:
ax.set_title(titles[j])
fig.colorbar(pcm, ax=axes, shrink=0.6);
For demonstration, we consider a simple case where the attention weight is one only when the
query and the key are the same; otherwise it is zero.
attention_weights = torch.eye(10).reshape((1, 1, 10, 10))

show_heatmaps(attention_weights, xlabel='Keys', ylabel='Queries')
In the subsequent sections, we will often invoke this function to visualize attention weights.
11.1. Attention Cues 417

11.1.4 Summary
• Human attention is a limited, valuable, and scarce resource.

• Subjects selectively direct attention using both the nonvolitional and volitional cues. The
former is based on saliency and the latter is task-dependent.
• Attention mechanisms are different from fully connected layers or pooling layers due to in-
clusion of the volitional cues.
• Attention mechanisms bias selection over values (sensory inputs) via attention pooling,
which incorporates queries (volitional cues) and keys (nonvolitional cues). Keys and values
are paired.
• We can visualize attention weights between queries and keys.
Exercises
1. What can be the volitional cue when decoding a sequence token by token in machine trans-
lation? What are the nonvolitional cues and the sensory inputs?
2. Randomly generate a 10 × 10 matrix and use the softmax operation to ensure each row is a
valid probability distribution. Visualize the output attention weights.
Discussions155
11.2 Attention Pooling
Now you know the major components of attention mechanisms under the framework in Fig.
11.1.3. To recapitulate, the interactions between queries (volitional cues) and keys (nonvolitional
cues) result in attention pooling. The attention pooling selectively aggregates values (sensory in-
puts) to produce the output. In this section, we will describe attention pooling in greater detail
to give you a high-level view of how attention mechanisms work in practice. Specifically, the
Nadaraya-Watson kernel regression model proposed in 1964 is a simple yet complete example for
demonstrating machine learning with attention mechanisms.
import torch
155

11.2.1 Generating the Dataset
To keep things simple, letʼs consider the following regression problem: given a dataset of input-
output pairs {(x1 , y1 ), . . . , (xn , yn )}, how to learn f to predict the output ŷ = f (x) for any new input
x?
Here we generate an artificial dataset according to the following nonlinear function with the noise
term ϵ:
yi = 2 sin(xi ) + xi0.8 + ϵ, (11.2.1)
where ϵ obeys a normal distribution with zero mean and standard deviation 0.5. Both 50 training
examples and 50 validation examples are generated. To better visualize the pattern of attention
later, the training inputs are sorted.
class NonlinearData(d2l.DataModule):
def __init__(self, n, batch_size):
f = lambda x: 2 * torch.sin(x) + x**0.8
self.x_train, _ = torch.sort(torch.rand(n) * 5)
self.y_train = f(self.x_train) + torch.randn(n)
self.x_val = torch.arange(0, 5, 5.0/n)
self.y_val = f(self.x_val)

arrays = (self.x_train, self.y_train) if train else (self.x_val, self.y_val)
return self.get_tensorloader(arrays, train)
n = 50
data = NonlinearData(n, batch_size=10)
The following function plots all the training examples (represented by circles), the ground-truth
data generation function f without the noise term (labeled by “Truth”), and the learned prediction
function (labeled by “Pred”).
def plot_kernel_reg(y_hat):
d2l.plot(data.x_val, [data.y_val, y_hat.detach().numpy()], 'x', 'y', legend=['Truth',
,→'Pred'],
xlim=[0, 5], ylim=[-1, 5])
d2l.plt.plot(data.x_train, data.y_train, 'o', alpha=0.5);
11.2.2 Average Pooling
We begin with perhaps the worldʼs “dumbest” estimator for this regression problem: using average
pooling to average over all the training outputs:
1∑
n
f (x) = yi , (11.2.2)
n
i=1
which is plotted below. As we can see, this estimator is indeed not so smart.
11.2. Attention Pooling 419

y_hat = data.y_train.mean().repeat(n)
plot_kernel_reg(y_hat)
11.2.3 Nonparametric Attention Pooling
Obviously, average pooling omits the inputs xi . A better idea was proposed by Nadaraya
(Nadaraya, 1964) and Watson (Watson, 1964) to weigh the outputs yi according to their input loca-
tions:
∑
n
K(x − xi )
f (x) = ∑n yi , (11.2.3)
i=1 j=1 K(x − xj )
where K is a kernel. The estimator in (11.2.3) is called Nadaraya-Watson kernel regression. Here we
will not dive into details of kernels. Recall the framework of attention mechanisms in Fig. 11.1.3.
From the perspective of attention, we can rewrite (11.2.3) in a more generalized form of attention
pooling:
∑
n
f (x) = α(x, xi )yi , (11.2.4)
i=1
where x is the query and (xi , yi ) is the key-value pair. Comparing (11.2.4) and (11.2.2), the at-
tention pooling here is a weighted average of values yi . The attention weight α(x, xi ) in (11.2.4)
is assigned to the corresponding value yi based on the interaction between the query x and the
key xi modeled by α. For any query, its attention weights over all the key-value pairs are a valid
probability distribution: they are non-negative and sum up to one.
To gain intuitions of attention pooling, just consider a Gaussian kernel defined as
1 u2
K(u) = √ exp(− ). (11.2.5)
2π 2

Plugging the Gaussian kernel into (11.2.4) and (11.2.3) gives
∑
n
f (x) = α(x, xi )yi
i=1
( )
∑n
exp − 12 (x − xi )2
= ∑n ( 1 ) yi (11.2.6)
j=1 exp − 2 (x − xj )
2
i=1
∑n ( )
1
= softmax − (x − xi ) yi . 2
2
i=1
In (11.2.6), a key xi that is closer to the given query x will get more attention via a larger attention
weight assigned to the keyʼs corresponding value yi .
Notably, Nadaraya-Watson kernel regression is a nonparametric model; thus (11.2.6) is an ex-
ample of nonparametric attention pooling. In the following, we plot the prediction based on this
nonparametric attention model. The predicted line is smooth and closer to the ground-truth than
that produced by average pooling.
def diff(queries, keys):

return queries.reshape((-1, 1)) - keys.reshape((1, -1))
def attention_pool(query_key_diffs, values):

attention_weights = F.softmax(- query_key_diffs**2 / 2, dim=1)
return torch.matmul(attention_weights, values), attention_weights
y_hat, attention_weights = attention_pool(

diff(data.x_val, data.x_train), data.y_train)
plot_kernel_reg(y_hat)
Now letʼs take a look at the attention weights. Here validation inputs are queries while training
inputs are keys. Since both inputs are sorted, we can see that the closer the query-key pair is, the
higher attention weight is in the attention pooling.
d2l.show_heatmaps([[attention_weights]],
xlabel='Sorted training inputs',
ylabel='Sorted validation inputs')

11.2.4 Parametric Attention Pooling
Nonparametric Nadaraya-Watson kernel regression enjoys the consistency benefit: given enough
data this model converges to the optimal solution. Nonetheless, we can easily integrate learnable
parameters into attention pooling.
As an example, slightly different from (11.2.6), in the following the distance between the query x
and the key xi is multiplied by a learnable parameter w:
∑
n
f (x) = α(x, xi )yi
i=1
( )
∑n
exp − 12 ((x − xi )w)2
= ∑n ( 1 ) yi (11.2.7)
i=1 j=1 exp − 2 ((x − x j )w) 2
∑n ( )
1
= softmax − ((x − xi )w) yi . 2
2
i=1
In the rest of the section, we will train this model by learning the parameter of the attention pool-
ing in (11.2.7).
Batch Matrix Multiplication
To more efficiently compute attention for minibatches, we can leverage batch matrix multiplica-
tion utilities provided by deep learning frameworks.
Suppose that the first minibatch contains n matrices X1 , . . . , Xn of shape a × b, and the second
minibatch contains n matrices Y1 , . . . , Yn of shape b × c. Their batch matrix multiplication results
in n matrices X1 Y1 , . . . , Xn Yn of shape a × c. Therefore, given two tensors of shape (n, a, b) and (n,
b, c), the shape of their batch matrix multiplication output is (n, a, c).
X = torch.ones((2, 1, 4))
Y = torch.ones((2, 4, 6))
d2l.check_shape(torch.bmm(X, Y), (2, 1, 6))
In the context of attention mechanisms, we can use minibatch matrix multiplication to compute
weighted averages of values in a minibatch.

weights = torch.ones((2, 10)) * 0.1
values = torch.arange(20.0).reshape((2, 10))
torch.bmm(weights.unsqueeze(1), values.unsqueeze(-1))
tensor([[[ 4.5000]],
[[14.5000]]])
Defining the Model
Using minibatch matrix multiplication, below we define the parametric version of Nadaraya-
Watson kernel regression based on the parametric attention pooling in (11.2.7).
class NWKernelRegression(d2l.Module):
def __init__(self, keys, values, lr):
super().__init__()
self.w = torch.ones(1, requires_grad=True)
def forward(self, queries):
y_hat, self.attention_weights = attention_pool(
diff(queries, self.keys) * self.w, self.values)
return y_hat

l = (y_hat.reshape(-1) - y.reshape(-1)) ** 2 / 2
return l.mean()
return d2l.SGD([self.w], self.lr)
Training
In the following, we transform the training dataset to keys and values to train the attention model.
In the parametric attention pooling, for simplicity any training input just takes key-value pairs
from all the training examples to predict its output.
model = NWKernelRegression(data.x_train, data.y_train, lr=1)

model.board.display = False
Trying to fit the training dataset with noise, the predicted line is less smooth than its nonparamet-
ric counterpart that was plotted earlier.
plot_kernel_reg(model.forward(data.x_val))

Comparing with nonparametric attention pooling, the region with large attention weights be-
comes sharper in the parametric setting.
d2l.show_heatmaps([[model.attention_weights]],
xlabel='Sorted training inputs',
ylabel='Sorted validation inputs')
11.2.5 Summary
• Nadaraya-Watson kernel regression is an example of machine learning with attention mech-

anisms.
• The attention pooling of Nadaraya-Watson kernel regression is a weighted average of the
training outputs. From the attention perspective, the attention weight is assigned to a value
based on a function of a query and the key that is paired with the value.
• Attention pooling can be either nonparametric or parametric.

Exercises
1. Increase the number of training examples. Can you learn nonparametric Nadaraya-Watson
kernel regression better?
2. What is the value of our learned w in the parametric attention pooling experiment? Why
does it make the weighted region sharper when visualizing the attention weights?
3. How can we add hyperparameters to nonparametric Nadaraya-Watson kernel regression to
predict better?
4. Design another parametric attention pooling for the kernel regression of this section. Train
this new model and visualize its attention weights.
Discussions156
11.3 Attention Scoring Functions
In Section 11.2, we used a Gaussian kernel to model interactions between queries and keys. Treat-
ing the exponent of the Gaussian kernel in (11.2.6) as an attention scoring function (or scoring func-
tion for short), the results of this function were essentially fed into a softmax operation. As a
result, we obtained a probability distribution (attention weights) over values that are paired with
keys. In the end, the output of the attention pooling is simply a weighted sum of the values based
on these attention weights.
At a high level, we can use the above algorithm to instantiate the framework of attention mech-
anisms in Fig. 11.1.3. Denoting an attention scoring function by a, Fig. 11.3.1 illustrates how the
output of attention pooling can be computed as a weighted sum of values. Since attention weights
are a probability distribution, the weighted sum is essentially a weighted average.
Fig. 11.3.1: Computing the output of attention pooling as a weighted average of values.
Mathematically, suppose that we have a query q ∈ Rq and m key-value pairs (k1 , v1 ), . . . , (km , vm ),
where any ki ∈ Rk and any vi ∈ Rv . The attention pooling f is instantiated as a weighted sum of
156
11.3. Attention Scoring Functions 425

the values:
∑
m
f (q, (k1 , v1 ), . . . , (km , vm )) = α(q, ki )vi ∈ Rv , (11.3.1)
i=1
where the attention weight (scalar) for the query q and key ki is computed by the softmax operation
of an attention scoring function a that maps two vectors to a scalar:
exp(a(q, ki ))
α(q, ki ) = softmax(a(q, ki )) = ∑m ∈ R. (11.3.2)
j=1 exp(a(q, kj ))
As we can see, different choices of the attention scoring function a lead to different behaviors of
attention pooling. In this section, we introduce two popular scoring functions that we will use to
develop more sophisticated attention mechanisms later.
import math
import torch
11.3.1 Masked Softmax Operation
As we just mentioned, a softmax operation is used to output a probability distribution as attention

weights. In some cases, not all the values should be fed into attention pooling. For instance, for
efficient minibatch processing in Section 10.5, some text sequences are padded with special tokens
that do not carry meaning. To get an attention pooling over only meaningful tokens as values, we
can specify a valid sequence length (in number of tokens) to filter out those beyond this specified
range when computing softmax. In this way, we can implement such a masked softmax operation
in the following masked_softmax function, where any value beyond the valid length is masked as
zero.
#@save
def masked_softmax(X, valid_lens):
"""Perform softmax operation by masking elements on the last axis."""
# X: 3D tensor, valid_lens: 1D or 2D tensor
def _sequence_mask(X, valid_len, value=0):
maxlen = X.size(1)
mask = torch.arange((maxlen), dtype=torch.float32,
device=X.device)[None, :] < valid_len[:, None]
X[~mask] = value
return X
if valid_lens is None:
return nn.functional.softmax(X, dim=-1)
else:
shape = X.shape
if valid_lens.dim() == 1:
valid_lens = torch.repeat_interleave(valid_lens, shape[1])
else:
valid_lens = valid_lens.reshape(-1)
# On the last axis, replace masked elements with a very large negative
# value, whose exponentiation outputs 0
X = _sequence_mask(X.reshape(-1, shape[-1]), valid_lens, value=-1e6)
return nn.functional.softmax(X.reshape(shape), dim=-1)

To demonstrate how this function works, consider a minibatch of two 2 × 4 matrix examples,
where the valid lengths for these two examples are two and three, respectively. As a result of the
masked softmax operation, values beyond the valid lengths are all masked as zero.
masked_softmax(torch.rand(2, 2, 4), torch.tensor([2, 3]))
tensor([[[0.4505, 0.5495, 0.0000, 0.0000],

[0.4318, 0.5682, 0.0000, 0.0000]],
[[0.3344, 0.3383, 0.3273, 0.0000],

[0.3098, 0.3535, 0.3366, 0.0000]]])
Similarly, we can also use a two-dimensional tensor to specify valid lengths for every row in each
matrix example.
masked_softmax(torch.rand(2, 2, 4), torch.tensor([[1, 3], [2, 4]]))
tensor([[[1.0000, 0.0000, 0.0000, 0.0000],

[0.2914, 0.3040, 0.4046, 0.0000]],
[[0.5124, 0.4876, 0.0000, 0.0000],

[0.1526, 0.3296, 0.2323, 0.2855]]])
11.3.2 Additive Attention
In general, when queries and keys are vectors of different lengths, we can use additive attention
as the scoring function. Given a query q ∈ Rq and a key k ∈ Rk , the additive attention scoring
function
a(q, k) = w⊤
v tanh(Wq q + Wk k) ∈ R, (11.3.3)
where learnable parameters Wq ∈ Rh×q , Wk ∈ Rh×k , and wv ∈ Rh . Equivalent to (11.3.3), the

query and the key are concatenated and fed into an MLP with a single hidden layer whose number
of hidden units is h, a hyperparameter. By using tanh as the activation function and disabling bias
terms, we implement additive attention in the following.
#@save
class AdditiveAttention(nn.Module):
"""Additive attention."""
def __init__(self, num_hiddens, dropout, **kwargs):
super(AdditiveAttention, self).__init__(**kwargs)
self.W_k = nn.LazyLinear(num_hiddens, bias=False)
self.W_q = nn.LazyLinear(num_hiddens, bias=False)
self.w_v = nn.LazyLinear(1, bias=False)
self.dropout = nn.Dropout(dropout)
def forward(self, queries, keys, values, valid_lens):

queries, keys = self.W_q(queries), self.W_k(keys)
# After dimension expansion, shape of queries: (batch_size, no. of
# queries, 1, num_hiddens) and shape of keys: (batch_size, 1, no. of

# key-value pairs, num_hiddens). Sum them up with broadcasting

features = queries.unsqueeze(2) + keys.unsqueeze(1)
features = torch.tanh(features)
# There is only one output of self.w_v, so we remove the last
# one-dimensional entry from the shape. Shape of scores: (batch_size,
# no. of queries, no. of key-value pairs)
scores = self.w_v(features).squeeze(-1)
self.attention_weights = masked_softmax(scores, valid_lens)
# Shape of values: (batch_size, no. of key-value pairs, value
# dimension)
return torch.bmm(self.dropout(self.attention_weights), values)
Letʼs demonstrate the above AdditiveAttention class with a toy example, where shapes (batch
size, number of steps or sequence length in tokens, feature size) of queries, keys, and values are
(2, 1, 20), (2, 10, 2), and (2, 10, 4), respectively. The attention pooling output has a shape of (batch
size, number of steps for queries, feature size for values).
queries, keys = torch.normal(0, 1, (2, 1, 20)), torch.ones((2, 10, 2))

# The two value matrices in the values minibatch are identical
values = torch.arange(40, dtype=torch.float32).reshape(1, 10, 4).repeat(
2, 1, 1)
valid_lens = torch.tensor([2, 6])
attention = AdditiveAttention(num_hiddens=8, dropout=0.1)

attention.eval()
attention(queries, keys, values, valid_lens)
tensor([[[ 2.0000, 3.0000, 4.0000, 5.0000]],
[[10.0000, 11.0000, 12.0000, 13.0000]]], grad_fn=<BmmBackward0>)
Although additive attention contains learnable parameters, since every key is the same in this
example, the attention weights are uniform, determined by the specified valid lengths.
d2l.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)),
xlabel='Keys', ylabel='Queries')

11.3.3 Scaled Dot-Product Attention
A more computationally efficient design for the scoring function can be simply dot product. How-
ever, the dot product operation requires that both the query and the key have the same vector
length, say d. Assume that all the elements of the query and the key are independent random
variables with zero mean and unit variance. The dot product of both vectors has zero mean and a
variance of d. To ensure that the variance of the dot product still remains one regardless of vector
length, the scaled dot-product attention scoring function
√
a(q, k) = q⊤ k/ d (11.3.4)
√
divides the dot product by d. In practice, we often think in minibatches for efficiency, such as
computing attention for n queries and m key-value pairs, where queries and keys are of length d
and values are of length v. The scaled dot-product attention of queries Q ∈ Rn×d , keys K ∈ Rm×d ,
and values V ∈ Rm×v is
( )
QK⊤
softmax √ V ∈ Rn×v . (11.3.5)
d
In the following implementation of the scaled dot product attention, we use dropout for model
regularization.
#@save
class DotProductAttention(nn.Module):
"""Scaled dot product attention."""
def __init__(self, dropout, num_heads=None):
super().__init__()
self.num_heads = num_heads # To be covered later
# Shape of queries: (batch_size, no. of queries, d)

# Shape of keys: (batch_size, no. of key-value pairs, d)
# Shape of values: (batch_size, no. of key-value pairs, value dimension)
# Shape of valid_lens: (batch_size,) or (batch_size, no. of queries)
def forward(self, queries, keys, values, valid_lens=None,
window_mask=None):
d = queries.shape[-1]
# Swap the last two dimensions of keys with keys.transpose(1, 2)
scores = torch.bmm(queries, keys.transpose(1, 2)) / math.sqrt(d)
if window_mask is not None: # To be covered later
num_windows = window_mask.shape[0]
n, num_queries, num_kv_pairs = scores.shape
# Shape of window_mask: (num_windows, no. of queries,
# no. of key-value pairs)
scores = scores.reshape((n // (num_windows * self.num_heads), num_windows, self.
num_heads, num_queries, num_kv_pairs)) + window_mask.unsqueeze(1).unsqueeze(0)
scores = scores.reshape((n, num_queries, num_kv_pairs))
self.attention_weights = masked_softmax(scores, valid_lens)
return torch.bmm(self.dropout(self.attention_weights), values)
To demonstrate the above DotProductAttention class, we use the same keys, values, and valid
lengths from the earlier toy example for additive attention. For the dot product operation, we
make the feature size of queries the same as that of keys.

queries = torch.normal(0, 1, (2, 1, 2))
attention = DotProductAttention(dropout=0.5)
attention.eval()
attention(queries, keys, values, valid_lens)
tensor([[[ 2.0000, 3.0000, 4.0000, 5.0000]],
[[10.0000, 11.0000, 12.0000, 13.0000]]])
Same as in the additive attention demonstration, since keys contains the same element that cannot
be differentiated by any query, uniform attention weights are obtained.
d2l.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)),
xlabel='Keys', ylabel='Queries')
11.3.4 Summary
• We can compute the output of attention pooling as a weighted average of values, where dif-
ferent choices of the attention scoring function lead to different behaviors of attention pool-
ing.
• When queries and keys are vectors of different lengths, we can use the additive attention
scoring function. When they are the same, the scaled dot-product attention scoring function
is more computationally efficient.
Exercises
1. Modify keys in the toy example and visualize attention weights. Do additive attention and
scaled dot-product attention still output the same attention weights? Why or why not?
2. Using matrix multiplications only, can you design a new scoring function for queries and
keys with different vector lengths?
3. When queries and keys have the same vector length, is vector summation a better design
than dot product for the scoring function? Why or why not?
Discussions157
157

11.4 Bahdanau Attention
We studied the machine translation problem in Section 10.7, where we designed an encoder-
decoder architecture based on two RNNs for sequence to sequence learning. Specifically, the
RNN encoder transforms a variable-length sequence into a fixed-shape context variable, then the
RNN decoder generates the output (target) sequence token by token based on the generated to-
kens and the context variable. However, even though not all the input (source) tokens are useful
for decoding a certain token, the same context variable that encodes the entire input sequence is
still used at each decoding step.
In a separate but related challenge of handwriting generation for a given text sequence, Graves
designed a differentiable attention model to align text characters with the much longer pen trace,
where the alignment moves only in one direction (Graves, 2013). Inspired by the idea of learning to
align, Bahdanau et al. proposed a differentiable attention model without the severe unidirectional
alignment limitation (Bahdanau et al., 2014). When predicting a token, if not all the input tokens
are relevant, the model aligns (or attends) only to parts of the input sequence that are relevant to
the current prediction. This is achieved by treating the context variable as an output of attention
pooling.
11.4.1 Model
When describing Bahdanau attention for the RNN encoder-decoder below, we will follow the same
notation in Section 10.7. The new attention-based model is the same as that in Section 10.7 except
that the context variable c in (10.7.3) is replaced by ct′ at any decoding time step t′ . Suppose that
there are T tokens in the input sequence, the context variable at the decoding time step t′ is the
output of attention pooling:
∑
T
ct′ = α(st′ −1 , ht )ht , (11.4.1)
t=1
where the decoder hidden state st′ −1 at time step t′ − 1 is the query, and the encoder hidden states
ht are both the keys and values, and the attention weight α is computed as in (11.3.2) using the
additive attention scoring function defined by (11.3.3).
Slightly different from the vanilla RNN encoder-decoder architecture in Fig. 10.7.2, the same ar-
chitecture with Bahdanau attention is depicted in Fig. 11.4.1.
Fig. 11.4.1: Layers in an RNN encoder-decoder model with Bahdanau attention.
11.4. Bahdanau Attention 431

import torch
11.4.2 Defining the Decoder with Attention
To implement the RNN encoder-decoder with Bahdanau attention, we only need to redefine the
decoder. To visualize the learned attention weights more conveniently, the following Attention-
Decoder class defines the base interface for decoders with attention mechanisms.
#@save
class AttentionDecoder(d2l.Decoder):
"""The base attention-based decoder interface."""
def __init__(self):
super().__init__()
@property
def attention_weights(self):
Now letʼs implement the RNN decoder with Bahdanau attention in the following
Seq2SeqAttentionDecoder class. The state of the decoder is initialized with (i) the encoder
final-layer hidden states at all the time steps (as keys and values of the attention); (ii) the encoder
all-layer hidden state at the final time step (to initialize the hidden state of the decoder); and (iii)
the encoder valid length (to exclude the padding tokens in attention pooling). At each decoding
time step, the decoder final-layer hidden state at the previous time step is used as the query of
the attention. As a result, both the attention output and the input embedding are concatenated
as input of the RNN decoder.
class Seq2SeqAttentionDecoder(AttentionDecoder):
dropout=0):
super().__init__()
self.attention = d2l.AdditiveAttention(num_hiddens, dropout)
self.rnn = nn.GRU(
embed_size + num_hiddens, num_hiddens, num_layers,
dropout=dropout)
self.apply(d2l.init_seq2seq)
def init_state(self, enc_outputs, enc_valid_lens):

# Shape of outputs: (num_steps, batch_size, num_hiddens).
# Shape of hidden_state: (num_layers, batch_size, num_hiddens)
outputs, hidden_state = enc_outputs
return (outputs.permute(1, 0, 2), hidden_state, enc_valid_lens)

# Shape of enc_outputs: (batch_size, num_steps, num_hiddens).
# Shape of hidden_state: (num_layers, batch_size, num_hiddens)
enc_outputs, hidden_state, enc_valid_lens = state
# Shape of the output X: (num_steps, batch_size, embed_size)

X = self.embedding(X).permute(1, 0, 2)
outputs, self._attention_weights = [], []
for x in X:
# Shape of query: (batch_size, 1, num_hiddens)
query = torch.unsqueeze(hidden_state[-1], dim=1)
# Shape of context: (batch_size, 1, num_hiddens)
context = self.attention(
query, enc_outputs, enc_outputs, enc_valid_lens)
# Concatenate on the feature dimension
x = torch.cat((context, torch.unsqueeze(x, dim=1)), dim=-1)
# Reshape x as (1, batch_size, embed_size + num_hiddens)
out, hidden_state = self.rnn(x.permute(1, 0, 2), hidden_state)
outputs.append(out)
self._attention_weights.append(self.attention.attention_weights)
# After fully connected layer transformation, shape of outputs:
# (num_steps, batch_size, vocab_size)
outputs = self.dense(torch.cat(outputs, dim=0))
return outputs.permute(1, 0, 2), [enc_outputs, hidden_state,
enc_valid_lens]
@property
return self._attention_weights
In the following, we test the implemented decoder with Bahdanau attention using a minibatch of
4 sequence inputs of 7 time steps.
vocab_size, embed_size, num_hiddens, num_layers = 10, 8, 16, 2

encoder = d2l.Seq2SeqEncoder(vocab_size, embed_size, num_hiddens, num_layers)
decoder = Seq2SeqAttentionDecoder(vocab_size, embed_size, num_hiddens,
num_layers)
X = torch.zeros((batch_size, num_steps), dtype=torch.long)
state = decoder.init_state(encoder(X), None)
output, state = decoder(X, state)
d2l.check_shape(output, (batch_size, num_steps, vocab_size))
d2l.check_shape(state[0], (batch_size, num_steps, num_hiddens))
d2l.check_shape(state[1][0], (batch_size, num_hiddens))

11.4.3 Training
Similar to Section 10.7.6, here we specify hyperparameters, instantiate an encoder and a decoder
with Bahdanau attention, and train this model for machine translation.
embed_size, num_hiddens, num_layers, dropout = 256, 256, 2, 0.2
encoder = d2l.Seq2SeqEncoder(
len(data.src_vocab), embed_size, num_hiddens, num_layers, dropout)
decoder = Seq2SeqAttentionDecoder(
len(data.tgt_vocab), embed_size, num_hiddens, num_layers, dropout)
model = d2l.Seq2Seq(encoder, decoder, tgt_pad=data.tgt_vocab['<pad>'],
lr=0.005)
After the model is trained, we use it to translate a few English sentences into French and compute
their BLEU scores.
translation = []
break
f'{d2l.bleu(" ".join(translation), fr, k=2):.3f}')
go . => ['va', '!'], bleu,1.000

he's calm . => ['je', 'suis', 'calme', '.'], bleu,0.537
i'm home . => ['je', 'suis', 'chez', 'moi', '.'], bleu,1.000
By visualizing the attention weights when translating the last English sentence, we can see that

each query assigns non-uniform weights over key-value pairs. It shows that at each decoding step,
different parts of the input sequences are selectively aggregated in the attention pooling.
_, dec_attention_weights = model.predict_step(
data.build([engs[-1]], [fras[-1]]), d2l.try_gpu(), data.num_steps, True)
attention_weights = torch.cat([step[0][0][0] for step in dec_attention_weights], 0).
,→reshape((1,
1, -1, data.num_steps))
# Plus one to include the end-of-sequence token

d2l.show_heatmaps(
attention_weights[:, :, :, :len(engs[-1].split()) + 1].cpu(),
xlabel='Key positions', ylabel='Query positions')
11.4.4 Summary
• When predicting a token, if not all the input tokens are relevant, the RNN encoder-decoder
with Bahdanau attention selectively aggregates different parts of the input sequence. This
is achieved by treating the context variable as an output of additive attention pooling.
• In the RNN encoder-decoder, Bahdanau attention treats the decoder hidden state at the pre-
vious time step as the query, and the encoder hidden states at all the time steps as both the
keys and values.
Exercises
1. Replace GRU with LSTM in the experiment.

2. Modify the experiment to replace the additive attention scoring function with the scaled
dot-product. How does it influence the training efficiency?
Discussions158
158

11.5 Multi-Head Attention
In practice, given the same set of queries, keys, and values we may want our model to combine
knowledge from different behaviors of the same attention mechanism, such as capturing depen-
dencies of various ranges (e.g., shorter-range vs. longer-range) within a sequence. Thus, it may
be beneficial to allow our attention mechanism to jointly use different representation subspaces
of queries, keys, and values.
To this end, instead of performing a single attention pooling, queries, keys, and values can be
transformed with h independently learned linear projections. Then these h projected queries,
keys, and values are fed into attention pooling in parallel. In the end, h attention pooling outputs
are concatenated and transformed with another learned linear projection to produce the final out-
put. This design is called multi-head attention, where each of the h attention pooling outputs is a
head (Vaswani et al., 2017). Using fully connected layers to perform learnable linear transforma-
tions, Fig. 11.5.1 describes multi-head attention.
Fig. 11.5.1: Multi-head attention, where multiple heads are concatenated then linearly trans-
formed.
11.5.1 Model
Before providing the implementation of multi-head attention, letʼs formalize this model mathe-
matically. Given a query q ∈ Rdq , a key k ∈ Rdk , and a value v ∈ Rdv , each attention head hi
(i = 1, . . . , h) is computed as
(q) (k) (v)
hi = f (Wi q, Wi k, Wi v) ∈ Rpv , (11.5.1)
where learnable parameters Wi ∈ Rpq ×dq , Wi ∈ Rpk ×dk and Wi ∈ Rpv ×dv , and f is attention
(q) (k) (v)
pooling, such as additive attention and scaled dot-product attention in Section 11.3. The multi-
head attention output is another linear transformation via learnable parameters Wo ∈ Rpo ×hpv of
the concatenation of h heads:
 
h1
 .. 
Wo  .  ∈ Rpo . (11.5.2)
hh
Based on this design, each head may attend to different parts of the input. More sophisticated
functions than the simple weighted average can be expressed.

import math
import torch
11.5.2 Implementation
In our implementation, we choose the scaled dot-product attention for each head of the multi-
head attention. To avoid significant growth of computational cost and parameterization cost, we
set pq = pk = pv = po /h. Note that h heads can be computed in parallel if we set the number of
outputs of linear transformations for the query, key, and value to pq h = pk h = pv h = po . In the
following implementation, po is specified via the argument num_hiddens.
#@save
class MultiHeadAttention(d2l.Module):
"""Multi-head attention."""
def __init__(self, num_hiddens, num_heads, dropout, bias=False, **kwargs):
super().__init__()
self.num_heads = num_heads
self.attention = d2l.DotProductAttention(dropout, num_heads)
self.W_q = nn.LazyLinear(num_hiddens, bias=bias)
self.W_k = nn.LazyLinear(num_hiddens, bias=bias)
self.W_v = nn.LazyLinear(num_hiddens, bias=bias)
self.W_o = nn.LazyLinear(num_hiddens, bias=bias)
def forward(self, queries, keys, values, valid_lens, window_mask=None):

# Shape of queries, keys, or values:
# (batch_size, no. of queries or key-value pairs, num_hiddens)
# Shape of valid_lens: (batch_size,) or (batch_size, no. of queries)
# After transposing, shape of output queries, keys, or values:
# (batch_size * num_heads, no. of queries or key-value pairs,
# num_hiddens / num_heads)
queries = self.transpose_qkv(self.W_q(queries))
keys = self.transpose_qkv(self.W_k(keys))
values = self.transpose_qkv(self.W_v(values))
if valid_lens is not None:

# On axis 0, copy the first item (scalar or vector) for num_heads
# times, then copy the next item, and so on
valid_lens = torch.repeat_interleave(
valid_lens, repeats=self.num_heads, dim=0)
# Shape of output: (batch_size * num_heads, no. of queries,

# num_hiddens / num_heads)
output = self.attention(queries, keys, values, valid_lens,
window_mask)
# Shape of output_concat: (batch_size, no. of queries, num_hiddens)
output_concat = self.transpose_output(output)
return self.W_o(output_concat)
To allow for parallel computation of multiple heads, the above MultiHeadAttention class uses two
transposition methods as defined below. Specifically, the transpose_output method reverses the
operation of the transpose_qkv method.
11.5. Multi-Head Attention 437

@d2l.add_to_class(MultiHeadAttention) #@save
def transpose_qkv(self, X):
"""Transposition for parallel computation of multiple attention heads."""
# Shape of input X: (batch_size, no. of queries or key-value pairs,
# num_hiddens). Shape of output X: (batch_size, no. of queries or
# key-value pairs, num_heads, num_hiddens / num_heads)
X = X.reshape(X.shape[0], X.shape[1], self.num_heads, -1)
# Shape of output X: (batch_size, num_heads, no. of queries or key-value
# pairs, num_hiddens / num_heads)
X = X.permute(0, 2, 1, 3)
# Shape of output: (batch_size * num_heads, no. of queries or key-value
# pairs, num_hiddens / num_heads)
return X.reshape(-1, X.shape[2], X.shape[3])
@d2l.add_to_class(MultiHeadAttention) #@save
def transpose_output(self, X):
"""Reverse the operation of transpose_qkv."""
X = X.reshape(-1, self.num_heads, X.shape[1], X.shape[2])
X = X.permute(0, 2, 1, 3)
return X.reshape(X.shape[0], X.shape[1], -1)
Letʼs test our implemented MultiHeadAttention class using a toy example where keys and values
are the same. As a result, the shape of the multi-head attention output is (batch_size, num_queries,
num_hiddens).
num_hiddens, num_heads = 100, 5

attention = MultiHeadAttention(num_hiddens, num_heads, 0.5)
batch_size, num_queries, num_kvpairs, valid_lens = 2, 4, 6, torch.tensor([3, 2])

X = torch.ones((batch_size, num_queries, num_hiddens))
Y = torch.ones((batch_size, num_kvpairs, num_hiddens))
d2l.check_shape(attention(X, Y, Y, valid_lens),
(batch_size, num_queries, num_hiddens))

11.5.3 Summary
• Multi-head attention combines knowledge of the same attention pooling via different rep-
resentation subspaces of queries, keys, and values.
• To compute multiple heads of multi-head attention in parallel, proper tensor manipulation
is needed.
Exercises
1. Visualize attention weights of multiple heads in this experiment.

2. Suppose that we have a trained model based on multi-head attention and we want to prune
least important attention heads to increase the prediction speed. How can we design exper-
iments to measure the importance of an attention head?
Discussions159
11.6 Self-Attention and Positional Encoding
In deep learning, we often use CNNs or RNNs to encode a sequence. Now with attention mech-
anisms, imagine that we feed a sequence of tokens into attention pooling so that the same set of
tokens act as queries, keys, and values. Specifically, each query attends to all the key-value pairs
and generates one attention output. Since the queries, keys, and values come from the same place,
this performs self-attention (Lin et al., 2017, Vaswani et al., 2017), which is also called intra-attention
(Cheng et al., 2016, Parikh et al., 2016, Paulus et al., 2017). In this section, we will discuss sequence
encoding using self-attention, including using additional information for the sequence order.
import math
import torch
11.6.1 Self-Attention
Given a sequence of input tokens x1 , . . . , xn where any xi ∈ Rd (1 ≤ i ≤ n), its self-attention

outputs a sequence of the same length y1 , . . . , yn , where
yi = f (xi , (x1 , x1 ), . . . , (xn , xn )) ∈ Rd (11.6.1)
according to the definition of attention pooling f in (11.3.1). Using multi-head attention, the fol-
lowing code snippet computes the self-attention of a tensor with shape (batch size, number of
time steps or sequence length in tokens, d). The output tensor has the same shape.
num_hiddens, num_heads = 100, 5

attention = d2l.MultiHeadAttention(num_hiddens, num_heads, 0.5)
159
11.6. Self-Attention and Positional Encoding 439

batch_size, num_queries, valid_lens = 2, 4, torch.tensor([3, 2])

X = torch.ones((batch_size, num_queries, num_hiddens))
d2l.check_shape(attention(X, X, X, valid_lens),
(batch_size, num_queries, num_hiddens))
11.6.2 Comparing CNNs, RNNs, and Self-Attention
Letʼs compare architectures for mapping a sequence of n tokens to another sequence of equal
length, where each input or output token is represented by a d-dimensional vector. Specifically,
we will consider CNNs, RNNs, and self-attention. We will compare their computational complex-
ity, sequential operations, and maximum path lengths. Note that sequential operations prevent
parallel computation, while a shorter path between any combination of sequence positions makes
it easier to learn long-range dependencies within the sequence (Hochreiter et al., 2001).
Fig. 11.6.1: Comparing CNN (padding tokens are omitted), RNN, and self-attention architectures.
Consider a convolutional layer whose kernel size is k. We will provide more details about sequence
processing using CNNs in later chapters. For now, we only need to know that since the sequence
length is n, the numbers of input and output channels are both d, the computational complexity of
the convolutional layer is O(knd2 ). As Fig. 11.6.1 shows, CNNs are hierarchical so there are O(1)
sequential operations and the maximum path length is O(n/k). For example, x1 and x5 are within
the receptive field of a two-layer CNN with kernel size 3 in Fig. 11.6.1.
When updating the hidden state of RNNs, multiplication of the d × d weight matrix and the d-
dimensional hidden state has a computational complexity of O(d2 ). Since the sequence length is

n, the computational complexity of the recurrent layer is O(nd2 ). According to Fig. 11.6.1, there
are O(n) sequential operations that cannot be parallelized and the maximum path length is also
O(n).
In self-attention, the queries, keys, and values are all n × d matrices. Consider the scaled dot-
product attention in (11.3.5), where a n×d matrix is multiplied by a d×n matrix, then the output n×
n matrix is multiplied by a n × d matrix. As a result, the self-attention has a O(n2 d) computational
complexity. As we can see in Fig. 11.6.1, each token is directly connected to any other token via
self-attention. Therefore, computation can be parallel with O(1) sequential operations and the
maximum path length is also O(1).
All in all, both CNNs and self-attention enjoy parallel computation and self-attention has the short-
est maximum path length. However, the quadratic computational complexity with respect to the
sequence length makes self-attention prohibitively slow for very long sequences.
11.6.3 Positional Encoding
Unlike RNNs that recurrently process tokens of a sequence one by one, self-attention ditches se-
quential operations in favor of parallel computation. To use the sequence order information, we
can inject absolute or relative positional information by adding positional encoding to the input rep-
resentations. Positional encodings can be either learned or fixed. In the following, we describe a
fixed positional encoding based on sine and cosine functions (Vaswani et al., 2017).
Suppose that the input representation X ∈ Rn×d contains the d-dimensional embeddings for n
tokens of a sequence. The positional encoding outputs X + P using a positional embedding matrix
P ∈ Rn×d of the same shape, whose element on the ith row and the (2j)th or the (2j + 1)th column
is
( )
i
pi,2j = sin ,
100002j/d
( ) (11.6.2)
i
pi,2j+1 = cos .
100002j/d
At first glance, this trigonometric-function design looks weird. Before explanations of this design,
letʼs first implement it in the following PositionalEncoding class.
#@save
class PositionalEncoding(nn.Module):
"""Positional encoding."""
def __init__(self, num_hiddens, dropout, max_len=1000):
super().__init__()
# Create a long enough P
self.P = torch.zeros((1, max_len, num_hiddens))
X = torch.arange(max_len, dtype=torch.float32).reshape(
-1, 1) / torch.pow(10000, torch.arange(
0, num_hiddens, 2, dtype=torch.float32) / num_hiddens)
self.P[:, :, 0::2] = torch.sin(X)
self.P[:, :, 1::2] = torch.cos(X)

X = X + self.P[:, :X.shape[1], :].to(X.device)
return self.dropout(X)

In the positional embedding matrix P, rows correspond to positions within a sequence and
columns represent different positional encoding dimensions. In the example below, we can see
that the 6th and the 7th columns of the positional embedding matrix have a higher frequency than
the 8th and the 9th columns. The offset between the 6th and the 7th (same for the 8th and the 9th )
columns is due to the alternation of sine and cosine functions.
encoding_dim, num_steps = 32, 60

pos_encoding = PositionalEncoding(encoding_dim, 0)
X = pos_encoding(torch.zeros((1, num_steps, encoding_dim)))
P = pos_encoding.P[:, :X.shape[1], :]
d2l.plot(torch.arange(num_steps), P[0, :, 6:10].T, xlabel='Row (position)',
figsize=(6, 2.5), legend=["Col %d" % d for d in torch.arange(6, 10)])
Absolute Positional Information
To see how the monotonically decreased frequency along the encoding dimension relates to abso-
lute positional information, letʼs print out the binary representations of 0, 1, . . . , 7. As we can see,
the lowest bit, the second-lowest bit, and the third-lowest bit alternate on every number, every
two numbers, and every four numbers, respectively.
for i in range(8):
print(f'{i} in binary is {i:>03b}')
0 in binary is 000
1 in binary is 001
2 in binary is 010
3 in binary is 011
4 in binary is 100
5 in binary is 101
6 in binary is 110
7 in binary is 111
In binary representations, a higher bit has a lower frequency than a lower bit. Similarly, as demon-
strated in the heat map below, the positional encoding decreases frequencies along the encoding
dimension by using trigonometric functions. Since the outputs are float numbers, such continu-
ous representations are more space-efficient than binary representations.

P = P[0, :, :].unsqueeze(0).unsqueeze(0)
d2l.show_heatmaps(P, xlabel='Column (encoding dimension)',
ylabel='Row (position)', figsize=(3.5, 4), cmap='Blues')
Relative Positional Information
Besides capturing absolute positional information, the above positional encoding also allows a
model to easily learn to attend by relative positions. This is because for any fixed position offset
δ, the positional encoding at position i + δ can be represented by a linear projection of that at
position i.
This projection can be explained mathematically. Denoting ωj = 1/100002j/d , any pair of
(pi,2j , pi,2j+1 ) in (11.6.2) can be linearly projected to (pi+δ,2j , pi+δ,2j+1 ) for any fixed offset δ:
[ ][ ]
cos(δωj ) sin(δωj ) pi,2j
− sin(δωj ) cos(δωj ) pi,2j+1
[ ]
cos(δωj ) sin(iωj ) + sin(δωj ) cos(iωj )
=
− sin(δωj ) sin(iωj ) + cos(δωj ) cos(iωj )
[ ] (11.6.3)
sin ((i + δ)ωj )
=
cos ((i + δ)ωj )
[ ]
pi+δ,2j
= ,
pi+δ,2j+1
where the 2 × 2 projection matrix does not depend on any position index i.

11.6.4 Summary
• In self-attention, the queries, keys, and values all come from the same place.
• Both CNNs and self-attention enjoy parallel computation and self-attention has the shortest
maximum path length. However, the quadratic computational complexity with respect to
the sequence length makes self-attention prohibitively slow for very long sequences.
• To use the sequence order information, we can inject absolute or relative positional infor-
mation by adding positional encoding to the input representations.
Exercises
1. Suppose that we design a deep architecture to represent a sequence by stacking self-

attention layers with positional encoding. What could be issues?
2. Can you design a learnable positional encoding method?
3. Can we assign different learned embeddings according to different offsets between queries
and keys that are compared in self-attention? Hint: you may refer to relative position em-
beddings (Huang et al., 2018, Shaw et al., 2018).
Discussions160
11.7 The Transformer Architecture
We have compared CNNs, RNNs, and self-attention in Section 11.6.2. Notably, self-attention en-
joys both parallel computation and the shortest maximum path length. Therefore natually, it is
appealing to design deep architectures by using self-attention. Unlike earlier self-attention mod-
els that still rely on RNNs for input representations (Cheng et al., 2016, Lin et al., 2017, Paulus et al.,
2017), the transformer model is solely based on attention mechanisms without any convolutional
or recurrent layer (Vaswani et al., 2017). Though originally proposed for sequence to sequence
learning on text data, transformers have been pervasive in a wide range of modern deep learning
applications, such as in areas of language, vision, speech, and reinforcement learning.
11.7.1 Model
As an instance of the encoder-decoder architecture, the overall architecture of the transformer

is presented in Fig. 11.7.1. As we can see, the transformer is composed of an encoder and a de-
coder. Different from Bahdanau attention for sequence to sequence learning in Fig. 11.4.1, the
input (source) and output (target) sequence embeddings are added with positional encoding be-
fore being fed into the encoder and the decoder that stack modules based on self-attention.
160

Fig. 11.7.1: The transformer architecture.
Now we provide an overview of the transformer architecture in Fig. 11.7.1. On a high level, the
transformer encoder is a stack of multiple identical layers, where each layer has two sublayers
(either is denoted as sublayer). The first is a multi-head self-attention pooling and the second is
a positionwise feed-forward network. Specifically, in the encoder self-attention, queries, keys,
and values are all from the outputs of the previous encoder layer. Inspired by the ResNet design
in Section 8.6, a residual connection is employed around both sublayers. In the transformer, for
any input x ∈ Rd at any position of the sequence, we require that sublayer(x) ∈ Rd so that the
residual connection x+sublayer(x) ∈ Rd is feasible. This addition from the residual connection is
immediately followed by layer normalization (Ba et al., 2016). As a result, the transformer encoder
outputs a d-dimensional vector representation for each position of the input sequence.
The transformer decoder is also a stack of multiple identical layers with residual connections and
11.7. The Transformer Architecture 445

layer normalizations. Besides the two sublayers described in the encoder, the decoder inserts
a third sublayer, known as the encoder-decoder attention, between these two. In the encoder-
decoder attention, queries are from the outputs of the previous decoder layer, and the keys and
values are from the transformer encoder outputs. In the decoder self-attention, queries, keys,
and values are all from the outputs of the previous decoder layer. However, each position in the
decoder is allowed to only attend to all positions in the decoder up to that position. This masked
attention preserves the auto-regressive property, ensuring that the prediction only depends on
those output tokens that have been generated.
We have already described and implemented multi-head attention based on scaled dot-products
in Section 11.5 and positional encoding in Section 11.6.3. In the following, we will implement the
rest of the transformer model.
import math
import pandas as pd
import torch
11.7.2 Positionwise Feed-Forward Networks
The positionwise feed-forward network transforms the representation at all the sequence posi-
tions using the same MLP. This is why we call it positionwise. In the implementation below, the
input X with shape (batch size, number of time steps or sequence length in tokens, number of
hidden units or feature dimension) will be transformed by a two-layer MLP into an output tensor
of shape (batch size, number of time steps, ffn_num_outputs).
#@save
class PositionWiseFFN(nn.Module):
"""Positionwise feed-forward network."""
def __init__(self, ffn_num_hiddens, ffn_num_outputs):
super().__init__()
self.dense1 = nn.LazyLinear(ffn_num_hiddens)
self.dense2 = nn.LazyLinear(ffn_num_outputs)

return self.dense2(self.relu(self.dense1(X)))
The following example shows that the innermost dimension of a tensor changes to the number
of outputs in the positionwise feed-forward network. Since the same MLP transforms at all the
positions, when the inputs at all these positions are the same, their outputs are also identical.
ffn = PositionWiseFFN(4, 8)
ffn.eval()
ffn(torch.ones((2, 3, 4)))[0]


tensor([[-0.2193, 0.1403, -0.3883, -0.7269, -0.2194, -0.7240, 0.5235, -0.2748],
[-0.2193, 0.1403, -0.3883, -0.7269, -0.2194, -0.7240, 0.5235, -0.2748],
[-0.2193, 0.1403, -0.3883, -0.7269, -0.2194, -0.7240, 0.5235, -0.2748]],
grad_fn=<SelectBackward0>)
11.7.3 Residual Connection and Layer Normalization
Now letʼs focus on the “add & norm” component in Fig. 11.7.1. As we described at the beginning
of this section, this is a residual connection immediately followed by layer normalization. Both
are key to effective deep architectures.
In Section 8.5, we explained how batch normalization recenters and rescales across the exam-
ples within a minibatch. Layer normalization is the same as batch normalization except that the
former normalizes across the feature dimension. Despite its pervasive applications in computer
vision, batch normalization is usually empirically less effective than layer normalization in natu-
ral language processing tasks, whose inputs are often variable-length sequences.
The following code snippet compares the normalization across different dimensions by layer nor-
malization and batch normalization.
ln = nn.LayerNorm(2)
bn = nn.LazyBatchNorm1d()
X = torch.tensor([[1, 2], [2, 3]], dtype=torch.float32)
# Compute mean and variance from X in the training mode
print('layer norm:', ln(X), '\nbatch norm:', bn(X))
layer norm: tensor([[-1.0000, 1.0000],

[-1.0000, 1.0000]], grad_fn=<NativeLayerNormBackward0>)
batch norm: tensor([[-1.0000, -1.0000],
[ 1.0000, 1.0000]], grad_fn=<NativeBatchNormBackward0>)
Now we can implement the AddNorm class using a residual connection followed by layer normal-
ization. Dropout is also applied for regularization.
#@save
class AddNorm(nn.Module):
"""Residual connection followed by layer normalization."""
def __init__(self, norm_shape, dropout):
super().__init__()
self.ln = nn.LayerNorm(norm_shape)
def forward(self, X, Y):

return self.ln(self.dropout(Y) + X)
The residual connection requires that the two inputs are of the same shape so that the output
tensor also has the same shape after the addition operation.
add_norm = AddNorm(4, 0.5)

d2l.check_shape(add_norm(torch.ones((2, 3, 4)), torch.ones((2, 3, 4))), (2, 3, 4))

11.7.4 Encoder
With all the essential components to assemble the transformer encoder, letʼs start by implement-
ing a single layer within the encoder. The following TransformerEncoderBlock class contains two
sublayers: multi-head self-attention and positionwise feed-forward networks, where a residual
connection followed by layer normalization is employed around both sublayers.
#@save
class TransformerEncoderBlock(nn.Module):
"""Transformer encoder block."""
def __init__(self, num_hiddens, ffn_num_hiddens, num_heads, dropout,
use_bias=False):
super().__init__()
self.attention = d2l.MultiHeadAttention(num_hiddens, num_heads,
dropout, use_bias)
self.addnorm1 = AddNorm(num_hiddens, dropout)
self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens)
def forward(self, X, valid_lens):

Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))
return self.addnorm2(Y, self.ffn(Y))
As we can see, any layer in the transformer encoder does not change the shape of its input.
X = torch.ones((2, 100, 24))

valid_lens = torch.tensor([3, 2])
encoder_blk = TransformerEncoderBlock(24, 48, 8, 0.5)
encoder_blk.eval()
d2l.check_shape(encoder_blk(X, valid_lens), X.shape)
In the following transformer encoder implementation, we stack num_blks instances of the above
TransformerEncoderBlock classes. Since we use the fixed positional encoding whose values are
always between -1 and 1, we multiply values of the learnable input embeddings by the square
root of the embedding dimension to rescale before summing up the input embedding and the
positional encoding.
#@save
class TransformerEncoder(d2l.Encoder):
"""Transformer encoder."""
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens,
num_heads, num_blks, dropout, use_bias=False):
super().__init__()
self.num_hiddens = num_hiddens
self.embedding = nn.Embedding(vocab_size, num_hiddens)
self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
self.blks = nn.Sequential()
for i in range(num_blks):
self.blks.add_module("block"+str(i), TransformerEncoderBlock(
num_hiddens, ffn_num_hiddens, num_heads, dropout, use_bias))
def forward(self, X, valid_lens):

# Since positional encoding values are between -1 and 1, the embedding
# values are multiplied by the square root of the embedding dimension

# to rescale before they are summed up

X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
self.attention_weights = [None] * len(self.blks)
for i, blk in enumerate(self.blks):
X = blk(X, valid_lens)
self.attention_weights[
i] = blk.attention.attention.attention_weights
return X
Below we specify hyperparameters to create a two-layer transformer encoder. The shape of the
transformer encoder output is (batch size, number of time steps, num_hiddens).
encoder = TransformerEncoder(200, 24, 48, 8, 2, 0.5)

d2l.check_shape(encoder(torch.ones((2, 100), dtype=torch.long), valid_lens),
(2, 100, 24))
11.7.5 Decoder
As shown in Fig. 11.7.1, the transformer decoder is composed of multiple identical layers. Each
layer is implemented in the following TransformerDecoderBlock class, which contains three sub-
layers: decoder self-attention, encoder-decoder attention, and positionwise feed-forward net-
works. These sublayers employ a residual connection around them followed by layer normal-
ization.
As we described earlier in this section, in the masked multi-head decoder self-attention (the first
sublayer), queries, keys, and values all come from the outputs of the previous decoder layer. When
training sequence-to-sequence models, tokens at all the positions (time steps) of the output se-
quence are known. However, during prediction the output sequence is generated token by token;
thus, at any decoder time step only the generated tokens can be used in the decoder self-attention.
To preserve auto-regression in the decoder, its masked self-attention specifies dec_valid_lens so
that any query only attends to all positions in the decoder up to the query position.
class TransformerDecoderBlock(nn.Module):
# The i-th block in the transformer decoder
def __init__(self, num_hiddens, ffn_num_hiddens, num_heads, dropout, i):
super().__init__()
self.i = i
self.attention1 = d2l.MultiHeadAttention(num_hiddens, num_heads,
dropout)
self.attention2 = d2l.MultiHeadAttention(num_hiddens, num_heads,
dropout)
self.ffn = PositionWiseFFN(ffn_num_hiddens, num_hiddens)

enc_outputs, enc_valid_lens = state[0], state[1]
# During training, all the tokens of any output sequence are processed
# at the same time, so state[2][self.i] is None as initialized. When
# decoding any output sequence token by token during prediction,

# state[2][self.i] contains representations of the decoded output at

# the i-th block up to the current time step
if state[2][self.i] is None:
key_values = X
else:
key_values = torch.cat((state[2][self.i], X), dim=1)
state[2][self.i] = key_values
if self.training:
batch_size, num_steps, _ = X.shape
# Shape of dec_valid_lens: (batch_size, num_steps), where every
# row is [1, 2, ..., num_steps]
dec_valid_lens = torch.arange(
1, num_steps + 1, device=X.device).repeat(batch_size, 1)
else:
dec_valid_lens = None
# Self-attention
X2 = self.attention1(X, key_values, key_values, dec_valid_lens)
Y = self.addnorm1(X, X2)
# Encoder-decoder attention. Shape of enc_outputs:
# (batch_size, num_steps, num_hiddens)
Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)
Z = self.addnorm2(Y, Y2)
return self.addnorm3(Z, self.ffn(Z)), state
To facilitate scaled dot-product operations in the encoder-decoder attention and addition opera-
tions in the residual connections, the feature dimension (num_hiddens) of the decoder is the same
as that of the encoder.
decoder_blk = TransformerDecoderBlock(24, 48, 8, 0.5, 0)

X = torch.ones((2, 100, 24))
state = [encoder_blk(X, valid_lens), valid_lens, [None]]
d2l.check_shape(decoder_blk(X, state)[0], X.shape)
Now we construct the entire transformer decoder composed of num_blks instances of Trans-
formerDecoderBlock. In the end, a fully connected layer computes the prediction for all the vo-
cab_size possible output tokens. Both of the decoder self-attention weights and the encoder-
decoder attention weights are stored for later visualization.
class TransformerDecoder(d2l.AttentionDecoder):
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
num_blks, dropout):
super().__init__()
self.num_hiddens = num_hiddens
self.num_blks = num_blks
self.embedding = nn.Embedding(vocab_size, num_hiddens)
self.pos_encoding = d2l.PositionalEncoding(num_hiddens, dropout)
self.blks.add_module("block"+str(i), TransformerDecoderBlock(
num_hiddens, ffn_num_hiddens, num_heads, dropout, i))
def init_state(self, enc_outputs, enc_valid_lens):


return [enc_outputs, enc_valid_lens, [None] * self.num_blks]

X = self.pos_encoding(self.embedding(X) * math.sqrt(self.num_hiddens))
self._attention_weights = [[None] * len(self.blks) for _ in range (2)]
for i, blk in enumerate(self.blks):
X, state = blk(X, state)
# Decoder self-attention weights
self._attention_weights[0][
i] = blk.attention1.attention.attention_weights
# Encoder-decoder attention weights
self._attention_weights[1][
i] = blk.attention2.attention.attention_weights
return self.dense(X), state
@property
return self._attention_weights
11.7.6 Training
Letʼs instantiate an encoder-decoder model by following the transformer architecture. Here we

specify that both the transformer encoder and the transformer decoder have 2 layers using 4-head
attention. Similar to Section 10.7.6, we train the transformer model for sequence to sequence
learning on the English-French machine translation dataset.
num_hiddens, num_blks, dropout = 256, 2, 0.2
ffn_num_hiddens, num_heads = 64, 4
encoder = TransformerEncoder(
len(data.src_vocab), num_hiddens, ffn_num_hiddens, num_heads,
num_blks, dropout)
decoder = TransformerDecoder(
len(data.tgt_vocab), num_hiddens, ffn_num_hiddens, num_heads,
num_blks, dropout)
model = d2l.Seq2Seq(encoder, decoder, tgt_pad=data.tgt_vocab['<pad>'],
lr=0.001)

After training, we use the transformer model to translate a few English sentences into French and
compute their BLEU scores.
translation = []
break
f'{d2l.bleu(" ".join(translation), fr, k=2):.3f}')
go . => ['va', '!'], bleu,1.000

he's calm . => ['il', 'est', 'venu', '.'], bleu,0.658
i'm home . => ['je', 'suis', 'chez', 'moi', '.'], bleu,1.000
Letʼs visualize the transformer attention weights when translating the last English sentence into
French. The shape of the encoder self-attention weights is (number of encoder layers, number of
attention heads, num_steps or number of queries, num_steps or number of key-value pairs).
_, dec_attention_weights = model.predict_step(
data.build([engs[-1]], [fras[-1]]), d2l.try_gpu(), data.num_steps, True)
enc_attention_weights = torch.cat(model.encoder.attention_weights, 0).reshape((num_blks, num_
,→heads,
-1, data.num_steps))
d2l.check_shape(enc_attention_weights,
(num_blks, num_heads, data.num_steps, data.num_steps))
In the encoder self-attention, both queries and keys come from the same input sequence. Since
padding tokens do not carry meaning, with specified valid length of the input sequence, no query
attends to positions of padding tokens. In the following, two layers of multi-head attention weights
are presented row by row. Each head independently attends based on a separate representation
subspaces of queries, keys, and values.

d2l.show_heatmaps(
enc_attention_weights.cpu(), xlabel='Key positions',
ylabel='Query positions', titles=['Head %d' % i for i in range(1, 5)],
figsize=(7, 3.5))
To visualize both the decoder self-attention weights and the encoder-decoder attention weights,
we need more data manipulations. For example, we fill the masked attention weights with zero.
Note that the decoder self-attention weights and the encoder-decoder attention weights both have
the same queries: the beginning-of-sequence token followed by the output tokens and possibly
end-of-sequence tokens.
dec_attention_weights_2d = [head[0].tolist()
for step in dec_attention_weights
for attn in step for blk in attn for head in blk]
dec_attention_weights_filled = torch.tensor(
pd.DataFrame(dec_attention_weights_2d).fillna(0.0).values)
dec_attention_weights = dec_attention_weights_filled.reshape((-1, 2, num_blks, num_heads,␣
,→data.
num_steps))
dec_self_attention_weights, dec_inter_attention_weights = \
dec_attention_weights.permute(1, 2, 3, 0, 4)
d2l.check_shape(dec_self_attention_weights,
d2l.check_shape(dec_inter_attention_weights,
Due to the auto-regressive property of the decoder self-attention, no query attends to key-value
pairs after the query position.
d2l.show_heatmaps(
dec_self_attention_weights[:, :, :, :],

xlabel='Key positions', ylabel='Query positions',

titles=['Head %d' % i for i in range(1, 5)], figsize=(7, 3.5))
Similar to the case in the encoder self-attention, via the specified valid length of the input se-
quence, no query from the output sequence attends to those padding tokens from the input se-
quence.
d2l.show_heatmaps(
dec_inter_attention_weights, xlabel='Key positions',
ylabel='Query positions', titles=['Head %d' % i for i in range(1, 5)],
figsize=(7, 3.5))
Although the transformer architecture was originally proposed for sequence-to-sequence learn-

ing, as we will discover later in the book, either the transformer encoder or the transformer de-
coder is often individually used for different deep learning tasks.
11.7.7 Summary
• The transformer is an instance of the encoder-decoder architecture, though either the en-
coder or the decoder can be used individually in practice.
• In the transformer, multi-head self-attention is used for representing the input sequence
and the output sequence, though the decoder has to preserve the auto-regressive property
via a masked version.
• Both the residual connections and the layer normalization in the transformer are important
for training a very deep model.
• The positionwise feed-forward network in the transformer model transforms the represen-
tation at all the sequence positions using the same MLP.
Exercises
1. Train a deeper transformer in the experiments. How does it affect the training speed and
the translation performance?
2. Is it a good idea to replace scaled dot-product attention with additive attention in the trans-
former? Why?
3. For language modeling, should we use the transformer encoder, decoder, or both? How to
design this method?
4. What can be challenges to transformers if input sequences are very long? Why?
5. How to improve computational and memory efficiency of transformers? Hint: you may refer
to the survey paper by Tay et al. (Tay et al., 2020).
Discussions161
11.8 Transformers for Vision
The transformer architecture was initially proposed for sequence to sequence learning, such as
for machine translation. With high effectiveness, transformers subsequently became the model
of choice in various natural language processing tasks (Brown et al., 2020, Devlin et al., 2018, Rad-
ford et al., 2018, Radford et al., 2019, Raffel et al., 2020). However, in the field of computer vision
the dominant architecture had been based on CNNs (Chapter 8). Can we adapt transformers to model
image data? This question has sparked immense interest in the computer vision community. (Ra-
machandran et al., 2019) proposed to replace convolution with self-attention. However, its use
of specialized patterns in attention makes it hard to scale up models on hardware accelerators.
(Cordonnier et al., 2020) theoretically proved that self-attention can learn to behave similarly to
convolution. Empirically, 2 × 2 patches were taken from images as input, but the small patch size
makes the model only applicable to image data with low resolutions.
161
11.8. Transformers for Vision 455

Without specific constraints on patch size, vision transformers (ViTs) extract patches from images
and feed them into a transformer encoder to obtain a global representation, which will finally be
transformed for classification (Dosovitskiy et al., 2021). Notably, transformers show better scala-
bility than CNNs: when training larger models on larger datasets, vision transformers outperform
ResNets by a significant margin. Similar to the landscape of network architecture design in natu-
ral language processing, transformers also became a game-changer in computer vision.
11.8.1 Model
Fig. 11.8.1 depicts the model architecture of vision transformers. This architecture consists of a
stem that patchifies images, a body based on the multi-layer transformer encoder, and a head that
transforms the global representation into the output label.
Fig. 11.8.1: The vision transformer architecture. In this example, an image is split into 9 patches. A
special “<cls>” token and the 9 flattened image patches are transformed via patch embedding and
n transformer encoder blocks into 10 representations, respectively. The “<cls>” representation is
further transformed into the output label.

Consider an input image with height h, width w, and c channels. Specifying the patch height and
width both as p, the image is split into a sequence of m = hw/p2 patches, where each patch is
flattened to a vector of length cp2 . In this way, image patches can be treated similarly to tokens
in text sequences by transformer encoders. A special “<cls>” (class) token and the m flattened
image patches are linearly projected into a sequence of m + 1 vectors, summed with learnable
positional embeddings. The multi-layer transformer encoder transforms m + 1 input vectors into
the same amount of output vector representations of the same length. It works exactly the same
way as the original transformer encoder in Fig. 11.7.1, only differing in the position of normaliza-
tion. Since the “<cls>” token attends to all the image patches via self-attention (see Fig. 11.6.1), its
representation from the transformer encoder output will be further transformed into the output
label.
import torch
11.8.2 Patch Embedding
To implement a vision transformer, letʼs start with patch embedding in Fig. 11.8.1. Splitting an
image into patches and linearly projecting these flattened patches can be simplified as a single
convolution operation, where both the kernel size and the stride size are set to the patch size.
class PatchEmbedding(nn.Module):
def __init__(self, img_size=96, patch_size=16, num_hiddens=512):
super().__init__()
def _make_tuple(x):
if not isinstance(x, (list, tuple)):
return (x, x)
return x
img_size, patch_size = _make_tuple(img_size), _make_tuple(patch_size)
self.num_patches = (img_size[0] // patch_size[0]) * (
img_size[1] // patch_size[1])
self.conv = nn.LazyConv2d(num_hiddens, kernel_size=patch_size,
stride=patch_size)

# Output shape: (batch size, no. of patches, no. of channels)
return self.conv(X).flatten(2).transpose(1, 2)
In the following example, taking images with height and width of img_size as input, the patch
embedding outputs (img_size//patch_size)**2 patches that are linearly projected to vectors of
length num_hiddens.
img_size, patch_size, num_hiddens, batch_size = 96, 16, 512, 4

patch_emb = PatchEmbedding(img_size, patch_size, num_hiddens)
X = torch.randn(batch_size, 3, img_size, img_size)
d2l.check_shape(patch_emb(X),
(batch_size, (img_size//patch_size)**2, num_hiddens))

,→ so changes to the API or functionality can happen at any moment.

11.8.3 Vision Transformer Encoder
The MLP of the vision transformer encoder is slightly different from the position-wise FFN of
the original transformer encoder (see Section 11.7.2). First, here the activation function uses the
Gaussian error linear unit (GELU), which can be considered as a smoother version of the ReLU
(Hendrycks and Gimpel, 2016). Second, dropout is applied to the output of each fully connected
layer in the MLP for regularization.
class ViTMLP(nn.Module):
def __init__(self, mlp_num_hiddens, mlp_num_outputs, dropout=0.5):
super().__init__()
self.dense1 = nn.LazyLinear(mlp_num_hiddens)
self.gelu = nn.GELU()
self.dropout1 = nn.Dropout(dropout)
self.dense2 = nn.LazyLinear(mlp_num_outputs)
self.dropout2 = nn.Dropout(dropout)

return self.dropout2(self.dense2(self.dropout1(self.gelu(
self.dense1(x)))))
The vision transformer encoder block implementation just follows the pre-normalization design
in Fig. 11.8.1, where normalization is applied right before multi-head attention or the MLP. In con-
trast to post-normalization (“add & norm” in Fig. 11.7.1), where normalization is placed right af-
ter residual connections, pre-normalization leads to more effective or efficient training for trans-
formers (Baevski and Auli, 2018, Wang et al., 2019, Xiong et al., 2020).
class ViTBlock(nn.Module):
def __init__(self, num_hiddens, norm_shape, mlp_num_hiddens,
num_heads, dropout, use_bias=False):
super().__init__()
self.ln1 = nn.LayerNorm(norm_shape)
self.attention = d2l.MultiHeadAttention(num_hiddens, num_heads,
dropout, use_bias)
self.ln2 = nn.LayerNorm(norm_shape)
self.mlp = ViTMLP(mlp_num_hiddens, num_hiddens, dropout)
def forward(self, X, valid_lens=None):

X = self.ln1(X)
return X + self.mlp(self.ln2(
X + self.attention(X, X, X, valid_lens)))
Same as in Section 11.7.4, any vision transformer encoder block does not change its input shape.
X = torch.ones((2, 100, 24))

encoder_blk = ViTBlock(24, 24, 48, 8, 0.5)
encoder_blk.eval()
d2l.check_shape(encoder_blk(X), X.shape)

11.8.4 Putting All Things Together
The forward pass of vision transformers below is straightforward. First, input images are fed into
an PatchEmbedding instance, whose output is concatenated with the “<cls>” token embedding.
They are summed with learnable positional embeddings before dropout. Then the output is fed
into the transformer encoder that stacks num_blks instances of the ViTBlock class. Finally, the
representation of the “<cls>” token is projected by the network head.
class ViT(d2l.Classifier):
"""Vision transformer."""
def __init__(self, img_size, patch_size, num_hiddens, mlp_num_hiddens,
num_heads, num_blks, emb_dropout, blk_dropout, lr=0.1,
use_bias=False, num_classes=10):
super().__init__()
self.patch_embedding = PatchEmbedding(
img_size, patch_size, num_hiddens)
self.cls_token = nn.Parameter(torch.zeros(1, 1, num_hiddens))
num_steps = self.patch_embedding.num_patches + 1 # Add the cls token
# Positional embeddings are learnable
self.pos_embedding = nn.Parameter(
torch.randn(1, num_steps, num_hiddens))
self.dropout = nn.Dropout(emb_dropout)
self.blks.add_module(f"{i}", ViTBlock(
num_hiddens, num_hiddens, mlp_num_hiddens,
num_heads, blk_dropout, use_bias))
self.head = nn.Sequential(nn.LayerNorm(num_hiddens),
nn.Linear(num_hiddens, num_classes))

X = self.patch_embedding(X)
X = torch.cat((self.cls_token.expand(X.shape[0], -1, -1), X), 1)
X = self.dropout(X + self.pos_embedding)
for blk in self.blks:
X = blk(X)
return self.head(X[:, 0])
11.8.5 Training
Training a vision transformer on the Fashion-MNIST dataset is just like how CNNs were trained in
Chapter 8.
img_size, patch_size = 96, 16

num_hiddens, mlp_num_hiddens, num_heads, num_blks = 512, 2048, 8, 2
emb_dropout, blk_dropout, lr = 0.1, 0.1, 0.1
model = ViT(img_size, patch_size, num_hiddens, mlp_num_hiddens, num_heads,
num_blks, emb_dropout, blk_dropout, lr)
data = d2l.FashionMNIST(batch_size=128, resize=(img_size, img_size))

You may notice that for small datasets like Fashion-MNIST, our implemented vision transformer
does not outperform the ResNet in Section 8.6. Similar observations can be made even on the
ImageNet dataset (1.2 million images). This is because transformers lack those useful principles
in convolution, such as translation invariance and locality (Section 7.1). However, the picture
changes when training larger models on larger datasets (e.g., 300 million images), where vision
transformers outperform ResNets by a large margin in image classification, demonstrating in-
trinsic superiority of transformers in scalability (Dosovitskiy et al., 2021). The introduction of
vision transformers has changed the landscape of network design for modeling image data. They
were soon shown effective on the ImageNet dataset with data-efficient training strategies of DeiT
(Touvron et al., 2021). However, quadratic complexity of self-attention (Section 11.6) makes the
transformer architecture less suitable for higher-resolution images. Towards a general-purpose
backbone network in computer vision, Swin transformers addressed the quadratic computational
complexity with respect to image size (Section 11.6.2) and added back convolution-like priors, ex-
tending the applicability of transformers to a range of computer vision tasks beyond image clas-
sification with state-of-the-art results (Liu et al., 2021).
Exercises
1. How does the value of img_size affect training time?

2. Instead of projecting the “<cls>” token representation to the output, how to project the av-
eraged patch representations? Implement this change and see how it affects the accuracy.
3. Can you modify hyperparameters to improve the accuracy of the vision transformer?
Discussions162
162

11.9 Large-Scale Pretraining with Transformers
So far in our image classification and machine translation experiments, models were trained on
datasets with input-output examples from scratch to perform specific tasks. For example, a trans-
former was trained with English-French pairs (Section 11.7) so that this model can translate input
English text into French. As a result, each model becomes a specific expert that is sensitive to even
slight shift in data distribution (Section 4.7). For better generalized models, or even more compe-
tent generalists that can perform multiple tasks with or without adaptation, pretraining models on
large data has been increasingly common.
Given larger data for pretraining, the transformer architecture performs better with an increased
model size and training compute, demonstrating superior scaling behavior. Specifically, perfor-
mance of transformer-based language models scales as a power-law with the amount of model
parameters, training tokens, and training compute (Kaplan et al., 2020). The scalability of trans-
formers is also evidenced by the significantly boosted performance from larger vision transform-
ers trained on larger data (discussed in Section 11.8). More recent success stories include Gato,
a generalist model that can play Atari, caption images, chat, and act as a robot (Reed et al., 2022).
Gato is a single transformer that scales well when pretrained on diverse modalities including text,
images, joint torques, and button presses. Notably, all such multi-modal data is serialized into a
flat sequence of tokens, which can be processed akin to text tokens (Section 11.7) or image patches
(Section 11.8) by transformers.
Before compelling success of pretraining transformers for multi-modal data, transformers were
extensively pretrained with a wealth of text. Originally proposed for machine translation, the
transformer architecture in Fig. 11.7.1 consists of an encoder for representing input sequences
and a decoder for generating target sequences. Primarily, transformers can be used in three dif-
ferent modes: encoder-only, encoder-decoder, and decoder-only. To conclude this chapter, we will
review these three modes and explain the scalability in pretraining transformers.
11.9.1 Encoder-Only
When only the transformer encoder is used, a sequence of input tokens is converted into the same
number of representations that can be further projected into output (e.g., classification). A trans-
former encoder consists of self-attention layers, where all input tokens attend to each other. For
example, vision transformers depicted in Fig. 11.8.1 are encoder-only, converting a sequence of
input image patches into the representation of a special “<cls>” token. Since this representation
depends on all input tokens, it is further projected into classification labels. This design was in-
spired by an earlier encoder-only transformer pretrained on text: BERT (Bidirectional Encoder
Representations from Transformers) (Devlin et al., 2018).
11.9. Large-Scale Pretraining with Transformers 461

Pretraining BERT
Fig. 11.9.1: Left: Pretraining BERT with masked language modeling. Prediction of the masked
“love” token depends on all input tokens before and after “love”. Right: Attention pattern in the
transformer encoder. Each token along the vertical axis attends to all input tokens along the hor-
izontal axis.
BERT is pretrained on text sequences using masked language modeling: input text with randomly
masked tokens is fed into a transformer encoder to predict the masked tokens. As illustrated in
Fig. 11.9.1, an original text sequence “I”, “love”, “this”, “red”, “car” is prepended with the “<cls>”
token, and the “<mask>” token randomly replaces “love”; then the cross-entropy loss between the
masked token “love” and its prediction is to be minimized during pretraining. Note that there is
no constraint in the attention pattern of transformer encoders (right of Fig. 11.9.1) so all tokens
can attend to each other. Thus, prediction of “love” depends on input tokens before and after it in
the sequence. This is why BERT is a “bidirectional encoder”. Without need for manual labeling,
large-scale text data from books and Wikipedia can be used for pretraining BERT.
Fine-Tuning BERT
The pretrained BERT can be fine-tuned to downstream encoding tasks involving single text or text
pairs. During fine-tuning, additional layers can be added to BERT with randomized parameters:
these parameters and those pretrained BERT parameters will be updated to fit training data of
downstream tasks.

Fig. 11.9.2: Fine-tuning BERT for sentiment analysis.
Fig. 11.9.2 illustrates fine-tuning of BERT for sentiment analysis. The transformer encoder is a
pretrained BERT, which takes a text sequence as input and feeds the “<cls>” representation (global
representation of the input) into an additional fully connected layer to predict the sentiment. Dur-
ing fine-tuning, the cross-entropy loss between the prediction and the label on sentiment analy-
sis data is minimized via gradient-based algorithms, where the additional layer is trained from
scratch while pretrained parameters of BERT are updated. BERT does more than sentiment anal-
ysis. The general language representations learned by the 350-million-parameter BERT from 250
billion training tokens advanced the state of the art for natural language tasks such as single text
classification, text pair classification or regression, text tagging, and question answering.
You may note that these downstream tasks include text pair understanding. BERT pretraining
has another loss for predicting whether one sentence immediately follows the other. However,
this loss was later found not useful when pretraining RoBERTa, a BERT variant of the same size,
on 2000 billion tokens (Liu et al., 2019). Other derivatives of BERT improved model architectures
or pretraining objectives, such as ALBERT (enforcing parameter sharing) (Lan et al., 2019), Span-
BERT (representing and predicting spans of text) (Joshi et al., 2020), DistilBERT (lightweight via
knowledge distillation) (Sanh et al., 2019), and ELECTRA (replaced token detection) (Clark et al.,
2020). Moreover, BERT inspired transformer pretraining in computer vision, such as with vision
transformers (Dosovitskiy et al., 2021), Swin transformers (Liu et al., 2021), and MAE (masked au-
toencoders) he2022masked.
11.9.2 Encoder-Decoder
Since a transformer encoder converts a sequence of input tokens into the same number of output
representations, the encoder-only mode cannot generate a sequence of arbitrary length like in
machine translation. Originally proposed for machine translation, the transformer architecture
also has a decoder that autoregressively predicts the target sequence of arbitrary length, token by
token, conditional on both encoder output and decoder output: (i) for conditioning on encoder
output, encoder-decoder cross-attention (multi-head attention of decoder in Fig. 11.7.1) allows
target tokens to attend to all input tokens; (ii) conditioning on decoder output is achieved by a
causal attention pattern (masked multi-head attention of decoder in Fig. 11.7.1), where any target
token can only attend to past and present tokens in the target sequence.
To pretrain encoder-decoder transformers beyond human-labeled machine translation data,
BART (Lewis et al., 2019) and T5 (Raffel et al., 2020) are two concurrently proposed encoder-
decoder transformers pretrained on large-scale text corpora. Both attempt to reconstruct origi-
nal text in their pretraining objectives, while the former emphasizes noising input (e.g., masking,

deletion, permutation, and rotation) and the latter highlights multitask unification with compre-
hensive ablation studies.
Pretraining T5
As an example of the pretrained transformer encoder-decoder, T5 (Text-to-Text Transfer Trans-

former) unifies many tasks as the same text-to-text problem: for any task, the input of the encoder
is a task description (e.g., “Summarize”, “:”) followed by task input (e.g., a sequence of tokens from
an article), and the decoder predicts the task output (e.g., a sequence of tokens summarizing the
input article). To perform as text-to-text, T5 is trained to generate some target text conditional on
input text.
Fig. 11.9.3: Left: Pretraining T5 by predicting consecutive spans. The original sentence is “I”,
“love”, “this”, “red”, “car”, where “love” is replaced by a special “<X>” token, and consecutive “red”,
“car” are replaced by a special “<Y>” token. The target sequence ends with a special “<Z>” token.
Right: Attention pattern in the transformer encoder-decoder. In the encoder self-attention (lower
square), all input tokens attend to each other; In the encoder-decoder cross-attention (upper rect-
angle), each target token attends to all input tokens; In the decoder self-attention (upper triangle),
each target token attends to present and past target tokens only (causal).
To obtain input and output from any original text, T5 is pretrained to predict consecutive spans.
Specifically, tokens from text are randomly replaced by special tokens where each consecutive
span is replaced by the same special token. Consider the example in Fig. 11.9.3, where the original
text is “I”, “love”, “this”, “red”, “car”. Tokens “love”, “red”, “car” are randomly replaced by special
tokens. Since “red” and “car” are a consecutive span, they are replaced by the same special token.
As a result, the input sequence is “I”, “<X>”, “this”, “<Y>”, and the target sequence is “<X>”, “love”,
“<Y>”, “red”, “car”, “<Z>”, where “<Z>” is another special token marking the end. As shown in Fig.
11.9.3, the decoder has a causal attention pattern to prevent itself from attending to future tokens
during sequence prediction.
In T5, predicting consecutive span is also referred to as reconstructing corrupted text. With this
objective, T5 is pretrained with 1000 billion tokens from the C4 (Colossal Clean Crawled Corpus)

data, which consists of clean English text from the Web (Raffel et al., 2020).
Fine-Tuning T5
Similar to BERT, T5 needs to be fine-tuned (updating T5 parameters) on task-specific training data

to perform this task. Major differences from BERT fine-tuning include: (i) T5 input includes task
descriptions; (ii) T5 can generate sequences with arbitrary length with its transformer decoder;
(iii) No additional layers are required.
Fig. 11.9.4: Fine-tuning T5 for text summarization. Both the task description and article tokens
are fed into the transformer encoder for predicting the summary.
Fig. 11.9.4 explains fine-tuning T5 using text summarization as an example. In this downstream
task, the task description tokens “Summarize”, “:” followed by the article tokens are input to the
encoder.
After fine-tuning, the 11-billion-parameter T5 (T5-11B) achieved state-of-the-art results on multi-
ple encoding (e.g., classification) and generation (e.g., summarization) benchmarks. Since re-
leased, T5 has been extensively used in later research. For example, switch transformers are
designed based off T5 to activate a subset of the parameters for better computational efficiency
(Fedus et al., 2022). In a text-to-image model called Imagen, text is input to a frozen T5 encoder
(T5-XXL) with 4.6 billion parameters (Saharia et al., 2022). The photorealistic text-to-image exam-
ples in Fig. 11.9.5 suggest that the T5 encoder alone may effectively represent text even without
fine-tuning.

Fig. 11.9.5: Text-to-image examples by the Imagen model, whose text encoder is from T5 (figures
taken from (Saharia et al., 2022)).
11.9.3 Decoder-Only
We have reviewed encoder-only and encoder-decoder transformers. Alternatively, decoder-only

transformers remove the entire encoder and the decoder sublayer with the encoder-decoder
cross-attention from the original encoder-decoder architecture depicted in Fig. 11.7.1. Nowadays,
decoder-only transformers have been the de facto architecture in large-scale language model-
ing (Section 9.3), which leverages the worldʼs abundant unlabeled text corpora via self-supervised
learning.
GPT and GPT-2
Using language modeling as the training objective, the GPT (generative pre-training) model
chooses a transformer decoder as its backbone (Radford et al., 2018).
Fig. 11.9.6: Left: Pretraining GPT with language modeling. The target sequence is the input se-
quence shifted by one token. Both “<bos>” and “<eos>” are special tokens marking the beginning
and end of sequences, respectively. Right: Attention pattern in the transformer decoder. Each
token along the vertical axis attends to only its past tokens along the horizontal axis (causal).

Following the autoregressive language model training as described in Section 9.3.3, Fig. 11.9.6
illustrates GPT pretraining with a transformer encoder, where the target sequence is the input
sequence shifted by one token. Note that the causal attention pattern in the transformer decoder
enforces that each token can only attend to its past tokens (token-by-token prediction cannot at-
tend to future tokens).
GPT has 100 million parameters and needs to be fine-tuned for individual downstream tasks. A
much larger transformer-decoder language model, GPT-2, was introduced one year later (Radford
et al., 2019). Compared with the original transformer decoder in GPT, pre-normalization (dis-
cussed in Section 11.8.3) and improved initialization and weight-scaling were adopted in GPT-2.
Pretrained on 40 GB of text, the 1.5-billion-parameter GPT-2 obtained the state-of-the-art results
on language modeling benchmarks and promising results on multiple other tasks without updating
the parameters or architecture.
GPT-3
GPT-2 demonstrated potential of using the same language model for multiple tasks without up-
dating the model. This is more computationally efficient than fine-tuning, which requires model
updates via gradient computation.
Fig. 11.9.7: Zero-shot, one-shot, few-shot learning with language models (transformer decoders).
No parameter update is needed.
Before explaining the more computationally efficient use of language models without parameter
update, recall Section 9.5 that a language model can be trained to generate a text sequence con-
ditional on some prefix text sequence. Thus, a pretrained language model may generate the task
output as a sequence without parameter update, conditional on an input sequence with the task de-
scription, task-specific input-output examples, and a prompt (task input). This learning paradigm

can be further categorized into zero-shot, one-shot, and few-shot, when there is no, one, or a few
task-specific input-output examples, respectively (Fig. 11.9.7).
Fig. 11.9.8: Aggregate performance of GPT-3 for all 42 accuracy-denominated benchmarks (cap-
tion adapted and figure taken from (Brown et al., 2020)).
These three settings were tested in GPT-3 (Brown et al., 2020), whose largest version uses data
and model size about two orders of magnitude larger than those in GPT-2. GPT-3 uses the same
transformer decoder architecture in its direct predecessor GPT-2 except that attention patterns
(right of Fig. 11.9.6) are sparser at alternating layers. Pretrained with 300 billion tokens, GPT-3
performs better with larger model size, where few-shot performance increases most rapidly (Fig.
11.9.8). Although enjoying computational efficiency, GPT-3 few-shot learning underperformed the
state-of-the-art fine-tuned models that require model updates. Nonetheless, GPT-3 has powered
a wide range of downstream applications across the Web163 : it was generating 4.5 billions words
every day around nine months of its API164 release.
11.9.4 Scalability
Fig. 11.9.8 empirically demonstrates scalability of transformers in the GPT-3 language model. For
language modeling, more comprehensive empirical studies on scalability of transformers had
suggested training larger transformers with more data and compute (Kaplan et al., 2020).
163
https://gpt3demo.com/
164
https://openai.com/api/

Fig. 11.9.9: Transformer language model performance improves smoothly as we increase the
model size, dataset size, and amount of compute used for training. For optimal performance all
three factors must be scaled up in tandem. Empirical performance has a power-law relationship
with each individual factor when not bottlenecked by the other two (caption adapted and figure
taken from (Kaplan et al., 2020)).
As shown in Fig. 11.9.9, precise power-law scaling can be observed in the performance with respect
to the model size (number of parameters, excluding embedding layers), dataset size (number of
training tokens), and amount of training compute (PetaFLOP/s-days, excluding embedding layers).
In general, increasing all these three factors in tandem leads to better performance. However, how
to increase them in tandem still remains a matter of debate (Hoffmann et al., 2022).
Fig. 11.9.10: Transformer language model training runs (figure taken from (Kaplan et al., 2020)).
Besides increased performance, large models also enjoy better sample efficiency than small mod-
els. Fig. 11.9.10 shows that large models need fewer training samples (tokens processed) to per-
form at the same level achieved by small models, and performance is scaled smoothly with com-
pute.

Fig. 11.9.11: GPT-3 performance (cross-entropy validation loss) follows a power-law trend with the
amount of compute used for training. The power-law behavior observed in (Kaplan et al., 2020)
continues for an additional two orders of magnitude with only small deviations from the predicted
curve. Embedding parameters are excluded from compute and parameter counts (caption adapted
and figure taken from (Brown et al., 2020)).
The empirical scaling behaviors in (Kaplan et al., 2020) have been tested in subsequent large trans-
former models. For example, GPT-3 supported this hypothesis with two more orders of magnitude
in Fig. 11.9.11.
The scalability of transformers in the GPT series have inspired subsequent transformer language
models. While the transformer decoder in GPT-3 was largely followed in OPT (Open Pretrained
Transformers) (Zhang et al., 2022) using only 1/7th the carbon footprint of the former, the GPT-2
transformer decoder was used in training the 530-billion-parameter Megatron-Turing NLG (Smith
et al., 2022) with 270 billion training tokens. Following the GPT-2 design, the 280-billion-parameter
Gopher (Rae et al., 2021) pretrained with 300 billion tokens achieved state-of-the-art performance
across the majority on about 150 diverse tasks. Inheriting the same architecture and using the
same compute budget of Gopher, Chinchilla (Hoffmann et al., 2022) is a substantially smaller (70
billion parameters) model that trains much longer (1.4 trillion training tokens), outperforming
Gopher on many tasks. To continue the scaling line of language modeling, PaLM (Pathway Lan-
guage Model) (Chowdhery et al., 2022), a 540-billion-parameter transformer decoder with modi-
fied designs pretrained on 780 billion tokens, outperformed average human performance on the
BIG-Bench benchmark (Srivastava et al., 2022). Further training PaLM on 38.5 billion tokens con-
taining scientific and mathematical content results in Minerva (Lewkowycz et al., 2022), a large
language model that can answer nearly a third of undergraduate-level problems that require quan-
titative reasoning, such as in physics, chemistry, biology, and economics.

Transformers have been pretrained as encoder-only (e.g., BERT), encoder-decoder (e.g., T5), and
decoder-only (e.g., GPT series). Pretrained models may be adapted to perform different tasks with
model update (e.g., fine tuning) or not (e.g., few shot). Scalability of transformers suggests that
better performance benefits from larger models, more training data, and more training compute.
Since transformers were first designed and pretrained for text data, this section leans slightly
towards natural language processing. Nonetheless, those models discussed above can be often
found in more recent models across multiple modalities. For example, (i) Chinchilla (Hoffmann
et al., 2022) was further extended to Flamingo (Alayrac et al., 2022), a visual language model for
few-shot learning; (ii) GPT-2 (Radford et al., 2019) and the vision transformer encode text and im-
ages in CLIP (Contrastive Language-Image Pre-training) (Radford et al., 2021), whose image and
text embeddings were later adopted in the DALL-E 2 text-to-image system (Ramesh et al., 2022).
Although there has been no systematic studies on transformer scalability in multi-modal pretrain-
ing yet, a recent all-transformer text-to-image model, Parti (Yu et al., 2022), shows potential of
scalability across modalities: a larger Parti is more capable of high-fidelity image generation and
content-rich text understanding (Fig. 11.9.12).
Fig. 11.9.12: Image examples generated from the same text by the Parti model of increasing sizes
(350M, 750M, 3B, 20B) (examples taken from (Yu et al., 2022)).
Exercises
1. Is it possible to fine tune T5 using a minibatch consisting of different tasks? Why or why not?
How about for GPT-2?
2. Given a powerful language model, what applications can you think of?
3. Say that you are asked to fine tune a language model to perform text classification by adding
additional layers. Where will you add them? Why?
4. Consider sequence to sequence problems (e.g., machine translation) where the input se-
quence is always available throughout the target sequence prediction. What could be limi-
tations of modeling with decoder-only transformers? Why?
Discussions165
165

12 | Optimization Algorithms
If you read the book in sequence up to this point you already used a number of optimization al-
gorithms to train deep learning models. They were the tools that allowed us to continue updating
model parameters and to minimize the value of the loss function, as evaluated on the training set.
Indeed, anyone content with treating optimization as a black box device to minimize objective
functions in a simple setting might well content oneself with the knowledge that there exists an
array of incantations of such a procedure (with names such as “SGD” and “Adam”).
To do well, however, some deeper knowledge is required. Optimization algorithms are important
for deep learning. On the one hand, training a complex deep learning model can take hours, days,
or even weeks. The performance of the optimization algorithm directly affects the modelʼs training
efficiency. On the other hand, understanding the principles of different optimization algorithms
and the role of their hyperparameters will enable us to tune the hyperparameters in a targeted
manner to improve the performance of deep learning models.
In this chapter, we explore common deep learning optimization algorithms in depth. Almost all
optimization problems arising in deep learning are nonconvex. Nonetheless, the design and anal-
ysis of algorithms in the context of convex problems have proven to be very instructive. It is for
that reason that this chapter includes a primer on convex optimization and the proof for a very
simple stochastic gradient descent algorithm on a convex objective function.
12.1 Optimization and Deep Learning
In this section, we will discuss the relationship between optimization and deep learning as well
as the challenges of using optimization in deep learning. For a deep learning problem, we will
usually define a loss function first. Once we have the loss function, we can use an optimization
algorithm in attempt to minimize the loss. In optimization, a loss function is often referred to as
the objective function of the optimization problem. By tradition and convention most optimization
algorithms are concerned with minimization. If we ever need to maximize an objective there is a
simple solution: just flip the sign on the objective.
473
12.1.1 Goal of Optimization
Although optimization provides a way to minimize the loss function for deep learning, in essence,
the goals of optimization and deep learning are fundamentally different. The former is primarily
concerned with minimizing an objective whereas the latter is concerned with finding a suitable
model, given a finite amount of data. In Section 3.6, we discussed the difference between these
two goals in detail. For instance, training error and generalization error generally differ: since the
objective function of the optimization algorithm is usually a loss function based on the training
dataset, the goal of optimization is to reduce the training error. However, the goal of deep learning
(or more broadly, statistical inference) is to reduce the generalization error. To accomplish the
latter we need to pay attention to overfitting in addition to using the optimization algorithm to
reduce the training error.
%matplotlib inline
import numpy as np
import torch
from mpl_toolkits import mplot3d
To illustrate the aforementioned different goals, letʼs consider the empirical risk and the risk. As
described in Section 4.7.3, the empirical risk is an average loss on the training dataset while the
risk is the expected loss on the entire population of data. Below we define two functions: the risk
function f and the empirical risk function g. Suppose that we have only a finite amount of training
data. As a result, here g is less smooth than f.
def f(x):
return x * torch.cos(np.pi * x)
def g(x):
return f(x) + 0.2 * torch.cos(5 * np.pi * x)
The graph below illustrates that the minimum of the empirical risk on a training dataset may be
at a different location from the minimum of the risk (generalization error).
def annotate(text, xy, xytext): #@save

d2l.plt.gca().annotate(text, xy=xy, xytext=xytext,
arrowprops=dict(arrowstyle='->'))
x = torch.arange(0.5, 1.5, 0.01)

d2l.set_figsize((4.5, 2.5))
d2l.plot(x, [f(x), g(x)], 'x', 'risk')
annotate('min of\nempirical risk', (1.0, -1.2), (0.5, -1.1))
annotate('min of risk', (1.1, -1.05), (0.95, -0.5))
474 Chapter 12. Optimization Algorithms

12.1.2 Optimization Challenges in Deep Learning
In this chapter, we are going to focus specifically on the performance of optimization algorithms
in minimizing the objective function, rather than a modelʼs generalization error. In Section 3.1
we distinguished between analytical solutions and numerical solutions in optimization problems.
In deep learning, most objective functions are complicated and do not have analytical solutions.
Instead, we must use numerical optimization algorithms. The optimization algorithms in this
chapter all fall into this category.
There are many challenges in deep learning optimization. Some of the most vexing ones are local
minima, saddle points, and vanishing gradients. Letʼs have a look at them.
Local Minima
For any objective function f (x), if the value of f (x) at x is smaller than the values of f (x) at any
other points in the vicinity of x, then f (x) could be a local minimum. If the value of f (x) at x is
the minimum of the objective function over the entire domain, then f (x) is the global minimum.
For example, given the function
f (x) = x · cos(πx) for − 1.0 ≤ x ≤ 2.0, (12.1.1)
we can approximate the local minimum and global minimum of this function.
x = torch.arange(-1.0, 2.0, 0.01)

d2l.plot(x, [f(x), ], 'x', 'f(x)')
annotate('local minimum', (-0.3, -0.25), (-0.77, -1.0))
annotate('global minimum', (1.1, -0.95), (0.6, 0.8))
12.1. Optimization and Deep Learning 475

The objective function of deep learning models usually has many local optima. When the nu-
merical solution of an optimization problem is near the local optimum, the numerical solution
obtained by the final iteration may only minimize the objective function locally, rather than glob-
ally, as the gradient of the objective functionʼs solutions approaches or becomes zero. Only some
degree of noise might knock the parameter out of the local minimum. In fact, this is one of the
beneficial properties of minibatch stochastic gradient descent where the natural variation of gra-
dients over minibatches is able to dislodge the parameters from local minima.
Saddle Points
Besides local minima, saddle points are another reason for gradients to vanish. A saddle point
is any location where all gradients of a function vanish but which is neither a global nor a local
minimum. Consider the function f (x) = x3 . Its first and second derivative vanish for x = 0.
Optimization might stall at this point, even though it is not a minimum.
x = torch.arange(-2.0, 2.0, 0.01)

d2l.plot(x, [x**3], 'x', 'f(x)')
annotate('saddle point', (0, -0.2), (-0.52, -5.0))
Saddle points in higher dimensions are even more insidious, as the example below shows. Con-
sider the function f (x, y) = x2 − y 2 . It has its saddle point at (0, 0). This is a maximum with

respect to y and a minimum with respect to x. Moreover, it looks like a saddle, which is where this
mathematical property got its name.
x, y = torch.meshgrid(
torch.linspace(-1.0, 1.0, 101), torch.linspace(-1.0, 1.0, 101))
z = x**2 - y**2
ax = d2l.plt.figure().add_subplot(111, projection='3d')
ax.plot_wireframe(x, y, z, **{'rstride': 10, 'cstride': 10})
ax.plot([0], [0], [0], 'rx')
ticks = [-1, 0, 1]
d2l.plt.xticks(ticks)
d2l.plt.yticks(ticks)
ax.set_zticks(ticks)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y');
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.8/site-packages/torch/
,→functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be␣
,→required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/

,→TensorShape.cpp:2895.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
We assume that the input of a function is a k-dimensional vector and its output is a scalar, so its
Hessian matrix will have k eigenvalues. The solution of the function could be a local minimum, a
local maximum, or a saddle point at a position where the function gradient is zero:
• When the eigenvalues of the functionʼs Hessian matrix at the zero-gradient position are all
positive, we have a local minimum for the function.
• When the eigenvalues of the functionʼs Hessian matrix at the zero-gradient position are all
negative, we have a local maximum for the function.
• When the eigenvalues of the functionʼs Hessian matrix at the zero-gradient position are neg-
ative and positive, we have a saddle point for the function.
For high-dimensional problems the likelihood that at least some of the eigenvalues are negative
is quite high. This makes saddle points more likely than local minima. We will discuss some ex-
ceptions to this situation in the next section when introducing convexity. In short, convex func-
tions are those where the eigenvalues of the Hessian are never negative. Sadly, though, most deep
12.1. Optimization and Deep Learning 477

learning problems do not fall into this category. Nonetheless it is a great tool to study optimization
algorithms.
Vanishing Gradients
Probably the most insidious problem to encounter is the vanishing gradient. Recall our
commonly-used activation functions and their derivatives in Section 5.1.2. For instance, assume
that we want to minimize the function f (x) = tanh(x) and we happen to get started at x = 4.
As we can see, the gradient of f is close to nil. More specifically, f ′ (x) = 1 − tanh2 (x) and thus
f ′ (4) = 0.0013. Consequently, optimization will get stuck for a long time before we make progress.
This turns out to be one of the reasons that training deep learning models was quite tricky prior
to the introduction of the ReLU activation function.
x = torch.arange(-2.0, 5.0, 0.01)

d2l.plot(x, [torch.tanh(x)], 'x', 'f(x)')
annotate('vanishing gradient', (4, 1), (2, 0.0))
As we saw, optimization for deep learning is full of challenges. Fortunately there exists a robust
range of algorithms that perform well and that are easy to use even for beginners. Furthermore,
it is not really necessary to find the best solution. Local optima or even approximate solutions
thereof are still very useful.
12.1.3 Summary
• Minimizing the training error does not guarantee that we find the best set of parameters to
minimize the generalization error.
• The optimization problems may have many local minima.
• The problem may have even more saddle points, as generally the problems are not convex.
• Vanishing gradients can cause optimization to stall. Often a reparameterization of the prob-
lem helps. Good initialization of the parameters can be beneficial, too.

Exercises
1. Consider a simple MLP with a single hidden layer of, say, d dimensions in the hidden layer
and a single output. Show that for any local minimum there are at least d! equivalent solu-
tions that behave identically.
2. Assume that we have a symmetric random matrix M where the entries Mij = Mji are each
drawn from some probability distribution pij . Furthermore assume that pij (x) = pij (−x),
i.e., that the distribution is symmetric (see e.g., (Wigner, 1958) for details).
1. Prove that the distribution over eigenvalues is also symmetric. That is, for any eigen-
vector v the probability that the associated eigenvalue λ satisfies P (λ > 0) = P (λ < 0).
2. Why does the above not imply P (λ > 0) = 0.5?
3. What other challenges involved in deep learning optimization can you think of?
4. Assume that you want to balance a (real) ball on a (real) saddle.
1. Why is this hard?
2. Can you exploit this effect also for optimization algorithms?
Discussions166
12.2 Convexity
Convexity plays a vital role in the design of optimization algorithms. This is largely due to the
fact that it is much easier to analyze and test algorithms in such a context. In other words, if
the algorithm performs poorly even in the convex setting, typically we should not hope to see
great results otherwise. Furthermore, even though the optimization problems in deep learning
are generally nonconvex, they often exhibit some properties of convex ones near local minima.
This can lead to exciting new optimization variants such as (Izmailov et al., 2018).
%matplotlib inline
import numpy as np
import torch
12.2.1 Definitions
Before convex analysis, we need to define convex sets and convex functions. They lead to mathe-
matical tools that are commonly applied to machine learning.
166
12.2. Convexity 479

Convex Sets
Sets are the basis of convexity. Simply put, a set X in a vector space is convex if for any a, b ∈ X
the line segment connecting a and b is also in X . In mathematical terms this means that for all
λ ∈ [0, 1] we have
λa + (1 − λ)b ∈ X whenever a, b ∈ X . (12.2.1)
This sounds a bit abstract. Consider Fig. 12.2.1. The first set is not convex since there exist line
segments that are not contained in it. The other two sets suffer no such problem.
Fig. 12.2.1: The first set is nonconvex and the other two are convex.
Definitions on their own are not particularly useful unless you can do something with them. In
this case we can look at intersections as shown in Fig. 12.2.2. Assume that X and Y are convex
sets. Then X ∩ Y is also convex. To see this, consider any a, b ∈ X ∩ Y. Since X and Y are convex,
the line segments connecting a and b are contained in both X and Y. Given that, they also need
to be contained in X ∩ Y, thus proving our theorem.
Fig. 12.2.2: The intersection between two convex sets is convex.
We can strengthen this result with little effort: given convex sets Xi , their intersection ∩i Xi is
convex. To see that the converse is not true, consider two disjoint sets X ∩ Y = ∅. Now pick a ∈ X
and b ∈ Y. The line segment in Fig. 12.2.3 connecting a and b needs to contain some part that is
neither in X nor in Y, since we assumed that X ∩ Y = ∅. Hence the line segment is not in X ∪ Y
either, thus proving that in general unions of convex sets need not be convex.

Fig. 12.2.3: The union of two convex sets need not be convex.
Typically the problems in deep learning are defined on convex sets. For instance, Rd , the set of
d-dimensional vectors of real numbers, is a convex set (after all, the line between any two points
in Rd remains in Rd ). In some cases we work with variables of bounded length, such as balls of
radius r as defined by {x|x ∈ Rd and ∥x∥ ≤ r}.
Convex Functions
Now that we have convex sets we can introduce convex functions f . Given a convex set X , a function
f : X → R is convex if for all x, x′ ∈ X and for all λ ∈ [0, 1] we have
λf (x) + (1 − λ)f (x′ ) ≥ f (λx + (1 − λ)x′ ). (12.2.2)
To illustrate this letʼs plot a few functions and check which ones satisfy the requirement. Below
we define a few functions, both convex and nonconvex.
f = lambda x: 0.5 * x**2 # Convex

g = lambda x: torch.cos(np.pi * x) # Nonconvex
h = lambda x: torch.exp(0.5 * x) # Convex
x, segment = torch.arange(-2, 2, 0.01), torch.tensor([-1.5, 1])

_, axes = d2l.plt.subplots(1, 3, figsize=(9, 3))
for ax, func in zip(axes, [f, g, h]):
d2l.plot([x, segment], [func(x), func(segment)], axes=ax)
12.2. Convexity 481

As expected, the cosine function is nonconvex, whereas the parabola and the exponential function
are. Note that the requirement that X is a convex set is necessary for the condition to make sense.
Otherwise the outcome of f (λx + (1 − λ)x′ ) might not be well defined.
Jensen’s Inequality
Given a convex function f , one of the most useful mathematical tools is Jensen’s inequality. It
amounts to a generalization of the definition of convexity:
( )
∑ ∑
αi f (xi ) ≥ f αi xi and EX [f (X)] ≥ f (EX [X]) , (12.2.3)
i i
∑
where αi are nonnegative real numbers such that i αi = 1 and X is a random variable. In other
words, the expectation of a convex function is no less than the convex function of an expectation,
where the latter is usually a simpler expression. To prove the first inequality we repeatedly apply
the definition of convexity to one term in the sum at a time.
One of the common applications of Jensenʼs inequality is to bound a more complicated expression
by a simpler one. For example, its application can be with regard to the log-likelihood of partially
observed random variables. That is, we use
EY ∼P (Y ) [− log P (X | Y )] ≥ − log P (X), (12.2.4)

∫
since P (Y )P (X | Y )dY = P (X). This can be used in variational methods. Here Y is typically
the unobserved random variable, P (Y ) is the best guess of how it might be distributed, and P (X)
is the distribution with Y integrated out. For instance, in clustering Y might be the cluster labels
and P (X | Y ) is the generative model when applying cluster labels.
12.2.2 Properties
Convex functions have many useful properties. We describe a few commonly-used ones below.
Local Minima Are Global Minima
First and foremost, the local minima of convex functions are also the global minima. We can
prove it by contradiction as follows.
Consider a convex function f defined on a convex set X . Suppose that x∗ ∈ X is a local minimum:
there exists a small positive value p so that for x ∈ X that satisfies 0 < |x − x∗ | ≤ p we have
f (x∗ ) < f (x).
Assume that the local minimum x∗ is not the global minumum of f : there exists x′ ∈ X for which
f (x′ ) < f (x∗ ). There also exists λ ∈ [0, 1) such as λ = 1− |x∗ −x
p ∗ ′ ∗
′ | so that 0 < |λx +(1−λ)x −x | ≤ p.
However, according to the definition of convex functions, we have
f (λx∗ + (1 − λ)x′ ) ≤ λf (x∗ ) + (1 − λ)f (x′ )

< λf (x∗ ) + (1 − λ)f (x∗ ) (12.2.5)
∗
= f (x ),
which contradicts with our statement that x∗ is a local minimum. Therefore, there does not exist
x′ ∈ X for which f (x′ ) < f (x∗ ). The local minimum x∗ is also the global minimum.

For instance, the convex function f (x) = (x − 1)2 has a local minimum at x = 1, which is also the
global minimum.
f = lambda x: (x - 1) ** 2
d2l.set_figsize()
d2l.plot([x, segment], [f(x), f(segment)], 'x', 'f(x)')
The fact that the local minima for convex functions are also the global minima is very convenient.
It means that if we minimize functions we cannot “get stuck”. Note, though, that this does not
mean that there cannot be more than one global minimum or that there might even exist one. For
instance, the function f (x) = max(|x| − 1, 0) attains its minimum value over the interval [−1, 1].
Conversely, the function f (x) = exp(x) does not attain a minimum value on R: for x → −∞ it
asymptotes to 0, but there is no x for which f (x) = 0.
Below Sets of Convex Functions Are Convex
We can conveniently define convex sets via below sets of convex functions. Concretely, given a
convex function f defined on a convex set X , any below set
Sb := {x|x ∈ X and f (x) ≤ b} (12.2.6)
is convex.
Letʼs prove this quickly. Recall that for any x, x′ ∈ Sb we need to show that λx + (1 − λ)x′ ∈ Sb as
long as λ ∈ [0, 1]. Since f (x) ≤ b and f (x′ ) ≤ b, by the definition of convexity we have
f (λx + (1 − λ)x′ ) ≤ λf (x) + (1 − λ)f (x′ ) ≤ b. (12.2.7)
12.2. Convexity 483

Convexity and Second Derivatives
Whenever the second derivative of a function f : Rn → R exists it is very easy to check whether f
is convex. All we need to do is check whether the Hessian of f is positive semidefinite: ∇2 f ⪰ 0,
i.e., denoting the Hessian matrix ∇2 f by H, x⊤ Hx ≥ 0 for all x ∈ Rn . For instance, the function
f (x) = 21 ∥x∥2 is convex since ∇2 f = 1, i.e., its Hessian is an identity matrix.
Formally, a twice-differentiable one-dimensional function f : R → R is convex if and only if its
second derivative f ′′ ≥ 0. For any twice-differentiable multi-dimensional function f : Rn → R, it
is convex if and only if its Hessian ∇2 f ⪰ 0.
First, we need to prove the one-dimensional case. To see that convexity of f implies f ′′ ≥ 0 we
use the fact that
( )
1 1 x+ϵ x−ϵ
f (x + ϵ) + f (x − ϵ) ≥ f + = f (x). (12.2.8)
2 2 2 2
Since the second derivative is given by the limit over finite differences it follows that
f (x + ϵ) + f (x − ϵ) − 2f (x)
f ′′ (x) = lim ≥ 0. (12.2.9)
ϵ→0 ϵ2
To see that f ′′ ≥ 0 implies that f is convex we use the fact that f ′′ ≥ 0 implies that f ′ is a mono-
tonically nondecreasing function. Let a < x < b be three points in R, where x = (1 − λ)a + λb and
λ ∈ (0, 1). According to the mean value theorem, there exist α ∈ [a, x] and β ∈ [x, b] such that
f (x) − f (a) f (b) − f (x)

f ′ (α) = and f ′ (β) = . (12.2.10)
x−a b−x
By monotonicity f ′ (β) ≥ f ′ (α), hence
x−a b−x
f (b) + f (a) ≥ f (x). (12.2.11)
b−a b−a
Since x = (1 − λ)a + λb, we have
λf (b) + (1 − λ)f (a) ≥ f ((1 − λ)a + λb), (12.2.12)
thus proving convexity.

Second, we need a lemma before proving the multi-dimensional case: f : Rn → R is convex if and
only if for all x, y ∈ Rn
def
g(z) = f (zx + (1 − z)y) where z ∈ [0, 1] (12.2.13)
is convex.
To prove that convexity of f implies that g is convex, we can show that for all a, b, λ ∈ [0, 1] (thus
0 ≤ λa + (1 − λ)b ≤ 1)
g(λa + (1 − λ)b)
=f ((λa + (1 − λ)b) x + (1 − λa − (1 − λ)b) y)
=f (λ (ax + (1 − a)y) + (1 − λ) (bx + (1 − b)y)) (12.2.14)
≤λf (ax + (1 − a)y) + (1 − λ)f (bx + (1 − b)y)
=λg(a) + (1 − λ)g(b).

To prove the converse, we can show that for all λ ∈ [0, 1]
f (λx + (1 − λ)y)
=g(λ · 1 + (1 − λ) · 0)
(12.2.15)
≤λg(1) + (1 − λ)g(0)
=λf (x) + (1 − λ)g(y).
Finally, using the lemma above and the result of the one-dimensional case, the multi-dimensional
case can be proven as follows. A multi-dimensional function f : Rn → R is convex if and only if for
def
all x, y ∈ Rn g(z) = f (zx+(1−z)y), where z ∈ [0, 1], is convex. According to the one-dimensional
def
case, this holds if and only if g ′′ = (x − y)⊤ H(x − y) ≥ 0 (H = ∇2 f ) for all x, y ∈ Rn , which is
equivalent to H ⪰ 0 per the definition of positive semidefinite matrices.
12.2.3 Constraints
One of the nice properties of convex optimization is that it allows us to handle constraints effi-
ciently. That is, it allows us to solve constrained optimization problems of the form:
minimize f (x)
x (12.2.16)
subject to ci (x) ≤ 0 for all i ∈ {1, . . . , n},
where f is the objective and the functions ci are constraint functions. To see what this does con-
sider the case where c1 (x) = ∥x∥2 − 1. In this case the parameters x are constrained to the unit
ball. If a second constraint is c2 (x) = v⊤ x + b, then this corresponds to all x lying on a half-space.
Satisfying both constraints simultaneously amounts to selecting a slice of a ball.
Lagrangian
In general, solving a constrained optimization problem is difficult. One way of addressing it stems
from physics with a rather simple intuition. Imagine a ball inside a box. The ball will roll to the
place that is lowest and the forces of gravity will be balanced out with the forces that the sides of
the box can impose on the ball. In short, the gradient of the objective function (i.e., gravity) will be
offset by the gradient of the constraint function (the ball need to remain inside the box by virtue
of the walls “pushing back”). Note that some constraints may not be active: the walls that are not
touched by the ball will not be able to exert any force on the ball.
Skipping over the derivation of the Lagrangian L, the above reasoning can be expressed via the
following saddle point optimization problem:
∑
n
L(x, α1 , . . . , αn ) = f (x) + αi ci (x) where αi ≥ 0. (12.2.17)
i=1
Here the variables αi (i = 1, . . . , n) are the so-called Lagrange multipliers that ensure that con-
straints are properly enforced. They are chosen just large enough to ensure that ci (x) ≤ 0 for all
i. For instance, for any x where ci (x) < 0 naturally, weʼd end up picking αi = 0. Moreover, this
is a saddle point optimization problem where one wants to maximize L with respect to all αi and
simultaneously minimize it with respect to x. There is a rich body of literature explaining how to
arrive at the function L(x, α1 , . . . , αn ). For our purposes it is sufficient to know that the saddle
point of L is where the original constrained optimization problem is solved optimally.
12.2. Convexity 485

Penalties
One way of satisfying constrained optimization problems at least approximately is to adapt the
Lagrangian L. Rather than satisfying ci (x) ≤ 0 we simply add αi ci (x) to the objective function
f (x). This ensures that the constraints will not be violated too badly.
In fact, we have been using this trick all along. Consider weight decay in Section 3.7. In it we add
2 ∥w∥ to the objective function to ensure that w does not grow too large. From the constrained
λ 2
optimization point of view we can see that this will ensure that ∥w∥2 − r2 ≤ 0 for some radius r.
Adjusting the value of λ allows us to vary the size of w.
In general, adding penalties is a good way of ensuring approximate constraint satisfaction. In
practice this turns out to be much more robust than exact satisfaction. Furthermore, for noncon-
vex problems many of the properties that make the exact approach so appealing in the convex
case (e.g., optimality) no longer hold.
Projections
An alternative strategy for satisfying constraints is projections. Again, we encountered them be-
fore, e.g., when dealing with gradient clipping in Section 9.5. There we ensured that a gradient
has length bounded by θ via
g ← g · min(1, θ/∥g∥). (12.2.18)
This turns out to be a projection of g onto the ball of radius θ. More generally, a projection on a
convex set X is defined as
ProjX (x) = argmin ∥x − x′ ∥, (12.2.19)

x′ ∈X
which is the closest point in X to x.
Fig. 12.2.4: Convex Projections.
The mathematical definition of projections may sound a bit abstract. Fig. 12.2.4 explains it some-
what more clearly. In it we have two convex sets, a circle and a diamond. Points inside both sets
(yellow) remain unchanged during projections. Points outside both sets (black) are projected to
the points inside the sets (red) that are closet to the original points (black). While for ℓ2 balls this
leaves the direction unchanged, this need not be the case in general, as can be seen in the case of
the diamond.
One of the uses for convex projections is to compute sparse weight vectors. In this case we project
weight vectors onto an ℓ1 ball, which is a generalized version of the diamond case in Fig. 12.2.4.

12.2.4 Summary
In the context of deep learning the main purpose of convex functions is to motivate optimiza-
tion algorithms and help us understand them in detail. In the following we will see how gradient
descent and stochastic gradient descent can be derived accordingly.
• Intersections of convex sets are convex. Unions are not.
• The expectation of a convex function is no less than the convex function of an expectation
(Jensenʼs inequality).
• A twice-differentiable function is convex if and only if its Hessian (a matrix of second deriva-
tives) is positive semidefinite.
• Convex constraints can be added via the Lagrangian. In practice we may simply add them
with a penalty to the objective function.
• Projections map to points in the convex set closest to the original points.
Exercises
1. Assume that we want to verify convexity of a set by drawing all lines between points within
the set and checking whether the lines are contained.
1. Prove that it is sufficient to check only the points on the boundary.
2. Prove that it is sufficient to check only the vertices of the set.
def
2. Denote by Bp [r] = {x|x ∈ Rd and ∥x∥p ≤ r} the ball of radius r using the p-norm. Prove that
Bp [r] is convex for all p ≥ 1.
3. Given convex functions f and g, show that max(f, g) is convex, too. Prove that min(f, g) is
not convex.
4. Prove that the normalization
∑ of the softmax function is convex. More specifically prove the
convexity of f (x) = log i exp(xi ).
5. Prove that linear subspaces, i.e., X = {x|Wx = b}, are convex sets.
6. Prove that in the case of linear subspaces with b = 0 the projection ProjX can be written as
Mx for some matrix M.
7. Show that for twice-differentiable convex functions f we can write f (x + ϵ) = f (x) + ϵf ′ (x) +
1 2 ′′
2 ϵ f (x + ξ) for some ξ ∈ [0, ϵ].
8. Given a vector w ∈ Rd with ∥w∥1 > 1 compute the projection on the ℓ1 unit ball.
1. As an intermediate step write out the penalized objective ∥w − w′ ∥2 + λ∥w′ ∥1 and com-
pute the solution for a given λ > 0.
2. Can you find the “right” value of λ without a lot of trial and error?
9. Given a convex set X and two vectors x and y, prove that projections never increase dis-
tances, i.e., ∥x − y∥ ≥ ∥ProjX (x) − ProjX (y)∥.
Discussions167
167
12.2. Convexity 487

12.3 Gradient Descent
In this section we are going to introduce the basic concepts underlying gradient descent. Although
it is rarely used directly in deep learning, an understanding of gradient descent is key to under-
standing stochastic gradient descent algorithms. For instance, the optimization problem might
diverge due to an overly large learning rate. This phenomenon can already be seen in gradient
descent. Likewise, preconditioning is a common technique in gradient descent and carries over
to more advanced algorithms. Letʼs start with a simple special case.
12.3.1 One-Dimensional Gradient Descent
Gradient descent in one dimension is an excellent example to explain why the gradient descent
algorithm may reduce the value of the objective function. Consider some continuously differen-
tiable real-valued function f : R → R. Using a Taylor expansion we obtain
f (x + ϵ) = f (x) + ϵf ′ (x) + O(ϵ2 ). (12.3.1)
That is, in first-order approximation f (x+ϵ) is given by the function value f (x) and the first deriva-
tive f ′ (x) at x. It is not unreasonable to assume that for small ϵ moving in the direction of the neg-
ative gradient will decrease f . To keep things simple we pick a fixed step size η > 0 and choose
ϵ = −ηf ′ (x). Plugging this into the Taylor expansion above we get
f (x − ηf ′ (x)) = f (x) − ηf ′2 (x) + O(η 2 f ′2 (x)). (12.3.2)
If the derivative f ′ (x) ̸= 0 does not vanish we make progress since ηf ′2 (x) > 0. Moreover, we can
always choose η small enough for the higher-order terms to become irrelevant. Hence we arrive
at
f (x − ηf ′ (x)) ⪅ f (x). (12.3.3)
This means that, if we use
x ← x − ηf ′ (x) (12.3.4)
to iterate x, the value of function f (x) might decline. Therefore, in gradient descent we first choose
an initial value x and a constant η > 0 and then use them to continuously iterate x until the stop
condition is reached, for example, when the magnitude of the gradient |f ′ (x)| is small enough or
the number of iterations has reached a certain value.
For simplicity we choose the objective function f (x) = x2 to illustrate how to implement gradient
descent. Although we know that x = 0 is the solution to minimize f (x), we still use this simple
function to observe how x changes.
%matplotlib inline
import numpy as np
import torch
def f(x): # Objective function

return x ** 2
def f_grad(x): # Gradient (derivative) of the objective function

return 2 * x

Next, we use x = 10 as the initial value and assume η = 0.2. Using gradient descent to iterate x for
10 times we can see that, eventually, the value of x approaches the optimal solution.
def gd(eta, f_grad):

x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x)
results.append(float(x))
print(f'epoch 10, x: {x:f}')
return results
results = gd(0.2, f_grad)
epoch 10, x: 0.060466
The progress of optimizing over x can be plotted as follows.
def show_trace(results, f):

n = max(abs(min(results)), abs(max(results)))
f_line = torch.arange(-n, n, 0.01)
d2l.set_figsize()
d2l.plot([f_line, results], [[f(x) for x in f_line], [
f(x) for x in results]], 'x', 'f(x)', fmts=['-', '-o'])
show_trace(results, f)
Learning Rate
The learning rate η can be set by the algorithm designer. If we use a learning rate that is too small,
it will cause x to update very slowly, requiring more iterations to get a better solution. To show
what happens in such a case, consider the progress in the same optimization problem for η = 0.05.
As we can see, even after 10 steps we are still very far from the optimal solution.
show_trace(gd(0.05, f_grad), f)
12.3. Gradient Descent 489

epoch 10, x: 3.486784
Conversely, if we use an excessively high learning rate, |ηf ′ (x)| might be too large for the first-order
Taylor expansion formula. That is, the term O(η 2 f ′2 (x)) in (12.3.2) might become significant. In
this case, we cannot guarantee that the iteration of x will be able to lower the value of f (x). For
example, when we set the learning rate to η = 1.1, x overshoots the optimal solution x = 0 and
gradually diverges.
show_trace(gd(1.1, f_grad), f)
epoch 10, x: 61.917364

Local Minima
To illustrate what happens for nonconvex functions consider the case of f (x) = x · cos(cx) for
some constant c. This function has infinitely many local minima. Depending on our choice of the
learning rate and depending on how well conditioned the problem is, we may end up with one
of many solutions. The example below illustrates how an (unrealistically) high learning rate will
lead to a poor local minimum.
c = torch.tensor(0.15 * np.pi)

return x * torch.cos(c * x)
def f_grad(x): # Gradient of the objective function

return torch.cos(c * x) - c * x * torch.sin(c * x)
show_trace(gd(2, f_grad), f)
epoch 10, x: -1.528166
12.3.2 Multivariate Gradient Descent
Now that we have a better intuition of the univariate case, letʼs consider the situation where
x = [x1 , x2 , . . . , xd ]⊤ . That is, the objective function f : Rd → R maps vectors into scalars. Corre-
spondingly its gradient is multivariate, too. It is a vector consisting of d partial derivatives:
[ ]⊤
∂f (x) ∂f (x) ∂f (x)
∇f (x) = , ,..., . (12.3.5)
∂x1 ∂x2 ∂xd
Each partial derivative element ∂f (x)/∂xi in the gradient indicates the rate of change of f at x
with respect to the input xi . As before in the univariate case we can use the corresponding Taylor
approximation for multivariate functions to get some idea of what we should do. In particular, we
have that
f (x + ϵ) = f (x) + ϵ⊤ ∇f (x) + O(∥ϵ∥2 ). (12.3.6)

In other words, up to second-order terms in ϵ the direction of steepest descent is given by the
negative gradient −∇f (x). Choosing a suitable learning rate η > 0 yields the prototypical gradient
descent algorithm:
x ← x − η∇f (x). (12.3.7)
To see how the algorithm behaves in practice letʼs construct an objective function f (x) = x21 + 2x22
with a two-dimensional vector x = [x1 , x2 ]⊤ as input and a scalar as output. The gradient is given
by ∇f (x) = [2x1 , 4x2 ]⊤ . We will observe the trajectory of x by gradient descent from the initial
position [−5, −2].
To begin with, we need two more helper functions. The first uses an update function and applies
it 20 times to the initial value. The second helper visualizes the trajectory of x.
def train_2d(trainer, steps=20, f_grad=None): #@save

"""Optimize a 2D objective function with a customized trainer."""
# `s1` and `s2` are internal state variables that will be used in Momentum, adagrad,␣
,→RMSProp
x1, x2, s1, s2 = -5, -2, 0, 0
results = [(x1, x2)]
for i in range(steps):
if f_grad:
x1, x2, s1, s2 = trainer(x1, x2, s1, s2, f_grad)
else:
x1, x2, s1, s2 = trainer(x1, x2, s1, s2)
results.append((x1, x2))
print(f'epoch {i + 1}, x1: {float(x1):f}, x2: {float(x2):f}')
return results
def show_trace_2d(f, results): #@save

"""Show the trace of 2D variables during optimization."""
d2l.set_figsize()
d2l.plt.plot(*zip(*results), '-o', color='#ff7f0e')
x1, x2 = torch.meshgrid(torch.arange(-5.5, 1.0, 0.1),
torch.arange(-3.0, 1.0, 0.1))
d2l.plt.contour(x1, x2, f(x1, x2), colors='#1f77b4')
d2l.plt.xlabel('x1')
d2l.plt.ylabel('x2')
Next, we observe the trajectory of the optimization variable x for learning rate η = 0.1. We can see
that after 20 steps the value of x approaches its minimum at [0, 0]. Progress is fairly well-behaved
albeit rather slow.
def f_2d(x1, x2): # Objective function

return x1 ** 2 + 2 * x2 ** 2
def f_2d_grad(x1, x2): # Gradient of the objective function

return (2 * x1, 4 * x2)
def gd_2d(x1, x2, s1, s2, f_grad):

g1, g2 = f_grad(x1, x2)
return (x1 - eta * g1, x2 - eta * g2, 0, 0)
eta = 0.1
show_trace_2d(f_2d, train_2d(gd_2d, f_grad=f_2d_grad))

epoch 20, x1: -0.057646, x2: -0.000073
12.3.3 Adaptive Methods
As we could see in Section 12.3.1, getting the learning rate η “just right” is tricky. If we pick it too
small, we make little progress. If we pick it too large, the solution oscillates and in the worst case
it might even diverge. What if we could determine η automatically or get rid of having to select
a learning rate at all? Second-order methods that look not only at the value and gradient of the
objective function but also at its curvature can help in this case. While these methods cannot be
applied to deep learning directly due to the computational cost, they provide useful intuition into
how to design advanced optimization algorithms that mimic many of the desirable properties of
the algorithms outlined below.
Newton’s Method
Reviewing the Taylor expansion of some function f : Rd → R there is no need to stop after the
first term. In fact, we can write it as
1
f (x + ϵ) = f (x) + ϵ⊤ ∇f (x) + ϵ⊤ ∇2 f (x)ϵ + O(∥ϵ∥3 ). (12.3.8)
2
def
To avoid cumbersome notation we define H = ∇2 f (x) to be the Hessian of f , which is a d × d
matrix. For small d and simple problems H is easy to compute. For deep neural networks, on the
other hand, H may be prohibitively large, due to the cost of storing O(d2 ) entries. Furthermore it
may be too expensive to compute via backpropagation. For now letʼs ignore such considerations
and look at what algorithm we would get.
After all, the minimum of f satisfies ∇f = 0. Following calculus rules in Section 2.4.3, by taking
derivatives of (12.3.8) with regard to ϵ and ignoring higher-order terms we arrive at
∇f (x) + Hϵ = 0 and hence ϵ = −H−1 ∇f (x). (12.3.9)

That is, we need to invert the Hessian H as part of the optimization problem.
As a simple example, for f (x) = 12 x2 we have ∇f (x) = x and H = 1. Hence for any x we obtain
ϵ = −x. In other words, a single step is sufficient to converge perfectly without the need for any
adjustment! Alas, we got a bit lucky here: the Taylor expansion was exact since f (x + ϵ) = 12 x2 +
ϵx + 12 ϵ2 .
Letʼs see what happens in other problems. Given a convex hyperbolic cosine function f (x) =
cosh(cx) for some constant c, we can see that the global minimum at x = 0 is reached after a few
iterations.
c = torch.tensor(0.5)

return torch.cosh(c * x)

return c * torch.sinh(c * x)
def f_hess(x): # Hessian of the objective function

return c**2 * torch.cosh(c * x)
def newton(eta=1):
x = 10.0
results = [x]
for i in range(10):
x -= eta * f_grad(x) / f_hess(x)
results.append(float(x))
print('epoch 10, x:', x)
return results
show_trace(newton(), f)
epoch 10, x: tensor(0.)
Now letʼs consider a nonconvex function, such as f (x) = x cos(cx) for some constant c. After all,
note that in Newtonʼs method we end up dividing by the Hessian. This means that if the second
derivative is negative we may walk into the direction of increasing the value of f . That is a fatal flaw
of the algorithm. Letʼs see what happens in practice.

c = torch.tensor(0.15 * np.pi)

return x * torch.cos(c * x)

return torch.cos(c * x) - c * x * torch.sin(c * x)
def f_hess(x): # Hessian of the objective function

return - 2 * c * torch.sin(c * x) - x * c**2 * torch.cos(c * x)
show_trace(newton(), f)
epoch 10, x: tensor(26.8341)
This went spectacularly wrong. How can we fix it? One way would be to “fix” the Hessian by taking
its absolute value instead. Another strategy is to bring back the learning rate. This seems to defeat
the purpose, but not quite. Having second-order information allows us to be cautious whenever
the curvature is large and to take longer steps whenever the objective function is flatter. Letʼs see
how this works with a slightly smaller learning rate, say η = 0.5. As we can see, we have quite an
efficient algorithm.
show_trace(newton(0.5), f)
epoch 10, x: tensor(7.2699)

Convergence Analysis
We only analyze the convergence rate of Newtonʼs method for some convex and three times dif-
ferentiable objective function f , where the second derivative is nonzero, i.e., f ′′ > 0. The multi-
variate proof is a straightforward extension of the one-dimensional argument below and omitted
since it does not help us much in terms of intuition.
def
Denote by x(k) the value of x at the k th iteration and let e(k) = x(k) − x∗ be the distance from
optimality at the k th iteration. By Taylor expansion we have that the condition f ′ (x∗ ) = 0 can be
written as
1
0 = f ′ (x(k) − e(k) ) = f ′ (x(k) ) − e(k) f ′′ (x(k) ) + (e(k) )2 f ′′′ (ξ (k) ), (12.3.10)
2
which holds for some ξ (k) ∈ [x(k) − e(k) , x(k) ]. Dividing the above expansion by f ′′ (x(k) ) yields
f ′ (x(k) ) 1 (k) 2 f ′′′ (ξ (k) )

e(k) − = (e ) ′′ (k) . (12.3.11)
f ′′ (x(k) ) 2 f (x )
Recall that we have the update x(k+1) = x(k) − f ′ (x(k) )/f ′′ (x(k) ). Plugging in this update equation
and taking the absolute value of both sides, we have
′′′ (k)
(k+1) 1 (k) 2 f (ξ )
e = (e ) . (12.3.12)
2 f ′′ (x(k) )

Consequently, whenever we are in a region of bounded f ′′′ (ξ (k) ) /(2f ′′ (x(k) )) ≤ c, we have a
quadratically decreasing error

(k+1)
e ≤ c(e(k) )2 . (12.3.13)
As an aside,
(k+1) optimization
researchers call this linear convergence, whereas a condition such as
e ≤ α e(k) would be called a constant rate of convergence. Note that this analysis comes
with a number of caveats. First, we do not really have much of a guarantee when we will reach the
region of rapid convergence. Instead, we only know that once we reach it, convergence will be
very quick. Second, this analysis requires that f is well-behaved up to higher-order derivatives. It
comes down to ensuring that f does not have any “surprising” properties in terms of how it might
change its values.

Preconditioning
Quite unsurprisingly computing and storing the full Hessian is very expensive. It is thus desirable
to find alternatives. One way to improve matters is preconditioning. It avoids computing the Hes-
sian in its entirety but only computes the diagonal entries. This leads to update algorithms of the
form
x ← x − ηdiag(H)−1 ∇f (x). (12.3.14)
While this is not quite as good as the full Newtonʼs method, it is still much better than not using
it. To see why this might be a good idea consider a situation where one variable denotes height in
millimeters and the other one denotes height in kilometers. Assuming that for both the natural
scale is in meters, we have a terrible mismatch in parameterizations. Fortunately, using precon-
ditioning removes this. Effectively preconditioning with gradient descent amounts to selecting a
different learning rate for each variable (coordinate of vector x). As we will see later, precondi-
tioning drives some of the innovation in stochastic gradient descent optimization algorithms.
Gradient Descent with Line Search
One of the key problems in gradient descent is that we might overshoot the goal or make insuf-
ficient progress. A simple fix for the problem is to use line search in conjunction with gradient
descent. That is, we use the direction given by ∇f (x) and then perform binary search as to which
learning rate η minimizes f (x − η∇f (x)).
This algorithm converges rapidly (for an analysis and proof see e.g., (Boyd and Vandenberghe,
2004)). However, for the purpose of deep learning this is not quite so feasible, since each step of
the line search would require us to evaluate the objective function on the entire dataset. This is
way too costly to accomplish.
12.3.4 Summary
• Learning rates matter. Too large and we diverge, too small and we do not make progress.
• Gradient descent can get stuck in local minima.
• In high dimensions adjusting the learning rate is complicated.
• Preconditioning can help with scale adjustment.
• Newtonʼs method is a lot faster once it has started working properly in convex problems.
• Beware of using Newtonʼs method without any adjustments for nonconvex problems.
Exercises
1. Experiment with different learning rates and objective functions for gradient descent.
2. Implement line search to minimize a convex function in the interval [a, b].
1. Do you need derivatives for binary search, i.e., to decide whether to pick [a, (a + b)/2]
or [(a + b)/2, b].
2. How rapid is the rate of convergence for the algorithm?

3. Implement the algorithm and apply it to minimizing log(exp(x) + exp(−2x − 3)).
3. Design an objective function defined on R2 where gradient descent is exceedingly slow. Hint:
scale different coordinates differently.
4. Implement the lightweight version of Newtonʼs method using preconditioning:
1. Use diagonal Hessian as preconditioner.
2. Use the absolute values of that rather than the actual (possibly signed) values.
3. Apply this to the problem above.
5. Apply the algorithm above to a number of objective functions (convex or not). What happens
if you rotate coordinates by 45 degrees?
Discussions168
12.4 Stochastic Gradient Descent
In earlier chapters we kept using stochastic gradient descent in our training procedure, however,
without explaining why it works. To shed some light on it, we just described the basic principles
of gradient descent in Section 12.3. In this section, we go on to discuss stochastic gradient descent
in greater detail.
%matplotlib inline
import math
import torch
12.4.1 Stochastic Gradient Updates
In deep learning, the objective function is usually the average of the loss functions for each exam-
ple in the training dataset. Given a training dataset of n examples, we assume that fi (x) is the loss
function with respect to the training example of index i, where x is the parameter vector. Then
we arrive at the objective function
1∑
n
f (x) = fi (x). (12.4.1)
n
i=1
The gradient of the objective function at x is computed as
1∑
n
∇f (x) = ∇fi (x). (12.4.2)
n
i=1
If gradient descent is used, the computational cost for each independent variable iteration is O(n),
which grows linearly with n. Therefore, when the training dataset is larger, the cost of gradient
descent for each iteration will be higher.
Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration
of stochastic gradient descent, we uniformly sample an index i ∈ {1, . . . , n} for data examples at
168

random, and compute the gradient ∇fi (x) to update x:
x ← x − η∇fi (x), (12.4.3)
where η is the learning rate. We can see that the computational cost for each iteration drops
from O(n) of the gradient descent to the constant O(1). Moreover, we want to emphasize that the
stochastic gradient ∇fi (x) is an unbiased estimate of the full gradient ∇f (x) because
1∑
n
Ei ∇fi (x) = ∇fi (x) = ∇f (x). (12.4.4)
n
i=1
This means that, on average, the stochastic gradient is a good estimate of the gradient.
Now, we will compare it with gradient descent by adding random noise with a mean of 0 and a
variance of 1 to the gradient to simulate a stochastic gradient descent.
def f(x1, x2): # Objective function

return x1 ** 2 + 2 * x2 ** 2
def f_grad(x1, x2): # Gradient of the objective function

return 2 * x1, 4 * x2
def sgd(x1, x2, s1, s2, f_grad):

g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += torch.normal(0.0, 1, (1,))
g2 += torch.normal(0.0, 1, (1,))
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
def constant_lr():
return 1
eta = 0.1
lr = constant_lr # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
epoch 50, x1: -0.104549, x2: 0.060831

/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.8/site-packages/numpy/core/
,→shape_base.py:65: VisibleDeprecationWarning: Creating an ndarray from ragged nested␣
,→sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths␣

,→or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when␣
,→creating the ndarray.
ary = asanyarray(ary)

12.4. Stochastic Gradient Descent 499

As we can see, the trajectory of the variables in the stochastic gradient descent is much more
noisy than the one we observed in gradient descent in Section 12.3. This is due to the stochastic
nature of the gradient. That is, even when we arrive near the minimum, we are still subject to
the uncertainty injected by the instantaneous gradient via η∇fi (x). Even after 50 steps the quality
is still not so good. Even worse, it will not improve after additional steps (we encourage you to
experiment with a larger number of steps to confirm this). This leaves us with the only alternative:
change the learning rate η. However, if we pick this too small, we will not make any meaningful
progress initially. On the other hand, if we pick it too large, we will not get a good solution, as seen
above. The only way to resolve these conflicting goals is to reduce the learning rate dynamically
as optimization progresses.
This is also the reason for adding a learning rate function lr into the sgd step function. In the
example above any functionality for learning rate scheduling lies dormant as we set the associated
lr function to be constant.
12.4.2 Dynamic Learning Rate
Replacing η with a time-dependent learning rate η(t) adds to the complexity of controlling con-
vergence of an optimization algorithm. In particular, we need to figure out how rapidly η should
decay. If it is too quick, we will stop optimizing prematurely. If we decrease it too slowly, we waste
too much time on optimization. The following are a few basic strategies that are used in adjusting
η over time (we will discuss more advanced strategies later):
η(t) = ηi if ti ≤ t ≤ ti+1 piecewise constant

η(t) = η0 · e−λt exponential decay (12.4.5)
−α
η(t) = η0 · (βt + 1) polynomial decay
In the first piecewise constant scenario we decrease the learning rate, e.g., whenever progress in
optimization stalls. This is a common strategy for training deep networks. Alternatively we could
decrease it much more aggressively by an exponential decay. Unfortunately this often leads to pre-
mature stopping before the algorithm has converged. A popular choice is polynomial decay with
α = 0.5. In the case of convex optimization there are a number of proofs that show that this rate
is well behaved.
Letʼs see what the exponential decay looks like in practice.

def exponential_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return math.exp(-0.1 * t)
t = 1
lr = exponential_lr
epoch 1000, x1: -0.881520, x2: -0.061965
As expected, the variance in the parameters is significantly reduced. However, this comes at the
expense of failing to converge to the optimal solution x = (0, 0). Even after 1000 iteration steps
are we are still very far away from the optimal solution. Indeed, the algorithm fails to converge
at all. On the other hand, if we use a polynomial decay where the learning rate decays with the
inverse square root of the number of steps, convergence gets better after only 50 steps.
def polynomial_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return (1 + 0.1 * t) ** (-0.5)
t = 1
lr = polynomial_lr
epoch 50, x1: 0.074717, x2: 0.184249

There exist many more choices for how to set the learning rate. For instance, we could start with a
small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We could even al-
ternate between smaller and larger learning rates. There exists a large variety of such schedules.
For now letʼs focus on learning rate schedules for which a comprehensive theoretical analysis is
possible, i.e., on learning rates in a convex setting. For general nonconvex problems it is very
difficult to obtain meaningful convergence guarantees, since in general minimizing nonlinear
nonconvex problems is NP hard. For a survey see e.g., the excellent lecture notes169 of Tibshirani
2015.
12.4.3 Convergence Analysis for Convex Objectives
The following convergence analysis of stochastic gradient descent for convex objective functions
is optional and primarily serves to convey more intuition about the problem. We limit ourselves
to one of the simplest proofs (Nesterov and Vial, 2000). Significantly more advanced proof tech-
niques exist, e.g., whenever the objective function is particularly well behaved.
Suppose that the objective function f (ξ, x) is convex in x for all ξ. More concretely, we consider
the stochastic gradient descent update:
xt+1 = xt − ηt ∂x f (ξ t , x), (12.4.6)
where f (ξ t , x) is the objective function with respect to the training example ξ t drawn from some
distribution at step t and x is the model parameter. Denote by
R(x) = Eξ [f (ξ, x)] (12.4.7)
the expected risk and by R∗ its minimum with regard to x. Last let x∗ be the minimizer (we assume
that it exists within the domain where x is defined). In this case we can track the distance between
the current parameter xt at time t and the risk minimizer x∗ and see whether it improves over time:
∥xt+1 − x∗ ∥2
=∥xt − ηt ∂x f (ξ t , x) − x∗ ∥2 (12.4.8)
=∥xt − x∗ ∥2 + ηt2 ∥∂x f (ξ t , x)∥2 − 2ηt ⟨xt − x∗ , ∂x f (ξ t , x)⟩ .
169
https://www.stat.cmu.edu/~ryantibs/convexopt-F15/lectures/26-nonconvex.pdf

We assume that the ℓ2 norm of stochastic gradient ∂x f (ξ t , x) is bounded by some constant L, hence
we have that
ηt2 ∥∂x f (ξ t , x)∥2 ≤ ηt2 L2 . (12.4.9)
We are mostly interested in how the distance between xt and x∗ changes in expectation. In fact,
for any specific sequence of steps the distance might well increase, depending on whichever ξ t
we encounter. Hence we need to bound the dot product. Since for any convex function f it holds
that f (y) ≥ f (x) + ⟨f ′ (x), y − x⟩ for all x and y, by convexity we have
f (ξ t , x∗ ) ≥ f (ξ t , xt ) + ⟨x∗ − xt , ∂x f (ξ t , xt )⟩ . (12.4.10)
Plugging both inequalities (12.4.9) and (12.4.10) into (12.4.8) we obtain a bound on the distance
between parameters at time t + 1 as follows:
∥xt − x∗ ∥2 − ∥xt+1 − x∗ ∥2 ≥ 2ηt (f (ξ t , xt ) − f (ξ t , x∗ )) − ηt2 L2 . (12.4.11)
This means that we make progress as long as the difference between current loss and the optimal
loss outweighs ηt L2 /2. Since this difference is bound to converge to zero it follows that the learning
rate ηt also needs to vanish.
Next we take expectations over (12.4.11). This yields
[ ] [ ]
E ∥xt − x∗ ∥2 − E ∥xt+1 − x∗ ∥2 ≥ 2ηt [E[R(xt )] − R∗ ] − ηt2 L2 . (12.4.12)
The last step involves summing over the inequalities for t ∈ {1, . . . , T }. Since the sum telescopes
and by dropping the lower term we obtain
( T )
∑ ∑T
∗ 2
∥x1 − x ∥ ≥ 2 ηt [E[R(xt )] − R∗ ] − L2 ηt2 . (12.4.13)
t=1 t=1
Note that we exploited that x1 is given and thus the expectation can be dropped. Last define
∑T
def η t xt
x̄ = ∑t=1T
. (12.4.14)
t=1 ηt
Since
(∑ ) ∑T
T
t=1 ηt R(xt ) t=1 ηt E[R(xt )]
E ∑T = ∑T = E[R(xt )], (12.4.15)
t=1 ηt t=1 ηt
∑T
by Jensenʼs inequality (setting i = t, αi = ηt / t=1 ηt in (12.2.3)) and convexity of R it follows that
E[R(xt )] ≥ E[R(x̄)], thus
∑
T ∑
T
ηt E[R(xt )] ≥ ηt E [R(x̄)] . (12.4.16)
t=1 t=1
Plugging this into the inequality (12.4.13) yields the bound

∑
∗ r2 + L2 Tt=1 ηt2
[E[x̄]] − R ≤ ∑ , (12.4.17)
2 Tt=1 ηt
def
where r2 = ∥x1 − x∗ ∥2 is a bound on the distance between the initial choice of parameters and
the final outcome. In short, the speed of convergence depends on how the norm of stochastic
gradient is bounded (L) and how far away from optimality the initial parameter value is (r). Note
that the bound is in terms of x̄ rather than xT . This is the case since x̄ is a smoothed version
of the
√ optimization path. Whenever r, L, and
√ T are known we can pick the learning √ rate η =
r/(L T ). This yields as upper bound rL/ T . That is, we converge with rate O(1/ T ) to the
optimal solution.

12.4.4 Stochastic Gradients and Finite Samples
So far we have played a bit fast and loose when it comes to talking about stochastic gradient de-
scent. We posited that we draw instances xi , typically with labels yi from some distribution p(x, y)
and that we use this to update the model parameters in some manner.∑In particular, for a finite
sample size we simply argued that the discrete distribution p(x, y) = n1 ni=1 δxi (x)δyi (y) for some
functions δxi and δyi allows us to perform stochastic gradient descent over it.
However, this is not really what we did. In the toy examples in the current section we simply
added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs (xi , yi ). It
turns out that this is justified here (see the exercises for a detailed discussion). More troubling is
that in all previous discussions we clearly did not do this. Instead we iterated over all instances
exactly once. To see why this is preferable consider the converse, namely that we are sampling
n observations from the discrete distribution with replacement. The probability of choosing an
element i at random is 1/n. Thus to choose it at least once is
P (choose i) = 1 − P (omit i) = 1 − (1 − 1/n)n ≈ 1 − e−1 ≈ 0.63. (12.4.18)
A similar reasoning shows that the probability of picking some sample (i.e., training example)
exactly once is given by
( ) ( ) ( )
n 1 1 n−1 n 1 n
1− = 1− ≈ e−1 ≈ 0.37. (12.4.19)
1 n n n−1 n
Sampling with replacement leads to an increased variance and decreased data efficiency relative
to sampling without replacement. Hence, in practice we perform the latter (and this is the default
choice throughout this book). Last note that repeated passes through the training dataset traverse
it in a different random order.
12.4.5 Summary
• For convex problems we can prove that for a wide choice of learning rates stochastic gradient
descent will converge to the optimal solution.
• For deep learning this is generally not the case. However, the analysis of convex problems
gives us useful insight into how to approach optimization, namely to reduce the learning
rate progressively, albeit not too quickly.
• Problems occur when the learning rate is too small or too large. In practice a suitable learn-
ing rate is often found only after multiple experiments.
• When there are more examples in the training dataset, it costs more to compute each itera-
tion for gradient descent, so stochastic gradient descent is preferred in these cases.
• Optimality guarantees for stochastic gradient descent are in general not available in noncon-
vex cases since the number of local minima that require checking might well be exponential.

Exercises
1. Experiment with different learning rate schedules for stochastic gradient descent and with
different numbers of iterations. In particular, plot the distance from the optimal solution
(0, 0) as a function of the number of iterations.
2. Prove that for the function f (x1 , x2 ) = x21 + 2x22 adding normal noise to the gradient is equiv-
alent to minimizing a loss function f (x, w) = (x1 − w1 )2 + 2(x2 − w2 )2 where x is drawn from
a normal distribution.
3. Compare convergence of stochastic gradient descent when you sample from
{(x1 , y1 ), . . . , (xn , yn )} with replacement and when you sample without replacement.
4. How would you change the stochastic gradient descent solver if some gradient (or rather
some coordinate associated with it) was consistently larger than all the other gradients?
5. Assume that f (x) = x2 (1 + sin x). How many local minima does f have? Can you change f
in such a way that to minimize it one needs to evaluate all the local minima?
Discussions170
12.5 Minibatch Stochastic Gradient Descent
So far we encountered two extremes in the approach to gradient-based learning: Section 12.3 uses
the full dataset to compute gradients and to update parameters, one pass at a time. Conversely
Section 12.4 processes one training example at a time to make progress. Either of them has its
own drawbacks. Gradient descent is not particularly data efficient whenever data is very similar.
Stochastic gradient descent is not particularly computationally efficient since CPUs and GPUs cannot
exploit the full power of vectorization. This suggests that there might be something in between,
and in fact, that is what we have been using so far in the examples we discussed.
12.5.1 Vectorization and Caches
At the heart of the decision to use minibatches is computational efficiency. This is most easily
understood when considering parallelization to multiple GPUs and multiple servers. In this case
we need to send at least one image to each GPU. With 8 GPUs per server and 16 servers we already
arrive at a minibatch size no smaller than 128.
Things are a bit more subtle when it comes to single GPUs or even CPUs. These devices have
multiple types of memory, often multiple types of computational units and different bandwidth
constraints between them. For instance, a CPU has a small number of registers and then the L1,
L2, and in some cases even L3 cache (which is shared among different processor cores). These
caches are of increasing size and latency (and at the same time they are of decreasing bandwidth).
Suffice to say, the processor is capable of performing many more operations than what the main
memory interface is able to provide.
First, a 2GHz CPU with 16 cores and AVX-512 vectorization can process up to 2 · 109 · 16 · 32 = 1012
bytes per second. The capability of GPUs easily exceeds this number by a factor of 100. On the
other hand, a midrange server processor might not have much more than 100 GB/s bandwidth, i.e.,
less than one tenth of what would be required to keep the processor fed. To make matters worse,
170
12.5. Minibatch Stochastic Gradient Descent 505

not all memory access is created equal: memory interfaces are typically 64 bit wide or wider (e.g.,
on GPUs up to 384 bit), hence reading a single byte incurs the cost of a much wider access.
Second, there is significant overhead for the first access whereas sequential access is relatively
cheap (this is often called a burst read). There are many more things to keep in mind, such as
caching when we have multiple sockets, chiplets, and other structures. See this Wikipedia arti-
cle171 for a more in-depth discussion.
The way to alleviate these constraints is to use a hierarchy of CPU caches that are actually fast
enough to supply the processor with data. This is the driving force behind batching in deep learn-
ing. To keep matters simple, consider matrix-matrix multiplication, say A = BC. We have a num-
ber of options for calculating A. For instance, we could try the following:
1. We could compute Aij = Bi,: C⊤
:,j , i.e., we could compute it elementwise by means of dot
products.
2. We could compute A:,j = BC⊤
:,j , i.e., we could compute it one column at a time. Likewise we
could compute A one row Ai,: at a time.
3. We could simply compute A = BC.
4. We could break B and C into smaller block matrices and compute A one block at a time.
If we follow the first option, we will need to copy one row and one column vector into the CPU
each time we want to compute an element Aij . Even worse, due to the fact that matrix elements
are aligned sequentially we are thus required to access many disjoint locations for one of the two
vectors as we read them from memory. The second option is much more favorable. In it, we are
able to keep the column vector C:,j in the CPU cache while we keep on traversing through B. This
halves the memory bandwidth requirement with correspondingly faster access. Of course, option
3 is most desirable. Unfortunately, most matrices might not entirely fit into cache (this is what we
are discussing after all). However, option 4 offers a practically useful alternative: we can move
blocks of the matrix into cache and multiply them locally. Optimized libraries take care of this for
us. Letʼs have a look at how efficient these operations are in practice.
Beyond computational efficiency, the overhead introduced by Python and by the deep learning
framework itself is considerable. Recall that each time we execute a command the Python in-
terpreter sends a command to the MXNet engine which needs to insert it into the computational
graph and deal with it during scheduling. Such overhead can be quite detrimental. In short, it is
highly advisable to use vectorization (and matrices) whenever possible.
%matplotlib inline
import time
import numpy as np
import torch
A = torch.zeros(256, 256)
B = torch.randn(256, 256)
C = torch.randn(256, 256)
Since we will benchmark the running time frequently in the rest of the book, letʼs define a timer.
171
https://en.wikipedia.org/wiki/Cache_hierarchy

class Timer: #@save
"""Record multiple running times."""
def __init__(self):
self.times = []
self.start()
def start(self):
"""Start the timer."""
self.tik = time.time()
def stop(self):
"""Stop the timer and record the time in a list."""
self.times.append(time.time() - self.tik)
return self.times[-1]
def avg(self):
"""Return the average time."""
return sum(self.times) / len(self.times)
def sum(self):
"""Return the sum of time."""
return sum(self.times)
def cumsum(self):
"""Return the accumulated time."""
return np.array(self.times).cumsum().tolist()
timer = Timer()
Element-wise assignment simply iterates over all rows and columns of B and C respectively to
assign the value to A.
# Compute A = BC one element at a time

timer.start()
for i in range(256):
for j in range(256):
A[i, j] = torch.dot(B[i, :], C[:, j])
timer.stop()
1.284989356994629
A faster strategy is to perform column-wise assignment.
# Compute A = BC one column at a time

timer.start()
for j in range(256):
A[:, j] = torch.mv(B, C[:, j])
timer.stop()
0.01000666618347168
Last, the most effective manner is to perform the entire operation in one block. Letʼs see what the
respective speed of the operations is.

# Compute A = BC in one go
timer.start()
A = torch.mm(B, C)
timer.stop()
# Multiply and add count as separate operations (fused in practice)

gigaflops = [2/i for i in timer.times]
print(f'performance in Gigaflops: element {gigaflops[0]:.3f}, '
f'column {gigaflops[1]:.3f}, full {gigaflops[2]:.3f}')
performance in Gigaflops: element 1.556, column 199.867, full 3636.154
12.5.2 Minibatches
In the past we took it for granted that we would read minibatches of data rather than single observa-
tions to update parameters. We now give a brief justification for it. Processing single observations
requires us to perform many single matrix-vector (or even vector-vector) multiplications, which is
quite expensive and which incurs a significant overhead on behalf of the underlying deep learning
framework. This applies both to evaluating a network when applied to data (often referred to as
inference) and when computing gradients to update parameters. That is, this applies whenever
we perform w ← w − ηt gt where
gt = ∂w f (xt , w) (12.5.1)
We can increase the computational efficiency of this operation by applying it to a minibatch of

observations at a time. That is, we replace the gradient gt over a single observation by one over a
small batch
1 ∑
gt = ∂w f (xi , w) (12.5.2)
|B |t
i∈Bt
Letʼs see what this does to the statistical properties of gt : since both xt and also all elements of
the minibatch Bt are drawn uniformly at random from the training set, the expectation of the
gradient remains unchanged. The variance, on the other hand, is reduced significantly. Since the
minibatch gradient is composed of b := |Bt | independent gradients which are being averaged, its
standard deviation is reduced by a factor of b− 2 . This, by itself, is a good thing, since it means that
1
the updates are more reliably aligned with the full gradient.
Naively this would indicate that choosing a large minibatch Bt would be universally desirable.
Alas, after some point, the additional reduction in standard deviation is minimal when compared
to the linear increase in computational cost. In practice we pick a minibatch that is large enough
to offer good computational efficiency while still fitting into the memory of a GPU. To illustrate the
savings letʼs have a look at some code. In it we perform the same matrix-matrix multiplication,
but this time broken up into “minibatches” of 64 columns at a time.
timer.start()
for j in range(0, 256, 64):
A[:, j:j+64] = torch.mm(B, C[:, j:j+64])
timer.stop()
print(f'performance in Gigaflops: block {2 / timer.times[3]:.3f}')

performance in Gigaflops: block 276.186
As we can see, the computation on the minibatch is essentially as efficient as on the full matrix. A
word of caution is in order. In Section 8.5 we used a type of regularization that was heavily depen-
dent on the amount of variance in a minibatch. As we increase the latter, the variance decreases
and with it the benefit of the noise-injection due to batch normalization. See e.g., (Ioffe, 2017) for
details on how to rescale and compute the appropriate terms.
Letʼs have a look at how minibatches are efficiently generated from data. In the following we use a
dataset developed by NASA to test the wing noise from different aircraft172 to compare these opti-
mization algorithms. For convenience we only use the first 1, 500 examples. The data is whitened
for preprocessing, i.e., we remove the mean and rescale the variance to 1 per coordinate.
#@save
d2l.DATA_HUB['airfoil'] = (d2l.DATA_URL + 'airfoil_self_noise.dat',
'76e5be1548fd8222e5074cf0faae75edff8cf93f')
#@save
def get_data_ch11(batch_size=10, n=1500):
data = np.genfromtxt(d2l.download('airfoil'),
dtype=np.float32, delimiter='\t')
data = torch.from_numpy((data - data.mean(axis=0)) / data.std(axis=0))
data_iter = d2l.load_array((data[:n, :-1], data[:n, -1]),
batch_size, is_train=True)
return data_iter, data.shape[1]-1
Recall the minibatch stochastic gradient descent implementation from Section 3.4. In the fol-
lowing we provide a slightly more general implementation. For convenience it has the same call
signature as the other optimization algorithms introduced later in this chapter. Specifically, we
add the status input states and place the hyperparameter in dictionary hyperparams. In addition,
we will average the loss of each minibatch example in the training function, so the gradient in the
optimization algorithm does not need to be divided by the batch size.
def sgd(params, states, hyperparams):

for p in params:
p.data.sub_(hyperparams['lr'] * p.grad)
p.grad.data.zero_()
Next, we implement a generic training function to facilitate the use of the other optimization al-
gorithms introduced later in this chapter. It initializes a linear regression model and can be used
to train the model with minibatch stochastic gradient descent and other algorithms introduced
subsequently.
172
https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise

#@save
def train_ch11(trainer_fn, states, hyperparams, data_iter,
feature_dim, num_epochs=2):
# Initialization
w = torch.normal(mean=0.0, std=0.01, size=(feature_dim, 1),
requires_grad=True)
b = torch.zeros((1), requires_grad=True)
net, loss = lambda X: d2l.linreg(X, w, b), d2l.squared_loss
# Train
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[0, num_epochs], ylim=[0.22, 0.35])
n, timer = 0, d2l.Timer()
for _ in range(num_epochs):
for X, y in data_iter:
l = loss(net(X), y).mean()
l.backward()
trainer_fn([w, b], states, hyperparams)
n += X.shape[0]
if n % 200 == 0:
timer.stop()
animator.add(n/X.shape[0]/len(data_iter),
(d2l.evaluate_loss(net, data_iter, loss),))
timer.start()
print(f'loss: {animator.Y[0][-1]:.3f}, {timer.avg():.3f} sec/epoch')
return timer.cumsum(), animator.Y[0]
Letʼs see how optimization proceeds for batch gradient descent. This can be achieved by setting
the minibatch size to 1500 (i.e., to the total number of examples). As a result the model parameters
are updated only once per epoch. There is little progress. In fact, after 6 steps progress stalls.
def train_sgd(lr, batch_size, num_epochs=2):

data_iter, feature_dim = get_data_ch11(batch_size)
return train_ch11(
sgd, None, {'lr': lr}, data_iter, feature_dim, num_epochs)
gd_res = train_sgd(1, 1500, 10)
loss: 0.250, 0.042 sec/epoch

When the batch size equals 1, we use stochastic gradient descent for optimization. For simplic-
ity of implementation we picked a constant (albeit small) learning rate. In stochastic gradient
descent, the model parameters are updated whenever an example is processed. In our case this
amounts to 1500 updates per epoch. As we can see, the decline in the value of the objective func-
tion slows down after one epoch. Although both the procedures processed 1500 examples within
one epoch, stochastic gradient descent consumes more time than gradient descent in our exper-
iment. This is because stochastic gradient descent updated the parameters more frequently and
since it is less efficient to process single observations one at a time.
sgd_res = train_sgd(0.005, 1)
Finally, when the batch size equals 100, we use minibatch stochastic gradient descent for opti-
mization. The time required per epoch is shorter than the time needed for stochastic gradient
descent and the time for batch gradient descent.
mini1_res = train_sgd(.4, 100)

Reducing the batch size to 10, the time for each epoch increases because the workload for each
batch is less efficient to execute.
mini2_res = train_sgd(.05, 10)
Now we can compare the time vs. loss for the previous four experiments. As can be seen, although
stochastic gradient descent converges faster than GD in terms of number of examples processed,
it uses more time to reach the same loss than GD because computing the gradient example by
example is not as efficient. Minibatch stochastic gradient descent is able to trade-off convergence
speed and computation efficiency. A minibatch size of 10 is more efficient than stochastic gradient
descent; a minibatch size of 100 even outperforms GD in terms of runtime.
d2l.set_figsize([6, 3])
d2l.plot(*list(map(list, zip(gd_res, sgd_res, mini1_res, mini2_res))),
'time (sec)', 'loss', xlim=[1e-2, 10],
legend=['gd', 'sgd', 'batch size=100', 'batch size=10'])
d2l.plt.gca().set_xscale('log')

In Gluon, we can use the Trainer class to call optimization algorithms. This is used to implement
a generic training function. We will use this throughout the current chapter.
#@save
def train_concise_ch11(trainer_fn, hyperparams, data_iter, num_epochs=4):
# Initialization
net = nn.Sequential(nn.Linear(5, 1))
def init_weights(module):
torch.nn.init.normal_(module.weight, std=0.01)
net.apply(init_weights)
optimizer = trainer_fn(net.parameters(), **hyperparams)

loss = nn.MSELoss(reduction='none')
xlim=[0, num_epochs], ylim=[0.22, 0.35])
n, timer = 0, d2l.Timer()
for _ in range(num_epochs):
optimizer.zero_grad()
out = net(X)
y = y.reshape(out.shape)
l = loss(out, y)
l.mean().backward()
optimizer.step()
n += X.shape[0]
if n % 200 == 0:
timer.stop()
# `MSELoss` computes squared error without the 1/2 factor
animator.add(n/X.shape[0]/len(data_iter),
(d2l.evaluate_loss(net, data_iter, loss) / 2,))
timer.start()
print(f'loss: {animator.Y[0][-1]:.3f}, {timer.avg():.3f} sec/epoch')
Using Gluon to repeat the last experiment shows identical behavior.
data_iter, _ = get_data_ch11(10)
trainer = torch.optim.SGD
train_concise_ch11(trainer, {'lr': 0.01}, data_iter)

12.5.6 Summary
• Vectorization makes code more efficient due to reduced overhead arising from the deep
learning framework and due to better memory locality and caching on CPUs and GPUs.
• There is a trade-off between statistical efficiency arising from stochastic gradient descent
and computational efficiency arising from processing large batches of data at a time.
• Minibatch stochastic gradient descent offers the best of both worlds: computational and
statistical efficiency.
• In minibatch stochastic gradient descent we process batches of data obtained by a random
permutation of the training data (i.e., each observation is processed only once per epoch,
albeit in random order).
• It is advisable to decay the learning rates during training.
• In general, minibatch stochastic gradient descent is faster than stochastic gradient descent
and gradient descent for convergence to a smaller risk, when measured in terms of clock
time.
Exercises
1. Modify the batch size and learning rate and observe the rate of decline for the value of the
objective function and the time consumed in each epoch.
2. Read the MXNet documentation and use the Trainer class set_learning_rate function to
reduce the learning rate of the minibatch stochastic gradient descent to 1/10 of its previous
value after each epoch.
3. Compare minibatch stochastic gradient descent with a variant that actually samples with re-
placement from the training set. What happens?
4. An evil genie replicates your dataset without telling you (i.e., each observation occurs twice
and your dataset grows to twice its original size, but nobody told you). How does the behavior
of stochastic gradient descent, minibatch stochastic gradient descent and that of gradient
descent change?

Discussions173
12.6 Momentum
In Section 12.4 we reviewed what happens when performing stochastic gradient descent, i.e.,
when performing optimization where only a noisy variant of the gradient is available. In partic-
ular, we noticed that for noisy gradients we need to be extra cautious when it comes to choosing
the learning rate in the face of noise. If we decrease it too rapidly, convergence stalls. If we are
too lenient, we fail to converge to a good enough solution since noise keeps on driving us away
from optimality.
12.6.1 Basics
In this section, we will explore more effective optimization algorithms, especially for certain types
of optimization problems that are common in practice.
Leaky Averages
The previous section saw us discussing minibatch SGD as a means for accelerating computation.
It also had the nice side-effect that averaging gradients reduced the amount of variance. The mini-
batch stochastic gradient descent can be calculated by:
1 ∑ 1 ∑
gt,t−1 = ∂w f (xi , wt−1 ) = hi,t−1 . (12.6.1)
|Bt | |Bt |
i∈Bt i∈Bt
To keep the notation simple, here we used hi,t−1 = ∂w f (xi , wt−1 ) as the stochastic gradient descent
for sample i using the weights updated at time t − 1. It would be nice if we could benefit from
the effect of variance reduction even beyond averaging gradients on a minibatch. One option to
accomplish this task is to replace the gradient computation by a “leaky average”:
vt = βvt−1 + gt,t−1 (12.6.2)
for some β ∈ (0, 1). This effectively replaces the instantaneous gradient by one thatʼs been aver-
aged over multiple past gradients. v is called momentum. It accumulates past gradients similar to
how a heavy ball rolling down the objective function landscape integrates over past forces. To see
what is happening in more detail letʼs expand vt recursively into
∑
t−1
2
vt = β vt−2 + βgt−1,t−2 + gt,t−1 = . . . , = β τ gt−τ,t−τ −1 . (12.6.3)
τ =0
Large β amounts to a long-range average, whereas small β amounts to only a slight correction
relative to a gradient method. The new gradient replacement no longer points into the direction
of steepest descent on a particular instance any longer but rather in the direction of a weighted
average of past gradients. This allows us to realize most of the benefits of averaging over a batch
without the cost of actually computing the gradients on it. We will revisit this averaging procedure
in more detail later.
173
12.6. Momentum 515

The above reasoning formed the basis for what is now known as accelerated gradient methods,
such as gradients with momentum. They enjoy the additional benefit of being much more effec-
tive in cases where the optimization problem is ill-conditioned (i.e., where there are some direc-
tions where progress is much slower than in others, resembling a narrow canyon). Furthermore,
they allow us to average over subsequent gradients to obtain more stable directions of descent.
Indeed, the aspect of acceleration even for noise-free convex problems is one of the key reasons
why momentum works and why it works so well.
As one would expect, due to its efficacy momentum is a well-studied subject in optimization for
deep learning and beyond. See e.g., the beautiful expository article174 by (Goh, 2017) for an in-
depth analysis and interactive animation. It was proposed by (Polyak, 1964). (Nesterov, 2018)
has a detailed theoretical discussion in the context of convex optimization. Momentum in deep
learning has been known to be beneficial for a long time. See e.g., the discussion by (Sutskever et
al., 2013) for details.
An Ill-conditioned Problem
To get a better understanding of the geometric properties of the momentum method we revisit
gradient descent, albeit with a significantly less pleasant objective function. Recall that in Section
12.3 we used f (x) = x21 + 2x22 , i.e., a moderately distorted ellipsoid objective. We distort this
function further by stretching it out in the x1 direction via
f (x) = 0.1x21 + 2x22 . (12.6.4)
As before f has its minimum at (0, 0). This function is very flat in the direction of x1 . Letʼs see what
happens when we perform gradient descent as before on this new function. We pick a learning
rate of 0.4.
%matplotlib inline
import torch
eta = 0.4
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2, s1, s2):
return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1: -0.943467, x2: -0.000073

174
https://distill.pub/2017/momentum/

By construction, the gradient in the x2 direction is much higher and changes much more rapidly
than in the horizontal x1 direction. Thus we are stuck between two undesirable choices: if we
pick a small learning rate we ensure that the solution does not diverge in the x2 direction but we
are saddled with slow convergence in the x1 direction. Conversely, with a large learning rate we
progress rapidly in the x1 direction but diverge in x2 . The example below illustrates what hap-
pens even after a slight increase in learning rate from 0.4 to 0.6. Convergence in the x1 direction
improves but the overall solution quality is much worse.
eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1: -0.387814, x2: -1673.365109
12.6. Momentum 517

The Momentum Method
The momentum method allows us to solve the gradient descent problem described above. Look-
ing at the optimization trace above we might intuit that averaging gradients over the past would
work well. After all, in the x1 direction this will aggregate well-aligned gradients, thus increasing
the distance we cover with every step. Conversely, in the x2 direction where gradients oscillate,
an aggregate gradient will reduce step size due to oscillations that cancel each other out. Using vt
instead of the gradient gt yields the following update equations:
vt ← βvt−1 + gt,t−1 ,
(12.6.5)
xt ← xt−1 − ηt vt .
Note that for β = 0 we recover regular gradient descent. Before delving deeper into the mathe-
matical properties letʼs have a quick look at how the algorithm behaves in practice.
def momentum_2d(x1, x2, v1, v2):

v1 = beta * v1 + 0.2 * x1
v2 = beta * v2 + 4 * x2
return x1 - eta * v1, x2 - eta * v2, v1, v2
eta, beta = 0.6, 0.5

d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1: 0.007188, x2: 0.002553
As we can see, even with the same learning rate that we used before, momentum still converges
well. Letʼs see what happens when we decrease the momentum parameter. Halving it to β =
0.25 leads to a trajectory that barely converges at all. Nonetheless, it is a lot better than without
momentum (when the solution diverges).
eta, beta = 0.6, 0.25

d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1: -0.126340, x2: -0.186632

Note that we can combine momentum with stochastic gradient descent and in particular, mini-
batch stochastic gradient descent. The only change is that in that case we replace the gradients
gt,t−1 with gt . Last, for convenience we initialize v0 = 0 at time t = 0. Letʼs look at what leaky
averaging actually does to the updates.
Effective Sample Weight

∑t−1 τ ∑∞ τ 1
Recall that vt = τ =0 β gt−τ,t−τ −1 . In the limit the terms add up to τ =0 β = 1−β . In other
words, rather than taking a step of size η in gradient descent or stochastic gradient descent we take
η
a step of size 1−β while at the same time, dealing with a potentially much better behaved descent
direction. These are two benefits in one. To illustrate how weighting behaves for different choices
of β consider the diagram below.
d2l.set_figsize()
betas = [0.95, 0.9, 0.6, 0]
for beta in betas:
x = torch.arange(40).detach().numpy()
d2l.plt.plot(x, beta ** x, label=f'beta = {beta:.2f}')
d2l.plt.xlabel('time')
d2l.plt.legend();
12.6. Momentum 519

12.6.2 Practical Experiments
Letʼs see how momentum works in practice, i.e., when used within the context of a proper opti-
mizer. For this we need a somewhat more scalable implementation.
Implementation from Scratch
Compared with (minibatch) stochastic gradient descent the momentum method needs to main-
tain a set of auxiliary variables, i.e., velocity. It has the same shape as the gradients (and variables
of the optimization problem). In the implementation below we call these variables states.
def init_momentum_states(feature_dim):
v_w = torch.zeros((feature_dim, 1))
v_b = torch.zeros(1)
return (v_w, v_b)
def sgd_momentum(params, states, hyperparams):

for p, v in zip(params, states):
v[:] = hyperparams['momentum'] * v + p.grad
p[:] -= hyperparams['lr'] * v
p.grad.data.zero_()
Letʼs see how this works in practice.
def train_momentum(lr, momentum, num_epochs=2):

d2l.train_ch11(sgd_momentum, init_momentum_states(feature_dim),
{'lr': lr, 'momentum': momentum}, data_iter,
feature_dim, num_epochs)
data_iter, feature_dim = d2l.get_data_ch11(batch_size=10)

train_momentum(0.02, 0.5)

When we increase the momentum hyperparameter momentum to 0.9, it amounts to a significantly
1
larger effective sample size of 1−0.9 = 10. We reduce the learning rate slightly to 0.01 to keep
matters under control.
Reducing the learning rate further addresses any issue of non-smooth optimization problems.
Setting it to 0.005 yields good convergence properties.
12.6. Momentum 521

Concise Implementation
There is very little to do in Gluon since the standard sgd solver already had momentum built in.
Setting matching parameters yields a very similar trajectory.
trainer = torch.optim.SGD
d2l.train_concise_ch11(trainer, {'lr': 0.005, 'momentum': 0.9}, data_iter)
12.6.3 Theoretical Analysis
So far the 2D example of f (x) = 0.1x21 + 2x22 seemed rather contrived. We will now see that this is
actually quite representative of the types of problem one might encounter, at least in the case of
minimizing convex quadratic objective functions.
Quadratic Convex Functions
Consider the function

1
h(x) = x⊤ Qx + x⊤ c + b. (12.6.6)
2
This is a general quadratic function. For positive definite matrices Q ≻ 0, i.e., for matrices with
positive eigenvalues this has a minimizer at x∗ = −Q−1 c with minimum value b− 21 c⊤ Q−1 c. Hence
we can rewrite h as
1 1
h(x) = (x − Q−1 c)⊤ Q(x − Q−1 c) + b − c⊤ Q−1 c. (12.6.7)
2 2
The gradient is given by ∂x h(x) = Q(x−Q−1 c). That is, it is given by the distance between x and the
minimizer, multiplied by Q. Consequently also the momentum is a linear combination of terms
Q(xt − Q−1 c).
Since Q is positive definite it can be decomposed into its eigensystem via Q = O⊤ ΛO for an or-
thogonal (rotation) matrix O and a diagonal matrix Λ of positive eigenvalues. This allows us to

perform a change of variables from x to z := O(x − Q−1 c) to obtain a much simplified expression:
1
h(z) = z⊤ Λz + b′ . (12.6.8)
2
Here b′ = b − 12 c⊤ Q−1 c. Since O is only an orthogonal matrix this does not perturb the gradients
in a meaningful way. Expressed in terms of z gradient descent becomes
zt = zt−1 − Λzt−1 = (I − Λ)zt−1 . (12.6.9)
The important fact in this expression is that gradient descent does not mix between different
eigenspaces. That is, when expressed in terms of the eigensystem of Q the optimization prob-
lem proceeds in a coordinate-wise manner. This also holds for momentum.
vt = βvt−1 + Λzt−1
zt = zt−1 − η (βvt−1 + Λzt−1 ) (12.6.10)
= (I − ηΛ)zt−1 − ηβvt−1 .
In doing this we just proved the following theorem: Gradient Descent with and without momen-
tum for a convex quadratic function decomposes into coordinate-wise optimization in the direc-
tion of the eigenvectors of the quadratic matrix.
Scalar Functions
λ 2
Given the above result letʼs see what happens when we minimize the function f (x) = 2x . For
gradient descent we have
xt+1 = xt − ηλxt = (1 − ηλ)xt . (12.6.11)
Whenever |1 − ηλ| < 1 this optimization converges at an exponential rate since after t steps we
have xt = (1 − ηλ)t x0 . This shows how the rate of convergence improves initially as we increase
the learning rate η until ηλ = 1. Beyond that things diverge and for ηλ > 2 the optimization
problem diverges.
lambdas = [0.1, 1, 10, 19]

eta = 0.1
d2l.set_figsize((6, 4))
for lam in lambdas:
t = torch.arange(20).detach().numpy()
d2l.plt.plot(t, (1 - eta * lam) ** t, label=f'lambda = {lam:.2f}')
d2l.plt.xlabel('time')
d2l.plt.legend();
12.6. Momentum 523

To analyze convergence in the case of momentum we begin by rewriting the update equations in
terms of two scalars: one for x and one for the momentum v. This yields:
[ ] [ ][ ] [ ]
vt+1 β λ vt v
= = R(β, η, λ) t . (12.6.12)
xt+1 −ηβ (1 − ηλ) xt xt
We used R to denote the 2 × 2 governing convergence behavior. After t steps the initial choice
[v0 , x0 ] becomes R(β, η, λ)t [v0 , x0 ]. Hence, it is up to the eigenvalues of R to determine the speed
of convergence. See the Distill post175 of (Goh, 2017) for a great animation and (Flammarion and
Bach, 2015) for a detailed analysis. One can show that 0 < ηλ < 2 + 2β momentum converges.
This is a larger range of feasible parameters when compared to 0 < ηλ < 2 for gradient descent. It
also suggests that in general large values of β are desirable. Further details require a fair amount
of technical detail and we suggest that the interested reader consult the original publications.
12.6.4 Summary
• Momentum replaces gradients with a leaky average over past gradients. This accelerates
convergence significantly.
• It is desirable for both noise-free gradient descent and (noisy) stochastic gradient descent.
• Momentum prevents stalling of the optimization process that is much more likely to occur
for stochastic gradient descent.
1
• The effective number of gradients is given by 1−β due to exponentiated downweighting of
past data.
• In the case of convex quadratic problems this can be analyzed explicitly in detail.
• Implementation is quite straightforward but it requires us to store an additional state vector
(momentum v).
175
https://distill.pub/2017/momentum/

Exercises
1. Use other combinations of momentum hyperparameters and learning rates and observe and
analyze the different experimental results.
2. Try out GD and ∑ momentum for a quadratic problem where you have multiple eigenvalues,
i.e., f (x) = 12 i λi x2i , e.g., λi = 2−i . Plot how the values of x decrease for the initialization
xi = 1.
3. Derive minimum value and minimizer for h(x) = 12 x⊤ Qx + x⊤ c + b.
4. What changes when we perform stochastic gradient descent with momentum? What hap-
pens when we use minibatch stochastic gradient descent with momentum? Experiment with
the parameters?
Discussions176
12.7 Adagrad
Letʼs begin by considering learning problems with features that occur infrequently.
12.7.1 Sparse Features and Learning Rates
Imagine that we are training a language model. To get good accuracy we typically want to decrease
the learning rate as we keep on training, usually at a rate of O(t− 2 ) or slower. Now consider a
1
model training on sparse features, i.e., features that occur only infrequently. This is common for
natural language, e.g., it is a lot less likely that we will see the word preconditioning than learning.
However, it is also common in other areas such as computational advertising and personalized
collaborative filtering. After all, there are many things that are of interest only for a small number
of people.
Parameters associated with infrequent features only receive meaningful updates whenever these
features occur. Given a decreasing learning rate we might end up in a situation where the parame-
ters for common features converge rather quickly to their optimal values, whereas for infrequent
features we are still short of observing them sufficiently frequently before their optimal values can
be determined. In other words, the learning rate either decreases too slowly for frequent features
or too quickly for infrequent ones.
A possible hack to redress this issue would be to count the number of times we see a particular
feature and to use this as a clock for adjusting learning rates. That is, rather than choosing a
learning rate of the form η = √ηt+c
0
we could use ηi = √ η0 . Here s(i, t) counts the number of
s(i,t)+c
nonzeros for feature i that we have observed up to time t. This is actually quite easy to implement
at no meaningful overhead. However, it fails whenever we do not quite have sparsity but rather
just data where the gradients are often very small and only rarely large. After all, it is unclear
where one would draw the line between something that qualifies as an observed feature or not.
Adagrad by (Duchi et al., 2011) addresses this by replacing the rather crude counter s(i, t) by an
aggregate of the squares of previously observed gradients. In particular, it uses s(i, t + 1) =
s(i, t) + (∂i f (x))2 as a means to adjust the learning rate. This has two benefits: first, we no longer
176
12.7. Adagrad 525

need to decide just when a gradient is large enough. Second, it scales automatically with the mag-
nitude of the gradients. Coordinates that routinely correspond to large gradients are scaled down
significantly, whereas others with small gradients receive a much more gentle treatment. In prac-
tice this leads to a very effective optimization procedure for computational advertising and related
problems. But this hides some of the additional benefits inherent in Adagrad that are best under-
stood in the context of preconditioning.
12.7.2 Preconditioning
Convex optimization problems are good for analyzing the characteristics of algorithms. After all,
for most nonconvex problems it is difficult to derive meaningful theoretical guarantees, but intu-
ition and insight often carry over. Letʼs look at the problem of minimizing f (x) = 12 x⊤ Qx + c⊤ x + b.
As we saw in Section 12.6, it is possible to rewrite this problem in terms of its eigendecompo-
sition Q = U⊤ ΛU to arrive at a much simplified problem where each coordinate can be solved
individually:
1
f (x) = f¯(x̄) = x̄⊤ Λx̄ + c̄⊤ x̄ + b. (12.7.1)
2
Here we used x̄ = Ux and consequently c̄ = Uc. The modified problem has as its minimizer
x̄ = −Λ−1 c̄ and minimum value − 21 c̄⊤ Λ−1 c̄ + b. This is much easier to compute since Λ is a
diagonal matrix containing the eigenvalues of Q.
If we perturb c slightly we would hope to find only slight changes in the minimizer of f . Unfortu-
nately this is not the case. While slight changes in c lead to equally slight changes in c̄, this is not
the case for the minimizer of f (and of f¯ respectively). Whenever the eigenvalues Λi are large we
will see only small changes in x̄i and in the minimum of f¯. Conversely, for small Λi changes in x̄i
can be dramatic. The ratio between the largest and the smallest eigenvalue is called the condition
number of an optimization problem.
Λ1
κ= . (12.7.2)
Λd
If the condition number κ is large, it is difficult to solve the optimization problem accurately. We
need to ensure that we are careful in getting a large dynamic range of values right. Our analysis
leads to an obvious, albeit somewhat naive question: couldnʼt we simply “fix” the problem by dis-
torting the space such that all eigenvalues are 1. In theory this is quite easy: we only need the
1
eigenvalues and eigenvectors of Q to rescale the problem from x to one in z := Λ 2 Ux. In the
new coordinate system x⊤ Qx could be simplified to ∥z∥2 . Alas, this is a rather impractical sugges-
tion. Computing eigenvalues and eigenvectors is in general much more expensive than solving the
actual problem.
While computing eigenvalues exactly might be expensive, guessing them and computing them
even somewhat approximately may already be a lot better than not doing anything at all. In par-
ticular, we could use the diagonal entries of Q and rescale it accordingly. This is much cheaper
than computing eigenvalues.
Q̃ = diag− 2 (Q)Qdiag− 2 (Q).

1 1
(12.7.3)
√
In this case we have Q̃ij = Qij / Qii Qjj and specifically Q̃ii = 1 for all i. In most cases this
simplifies the condition number considerably. For instance, the cases we discussed previously,
this would entirely eliminate the problem at hand since the problem is axis aligned.

Unfortunately we face yet another problem: in deep learning we typically do not even have access
to the second derivative of the objective function: for x ∈ Rd the second derivative even on a
minibatch may require O(d2 ) space and work to compute, thus making it practically infeasible.
The ingenious idea of Adagrad is to use a proxy for that elusive diagonal of the Hessian that is both
relatively cheap to compute and effective—the magnitude of the gradient itself.
In order to see why this works, letʼs look at f¯(x̄). We have that
∂x̄ f¯(x̄) = Λx̄ + c̄ = Λ (x̄ − x̄0 ) , (12.7.4)
where x̄0 is the minimizer of f¯. Hence the magnitude of the gradient depends both on Λ and the
distance from optimality. If x̄ − x̄0 didnʼt change, this would be all thatʼs needed. After all, in this
case the magnitude of the gradient ∂x̄ f¯(x̄) suffices. Since AdaGrad is a stochastic gradient descent
algorithm, we will see gradients with nonzero variance even at optimality. As a result we can safely
use the variance of the gradients as a cheap proxy for the scale of the Hessian. A thorough analysis
is beyond the scope of this section (it would be several pages). We refer the reader to (Duchi et al.,
2011) for details.
12.7.3 The Algorithm
Letʼs formalize the discussion from above. We use the variable st to accumulate past gradient
variance as follows.
gt = ∂w l(yt , f (xt , w)),
st = st−1 + g2t , (12.7.5)
η
wt = wt−1 − √ · gt .
st + ϵ
Here the operation are applied coordinate wise. That is, v2 has entries vi2 . Likewise √1 has entries
v
√1
viand u · v has entries ui vi . As before η is the learning rate and ϵ is an additive constant that
ensures that we do not divide by 0. Last, we initialize s0 = 0.
Just like in the case of momentum we need to keep track of an auxiliary variable, in this case to
allow for an individual learning rate per coordinate. This does not increase the cost of Adagrad
significantly relative to SGD, simply since the main cost is typically to compute l(yt , f (xt , w)) and
its derivative.
Note that accumulating squared gradients in st means that st grows essentially at linear rate (some-
what slower than linearly in practice, since the gradients initially diminish). This leads to an
O(t− 2 ) learning rate, albeit adjusted on a per coordinate basis. For convex problems this is per-
1
fectly adequate. In deep learning, though, we might want to decrease the learning rate rather
more slowly. This led to a number of Adagrad variants that we will discuss in the subsequent
chapters. For now letʼs see how it behaves in a quadratic convex problem. We use the same prob-
lem as before:
f (x) = 0.1x21 + 2x22 . (12.7.6)
We are going to implement Adagrad using the same learning rate previously, i.e., η = 0.4. As
we can see, the iterative trajectory of the independent variable is smoother. However, due to the
cumulative effect of st , the learning rate continuously decays, so the independent variable does
not move as much during later stages of iteration.
12.7. Adagrad 527

%matplotlib inline
import math
import torch
def adagrad_2d(x1, x2, s1, s2):

eps = 1e-6
g1, g2 = 0.2 * x1, 4 * x2
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
return x1, x2, s1, s2
def f_2d(x1, x2):

return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1: -2.382563, x2: -0.158591


As we increase the learning rate to 2 we see much better behavior. This already indicates that the
decrease in learning rate might be rather aggressive, even in the noise-free case and we need to
ensure that parameters converge appropriately.
eta = 2
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1: -0.002295, x2: -0.000000

Just like the momentum method, Adagrad needs to maintain a state variable of the same shape as
the parameters.
def init_adagrad_states(feature_dim):
s_w = torch.zeros((feature_dim, 1))
s_b = torch.zeros(1)
return (s_w, s_b)
def adagrad(params, states, hyperparams):

eps = 1e-6
for p, s in zip(params, states):
s[:] += torch.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
p.grad.data.zero_()
Compared to the experiment in Section 12.5 we use a larger learning rate to train the model.

d2l.train_ch11(adagrad, init_adagrad_states(feature_dim),
{'lr': 0.1}, data_iter, feature_dim);
12.7. Adagrad 529

Using the Trainer instance of the algorithm adagrad, we can invoke the Adagrad algorithm in
Gluon.
trainer = torch.optim.Adagrad
d2l.train_concise_ch11(trainer, {'lr': 0.1}, data_iter)
12.7.6 Summary
• Adagrad decreases the learning rate dynamically on a per-coordinate basis.

• It uses the magnitude of the gradient as a means of adjusting how quickly progress is
achieved - coordinates with large gradients are compensated with a smaller learning rate.
• Computing the exact second derivative is typically infeasible in deep learning problems due
to memory and computational constraints. The gradient can be a useful proxy.

• If the optimization problem has a rather uneven structure Adagrad can help mitigate the
distortion.
• Adagrad is particularly effective for sparse features where the learning rate needs to decrease
more slowly for infrequently occurring terms.
• On deep learning problems Adagrad can sometimes be too aggressive in reducing learning
rates. We will discuss strategies for mitigating this in the context of Section 12.10.
Exercises
1. Prove that for an orthogonal matrix U and a vector c the following holds: ∥c − δ∥2 = ∥Uc −
Uδ∥2 . Why does this mean that the magnitude of perturbations does not change after an
orthogonal change of variables?
2. Try out Adagrad for f (x) = 0.1x21 + 2x22 and also for the objective function was rotated by 45
degrees, i.e., f (x) = 0.1(x1 + x2 )2 + 2(x1 − x2 )2 . Does it behave differently?
3. Prove Gerschgorinʼs
∑ circle theorem177 which states that eigenvalues λi of a matrix M satisfy
|λi − Mjj | ≤ k̸=j |Mjk | for at least one choice of j.
4. What does Gerschgorinʼs theorem tell us about the eigenvalues of the diagonally precondi-
tioned matrix diag− 2 (M)Mdiag− 2 (M)?
1 1
5. Try out Adagrad for a proper deep network, such as Section 7.6 when applied to Fashion-
MNIST.
6. How would you need to modify Adagrad to achieve a less aggressive decay in learning rate?
Discussions178
12.8 RMSProp
One of the key issues in Section 12.7 is that the learning rate decreases at a predefined schedule
of effectively O(t− 2 ). While this is generally appropriate for convex problems, it might not be
1
ideal for nonconvex ones, such as those encountered in deep learning. Yet, the coordinate-wise
adaptivity of Adagrad is highly desirable as a preconditioner.
(Tieleman and Hinton, 2012) proposed the RMSProp algorithm as a simple fix to decouple rate
scheduling from coordinate-adaptive learning rates. The issue is that Adagrad accumulates the
squares of the gradient gt into a state vector st = st−1 + g2t . As a result st keeps on growing without
bound due to the lack of normalization, essentially linearly as the algorithm converges.
One way of fixing this problem would be to use st /t. For reasonable distributions of gt this will
converge. Unfortunately it might take a very long time until the limit behavior starts to matter
since the procedure remembers the full trajectory of values. An alternative is to use a leaky av-
erage in the same way we used in the momentum method, i.e., st ← γst−1 + (1 − γ)g2t for some
parameter γ > 0. Keeping all other parts unchanged yields RMSProp.
177
https://en.wikipedia.org/wiki/Gershgorin_circle_theorem
178
12.8. RMSProp 531

Letʼs write out the equations in detail.
st ← γst−1 + (1 − γ)g2t ,
η (12.8.1)
xt ← xt−1 − √ ⊙ gt .
st + ϵ
The constant ϵ > 0 is typically set to 10−6 to ensure that we do not suffer from division by zero
or overly large step sizes. Given this expansion we are now free to control the learning rate η
independently of the scaling that is applied on a per-coordinate basis. In terms of leaky averages
we can apply the same reasoning as previously applied in the case of the momentum method.
Expanding the definition of st yields
st = (1 − γ)g2t + γst−1
( ) (12.8.2)
= (1 − γ) g2t + γg2t−1 + γ 2 gt−2 + . . . , .
1
As before in Section 12.6 we use 1 + γ + γ 2 + . . . , = 1−γ . Hence the sum of weights is normalized
−1
to 1 with a half-life time of an observation of γ . Letʼs visualize the weights for the past 40 time
steps for various choices of γ.
import math
import torch
d2l.set_figsize()
gammas = [0.95, 0.9, 0.8, 0.7]
for gamma in gammas:
x = torch.arange(40).detach().numpy()
d2l.plt.plot(x, (1-gamma) * gamma ** x, label=f'gamma = {gamma:.2f}')
d2l.plt.xlabel('time');

As before we use the quadratic function f (x) = 0.1x21 + 2x22 to observe the trajectory of RMSProp.
Recall that in Section 12.7, when we used Adagrad with a learning rate of 0.4, the variables moved
only very slowly in the later stages of the algorithm since the learning rate decreased too quickly.
Since η is controlled separately this does not happen with RMSProp.
def rmsprop_2d(x1, x2, s1, s2):

g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6
s1 = gamma * s1 + (1 - gamma) * g1 ** 2
s2 = gamma * s2 + (1 - gamma) * g2 ** 2
return x1, x2, s1, s2
def f_2d(x1, x2):

return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta, gamma = 0.4, 0.9

d2l.show_trace_2d(f_2d, d2l.train_2d(rmsprop_2d))
epoch 20, x1: -0.010599, x2: 0.000000

Next, we implement RMSProp to be used in a deep network. This is equally straightforward.
def init_rmsprop_states(feature_dim):
s_w = torch.zeros((feature_dim, 1))
s_b = torch.zeros(1)
return (s_w, s_b)
12.8. RMSProp 533

def rmsprop(params, states, hyperparams):
gamma, eps = hyperparams['gamma'], 1e-6
for p, s in zip(params, states):
s[:] = gamma * s + (1 - gamma) * torch.square(p.grad)
p[:] -= hyperparams['lr'] * p.grad / torch.sqrt(s + eps)
p.grad.data.zero_()
We set the initial learning rate to 0.01 and the weighting term γ to 0.9. That is, s aggregates on
average over the past 1/(1 − γ) = 10 observations of the square gradient.

d2l.train_ch11(rmsprop, init_rmsprop_states(feature_dim),
{'lr': 0.01, 'gamma': 0.9}, data_iter, feature_dim);
Since RMSProp is a rather popular algorithm it is also available in the Trainer instance. All we
need to do is instantiate it using an algorithm named rmsprop, assigning γ to the parameter gamma1.
trainer = torch.optim.RMSprop
d2l.train_concise_ch11(trainer, {'lr': 0.01, 'alpha': 0.9},
data_iter)

12.8.4 Summary
• RMSProp is very similar to Adagrad insofar as both use the square of the gradient to scale
coefficients.
• RMSProp shares with momentum the leaky averaging. However, RMSProp uses the tech-
nique to adjust the coefficient-wise preconditioner.
• The learning rate needs to be scheduled by the experimenter in practice.
• The coefficient γ determines how long the history is when adjusting the per-coordinate
scale.
Exercises
1. What happens experimentally if we set γ = 1? Why?

2. Rotate the optimization problem to minimize f (x) = 0.1(x1 + x2 )2 + 2(x1 − x2 )2 . What
happens to the convergence?
3. Try out what happens to RMSProp on a real machine learning problem, such as training on
Fashion-MNIST. Experiment with different choices for adjusting the learning rate.
4. Would you want to adjust γ as optimization progresses? How sensitive is RMSProp to this?
Discussions179
179
12.8. RMSProp 535

12.9 Adadelta
Adadelta is yet another variant of AdaGrad (Section 12.7). The main difference lies in the fact that
it decreases the amount by which the learning rate is adaptive to coordinates. Moreover, tradi-
tionally it referred to as not having a learning rate since it uses the amount of change itself as
calibration for future change. The algorithm was proposed in (Zeiler, 2012). It is fairly straight-
forward, given the discussion of previous algorithms so far.
In a nutshell, Adadelta uses two state variables, st to store a leaky average of the second moment
of the gradient and ∆xt to store a leaky average of the second moment of the change of parameters
in the model itself. Note that we use the original notation and naming of the authors for compati-
bility with other publications and implementations (there is no other real reason why one should
use different Greek variables to indicate a parameter serving the same purpose in momentum,
Adagrad, RMSProp, and Adadelta).
Here are the technical details of Adadelta. Given the parameter du jour is ρ, we obtain the follow-
ing leaky updates similarly to Section 12.8:
st = ρst−1 + (1 − ρ)g2t . (12.9.1)
The difference to Section 12.8 is that we perform updates with the rescaled gradient g′t , i.e.,
xt = xt−1 − g′t . (12.9.2)
So what is the rescaled gradient g′t ? We can calculate it as follows:

√
′ ∆xt−1 + ϵ
gt = √ ⊙ gt , (12.9.3)
st + ϵ
where ∆xt−1 is the leaky average of the squared rescaled gradients g′t . We initialize ∆x0 to be 0
and update it at each step with g′t , i.e.,
∆xt = ρ∆xt−1 + (1 − ρ)g′t ,

2
(12.9.4)
and ϵ (a small value such as 10−5 ) is added to maintain numerical stability.
Adadelta needs to maintain two state variables for each variable, st and ∆xt . This yields the fol-
lowing implementation.
%matplotlib inline
import torch
def init_adadelta_states(feature_dim):
s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
delta_w, delta_b = torch.zeros((feature_dim, 1)), torch.zeros(1)

return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, states, hyperparams):

rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
# In-place updates via [:]
s[:] = rho * s + (1 - rho) * torch.square(p.grad)
g = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * p.grad
p[:] -= g
delta[:] = rho * delta + (1 - rho) * g * g
p.grad.data.zero_()
Choosing ρ = 0.9 amounts to a half-life time of 10 for each parameter update. This tends to work
quite well. We get the following behavior.

d2l.train_ch11(adadelta, init_adadelta_states(feature_dim),
{'rho': 0.9}, data_iter, feature_dim);
For a concise implementation we simply use the adadelta algorithm from the Trainer class. This
yields the following one-liner for a much more compact invocation.
trainer = torch.optim.Adadelta
d2l.train_concise_ch11(trainer, {'rho': 0.9}, data_iter)
12.9. Adadelta 537

12.9.3 Summary
• Adadelta has no learning rate parameter. Instead, it uses the rate of change in the parame-
ters itself to adapt the learning rate.
• Adadelta requires two state variables to store the second moments of gradient and the
change in parameters.
• Adadelta uses leaky averages to keep a running estimate of the appropriate statistics.
Exercises
1. Adjust the value of ρ. What happens?

2. Show how to implement the algorithm without the use of g′t . Why might this be a good idea?
3. Is Adadelta really learning rate free? Could you find optimization problems that break
Adadelta?
4. Compare Adadelta to Adagrad and RMS prop to discuss their convergence behavior.
Discussions180
12.10 Adam
In the discussions leading up to this section we encountered a number of techniques for efficient
optimization. Letʼs recap them in detail here:
• We saw that Section 12.4 is more effective than Gradient Descent when solving optimization
problems, e.g., due to its inherent resilience to redundant data.
• We saw that Section 12.5 affords significant additional efficiency arising from vectorization,
using larger sets of observations in one minibatch. This is the key to efficient multi-machine,
multi-GPU and overall parallel processing.
180

• Section 12.6 added a mechanism for aggregating a history of past gradients to accelerate
convergence.
• Section 12.7 used per-coordinate scaling to allow for a computationally efficient precondi-
tioner.
• Section 12.8 decoupled per-coordinate scaling from a learning rate adjustment.
Adam (Kingma and Ba, 2014) combines all these techniques into one efficient learning algorithm.
As expected, this is an algorithm that has become rather popular as one of the more robust and
effective optimization algorithms to use in deep learning. It is not without issues, though. In
particular, (Reddi et al., 2019) show that there are situations where Adam can diverge due to poor
variance control. In a follow-up work (Zaheer et al., 2018) proposed a hotfix to Adam, called Yogi
which addresses these issues. More on this later. For now letʼs review the Adam algorithm.
One of the key components of Adam is that it uses exponential weighted moving averages (also
known as leaky averaging) to obtain an estimate of both the momentum and also the second mo-
ment of the gradient. That is, it uses the state variables
vt ← β1 vt−1 + (1 − β1 )gt ,
(12.10.1)
st ← β2 st−1 + (1 − β2 )g2t .
Here β1 and β2 are nonnegative weighting parameters. Common choices for them are β1 = 0.9
and β2 = 0.999. That is, the variance estimate moves much more slowly than the momentum term.
Note that if we initialize v0 = s0 = 0 we have a significant amount of bias initially towards smaller
∑ t
values. This can be addressed by using the fact that ti=0 β i = 1−β
1−β to re-normalize terms. Cor-
respondingly the normalized state variables are given by
vt st
v̂t = and ŝt = . (12.10.2)
1 − β1t 1 − β2t
Armed with the proper estimates we can now write out the update equations. First, we rescale the
gradient in a manner very much akin to that of RMSProp to obtain
η v̂t
g′t = √ . (12.10.3)
ŝt + ϵ
Unlike RMSProp our update uses the momentum v̂t rather than the gradient itself. Moreover,
there is a slight cosmetic difference as the rescaling happens using √ŝ1+ϵ instead of √ŝ1+ϵ . The
t t
former works arguably slightly better in practice, hence the deviation from RMSProp. Typically
we pick ϵ = 10−6 for a good trade-off between numerical stability and fidelity.
Now we have all the pieces in place to compute updates. This is slightly anticlimactic and we have
a simple update of the form
xt ← xt−1 − g′t . (12.10.4)
Reviewing the design of Adam its inspiration is clear. Momentum and scale are clearly visible in
the state variables. Their rather peculiar definition forces us to debias terms (this could be fixed
by a slightly different initialization and update condition). Second, the combination of both terms
is pretty straightforward, given RMSProp. Last, the explicit learning rate η allows us to control the
step length to address issues of convergence.
12.10. Adam 539

Implementing Adam from scratch is not very daunting. For convenience we store the time step
counter t in the hyperparams dictionary. Beyond that all is straightforward.
%matplotlib inline
import torch
def init_adam_states(feature_dim):
v_w, v_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
return ((v_w, s_w), (v_b, s_b))
def adam(params, states, hyperparams):

beta1, beta2, eps = 0.9, 0.999, 1e-6
for p, (v, s) in zip(params, states):
v[:] = beta1 * v + (1 - beta1) * p.grad
s[:] = beta2 * s + (1 - beta2) * torch.square(p.grad)
v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr)
+ eps)
p.grad.data.zero_()
hyperparams['t'] += 1
We are ready to use Adam to train the model. We use a learning rate of η = 0.01.

d2l.train_ch11(adam, init_adam_states(feature_dim),
{'lr': 0.01, 't': 1}, data_iter, feature_dim);
A more concise implementation is straightforward since adam is one of the algorithms provided

as part of the Gluon trainer optimization library. Hence we only need to pass configuration pa-
rameters for an implementation in Gluon.
trainer = torch.optim.Adam
d2l.train_concise_ch11(trainer, {'lr': 0.01}, data_iter)
12.10.3 Yogi
One of the problems of Adam is that it can fail to converge even in convex settings when the second
moment estimate in st blows up. As a fix (Zaheer et al., 2018) proposed a refined update (and
initialization) for st . To understand whatʼs going on, letʼs rewrite the Adam update as follows:
( )
st ← st−1 + (1 − β2 ) g2t − st−1 . (12.10.5)
Whenever g2t has high variance or updates are sparse, st might forget past values too quickly. A
possible fix for this is to replace g2t − st−1 by g2t ⊙ sgn(g2t − st−1 ). Now the magnitude of the update
no longer depends on the amount of deviation. This yields the Yogi updates
st ← st−1 + (1 − β2 )g2t ⊙ sgn(g2t − st−1 ). (12.10.6)
The authors furthermore advise to initialize the momentum on a larger initial batch rather than
just initial pointwise estimate. We omit the details since they are not material to the discussion
and since even without this convergence remains pretty good.
def yogi(params, states, hyperparams):

beta1, beta2, eps = 0.9, 0.999, 1e-3
for p, (v, s) in zip(params, states):
v[:] = beta1 * v + (1 - beta1) * p.grad
s[:] = s + (1 - beta2) * torch.sign(
torch.square(p.grad) - s) * torch.square(p.grad)
v_bias_corr = v / (1 - beta1 ** hyperparams['t'])
s_bias_corr = s / (1 - beta2 ** hyperparams['t'])
12.10. Adam 541

p[:] -= hyperparams['lr'] * v_bias_corr / (torch.sqrt(s_bias_corr)

+ eps)
p.grad.data.zero_()
hyperparams['t'] += 1

d2l.train_ch11(yogi, init_adam_states(feature_dim),
{'lr': 0.01, 't': 1}, data_iter, feature_dim);
12.10.4 Summary
• Adam combines features of many optimization algorithms into a fairly robust update rule.
• Created on the basis of RMSProp, Adam also uses EWMA on the minibatch stochastic gradi-
ent.
• Adam uses bias correction to adjust for a slow startup when estimating momentum and a
second moment.
• For gradients with significant variance we may encounter issues with convergence. They
can be amended by using larger minibatches or by switching to an improved estimate for st .
Yogi offers such an alternative.
Exercises
1. Adjust the learning rate and observe and analyze the experimental results.
2. Can you rewrite momentum and second moment updates such that it does not require bias
correction?
3. Why do you need to reduce the learning rate η as we converge?
4. Try to construct a case for which Adam diverges and Yogi converges?

Discussions181
12.11 Learning Rate Scheduling
So far we primarily focused on optimization algorithms for how to update the weight vectors rather
than on the rate at which they are being updated. Nonetheless, adjusting the learning rate is often
just as important as the actual algorithm. There are a number of aspects to consider:
• Most obviously the magnitude of the learning rate matters. If it is too large, optimization
diverges, if it is too small, it takes too long to train or we end up with a suboptimal result.
We saw previously that the condition number of the problem matters (see e.g., Section 12.6
for details). Intuitively it is the ratio of the amount of change in the least sensitive direction
vs. the most sensitive one.
• Secondly, the rate of decay is just as important. If the learning rate remains large we may
simply end up bouncing around the minimum and thus not reach optimality. Section 12.5
discussed this in some detail and we analyzed performance guarantees in Section 12.4. In
short, we want the rate to decay, but probably more slowly than O(t− 2 ) which would be a
1
good choice for convex problems.

• Another aspect that is equally important is initialization. This pertains both to how the pa-
rameters are set initially (review Section 5.4 for details) and also how they evolve initially.
This goes under the moniker of warmup, i.e., how rapidly we start moving towards the so-
lution initially. Large steps in the beginning might not be beneficial, in particular since the
initial set of parameters is random. The initial update directions might be quite meaning-
less, too.
• Lastly, there are a number of optimization variants that perform cyclical learning rate ad-
justment. This is beyond the scope of the current chapter. We recommend the reader to
review details in (Izmailov et al., 2018), e.g., how to obtain better solutions by averaging
over an entire path of parameters.
Given the fact that there is a lot of detail needed to manage learning rates, most deep learning
frameworks have tools to deal with this automatically. In the current chapter we will review the
effects that different schedules have on accuracy and also show how this can be managed effi-
ciently via a learning rate scheduler.
12.11.1 Toy Problem
We begin with a toy problem that is cheap enough to compute easily, yet sufficiently nontrivial to
illustrate some of the key aspects. For that we pick a slightly modernized version of LeNet (relu
instead of sigmoid activation, MaxPooling rather than AveragePooling), as applied to Fashion-
MNIST. Moreover, we hybridize the network for performance. Since most of the code is standard
we just introduce the basics without further detailed discussion. See Chapter 7 for a refresher as
needed.
%matplotlib inline
import math
import torch
181
12.11. Learning Rate Scheduling 543


from torch.optim import lr_scheduler
def net_fn():
model = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=5, padding=2), nn.ReLU(),
nn.Conv2d(6, 16, kernel_size=5), nn.ReLU(),
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120), nn.ReLU(),
nn.Linear(120, 84), nn.ReLU(),
nn.Linear(84, 10))
return model
loss = nn.CrossEntropyLoss()
device = d2l.try_gpu()
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
# The code is almost identical to `d2l.train_ch6` defined in the

# lenet section of chapter convolutional neural networks
def train(net, train_iter, test_iter, num_epochs, loss, trainer, device,
scheduler=None):
net.to(device)
animator = d2l.Animator(xlabel='epoch', xlim=[0, num_epochs],
legend=['train loss', 'train acc', 'test acc'])
for epoch in range(num_epochs):

metric = d2l.Accumulator(3) # train_loss, train_acc, num_examples
for i, (X, y) in enumerate(train_iter):
net.train()
trainer.zero_grad()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
trainer.step()
metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
train_loss = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
if (i + 1) % 50 == 0:
animator.add(epoch + i / len(train_iter),
(train_loss, train_acc, None))
test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)

animator.add(epoch+1, (None, None, test_acc))
if scheduler:
if scheduler.__module__ == lr_scheduler.__name__:

# Using PyTorch In-Built scheduler

scheduler.step()
else:
# Using custom defined scheduler
for param_group in trainer.param_groups:
param_group['lr'] = scheduler(epoch)
print(f'train loss {train_loss:.3f}, train acc {train_acc:.3f}, '

f'test acc {test_acc:.3f}')
Letʼs have a look at what happens if we invoke this algorithm with default settings, such as a learn-
ing rate of 0.3 and train for 30 iterations. Note how the training accuracy keeps on increasing while
progress in terms of test accuracy stalls beyond a point. The gap between both curves indicates
overfitting.
lr, num_epochs = 0.3, 30

net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr=lr)
train(net, train_iter, test_iter, num_epochs, loss, trainer, device)
train loss 0.168, train acc 0.935, test acc 0.885
12.11.2 Schedulers
One way of adjusting the learning rate is to set it explicitly at each step. This is conveniently
achieved by the set_learning_rate method. We could adjust it downward after every epoch (or
even after every minibatch), e.g., in a dynamic manner in response to how optimization is pro-
gressing.
lr = 0.1
trainer.param_groups[0]["lr"] = lr
print(f'learning rate is now {trainer.param_groups[0]["lr"]:.2f}')

learning rate is now 0.10
More generally we want to define a scheduler. When invoked with the number of updates it returns
the appropriate value of the learning rate. Letʼs define a simple one that sets the learning rate to
η = η0 (t + 1)− 2 .
1
class SquareRootScheduler:
def __init__(self, lr=0.1):
self.lr = lr
def __call__(self, num_update):

return self.lr * pow(num_update + 1.0, -0.5)
Letʼs plot its behavior over a range of values.
scheduler = SquareRootScheduler(lr=0.1)
d2l.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
Now letʼs see how this plays out for training on Fashion-MNIST. We simply provide the scheduler
as an additional argument to the training algorithm.
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr)
train(net, train_iter, test_iter, num_epochs, loss, trainer, device,
scheduler)

This worked quite a bit better than previously. Two things stand out: the curve was rather more
smooth than previously. Secondly, there was less overfitting. Unfortunately it is not a well-
resolved question as to why certain strategies lead to less overfitting in theory. There is some
argument that a smaller stepsize will lead to parameters that are closer to zero and thus simpler.
However, this does not explain the phenomenon entirely since we do not really stop early but
simply reduce the learning rate gently.
12.11.3 Policies
While we cannot possibly cover the entire variety of learning rate schedulers, we attempt to give
a brief overview of popular policies below. Common choices are polynomial decay and piecewise
constant schedules. Beyond that, cosine learning rate schedules have been found to work well
empirically on some problems. Lastly, on some problems it is beneficial to warm up the optimizer
prior to using large learning rates.
Factor Scheduler
One alternative to a polynomial decay would be a multiplicative one, that is ηt+1 ← ηt · α for
α ∈ (0, 1). To prevent the learning rate from decaying beyond a reasonable lower bound the
update equation is often modified to ηt+1 ← max(ηmin , ηt · α).
class FactorScheduler:
def __init__(self, factor=1, stop_factor_lr=1e-7, base_lr=0.1):
self.factor = factor
self.stop_factor_lr = stop_factor_lr
self.base_lr = base_lr
def __call__(self, num_update):

self.base_lr = max(self.stop_factor_lr, self.base_lr * self.factor)
return self.base_lr
scheduler = FactorScheduler(factor=0.9, stop_factor_lr=1e-2, base_lr=2.0)

d2l.plot(torch.arange(50), [scheduler(t) for t in range(50)])

This can also be accomplished by a built-in scheduler in MXNet via the lr_scheduler.
FactorScheduler object. It takes a few more parameters, such as warmup period, warmup mode
(linear or constant), the maximum number of desired updates, etc.; Going forward we will use
the built-in schedulers as appropriate and only explain their functionality here. As illustrated, it
is fairly straightforward to build your own scheduler if needed.
Multi Factor Scheduler
A common strategy for training deep networks is to keep the learning rate piecewise constant and
to decrease it by a given amount every so often. That is, given a set of times when to decrease the
rate, such as s = {5, 10, 20} decrease ηt+1 ← ηt · α whenever t ∈ s. Assuming that the values are
halved at each step we can implement this as follows.
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr=0.5)
scheduler = lr_scheduler.MultiStepLR(trainer, milestones=[15, 30], gamma=0.5)
def get_lr(trainer, scheduler):

lr = scheduler.get_last_lr()[0]
trainer.step()
scheduler.step()
return lr
d2l.plot(torch.arange(num_epochs), [get_lr(trainer, scheduler)

for t in range(num_epochs)])

The intuition behind this piecewise constant learning rate schedule is that one lets optimization
proceed until a stationary point has been reached in terms of the distribution of weight vectors.
Then (and only then) do we decrease the rate such as to obtain a higher quality proxy to a good
local minimum. The example below shows how this can produce ever slightly better solutions.

scheduler)
Cosine Scheduler
A rather perplexing heuristic was proposed by (Loshchilov and Hutter, 2016). It relies on the ob-
servation that we might not want to decrease the learning rate too drastically in the beginning and
moreover, that we might want to “refine” the solution in the end using a very small learning rate.
This results in a cosine-like schedule with the following functional form for learning rates in the
range t ∈ [0, T ].
η0 − ηT
ηt = ηT + (1 + cos(πt/T )) (12.11.1)
2
Here η0 is the initial learning rate, ηT is the target rate at time T . Furthermore, for t > T we simply
pin the value to ηT without increasing it again. In the following example, we set the max update

step T = 20.
class CosineScheduler:
def __init__(self, max_update, base_lr=0.01, final_lr=0,
warmup_steps=0, warmup_begin_lr=0):
self.base_lr_orig = base_lr
self.max_update = max_update
self.final_lr = final_lr
self.warmup_steps = warmup_steps
self.warmup_begin_lr = warmup_begin_lr
self.max_steps = self.max_update - self.warmup_steps
def get_warmup_lr(self, epoch):

increase = (self.base_lr_orig - self.warmup_begin_lr) \
* float(epoch) / float(self.warmup_steps)
return self.warmup_begin_lr + increase
def __call__(self, epoch):

if epoch < self.warmup_steps:
return self.get_warmup_lr(epoch)
if epoch <= self.max_update:
self.base_lr = self.final_lr + (
self.base_lr_orig - self.final_lr) * (1 + math.cos(
math.pi * (epoch - self.warmup_steps) / self.max_steps)) / 2
return self.base_lr
scheduler = CosineScheduler(max_update=20, base_lr=0.3, final_lr=0.01)

In the context of computer vision this schedule can lead to improved results. Note, though, that
such improvements are not guaranteed (as can be seen below).
net = net_fn()
scheduler)

Warmup
In some cases initializing the parameters is not sufficient to guarantee a good solution. This is
particularly a problem for some advanced network designs that may lead to unstable optimiza-
tion problems. We could address this by choosing a sufficiently small learning rate to prevent
divergence in the beginning. Unfortunately this means that progress is slow. Conversely, a large
learning rate initially leads to divergence.
A rather simple fix for this dilemma is to use a warmup period during which the learning rate
increases to its initial maximum and to cool down the rate until the end of the optimization process.
For simplicity one typically uses a linear increase for this purpose. This leads to a schedule of the
form indicated below.
scheduler = CosineScheduler(20, warmup_steps=5, base_lr=0.3, final_lr=0.01)

Note that the network converges better initially (in particular observe the performance during the
first 5 epochs).
net = net_fn()


scheduler)
Warmup can be applied to any scheduler (not just cosine). For a more detailed discussion of learn-
ing rate schedules and many more experiments see also (Gotmare et al., 2018). In particular they
find that a warmup phase limits the amount of divergence of parameters in very deep networks.
This makes intuitively sense since we would expect significant divergence due to random initial-
ization in those parts of the network that take the most time to make progress in the beginning.
12.11.4 Summary
• Decreasing the learning rate during training can lead to improved accuracy and (most per-
plexingly) reduced overfitting of the model.
• A piecewise decrease of the learning rate whenever progress has plateaued is effective in
practice. Essentially this ensures that we converge efficiently to a suitable solution and only
then reduce the inherent variance of the parameters by reducing the learning rate.
• Cosine schedulers are popular for some computer vision problems. See e.g., GluonCV182 for
details of such a scheduler.
• A warmup period before optimization can prevent divergence.
• Optimization serves multiple purposes in deep learning. Besides minimizing the training
objective, different choices of optimization algorithms and learning rate scheduling can lead
to rather different amounts of generalization and overfitting on the test set (for the same
amount of training error).
182
http://gluon-cv.mxnet.io

Exercises
1. Experiment with the optimization behavior for a given fixed learning rate. What is the best
model you can obtain this way?
2. How does convergence change if you change the exponent of the decrease in the learning
rate? Use PolyScheduler for your convenience in the experiments.
3. Apply the cosine scheduler to large computer vision problems, e.g., training ImageNet. How
does it affect performance relative to other schedulers?
4. How long should warmup last?
5. Can you connect optimization and sampling? Start by using results from (Welling and Teh,
2011) on Stochastic Gradient Langevin Dynamics.
Discussions183
183

13 | Computational Performance
In deep learning, datasets and models are usually large, which involves heavy computation.
Therefore, computational performance matters a lot. This chapter will focus on the major fac-
tors that affect computational performance: imperative programming, symbolic programming,
asynchronous computing, automatic parallelism, and multi-GPU computation. By studying this
chapter, you may further improve computational performance of those models implemented in
the previous chapters, for example, by reducing training time without affecting accuracy.
13.1 Compilers and Interpreters
So far, this book has focused on imperative programming, which makes use of statements such
as print, +, and if to change a programʼs state. Consider the following example of a simple im-
perative program.
def add(a, b):

return a + b
def fancy_func(a, b, c, d):

e = add(a, b)
f = add(c, d)
g = add(e, f)
return g
print(fancy_func(1, 2, 3, 4))
10
Python is an interpreted language. When evaluating the above fancy_func function it performs the
operations making up the functionʼs body in sequence. That is, it will evaluate e = add(a, b) and
store the results as variable e, thereby changing the programʼs state. The next two statements f
= add(c, d) and g = add(e, f) will be executed similarly, performing additions and storing the
results as variables. Fig. 13.1.1 illustrates the flow of data.
555
Fig. 13.1.1: Data flow in an imperative program.
Although imperative programming is convenient, it may be inefficient. On the one hand, even if
the add function is repeatedly called throughout fancy_func, Python will execute the three func-
tion calls individually. If these are executed, say, on a GPU (or even on multiple GPUs), the over-
head arising from the Python interpreter can become overwhelming. Moreover, it will need to
save the variable values of e and f until all the statements in fancy_func have been executed. This
is because we do not know whether the variables e and f will be used by other parts of the program
after the statements e = add(a, b) and f = add(c, d) are executed.
13.1.1 Symbolic Programming
Consider the alternative, symbolic programming, where computation is usually performed only
once the process has been fully defined. This strategy is used by multiple deep learning frame-
works, including Theano and TensorFlow (the latter has acquired imperative extensions). It usu-
ally involves the following steps:
1. Define the operations to be executed.
2. Compile the operations into an executable program.
3. Provide the required inputs and call the compiled program for execution.
This allows for a significant amount of optimization. First, we can skip the Python interpreter
in many cases, thus removing a performance bottleneck that can become significant on multiple
fast GPUs paired with a single Python thread on a CPU. Second, a compiler might optimize and
rewrite the above code into print((1 + 2) + (3 + 4)) or even print(10). This is possible since a
compiler gets to see the full code before turning it into machine instructions. For instance, it can
release memory (or never allocate it) whenever a variable is no longer needed. Or it can transform
the code entirely into an equivalent piece. To get a better idea, consider the following simulation
of imperative programming (it is Python after all) below.
def add_():
return '''
def add(a, b):
return a + b
'''
def fancy_func_():
return '''
e = add(a, b)
f = add(c, d)
556 Chapter 13. Computational Performance

g = add(e, f)
return g
'''
def evoke_():
return add_() + fancy_func_() + 'print(fancy_func(1, 2, 3, 4))'
prog = evoke_()
print(prog)
y = compile(prog, '', 'exec')
exec(y)
def add(a, b):

return a + b

e = add(a, b)
f = add(c, d)
g = add(e, f)
return g
print(fancy_func(1, 2, 3, 4))
10
The differences between imperative (interpreted) programming and symbolic programming are
as follows:
• Imperative programming is easier. When imperative programming is used in Python, the
majority of the code is straightforward and easy to write. It is also easier to debug imperative
programming code. This is because it is easier to obtain and print all relevant intermediate
variable values, or use Pythonʼs built-in debugging tools.
• Symbolic programming is more efficient and easier to port. Symbolic programming makes
it easier to optimize the code during compilation, while also having the ability to port the
program into a format independent of Python. This allows the program to be run in a non-
Python environment, thus avoiding any potential performance issues related to the Python
interpreter.
13.1.2 Hybrid Programming
Historically most deep learning frameworks choose between an imperative or a symbolic ap-
proach. For example, Theano, TensorFlow (inspired by the former), Keras, and CNTK formulate
models symbolically. Conversely, Chainer and PyTorch take an imperative approach. An imper-
ative mode was added to TensorFlow 2.0 and Keras in later revisions.
As mentioned above, PyTorch is based on imperative programming and uses dynamic compu-
tation graphs. In an effort to leverage the portability and efficiency of symbolic programming,
developers considered whether it would be possible to combine the benefits of both program-
ming models. This led to a torchscript that lets users develop and debug using pure imperative
programming, while having the ability to convert most programs into symbolic programs to be
run when product-level computing performance and deployment are required.
13.1. Compilers and Interpreters 557

13.1.3 Hybridizing the Sequential Class
The easiest way to get a feel for how hybridization works is to consider deep networks with mul-
tiple layers. Conventionally the Python interpreter will need to execute the code for all layers to
generate an instruction that can then be forwarded to a CPU or a GPU. For a single (fast) comput-
ing device this does not cause any major issues. On the other hand, if we use an advanced 8-GPU
server such as an AWS P3dn.24xlarge instance Python will struggle to keep all GPUs busy. The
single-threaded Python interpreter becomes the bottleneck here. Letʼs see how we can address
this for significant parts of the code by replacing Sequential with HybridSequential. We begin by
defining a simple MLP.
import torch
# Factory for networks

def get_net():
net = nn.Sequential(nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 2))
return net
x = torch.randn(size=(1, 512))
net = get_net()
net(x)
tensor([[-0.0929, -0.0203]], grad_fn=<AddmmBackward0>)
By converting the model using torch.jit.script function, we are able to compile and optimize
the computation in the MLP. The modelʼs computation result remains unchanged.
net = torch.jit.script(net)
net(x)
tensor([[-0.0929, -0.0203]], grad_fn=<AddmmBackward0>)
This seems almost too good to be true: write the same code as before and simply convert the
model using torch.jit.script. Once this happens the network is optimized (we will benchmark
the performance below).

Acceleration by Hybridization
To demonstrate the performance improvement gained by compilation we compare the time

needed to evaluate net(x) before and after hybridization. Letʼs define a class to measure this
time first. It will come handy throughout the chapter as we set out to measure (and improve)
performance.
#@save
class Benchmark:
"""For measuring running time."""
def __init__(self, description='Done'):
self.description = description
def __enter__(self):
self.timer = d2l.Timer()
return self
def __exit__(self, *args):

print(f'{self.description}: {self.timer.stop():.4f} sec')
Now we can invoke the network twice, once with and once without torchscript.
net = get_net()
with Benchmark('Without torchscript'):
for i in range(1000): net(x)
net = torch.jit.script(net)
with Benchmark('With torchscript'):
for i in range(1000): net(x)
Without torchscript: 2.4347 sec

With torchscript: 3.3845 sec
As is observed in the above results, after an nn.Sequential instance is scripted using the torch.
jit.script function, computing performance is improved through the use of symbolic program-
ming.
Serialization
One of the benefits of compiling the models is that we can serialize (save) the model and its pa-
rameters to disk. This allows us to store a model in a manner that is independent of the front-end
language of choice. This allows us to deploy trained models to other devices and easily use other
front-end programming languages. At the same time the code is often faster than what can be
achieved in imperative programming. Letʼs see the save function in action.
net.save('my_mlp')
!ls -lh my_mlp*
-rw-rw-r-- 1 d2l-worker d2l-worker 652K Aug 19 17:17 my_mlp
13.1. Compilers and Interpreters 559

13.1.4 Summary
• Imperative programming makes it easy to design new models since it is possible to write
code with control flow and the ability to use a large amount of the Python software ecosys-
tem.
• Symbolic programming requires that we specify the program and compile it before execut-
ing it. The benefit is improved performance.
Exercises
1. Review the models that interest you in the previous chapters. Can you improve their com-
putational performance by reimplementing them?
Discussions184
13.2 Asynchronous Computation
Todayʼs computers are highly parallel systems, consisting of multiple CPU cores (often multiple
threads per core), multiple processing elements per GPU, and often multiple GPUs per device. In
short, we can process many different things at the same time, often on different devices. Unfor-
tunately Python is not a great way of writing parallel and asynchronous code, at least not without
some extra help. After all, Python is single-threaded and this is unlikely to change in the future.
Deep learning frameworks such as MXNet and TensorFlow adopt an asynchronous programming
model to improve performance, while PyTorch uses Pythonʼs own scheduler leading to a differ-
ent performance trade-off. For PyTorch, by default, GPU operations are asynchronous. When you
call a function that uses the GPU, the operations are enqueued to the particular device, but not
necessarily executed until later. This allows us to execute more computations in parallel, includ-
ing operations on the CPU or other GPUs.
Hence, understanding how asynchronous programming works helps us to develop more efficient
programs, by proactively reducing computational requirements and mutual dependencies. This
allows us to reduce memory overhead and increase processor utilization.
import os
import subprocess
import numpy
import torch
184

13.2.1 Asynchrony via Backend
For a warmup consider the following toy problem: we want to generate a random matrix and
multiply it. Letʼs do that both in NumPy and in PyTorch tensor to see the difference. Note that
PyTorch tensor is defined on a GPU.
# Warmup for GPU computation

device = d2l.try_gpu()
a = torch.randn(size=(1000, 1000), device=device)
b = torch.mm(a, a)
with d2l.Benchmark('numpy'):
for _ in range(10):
a = numpy.random.normal(size=(1000, 1000))
b = numpy.dot(a, a)
with d2l.Benchmark('torch'):
for _ in range(10):
b = torch.mm(a, a)
numpy: 1.3500 sec

torch: 0.0011 sec
The benchmark output via PyTorch is orders of magnitude faster. NumPy dot product is executed
on the CPU processor while PyTorch matrix multiplication is executed on GPU and hence the
latter is expected to be much faster. But the huge time difference suggests something else must be
going on. By default, GPU operations are asynchronous in PyTorch. Forcing PyTorch to finish all
computation prior to returning shows what happened previously: computation is being executed
by the backend while the frontend returns control to Python.
with d2l.Benchmark():
for _ in range(10):
b = torch.mm(a, a)
torch.cuda.synchronize(device)
Done: 0.0029 sec
Broadly speaking, PyTorch has a frontend for direct interaction with the users, e.g., via Python, as
well as a backend used by the system to perform the computation. As shown in Fig. 13.2.1, users
can write PyTorch programs in various frontend languages, such as Python and C++. Regardless
of the frontend programming language used, the execution of PyTorch programs occurs primarily
in the backend of C++ implementations. Operations issued by the frontend language are passed
on to the backend for execution. The backend manages its own threads that continuously collect
and execute queued tasks. Note that for this to work the backend must be able to keep track of
the dependencies between various steps in the computational graph. Hence, it is not possible to
parallelize operations that depend on each other.
13.2. Asynchronous Computation 561

Fig. 13.2.1: Programming language frontends and deep learning framework backends.
Letʼs look at another toy example to understand the dependency graph a bit better.
x = torch.ones((1, 2), device=device)

y = torch.ones((1, 2), device=device)
z = x * y + 2
z
tensor([[3., 3.]], device='cuda:0')
Fig. 13.2.2: The backend tracks dependencies between various steps in the computational graph.
The code snippet above is also illustrated in Fig. 13.2.2. Whenever the Python frontend thread
executes one of the first three statements, it simply returns the task to the backend queue. When
the last statementʼs results need to be printed, the Python frontend thread will wait for the C++
backend thread to finish computing the result of the variable z. One benefit of this design is that
the Python frontend thread does not need to perform actual computations. Thus, there is little
impact on the programʼs overall performance, regardless of Pythonʼs performance. Fig. 13.2.3
illustrates how frontend and backend interact.

Fig. 13.2.3: Interactions of the frontend and backend.
13.2.2 Barriers and Blockers
13.2.3 Improving Computation
13.2.4 Summary
• Deep learning frameworks may decouple the Python frontend from an execution backend.
This allows for fast asynchronous insertion of commands into the backend and associated
parallelism.
• Asynchrony leads to a rather responsive frontend. However, use caution not to overfill the
task queue since it may lead to excessive memory consumption. It is recommended to syn-
chronize for each minibatch to keep frontend and backend approximately synchronized.
• Chip vendors offer sophisticated performance analysis tools to obtain a much more fine-
grained insight into the efficiency of deep learning.
Exercises
1. On the CPU, benchmark the same matrix multiplication operations in this section. Can you
still observe asynchrony via the backend?
Discussions185
13.3 Automatic Parallelism
Deep learning frameworks (e.g., MXNet and PyTorch) automatically construct computational
graphs at the backend. Using a computational graph, the system is aware of all the dependen-
cies, and can selectively execute multiple non-interdependent tasks in parallel to improve speed.
For instance, Fig. 13.2.2 in Section 13.2 initializes two variables independently. Consequently the
system can choose to execute them in parallel.
Typically, a single operator will use all the computational resources on all CPUs or on a single
GPU. For example, the dot operator will use all cores (and threads) on all CPUs, even if there are
multiple CPU processors on a single machine. The same applies to a single GPU. Hence paral-
lelization is not quite so useful for single-device computers. With multiple devices things matter
185
13.3. Automatic Parallelism 563

more. While parallelization is typically most relevant between multiple GPUs, adding the local
CPU will increase performance slightly. For example, see (Hadjis et al., 2016) that focuses on
training computer vision models combining a GPU and a CPU. With the convenience of an auto-
matically parallelizing framework we can accomplish the same goal in a few lines of Python code.
More broadly, our discussion of automatic parallel computation focuses on parallel computation
using both CPUs and GPUs, as well as the parallelization of computation and communication.
Note that we need at least two GPUs to run the experiments in this section.
import torch
13.3.1 Parallel Computation on GPUs
Letʼs start by defining a reference workload to test: the run function below performs 10 matrix-
matrix multiplications on the device of our choice using data allocated into two variables: x_gpu1
and x_gpu2.
devices = d2l.try_all_gpus()
def run(x):
return [x.mm(x) for _ in range(50)]
x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])

x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])
Now we apply the function to the data. To ensure that caching does not play a role in the results
we warm up the devices by performing a single pass on either of them prior to measuring. torch.
cuda.synchronize() waits for all kernels in all streams on a CUDA device to complete. It takes in
a device argument, the device for which we need to synchronize. It uses the current device, given
by current_device(), if the device argument is None (default).
run(x_gpu1)
run(x_gpu2) # Warm-up all devices
torch.cuda.synchronize(devices[0])
with d2l.Benchmark('GPU1 time'):

run(x_gpu1)
with d2l.Benchmark('GPU2 time'):

run(x_gpu2)
GPU1 time: 0.5121 sec

GPU2 time: 0.5256 sec
If we remove the synchronize statement between both tasks the system is free to parallelize com-
putation on both devices automatically.

with d2l.Benchmark('GPU1 & GPU2'):
run(x_gpu1)
run(x_gpu2)
torch.cuda.synchronize()
GPU1 & GPU2: 0.5121 sec
In the above case the total execution time is less than the sum of its parts, since the deep learn-
ing framework automatically schedules computation on both GPU devices without the need for
sophisticated code on behalf of the user.
13.3.2 Parallel Computation and Communication
In many cases we need to move data between different devices, say between the CPU and GPU, or
between different GPUs. For instance, this occurs when we want to perform distributed optimiza-
tion where we need to aggregate the gradients over multiple accelerator cards. Letʼs simulate this
by computing on the GPU and then copying the results back to the CPU.
def copy_to_cpu(x, non_blocking=False):

return [y.to('cpu', non_blocking=non_blocking) for y in x]
with d2l.Benchmark('Run on GPU1'):

y = run(x_gpu1)
with d2l.Benchmark('Copy to CPU'):

y_cpu = copy_to_cpu(y)
Run on GPU1: 0.5122 sec

Copy to CPU: 2.3798 sec
This is somewhat inefficient. Note that we could already start copying parts of y to the CPU while
the remainder of the list is still being computed. This situation occurs, e.g., when we compute
the (backprop) gradient on a minibatch. The gradients of some of the parameters will be avail-
able earlier than that of others. Hence it works to our advantage to start using PCI-Express bus
bandwidth while the GPU is still running. In PyTorch, several functions such as to() and copy_()
admit an explicit non_blocking argument, which lets the caller bypass synchronization when it is
unnecessary. Setting non_blocking=True allows us to simulate this scenario.
with d2l.Benchmark('Run on GPU1 and copy to CPU'):

y = run(x_gpu1)
y_cpu = copy_to_cpu(y, True)
Run on GPU1 and copy to CPU: 1.8728 sec
The total time required for both operations is (as expected) less than the sum of their parts. Note
that this task is different from parallel computation as it uses a different resource: the bus between
13.3. Automatic Parallelism 565

the CPU and GPUs. In fact, we could compute on both devices and communicate, all at the same
time. As noted above, there is a dependency between computation and communication: y[i]
must be computed before it can be copied to the CPU. Fortunately, the system can copy y[i-1]
while computing y[i] to reduce the total running time.
We conclude with an illustration of the computational graph and its dependencies for a simple
two-layer MLP when training on a CPU and two GPUs, as depicted in Fig. 13.3.1. It would be quite
painful to schedule the parallel program resulting from this manually. This is where it is advan-
tageous to have a graph-based computing backend for optimization.
Fig. 13.3.1: The computational graph and its dependencies of a two-layer MLP on a CPU and two
GPUs.
13.3.3 Summary
• Modern systems have a variety of devices, such as multiple GPUs and CPUs. They can be
used in parallel, asynchronously.
• Modern systems also have a variety of resources for communication, such as PCI Express,
storage (typically solid-state drives or via networks), and network bandwidth. They can be
used in parallel for peak efficiency.
• The backend can improve performance through automatic parallel computation and com-
munication.

Exercises
1. Eight operations were performed in the run function defined in this section. There are no
dependencies between them. Design an experiment to see if the deep learning framework
will automatically execute them in parallel.
2. When the workload of an individual operator is sufficiently small, parallelization can help
even on a single CPU or GPU. Design an experiment to verify this.
3. Design an experiment that uses parallel computation on CPUs, GPUs, and communication
between both devices.
4. Use a debugger such as NVIDIAʼs Nsight186 to verify that your code is efficient.
5. Designing computation tasks that include more complex data dependencies, and run exper-
iments to see if you can obtain the correct results while improving performance.
Discussions187
13.4 Hardware
Building systems with great performance requires a good understanding of the algorithms and
models to capture the statistical aspects of the problem. At the same time it is also indispensable
to have at least a modicum of knowledge of the underlying hardware. The current section is no
substitute for a proper course on hardware and system design. Instead, it might serve as a starting
point for understanding why some algorithms are more efficient than others and how to achieve
good throughput. A good design can easily make a difference of an order of magnitude and, in
turn, this can make the difference between being able to train a network (e.g., in a week) and not
at all (in 3 months, thus missing the deadline). We will start by looking at computers. Then we
will zoom in to look more carefully at CPUs and GPUs. Lastly we zoom out to review how multiple
computers are connected in a server center or in the cloud.
Fig. 13.4.1: Latency Numbers that every programmer should know.
Impatient readers may be able to get by with Fig. 13.4.1. It is taken from Colin Scottʼs interactive
post188 that gives a good overview of the progress over the past decade. The original numbers are
186
https://developer.nvidia.com/nsight-compute-2019_5
187
188
https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
13.4. Hardware 567

due to Jeff Deanʼs Stanford talk from 2010189 . The discussion below explains some of the rationale
for these numbers and how they can guide us in designing algorithms. The discussion below is
very high level and cursory. It is clearly no substitute for a proper course but rather just meant
to provide enough information for a statistical modeler to make suitable design decisions. For an
in-depth overview of computer architecture we refer the reader to (Hennessy and Patterson, 2011)
or a recent course on the subject, such as the one by Arste Asanovic190 .
13.4.1 Computers
Most deep learning researchers and practitioners have access to a computer with a fair amount
of memory, computation, some form of an accelerator such as a GPU, or multiples thereof. A
computer consists of the following key components:
• A processor (also referred to as a CPU) that is able to execute the programs we give it (in
addition to running an operating system and many other things), typically consisting of 8 or
more cores.
• Memory (RAM) to store and retrieve the results from computation, such as weight vectors
and activations, and training data.
• An Ethernet network connection (sometimes multiple) with speeds ranging from 1 GB/s to
100 GB/s. On high end servers more advanced interconnects can be found.
• A high speed expansion bus (PCIe) to connect the system to one or more GPUs. Servers have
up to 8 accelerators, often connected in an advanced topology, while desktop systems have
1 or 2, depending on the budget of the user and the size of the power supply.
• Durable storage, such as a magnetic hard disk drive, a solid state drive, in many cases con-
nected using the PCIe bus. It provides efficient transfer of training data to the system and
storage of intermediate checkpoints as needed.
Fig. 13.4.2: Connectivity of components of a computer.
As Fig. 13.4.2 indicates, most components (network, GPU, and storage) are connected to the CPU
across the PCIe bus. It consists of multiple lanes that are directly attached to the CPU. For instance
AMDʼs Threadripper 3 has 64 PCIe 4.0 lanes, each of which is capable 16 Gbit/s data transfer in both
directions. The memory is directly attached to the CPU with a total bandwidth of up to 100 GB/s.
When we run code on a computer we need to shuffle data to the processors (CPUs or GPUs), per-
form computation, and then move the results off the processor back to RAM and durable storage.
Hence, in order to get good performance we need to make sure that this works seamlessly with-
out any one of the systems becoming a major bottleneck. For instance, if we cannot load images
quickly enough the processor will not have any work to do. Likewise, if we cannot move matrices
189
https://static.googleusercontent.com/media/research.google.com/en//people/jeff/Stanford-DL-Nov-2010.pdf
190
http://inst.eecs.berkeley.edu/~cs152/sp19/

quickly enough to the CPU (or GPU), its processing elements will starve. Finally, if we want to
synchronize multiple computers across the network, the latter should not slow down computa-
tion. One option is to interleave communication and computation. Letʼs have a look at the various
components in more detail.
13.4.2 Memory
At its most basic memory is used to store data that needs to be readily accessible. At present
CPU RAM is typically of the DDR4191 variety, offering 20–25 GB/s bandwidth per module. Each
module has a 64-bit-wide bus. Typically pairs of memory modules are used to allow for multiple
channels. CPUs have between 2 and 4 memory channels, i.e., they have between 4 0GB/s and 100
GB/s peak memory bandwidth. Often there are two banks per channel. For instance AMDʼs Zen 3
Threadripper has 8 slots.
While these numbers are impressive, indeed, they only tell part of the story. When we want to read
a portion from memory we first need to tell the memory module where the information can be
found. That is, we first need to send the address to RAM. Once this is accomplished we can choose
to read just a single 64 bit record or a long sequence of records. The latter is called burst read. In
a nutshell, sending an address to memory and setting up the transfer takes approximately 100 ns
(details depend on the specific timing coefficients of the memory chips used), every subsequent
transfer takes only 0.2 ns. In short, the first read is 500 times as expensive as subsequent ones!
Note that we could perform up to 10,000,000 random reads per second. This suggests that we avoid
random memory access as far as possible and use burst reads (and writes) instead.
Matters are a bit more complex when we take into account that we have multiple banks. Each bank
can read memory largely independently. This means two things. On the one hand, the effective
number of random reads is up to 4 times higher, provided that they are spread evenly across mem-
ory. It also means that it is still a bad idea to perform random reads since burst reads are 4 times
faster, too. On the other hand, due to memory alignment to 64 bit boundaries it is a good idea
to align any data structures with the same boundaries. Compilers do this pretty much automati-
cally192 when the appropriate flags are set. Curious readers are encouraged to review a lecture on
DRAMs such as the one by Zeshan Chishti193 .
GPU memory is subject to even higher bandwidth requirements since they have many more pro-
cessing elements than CPUs. By and large there are two options to address them. The first is to
make the memory bus significantly wider. For instance, NVIDIAʼs RTX 2080 Ti has a 352-bit-wide
bus. This allows for much more information to be transferred at the same time. Second, GPUs
use specific high-performance memory. Consumer-grade devices, such as NVIDIAʼs RTX and Ti-
tan series typically use GDDR6194 chips with over 500 GB/s aggregate bandwidth. An alternative is
to use HBM (high bandwidth memory) modules. They use a very different interface and connect
directly with GPUs on a dedicated silicon wafer. This makes them very expensive and their use is
typically limited to high-end server chips, such as the NVIDIA Volta V100 series of accelerators.
Quite unsurprisingly, GPU memory is generally much smaller than CPU memory due to the higher
cost of the former. For our purposes, by and large their performance characteristics are similar,
just a lot faster. We can safely ignore the details for the purpose of this book. They only matter
when tuning GPU kernels for high throughput.
191
https://en.wikipedia.org/wiki/DDR4_SDRAM
192
https://en.wikipedia.org/wiki/Data_structure_alignment
193
http://web.cecs.pdx.edu/~zeshan/ece585_lec5.pdf
194
https://en.wikipedia.org/wiki/GDDR6_SDRAM
13.4. Hardware 569

13.4.3 Storage
We saw that some of the key characteristics of RAM are bandwidth and latency. The same is true
for storage devices, just that the differences can be even more extreme.
Hard Disk Drives
Hard disk drives (HDDs) have been in use for over half a century. In a nutshell they contain a
number of spinning platters with heads that can be positioned to read or write at any given track.
High-end disks hold up to 16 TB on 9 platters. One of the key benefits of HDDs is that they are
relatively inexpensive. One of their many downsides are their typically catastrophic failure modes
and their relatively high read latency.
To understand the latter, consider the fact that HDDs spin at around 7,200 RPM (revolutions per
minute). If they were much faster they would shatter due to the centrifugal force exerted on the
platters. This has a major downside when it comes to accessing a specific sector on the disk: we
need to wait until the platter has rotated in position (we can move the heads but not accelerate the
actual disks). Hence it can take over 8 ms until the requested data is available. A common way this
is expressed is to say that HDDs can operate at approximately 100 IOPs (input/output operations
per second). This number has essentially remained unchanged for the past two decades. Worse
still, it is equally difficult to increase bandwidth (it is in the order of 100–200 MB/s). After all,
each head reads a track of bits, hence the bit rate only scales with the square root of the informa-
tion density. As a result, HDDs are quickly becoming relegated to archival storage and low-grade
storage for very large datasets.
Solid State Drives
Solid state drives (SSDs) use flash memory to store information persistently. This allows for much
faster access to stored records. Modern SSDs can operate at 100,000 to 500,000 IOPs, i.e., up to
3 orders of magnitude faster than HDDs. Furthermore, their bandwidth can reach 1–3GB/s, i.e.,
one order of magnitude faster than HDDs. These improvements sound almost too good to be true.
Indeed, they come with the following caveats, due to the way SSDs are designed.
• SSDs store information in blocks (256 KB or larger). They can only be written as a whole,
which takes significant time. Consequently bit-wise random writes on SSD have very poor
performance. Likewise, writing data in general takes significant time since the block has
to be read, erased and then rewritten with new information. By now SSD controllers and
firmware have developed algorithms to mitigate this. Nonetheless, writes can be much
slower, in particular for QLC (quad level cell) SSDs. The key for improved performance is to
maintain a queue of operations, to prefer reads and to write in large blocks if possible.
• The memory cells in SSDs wear out relatively quickly (often already after a few thousand
writes). Wear-level protection algorithms are able to spread the degradation over many cells.
That said, it is not recommended to use SSDs for swapping files or for large aggregations of
log-files.
• Lastly, the massive increase in bandwidth has forced computer designers to attach SSDs di-
rectly to the PCIe bus. The drives capable of handling this, referred to as NVMe (Non Volatile
Memory enhanced), can use up to 4 PCIe lanes. This amounts to up to 8GB/s on PCIe 4.0.

Cloud Storage
Cloud storage provides a configurable range of performance. That is, the assignment of storage to
virtual machines is dynamic, both in terms of quantity and in terms of speed, as chosen by users.
We recommend that users increase the provisioned number of IOPs whenever latency is too high,
e.g., during training with many small records.
13.4.4 CPUs
Central processing units (CPUs) are the centerpiece of any computer. They consist of a number
of key components: processor cores that are able to execute machine code, a bus connecting them
(the specific topology differs significantly between processor models, generations, and vendors),
and caches to allow for higher bandwidth and lower latency memory access than what is possible
by reads from main memory. Lastly, almost all modern CPUs contain vector processing units to aid
with high performance linear algebra and convolutions, as they are common in media processing
and machine learning.
Fig. 13.4.3: Intel Skylake consumer quad-core CPU.
Fig. 13.4.3 depicts an Intel Skylake consumer-grade quad-core CPU. It has an integrated GPU,
caches, and a ringbus connecting the four cores. Peripherals, such as Ethernet, WiFi, Bluetooth,
SSD controller, and USB, are either part of the chipset or directly attached (PCIe) to the CPU.
13.4. Hardware 571

Microarchitecture
Each of the processor cores consists of a rather sophisticated set of components. While details
differ between generations and vendors, the basic functionality is pretty much standard. The
front-end loads instructions and tries to predict which path will be taken (e.g., for control flow).
Instructions are then decoded from assembly code to microinstructions. Assembly code is often
not the lowest level code that a processor executes. Instead, complex instructions may be decoded
into a set of more lower level operations. These are then processed by the actual execution core.
Often the latter is capable of performing many operations simultaneously. For instance, the ARM
Cortex A77 core of Fig. 13.4.4 is able to perform up to 8 operations simultaneously.
Fig. 13.4.4: ARM Cortex A77 Microarchitecture.
This means that efficient programs might be able to perform more than one instruction per clock
cycle, provided that they can be carried out independently. Not all units are created equal. Some
specialize in integer instructions whereas others are optimized for floating point performance.
To increase throughput, the processor might also follow multiple code paths simultaneously in a
branching instruction and then discard the results of the branches not taken. This is why branch
prediction units matter (on the front-end) such that only the most promising paths are pursued.
Vectorization
Deep learning is extremely compute-hungry. Hence, to make CPUs suitable for machine learning,
one needs to perform many operations in one clock cycle. This is achieved via vector units. They
have different names: on ARM they are called NEON, on x86 they (a recent generation) are referred
to as AVX2195 units. A common aspect is that they are able to perform SIMD (single instruction
multiple data) operations. Fig. 13.4.5 shows how 8 short integers can be added in one clock cycle
on ARM.
195
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

Fig. 13.4.5: 128 bit NEON vectorization.
Depending on architecture choices, such registers are up to 512 bits long, allowing for the com-
bination of up to 64 pairs of numbers. For instance, we might be multiplying two numbers and
adding them to a third, which is also known as a fused multiply-add. Intelʼs OpenVino196 uses these
to achieve respectable throughput for deep learning on server-grade CPUs. Note, though, that this
number is entirely dwarfed by what GPUs are capable of achieving. For instance, NVIDIAʼs RTX
2080 Ti has 4,352 CUDA cores, each of which is capable of processing such an operation at any
time.
Cache
Consider the following situation: we have a modest CPU core with 4 cores as depicted in Fig. 13.4.3
above, running at 2 GHz frequency. Moreover, letʼs assume that we have an IPC (instructions per
clock) count of 1 and that the units have AVX2 with 256-bit width enabled. Letʼs furthermore as-
sume that at least one of the registers used for AVX2 operations needs to be retrieved from mem-
ory. This means that the CPU consumes 4 × 256 bit = 128 bytes of data per clock cycle. Unless
we are able to transfer 2 × 109 × 128 = 256 × 109 bytes to the processor per second the processing
elements are going to starve. Unfortunately the memory interface of such a chip only supports
20–40 GB/s data transfer, i.e., one order of magnitude less. The fix is to avoid loading new data
from memory as far as possible and rather to cache it locally on the CPU. This is where caches
come in handy. Commonly the following names or concepts are used:
• Registers are strictly speaking not part of the cache. They help stage instructions. That said,
CPU registers are memory locations that a CPU can access at clock speed without any delay
penalty. CPUs have tens of registers. It is up to the compiler (or programmer) to use registers
efficiently. For instance the C programming language has a register keyword.
• L1 caches are the first line of defense against high memory bandwidth requirements. L1
caches are tiny (typical sizes might be 32–64 KB) and often split into data and instructions
caches. When data is found in the L1 cache, access is very fast. If they cannot be found
there, the search progresses down the cache hierarchy.
• L2 caches are the next stop. Depending on architecture design and processor size they might
be exclusive. They might be accessible only by a given core or shared among multiple cores.
L2 caches are larger (typically 256–512 KB per core) and slower than L1. Furthermore, to
access something in L2 we first need to check to realize that the data is not in L1, which adds
a small amount of extra latency.
196
https://01.org/openvinotoolkit
13.4. Hardware 573

• L3 caches are shared among multiple cores and can be quite large. AMDʼs Epyc 3 server CPUs
have a whopping 256 MB of cache spread across multiple chiplets. More typical numbers are
in the 4–8 MB range.
Predicting which memory elements will be needed next is one of the key optimization parame-
ters in chip design. For instance, it is advisable to traverse memory in a forward direction since
most caching algorithms will try to read ahead rather than backwards. Likewise, keeping memory
access patterns local is a good way of improving performance.
Adding caches is a double-edge sword. On the one hand they ensure that the processor cores do
not starve of data. At the same time they increase chip size, using up area that otherwise could have
been spent on increasing processing power. Moreover, cache misses can be expensive. Consider
the worst case scenario, false sharing, as depicted in Fig. 13.4.6. A memory location is cached on
processor 0 when a thread on processor 1 requests the data. To obtain it, processor 0 needs to stop
what it is doing, write the information back to main memory and then let processor 1 read it from
memory. During this operation both processors wait. Quite potentially such code runs more slowly
on multiple processors when compared with an efficient single-processor implementation. This
is one more reason for why there is a practical limit to cache sizes (besides their physical size).
Fig. 13.4.6: False sharing (image courtesy of Intel).
13.4.5 GPUs and other Accelerators
It is not an exaggeration to claim that deep learning would not have been successful without GPUs.
By the same token, it is quite reasonable to argue that GPU manufacturersʼ fortunes have increased
significantly due to deep learning. This co-evolution of hardware and algorithms has led to a sit-
uation where for better or worse deep learning is the preferable statistical modeling paradigm.
Hence it pays to understand the specific benefits that GPUs and related accelerators such as the
TPU (Jouppi et al., 2017).
Of note is a distinction that is often made in practice: accelerators are optimized either for train-
ing or inference. For the latter we only need to compute the forward propagation in a network.
No storage of intermediate data is needed for backpropagation. Moreover, we may not need very
precise computation (FP16 or INT8 typically suffice). On the other hand, during training all inter-
mediate results need storage to compute gradients. Moreover, accumulating gradients requires
higher precision to avoid numerical underflow (or overflow). This means that FP16 (or mixed pre-
cision with FP32) is the minimum requirement. All of this necessitates faster and larger memory

(HBM2 vs. GDDR6) and more processing power. For instance, NVIDIAʼs Turing197 T4 GPUs are
optimized for inference whereas the V100 GPUs are preferable for training.
Recall vectorization as illustrated in Fig. 13.4.5. Adding vector units to a processor core allowed
us to increase throughput significantly. For example, in the example in Fig. 13.4.5 we were able to
perform 16 operations simultaneously. First, what if we added operations that optimized not just
operations between vectors but also between matrices? This strategy led to tensor cores (to be
covered shortly). Second, what if we added many more cores? In a nutshell, these two strategies
summarize the design decisions in GPUs. Fig. 13.4.7 gives an overview of a basic processing block.
It contains 16 integer and 16 floating point units. In addition to that, two tensor cores accelerate a
narrow subset of additional operations relevant for deep learning. Each streaming multiprocessor
consists of four such blocks.
Fig. 13.4.7: NVIDIA Turing processing block (image courtesy of NVIDIA).
Next, 12 streaming multiprocessors are grouped into graphics processing clusters which make
up the high-end TU102 processors. Ample memory channels and an L2 cache complement the
setup. Fig. 13.4.8 has the relevant details. One of the reasons for designing such a device is that
individual blocks can be added or removed as needed to allow for more compact chips and to deal
with yield issues (faulty modules might not be activated). Fortunately programming such devices
is well hidden from the casual deep learning researcher beneath layers of CUDA and framework
code. In particular, more than one of the programs might well be executed simultaneously on the
GPU, provided that there are available resources. Nonetheless it pays to be aware of the limitations
of the devices to avoid picking models that do not fit into device memory.
197
https://devblogs.nvidia.com/nvidia-turing-architecture-in-depth/
13.4. Hardware 575

Fig. 13.4.8: NVIDIA Turing architecture (image courtesy of NVIDIA)
A last aspect that is worth mentioning in more detail are tensor cores. They are an example of a
recent trend of adding more optimized circuits that are specifically effective for deep learning.
For instance, the TPU added a systolic array (Kung, 1988) for fast matrix multiplication. There
the design was to support a very small number (one for the first generation of TPUs) of large op-
erations. Tensor cores are at the other end. They are optimized for small operations involving
between 4 × 4 and 16 × 16 matrices, depending on their numerical precision. Fig. 13.4.9 gives an
overview of the optimizations.
Fig. 13.4.9: NVIDIA tensor cores in Turing (image courtesy of NVIDIA).
Obviously when optimizing for computation we end up making certain compromises. One of
them is that GPUs are not very good at handling interrupts and sparse data. While there are no-
table exceptions, such as Gunrock198 (Wang et al., 2016), the access pattern of sparse matrices and
vectors do not go well with the high bandwidth burst read operations where GPUs excel. Match-
ing both goals is an area of active research. See e.g., DGL199 , a library tuned for deep learning on
198
https://github.com/gunrock/gunrock
199
http://dgl.ai

graphs.
13.4.6 Networks and Buses
Whenever a single device is insufficient for optimization we need to transfer data to and from it
to synchronize processing. This is where networks and buses come in handy. We have a number
of design parameters: bandwidth, cost, distance, and flexibility. On one end we have WiFi that
has a pretty good range, is very easy to use (no wires, after all), cheap but it offers comparatively
mediocre bandwidth and latency. No machine learning researcher within their right mind would
use it to build a cluster of servers. In what follows we focus on interconnects that are suitable for
deep learning.
• PCIe is a dedicated bus for very high bandwidth point-to-point connections (up to 32 GB/s
on PCIe 4.0 in a 16-lane slot) per lane. Latency is in the order of single-digit microseconds (5
µs). PCIe links are precious. Processors only have a limited number of them: AMDʼs EPYC 3
has 128 lanes, Intelʼs Xeon has up to 48 lanes per chip; on desktop-grade CPUs the numbers
are 20 (Ryzen 9) and 16 (Core i9) respectively. Since GPUs have typically 16 lanes, this limits
the number of GPUs that can connect to the CPU at full bandwidth. After all, they need to
share the links with other high bandwidth peripherals such as storage and Ethernet. Just
like with RAM access, large bulk transfers are preferable due to reduced packet overhead.
• Ethernet is the most commonly used way of connecting computers. While it is significantly
slower than PCIe, it is very cheap and resilient to install and covers much longer distances.
Typical bandwidth for low-grade servers is 1 GBit/s. Higher-end devices (e.g., C5 instances200
in the cloud) offer between 10 and 100 GBit/s bandwidth. As in all previous cases data trans-
mission has significant overheads. Note that we almost never use raw Ethernet directly but
rather a protocol that is executed on top of the physical interconnect (such as UDP or TCP/IP).
This adds further overhead. Like PCIe, Ethernet is designed to connect two devices, e.g., a
computer and a switch.
• Switches allow us to connect multiple devices in a manner where any pair of them can
carry out a (typically full bandwidth) point-to-point connection simultaneously. For in-
stance, Ethernet switches might connect 40 servers at high cross-sectional bandwidth. Note
that switches are not unique to traditional computer networks. Even PCIe lanes can be
switched201 . This occurs, e.g., to connect a large number of GPUs to a host processor, as
is the case for the P2 instances202 .
• NVLink is an alternative to PCIe when it comes to very high bandwidth interconnects. It
offers up to 300 Gbit/s data transfer rate per link. Server GPUs (Volta V100) have six links
whereas consumer-grade GPUs (RTX 2080 Ti) have only one link, operating at a reduced 100
Gbit/s rate. We recommend to use NCCL203 to achieve high data transfer between GPUs.
200
https://aws.amazon.com/ec2/instance-types/c5/
201
https://www.broadcom.com/products/pcie-switches-bridges/pcie-switches
202
https://aws.amazon.com/ec2/instance-types/p2/
203
https://github.com/NVIDIA/nccl
13.4. Hardware 577

13.4.7 More Latency Numbers
The summary in Section 13.4.7 and Section 13.4.7 are from Eliot Eshelman204 who maintains an
updated version of the numbers as a GitHub gist205 .
Action Time Notes

L1 cache reference/hit 1.5 ns 4 cycles
Floating-point add/mult/FMA 1.5 ns 4 cycles
L2 cache reference/hit 5 ns 12 ~ 17 cycles
Branch mispredict 6 ns 15 ~ 20 cycles
L3 cache hit (unshared cache) 16 ns 42 cycles
L3 cache hit (shared in another core) 25 ns 65 cycles
Mutex lock/unlock 25 ns
L3 cache hit (modified in another core) 29 ns 75 cycles
L3 cache hit (on a remote CPU socket) 40 ns 100 ~ 300 cycles (40 ~ 116 ns)
QPI hop to a another CPU (per hop) 40 ns
64MB memory ref. (local CPU) 46 ns TinyMemBench on Broadwell E5-2690v4
64MB memory ref. (remote CPU) 70 ns TinyMemBench on Broadwell E5-2690v4
256MB memory ref. (local CPU) 75 ns TinyMemBench on Broadwell E5-2690v4
Intel Optane random write 94 ns UCSD Non-Volatile Systems Lab
256MB memory ref. (remote CPU) 120 ns TinyMemBench on Broadwell E5-2690v4
Intel Optane random read 305 ns UCSD Non-Volatile Systems Lab
Send 4KB over 100 Gbps HPC fabric 1 µs MVAPICH2 over Intel Omni-Path
Compress 1KB with Google Snappy 3 µs
Send 4KB over 10 Gbps ethernet 10 µs
Write 4KB randomly to NVMe SSD 30 µs DC P3608 NVMe SSD (QOS 99% is 500µs)
Transfer 1MB to/from NVLink GPU 30 µs ~33GB/s on NVIDIA 40GB NVLink
Transfer 1MB to/from PCI-E GPU 80 µs ~12GB/s on PCIe 3.0 x16 link
Read 4KB randomly from NVMe SSD 120 µs DC P3608 NVMe SSD (QOS 99%)
Read 1MB sequentially from NVMe SSD 208 µs ~4.8GB/s DC P3608 NVMe SSD
Write 4KB randomly to SATA SSD 500 µs DC S3510 SATA SSD (QOS 99.9%)
Read 4KB randomly from SATA SSD 500 µs DC S3510 SATA SSD (QOS 99.9%)
Round trip within same data center 500 µs One-way ping is ~250µs
Read 1MB sequentially from SATA SSD 2 ms ~550MB/s DC S3510 SATA SSD
Read 1MB sequentially from disk 5 ms ~200MB/s server HDD
Random Disk Access (seek+rotation) 10 ms
Send packet CA->Netherlands->CA 150 ms
Table: Common Latency Numbers.
Action Time Notes

GPU Shared Memory access 30 ns 30~90 cycles (bank conflicts add latency)
GPU Global Memory access 200 ns 200~800 cycles
Launch CUDA kernel on GPU 10 µs Host CPU instructs GPU to start kernel
Transfer 1MB to/from NVLink GPU 30 µs ~33GB/s on NVIDIA 40GB NVLink
Transfer 1MB to/from PCI-E GPU 80 µs ~12GB/s on PCI-Express x16 link
204
https://gist.github.com/eshelman
205
https://gist.github.com/eshelman/343a1c46cb3fba142c1afdcdeec17646

Table: Latency Numbers for NVIDIA Tesla GPUs.
13.4.8 Summary
• Devices have overheads for operations. Hence it is important to aim for a small number of
large transfers rather than many small ones. This applies to RAM, SSDs, networks and GPUs.
• Vectorization is key for performance. Make sure you are aware of the specific abilities of your
accelerator. E.g., some Intel Xeon CPUs are particularly good for INT8 operations, NVIDIA
Volta GPUs excel at FP16 matrix-matrix operations and NVIDIA Turing shines at FP16, INT8,
and INT4 operations.
• Numerical overflow due to small data types can be a problem during training (and to a lesser
extent during inference).
• Aliasing can significantly degrade performance. For instance, memory alignment on 64 bit
CPUs should be done with respect to 64 bit boundaries. On GPUs it is a good idea to keep
convolution sizes aligned, e.g., to tensor cores.
• Match your algorithms to the hardware (e.g., memory footprint, and bandwidth). Great
speedup (orders of magnitude) can be achieved when fitting the parameters into caches.
• We recommend that you sketch out the performance of a novel algorithm on paper before
verifying the experimental results. Discrepancies of an order-of-magnitude or more are rea-
sons for concern.
• Use profilers to debug performance bottlenecks.
• Training and inference hardware have different sweet spots in terms of price and perfor-
mance.
Exercises
1. Write C code to test whether there is any difference in speed between accessing memory
aligned or misaligned relative to the external memory interface. Hint: be careful of caching
effects.
2. Test the difference in speed between accessing memory in sequence or with a given stride.
3. How could you measure the cache sizes on a CPU?
4. How would you lay out data across multiple memory channels for maximum bandwidth?
How would you lay it out if you had many small threads?
5. An enterprise-class HDD is spinning at 10,000 rpm. What is the absolutely minimum time
an HDD needs to spend worst case before it can read data (you can assume that heads move
almost instantaneously)? Why are 2.5” HDDs becoming popular for commercial servers (rel-
ative to 3.5” and 5.25” drives)?
6. Assume that an HDD manufacturer increases the storage density from 1 Tbit per square inch
to 5 Tbit per square inch. How much information can you store on a ring on a 2.5” HDD? Is
there a difference between the inner and outer tracks?
7. Going from 8 bit to 16 bit data types increases the amount of silicon approximately by four
times. Why? Why might NVIDIA have added INT4 operations to their Turing GPUs?
13.4. Hardware 579

8. How much faster is it to read forward through memory vs. reading backwards? Does this
number differ between different computers and CPU vendors? Why? Write C code and ex-
periment with it.
9. Can you measure the cache size of your disk? What is it for a typical HDD? Do SSDs need a
cache?
10. Measure the packet overhead when sending messages across the Ethernet. Look up the dif-
ference between UDP and TCP/IP connections.
11. Direct memory access allows devices other than the CPU to write (and read) directly to (from)
memory. Why is this a good idea?
12. Look at the performance numbers for the Turing T4 GPU. Why does the performance “only”
double as you go from FP16 to INT8 and INT4?
13. What is the shortest time it should take for a packet on a round trip between San Francisco
and Amsterdam? Hint: you can assume that the distance is 10,000 km.
Discussions206
13.5 Training on Multiple GPUs
So far we discussed how to train models efficiently on CPUs and GPUs. We even showed how deep
learning frameworks allow one to parallelize computation and communication automatically be-
tween them in Section 13.3. We also showed in Section 6.7 how to list all the available GPUs on
a computer using the nvidia-smi command. What we did not discuss is how to actually paral-
lelize deep learning training. Instead, we implied in passing that one would somehow split the
data across multiple devices and make it work. The present section fills in the details and shows
how to train a network in parallel when starting from scratch. Details on how to take advantage of
functionality in high-level APIs is relegated to Section 13.6. We assume that you are familiar with
minibatch stochastic gradient descent algorithms such as the ones described in Section 12.5.
13.5.1 Splitting the Problem
Letʼs start with a simple computer vision problem and a slightly archaic network, e.g., with multi-
ple layers of convolutions, pooling, and possibly a few fully connected layers in the end. That
is, letʼs start with a network that looks quite similar to LeNet (LeCun et al., 1998) or AlexNet
(Krizhevsky et al., 2012). Given multiple GPUs (2 if it is a desktop server, 4 on an AWS g4dn.12xlarge
instance, 8 on a p3.16xlarge, or 16 on a p2.16xlarge), we want to partition training in a manner as
to achieve good speedup while simultaneously benefitting from simple and reproducible design
choices. Multiple GPUs, after all, increase both memory and computation ability. In a nutshell, we
have the following choices, given a minibatch of training data that we want to classify.
First, we could partition the network across multiple GPUs. That is, each GPU takes as input the
data flowing into a particular layer, processes data across a number of subsequent layers and then
sends the data to the next GPU. This allows us to process data with larger networks when compared
with what a single GPU could handle. Besides, memory footprint per GPU can be well controlled
(it is a fraction of the total network footprint).
206

However, the interface between layers (and thus GPUs) requires tight synchronization. This can
be tricky, in particular if the computational workloads are not properly matched between layers.
The problem is exacerbated for large numbers of GPUs. The interface between layers also requires
large amounts of data transfer, such as activations and gradients. This may overwhelm the band-
width of the GPU buses. Moreover, compute-intensive, yet sequential operations are nontrivial to
partition. See e.g., (Mirhoseini et al., 2017) for a best effort in this regard. It remains a difficult
problem and it is unclear whether it is possible to achieve good (linear) scaling on nontrivial prob-
lems. We do not recommend it unless there is excellent framework or operating system support
for chaining together multiple GPUs.
Second, we could split the work layerwise. For instance, rather than computing 64 channels on a
single GPU we could split up the problem across 4 GPUs, each of which generates data for 16 chan-
nels. Likewise, for a fully connected layer we could split the number of output units. Fig. 13.5.1
(taken from (Krizhevsky et al., 2012)) illustrates this design, where this strategy was used to deal
with GPUs that had a very small memory footprint (2 GB at the time). This allows for good scaling
in terms of computation, provided that the number of channels (or units) is not too small. Be-
sides, multiple GPUs can process increasingly larger networks since the available memory scales
linearly.
Fig. 13.5.1: Model parallelism in the original AlexNet design due to limited GPU memory.
However, we need a very large number of synchronization or barrier operations since each layer
depends on the results from all the other layers. Moreover, the amount of data that needs to be
transferred is potentially even larger than when distributing layers across GPUs. Thus, we do not
recommend this approach due to its bandwidth cost and complexity.
Last, we could partition data across multiple GPUs. This way all GPUs perform the same type of
work, albeit on different observations. Gradients are aggregated across GPUs after each minibatch
of training data. This is the simplest approach and it can be applied in any situation. We only need
to synchronize after each minibatch. That said, it is highly desirable to start exchanging gradients
parameters already while others are still being computed. Moreover, larger numbers of GPUs lead
to larger minibatch sizes, thus increasing training efficiency. However, adding more GPUs does
not allow us to train larger models.
13.5. Training on Multiple GPUs 581

Fig. 13.5.2: Parallelization on multiple GPUs. From left to right: original problem, network parti-
tioning, layerwise partitioning, data parallelism.
A comparison of different ways of parallelization on multiple GPUs is depicted in Fig. 13.5.2. By

and large, data parallelism is the most convenient way to proceed, provided that we have access
to GPUs with sufficiently large memory. See also (Li et al., 2014) for a detailed description of par-
titioning for distributed training. GPU memory used to be a problem in the early days of deep
learning. By now this issue has been resolved for all but the most unusual cases. We focus on data
parallelism in what follows.
13.5.2 Data Parallelism
Assume that there are k GPUs on a machine. Given the model to be trained, each GPU will maintain
a complete set of model parameters independently though parameter values across the GPUs are
identical and synchronized. As an example, Fig. 13.5.3 illustrates training with data parallelism
when k = 2.

Fig. 13.5.3: Calculation of minibatch stochastic gradient descent using data parallelism on two
GPUs.
In general, the training proceeds as follows:

• In any iteration of training, given a random minibatch, we split the examples in the batch
into k portions and distribute them evenly across the GPUs.
• Each GPU calculates loss and gradient of the model parameters based on the minibatch sub-
set it was assigned.
• The local gradients of each of the k GPUs are aggregated to obtain the current minibatch
stochastic gradient.
• The aggregate gradient is re-distributed to each GPU.
• Each GPU uses this minibatch stochastic gradient to update the complete set of model pa-
rameters that it maintains.
Note that in practice we increase the minibatch size k-fold when training on k GPUs such that each
GPU has the same amount of work to do as if we were training on a single GPU only. On a 16-
GPU server this can increase the minibatch size considerably and we may have to increase the
learning rate accordingly. Also note that batch normalization in Section 8.5 needs to be adjusted,
e.g., by keeping a separate batch normalization coefficient per GPU. In what follows we will use a
toy network to illustrate multi-GPU training.
%matplotlib inline
import torch

13.5.3 A Toy Network
We use LeNet as introduced in Section 7.6 (with slight modifications). We define it from scratch
to illustrate parameter exchange and synchronization in detail.
# Initialize model parameters

scale = 0.01
W1 = torch.randn(size=(20, 1, 3, 3)) * scale
b1 = torch.zeros(20)
W2 = torch.randn(size=(50, 20, 5, 5)) * scale
W3 = torch.randn(size=(800, 128)) * scale
W4 = torch.randn(size=(128, 10)) * scale
params = [W1, b1, W2, b2, W3, b3, W4, b4]
# Define the model

def lenet(X, params):
h1_conv = F.conv2d(input=X, weight=params[0], bias=params[1])
h1_activation = F.relu(h1_conv)
h1 = F.avg_pool2d(input=h1_activation, kernel_size=(2, 2), stride=(2, 2))
h2_conv = F.conv2d(input=h1, weight=params[2], bias=params[3])
h2_activation = F.relu(h2_conv)
h2 = F.avg_pool2d(input=h2_activation, kernel_size=(2, 2), stride=(2, 2))
h2 = h2.reshape(h2.shape[0], -1)
h3_linear = torch.mm(h2, params[4]) + params[5]
h3 = F.relu(h3_linear)
y_hat = torch.mm(h3, params[6]) + params[7]
return y_hat
# Cross-entropy loss function

loss = nn.CrossEntropyLoss(reduction='none')
13.5.4 Data Synchronization
For efficient multi-GPU training we need two basic operations. First we need to have the ability to
distribute a list of parameters to multiple devices and to attach gradients (get_params). Without
parameters it is impossible to evaluate the network on a GPU. Second, we need the ability to sum
parameters across multiple devices, i.e., we need an allreduce function.
def get_params(params, device):

new_params = [p.to(device) for p in params]
for p in new_params:
p.requires_grad_()
return new_params
Letʼs try it out by copying the model parameters to one GPU.
new_params = get_params(params, d2l.try_gpu(0))

print('b1 weight:', new_params[1])
print('b1 grad:', new_params[1].grad)

b1 weight: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.
,→, 0.],
device='cuda:0', requires_grad=True)
b1 grad: None
Since we did not perform any computation yet, the gradient with regard to the bias parameter is
still zero. Now letʼs assume that we have a vector distributed across multiple GPUs. The following
allreduce function adds up all vectors and broadcasts the result back to all GPUs. Note that for
this to work we need to copy the data to the device accumulating the results.
def allreduce(data):
for i in range(1, len(data)):
data[0][:] += data[i].to(data[0].device)
for i in range(1, len(data)):
data[i][:] = data[0].to(data[i].device)
Letʼs test this by creating vectors with different values on different devices and aggregate them.
data = [torch.ones((1, 2), device=d2l.try_gpu(i)) * (i + 1) for i in range(2)]

print('before allreduce:\n', data[0], '\n', data[1])
allreduce(data)
print('after allreduce:\n', data[0], '\n', data[1])
before allreduce:
after allreduce:
13.5.5 Distributing Data
We need a simple utility function to distribute a minibatch evenly across multiple GPUs. For in-
stance, on two GPUs we would like to have half of the data to be copied to either of the GPUs.
Since it is more convenient and more concise, we use the built-in function from the deep learning
framework to try it out on a 4 × 5 matrix.
data = torch.arange(20).reshape(4, 5)
devices = [torch.device('cuda:0'), torch.device('cuda:1')]
split = nn.parallel.scatter(data, devices)
print('input :', data)
print('load into', devices)
print('output:', split)
input : tensor([[ 0, 1, 2, 3, 4],

[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
load into [device(type='cuda', index=0), device(type='cuda', index=1)]

output: (tensor([[0, 1, 2, 3, 4],

[5, 6, 7, 8, 9]], device='cuda:0'), tensor([[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]], device='cuda:1'))
For later reuse we define a split_batch function that splits both data and labels.
#@save
def split_batch(X, y, devices):
"""Split `X` and `y` into multiple devices."""
assert X.shape[0] == y.shape[0]
return (nn.parallel.scatter(X, devices),
nn.parallel.scatter(y, devices))
13.5.6 Training
Now we can implement multi-GPU training on a single minibatch. Its implementation is primarily
based on the data parallelism approach described in this section. We will use the auxiliary func-
tions we just discussed, allreduce and split_and_load, to synchronize the data among multiple
GPUs. Note that we do not need to write any specific code to achieve parallelism. Since the com-
putational graph does not have any dependencies across devices within a minibatch, it is executed
in parallel automatically.
def train_batch(X, y, device_params, devices, lr):

X_shards, y_shards = split_batch(X, y, devices)
# Loss is calculated separately on each GPU
ls = [loss(lenet(X_shard, device_W), y_shard).sum()
for X_shard, y_shard, device_W in zip(
X_shards, y_shards, device_params)]
for l in ls: # Backpropagation is performed separately on each GPU
l.backward()
# Sum all gradients from each GPU and broadcast them to all GPUs
for i in range(len(device_params[0])):
allreduce([device_params[c][i].grad for c in range(len(devices))])
# The model parameters are updated separately on each GPU
for param in device_params:
d2l.sgd(param, lr, X.shape[0]) # Here, we use a full-size batch
Now, we can define the training function. It is slightly different from the ones used in the previ-
ous chapters: we need to allocate the GPUs and copy all the model parameters to all the devices.
Obviously each batch is processed using the train_batch function to deal with multiple GPUs. For
convenience (and conciseness of code) we compute the accuracy on a single GPU, though this is
inefficient since the other GPUs are idle.
def train(num_gpus, batch_size, lr):

train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
devices = [d2l.try_gpu(i) for i in range(num_gpus)]
# Copy model parameters to `num_gpus` GPUs
device_params = [get_params(params, d) for d in devices]
num_epochs = 10

animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])

timer = d2l.Timer()
timer.start()
for X, y in train_iter:
# Perform multi-GPU training for a single minibatch
train_batch(X, y, device_params, devices, lr)
timer.stop()
# Evaluate the model on GPU 0
animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(
lambda x: lenet(x, device_params[0]), test_iter, devices[0]),))
print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '
f'on {str(devices)}')
Letʼs see how well this works on a single GPU. We first use a batch size of 256 and a learning rate
of 0.2.
train(num_gpus=1, batch_size=256, lr=0.2)
test acc: 0.80, 2.5 sec/epoch on [device(type='cuda', index=0)]
By keeping the batch size and learning rate unchanged and increasing the number of GPUs to 2,
we can see that the test accuracy roughly stays the same compared with the previous experiment.
In terms of the optimization algorithms, they are identical. Unfortunately there is no meaningful
speedup to be gained here: the model is simply too small; moreover we only have a small dataset,
where our slightly unsophisticated approach to implementing multi-GPU training suffered from
significant Python overhead. We will encounter more complex models and more sophisticated
ways of parallelization going forward. Letʼs see what happens nonetheless for Fashion-MNIST.
train(num_gpus=2, batch_size=256, lr=0.2)
test acc: 0.84, 2.5 sec/epoch on [device(type='cuda', index=0), device(type='cuda', index=1)]

13.5.7 Summary
• There are multiple ways to split deep network training over multiple GPUs. We could split
them between layers, across layers, or across data. The former two require tightly chore-
ographed data transfers. Data parallelism is the simplest strategy.
• Data parallel training is straightforward. However, it increases the effective minibatch size
to be efficient.
• In data parallelism, data is split across multiple GPUs, where each GPU executes its own
forward and backward operation and subsequently gradients are aggregated and results are
broadcast back to the GPUs.
• We may use slightly increased learning rates for larger minibatches.
Exercises
1. When training on k GPUs, change the minibatch size from b to k · b, i.e., scale it up by the
number of GPUs.
2. Compare accuracy for different learning rates. How does it scale with the number of GPUs?
3. Implement a more efficient allreduce function that aggregates different parameters on dif-
ferent GPUs? Why is it more efficient?
4. Implement multi-GPU test accuracy computation.
Discussions207
207

13.6 Concise Implementation for Multiple GPUs
Implementing parallelism from scratch for every new model is no fun. Moreover, there is signif-
icant benefit in optimizing synchronization tools for high performance. In the following we will
show how to do this using high-level APIs of deep learning frameworks. The mathematics and the
algorithms are the same as in Section 13.5. Quite unsurprisingly you will need at least two GPUs
to run code of this section.
import torch
13.6.1 A Toy Network
Letʼs use a slightly more meaningful network than LeNet from Section 13.5 that is still sufficiently
easy and quick to train. We pick a ResNet-18 variant (He et al., 2016). Since the input images are
tiny we modify it slightly. In particular, the difference from Section 8.6 is that we use a smaller
convolution kernel, stride, and padding at the beginning. Moreover, we remove the max-pooling
layer.
#@save
def resnet18(num_classes, in_channels=1):
"""A slightly modified ResNet-18 model."""
def resnet_block(in_channels, out_channels, num_residuals,
first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(d2l.Residual(out_channels, use_1x1conv=True,
strides=2))
else:
blk.append(d2l.Residual(out_channels))
# This model uses a smaller convolution kernel, stride, and padding and
# removes the max-pooling layer
net = nn.Sequential(
nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(64),
nn.ReLU())
net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
net.add_module("resnet_block2", resnet_block(64, 128, 2))
net.add_module("global_avg_pool", nn.AdaptiveAvgPool2d((1,1)))
net.add_module("fc", nn.Sequential(nn.Flatten(),
nn.Linear(512, num_classes)))
return net
13.6. Concise Implementation for Multiple GPUs 589

13.6.2 Network Initialization
We will initialize the network inside the training loop. For a refresher on initialization methods
see Section 5.4.
net = resnet18(10)
# Get a list of GPUs
# We will initialize the network inside the training loop
13.6.3 Training
As before, the training code needs to perform several basic functions for efficient parallelism:
• Network parameters need to be initialized across all devices.
• While iterating over the dataset minibatches are to be divided across all devices.
• We compute the loss and its gradient in parallel across devices.
• Gradients are aggregated and parameters are updated accordingly.
In the end we compute the accuracy (again in parallel) to report the final performance of the
network. The training routine is quite similar to implementations in previous chapters, except
that we need to split and aggregate data.
def train(net, num_gpus, batch_size, lr):

train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
devices = [d2l.try_gpu(i) for i in range(num_gpus)]
if type(module) in [nn.Linear, nn.Conv2d]:
nn.init.normal_(module.weight, std=0.01)
# Set the model on multiple GPUs
net = nn.DataParallel(net, device_ids=devices)
trainer = torch.optim.SGD(net.parameters(), lr)
timer, num_epochs = d2l.Timer(), 10
animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
net.train()
timer.start()
trainer.zero_grad()
X, y = X.to(devices[0]), y.to(devices[0])
l = loss(net(X), y)
l.backward()
trainer.step()
timer.stop()

animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(net, test_iter),))

print(f'test acc: {animator.Y[0][-1]:.2f}, {timer.avg():.1f} sec/epoch '
f'on {str(devices)}')
Letʼs see how this works in practice. As a warm-up we train the network on a single GPU.
train(net, num_gpus=1, batch_size=256, lr=0.1)
test acc: 0.93, 13.7 sec/epoch on [device(type='cuda', index=0)]
Next we use 2 GPUs for training. Compared with LeNet evaluated in Section 13.5, the model for
ResNet-18 is considerably more complex. This is where parallelization shows its advantage. The
time for computation is meaningfully larger than the time for synchronizing parameters. This
improves scalability since the overhead for parallelization is less relevant.
train(net, num_gpus=2, batch_size=512, lr=0.2)
test acc: 0.71, 8.4 sec/epoch on [device(type='cuda', index=0), device(type='cuda', index=1)]
13.6. Concise Implementation for Multiple GPUs 591

13.6.4 Summary
• Data is automatically evaluated on the devices where the data can be found.
• Take care to initialize the networks on each device before trying to access the parameters on
that device. Otherwise you will encounter an error.
• The optimization algorithms automatically aggregate over multiple GPUs.
Exercises
1. This section uses ResNet-18. Try different epochs, batch sizes, and learning rates. Use
more GPUs for computation. What happens if you try this with 16 GPUs (e.g., on an AWS
p2.16xlarge instance)?
2. Sometimes, different devices provide different computing power. We could use the GPUs
and the CPU at the same time. How should we divide the work? Is it worth the effort? Why?
Why not?
Discussions208
13.7 Parameter Servers
As we move from a single GPU to multiple GPUs and then to multiple servers containing multiple
GPUs, possibly all spread out across multiple racks and network switches, our algorithms for dis-
tributed and parallel training need to become much more sophisticated. Details matter since dif-
ferent interconnects have very different bandwidth (e.g., NVLink can offer up to 100 GB/s across
6 links in an appropriate setting, PCIe 4.0 (16-lane) offers 32 GB/s, while even high speed 100GbE
Ethernet only amounts to 10 GB/s). At the same time it is unreasonable to expect that a statistical
modeler be an expert in networking and systems.
The core idea of the parameter server was introduced in (Smola and Narayanamurthy, 2010) in
the context of distributed latent variable models. A description of the push and pull semantics
then followed in (Ahmed et al., 2012) and a description of the system and an open source library
followed in (Li et al., 2014). In the following we will motivate the components needed for efficiency.
13.7.1 Data-Parallel Training
Letʼs review the data parallel training approach to distributed training. We will use this to the
exclusion of all others in this section since it is significantly simpler to implement in practice.
There are virtually no use cases (besides deep learning on graphs) where any other strategy for
parallelism is preferred since GPUs have plenty of memory nowadays. Fig. 13.7.1 describes the
variant of data parallelism that we implemented in Section 13.5. The key aspect in it is that the
aggregation of gradients occurs on GPU 0 before the updated parameters are rebroadcast to all
GPUs.
208

Fig. 13.7.1: Left: single GPU training. Right: a variant of multi-GPU training: (1) we compute loss
and gradient, (2) all gradients are aggregated on one GPU, (3) parameter update happens and the
parameters are re-distributed to all GPUs.
In retrospect, the decision to aggregate on GPU 0 seems rather ad-hoc. After all, we might just
as well aggregate on the CPU. In fact, we could even decide to aggregate some of the parameters
on one GPU and some others on another. Provided that the optimization algorithm supports this,
there is no real reason for why we could not. For instance, if we have four parameter vectors
with associated gradients g1 , . . . , g4 we could aggregate the gradients on one GPU for each gi (i =
1, . . . , 4).
This reasoning seems arbitrary and frivolous. After all, the mathematics is the same throughout.
However, we are dealing with real physical hardware where different buses have different band-
width as discussed in Section 13.4. Consider a real 4-way GPU server as described in Fig. 13.7.2. If
it is particularly well connected, it might have a 100 GbE network card. More typical numbers are
in the 1–10 GbE range with an effective bandwidth of 100 MB/s to 1 GB/s. Since the CPUs have too
few PCIe lanes to connect to all GPUs directly (e.g., consumer-grade Intel CPUs have 24 lanes) we
need a multiplexer209 . The bandwidth from the CPU on a 16x Gen3 link is 16 GB/s. This is also the
speed at which each of the GPUs is connected to the switch. This means that it is more effective to
communicate between the devices.
209
https://www.broadcom.com/products/pcie-switches-bridges/pcie-switches
13.7. Parameter Servers 593

Fig. 13.7.2: A 4-way GPU server.
For the sake of the argument letʼs assume that the gradients are of 160 MB. In this case it takes
30 ms to send the gradients from all 3 remaining GPUs to the fourth one (each transfer takes 10
ms = 160 MB / 16 GB/s). Adding another 30 ms to transmit the weight vectors back we arrive at a
total of 60 ms. If we send all data to the CPU we incur a penalty of 40 ms since each of the four
GPUs needs to send the data to the CPU, yielding a total of 80 ms. Lastly assume that we are able
to split the gradients into 4 parts of 40 MB each. Now we can aggregate each of the parts on a
different GPU simultaneously since the PCIe switch offers a full-bandwidth operation between all
links. Instead of 30 ms this takes 7.5 ms, yielding a total of 15 ms for a synchronization operation.
In short, depending on how we synchronize parameters the same operation can take anywhere
from 15 ms to 80 ms. Fig. 13.7.3 depicts the different strategies for exchanging parameters.
Fig. 13.7.3: Parameter synchronization strategies.
Note that we have yet another tool at our disposal when it comes to improving performance: in
a deep network it takes some time to compute all gradients from the top to the bottom. We can
begin synchronizing gradients for some parameter groups even while we are still busy computing
them for others. See e.g., (Sergeev and Del Balso, 2018) for details on how to do this in Horovod210 .
210
https://github.com/horovod/horovod

13.7.2 Ring Synchronization
When it comes to synchronization on modern deep learning hardware we often encounter sig-
nificantly bespoke network connectivity. For instance, the AWS p3.16xlarge and NVIDIA DGX-2
instances share the connectivity structure of Fig. 13.7.4. Each GPU connects to a host CPU via a
PCIe link which operates at best at 16 GB/s. Additionally each GPU also has 6 NVLink connections,
each of which is capable of transferring 300 Gbit/s bidirectionally. This amounts to around 18 GB/s
per link per direction. In short, the aggregate NVLink bandwidth is significantly higher than the
PCIe bandwidth. The question is how to use it most efficiently.
Fig. 13.7.4: NVLink connectivity on 8 V100 GPU servers (image courtesy of NVIDIA).
It turns out that the optimal synchronization strategy is to decompose the network into two rings
and to use them to synchronize data directly (Wang et al., 2018). Fig. 13.7.5 illustrates that the net-
work can be decomposed into one ring (1-2-3-4-5-6-7-8-1) with double NVLink bandwidth and into
one (1-4-6-3-5-8-2-7-1) with regular bandwidth. Designing an efficient synchronization protocol in
this case is nontrivial.

Fig. 13.7.5: Decomposition of the NVLink network into two rings.
Consider the following thought experiment: given a ring of n computing nodes (or GPUs) we can
send gradients from the first to the second node. There it is added to the local gradient and sent
on to the third node, and so on. After n − 1 steps the aggregate gradient can be found in the last-
visited node. That is, the time to aggregate gradients grows linearly with the number of nodes.
But if we do this the algorithm is quite inefficient. After all, at any time there is only one of the
nodes communicating. What if we broke the gradients into n chunks and started synchronizing
chunk i starting at node i? Since each chunk is of size 1/n the total time is now (n − 1)/n ≈ 1. In
other words, the time spent to aggregate gradients does not grow as we increase the size of the ring.
This is quite an astonishing result. Fig. 13.7.6 illustrates the sequence of steps on n = 4 nodes.

Fig. 13.7.6: Ring synchronization across 4 nodes. Each node starts transmitting parts of gradients
to its left neighbor until the assembled gradient can be found in its right neighbor.
If we use the same example of synchronizing 160 MB across 8 V100 GPUs we arrive at approxi-
mately 2 · 160MB/(3 · 18GB/s) ≈ 6ms. This is better than using the PCIe bus, even though we
are now using 8 GPUs. Note that in practice these numbers are a bit worse, since deep learning
frameworks often fail to assemble communication into large burst transfers.
Note that there is a common misconception that ring synchronization is fundamentally different
from other synchronization algorithms. The only difference is that the synchronization path is
somewhat more elaborate when compared with a simple tree.
13.7.3 Multi-Machine Training
Distributed training on multiple machines adds a further challenge: we need to communicate

with servers that are only connected across a comparatively lower bandwidth fabric that can be
over an order of magnitude slower in some cases. Synchronization across devices is tricky. After
all, different machines running training code will have subtly different speed. Hence we need to
synchronize them if we want to use synchronous distributed optimization. Fig. 13.7.7 illustrates
how distributed parallel training occurs.
1. A (different) batch of data is read on each machine, split across multiple GPUs and trans-
ferred to GPU memory. There predictions and gradients are computed on each GPU batch
separately.
2. The gradients from all local GPUs are aggregated on one GPU (or parts of it are aggregated
over different GPUs).
3. The gradients are sent to the CPUs.
4. The CPUs send the gradients to a central parameter server which aggregates all the gradients.

5. The aggregate gradients are then used to update the parameters and the updated parameters
are broadcast back to the individual CPUs.
6. The information is sent to one (or multiple) GPUs.
7. The updated parameters are spread across all GPUs.
Fig. 13.7.7: Multi-machine multi-GPU distributed parallel training.
Each of these operations seems rather straightforward. And, indeed, they can be carried out ef-
ficiently within a single machine. Once we look at multiple machines, though, we can see that
the central parameter server becomes the bottleneck. After all, the bandwidth per server is lim-
ited, hence for m workers the time it takes to send all gradients to the server is O(m). We can
break through this barrier by increasing the number of servers to n. At this point each server only
needs to store O(1/n) of the parameters, hence the total time for updates and optimization be-
comes O(m/n). Matching both numbers yields constant scaling regardless of how many workers
we are dealing with. In practice we use the same machines both as workers and as servers. Fig.
13.7.8 illustrates the design (see also (Li et al., 2014) for details). In particular, ensuring that mul-
tiple machines work without unreasonable delays is nontrivial. We omit details on barriers and
will only briefly touch on synchronous and asynchronous updates below.

Fig. 13.7.8: Top: a single parameter server is a bottleneck since its bandwidth is finite. Bottom:
multiple parameter servers store parts of the parameters with aggregate bandwidth.
13.7.4 Key–Value Stores
Implementing the steps required for distributed multi-GPU training in practice is nontrivial. This
is why it pays to use a common abstraction, namely that of a key–value store with redefined update
semantics.
Across many workers and many GPUs the computation for gradient i can be defined as
∑ ∑
gi = gijk , (13.7.1)
k∈workers j∈GPUs
where gijk is part of gradient i split on GPU j of worker k. The key aspect in this operation is
that it is a commutative reduction, that is, it turns many vectors into one and the order in which
the operation is applied does not matter. This is great for our purposes since we do not (need to)
have fine grained control over when which gradient is received. Besides, note that this operation
is independent among different i.
This allows us to define the following two operations: push, which accumulates gradients, and pull,
which retrieves aggregate gradients. Since we have many different sets of gradients (after all, we
have many layers), we need to index the gradients with a key i. This similarity to key–value stores,

such as the one introduced in Dynamo (DeCandia et al., 2007) is not by coincidence. They, too,
satisfy many similar characteristics, in particular when it comes to distributing the parameters
across multiple servers.
The push and pull operations for key-value stores are described as follows:
• push(key, value) sends a particular gradient (the value) from a worker to a common storage.
There the value is aggregated, e.g., by summing it up.
• pull(key, value) retrieves an aggregate value from common storage, e.g., after combining
the gradients from all workers.
By hiding all the complexity about synchronization behind a simple push and pull operation we
can decouple the concerns of statistical modelers who want to be able to express optimization
in simple terms and the system engineers who need to deal with the complexity inherent in dis-
tributed synchronization.
13.7.5 Summary
• Synchronization needs to be highly adaptive to specific network infrastructure and connec-

tivity within a server. This can make a significant difference to the time it takes to synchro-
nize.
• Ring-synchronization can be optimal for p3 and DGX-2 servers. For others possibly not so
much.
• A hierarchical synchronization strategy works well when adding multiple parameter servers
for increased bandwidth.
Exercises
1. Can you increase the ring synchronization even further? Hint: you can send messages in
both directions.
2. Is it possible to allow asynchronous communication (while computation is still ongoing)?
How does it affect performance?
3. What if we lost a server during a long-running computation? How can we design a fault
tolerance mechanism to avoid restarting the computation fully?
Discussions211
211

14 | Computer Vision
Whether it is medical diagnosis, self-driving vehicles, camera monitoring, or smart filters, many
applications in the field of computer vision are closely related to our current and future lives. In
recent years, deep learning has been the transformative power for advancing the performance
of computer vision systems. It can be said that the most advanced computer vision applications
are almost inseparable from deep learning. In view of this, this chapter will focus on the field of
computer vision, and investigate methods and applications that have recently been influential in
academia and industry.
In Chapter 7 and Chapter 8, we studied various convolutional neural networks that are commonly
used in computer vision, and applied them to simple image classification tasks. At the beginning
of this chapter, we will describe two methods that may improve model generalization, namely
image augmentation and fine-tuning, and apply them to image classification. Since deep neural
networks can effectively represent images in multiple levels, such layerwise representations have
been successfully used in various computer vision tasks such as object detection, semantic segmenta-
tion, and style transfer. Following the key idea of leveraging layerwise representations in computer
vision, we will begin with major components and techniques for object detection. Next, we will
show how to use fully convolutional networks for semantic segmentation of images. Then we will
explain how to use style transfer techniques to generate images like the cover of this book. In
the end, we conclude this chapter by applying the materials of this chapter and several previous
chapters on two popular computer vision benchmark datasets.
14.1 Image Augmentation
In Section 8.1, we mentioned that large datasets are a prerequisite for the success of deep neural
networks in various applications. Image augmentation generates similar but distinct training ex-
amples after a series of random changes to the training images, thereby expanding the size of the
training set. Alternatively, image augmentation can be motivated by the fact that random tweaks
of training examples allow models to less rely on certain attributes, thereby improving their gen-
eralization ability. For example, we can crop an image in different ways to make the object of
interest appear in different positions, thereby reducing the dependence of a model on the posi-
tion of the object. We can also adjust factors such as brightness and color to reduce a modelʼs
sensitivity to color. It is probably true that image augmentation was indispensable for the success
of AlexNet at that time. In this section we will discuss this widely used technique in computer
vision.
%matplotlib inline
import torch
import torchvision
601

14.1.1 Common Image Augmentation Methods
In our investigation of common image augmentation methods, we will use the following 400 × 500
image an example.
d2l.set_figsize()
img = d2l.Image.open('../img/cat1.jpg')
d2l.plt.imshow(img);
Most image augmentation methods have a certain degree of randomness. To make it easier for
us to observe the effect of image augmentation, next we define an auxiliary function apply. This
function runs the image augmentation method aug multiple times on the input image img and
shows all the results.
def apply(img, aug, num_rows=2, num_cols=4, scale=1.5):

Y = [aug(img) for _ in range(num_rows * num_cols)]
d2l.show_images(Y, num_rows, num_cols, scale=scale)
Flipping and Cropping
Flipping the image left and right usually does not change the category of the object. This is one of
the earliest and most widely used methods of image augmentation. Next, we use the transforms
module to create the RandomHorizontalFlip instance, which flips an image left and right with a
50% chance.
apply(img, torchvision.transforms.RandomHorizontalFlip())
602 Chapter 14. Computer Vision

Flipping up and down is not as common as flipping left and right. But at least for this example
image, flipping up and down does not hinder recognition. Next, we create a RandomVerticalFlip
instance to flip an image up and down with a 50% chance.
apply(img, torchvision.transforms.RandomVerticalFlip())
In the example image we used, the cat is in the middle of the image, but this may not be the
case in general. In Section 7.5, we explained that the pooling layer can reduce the sensitivity of
a convolutional layer to the target position. In addition, we can also randomly crop the image to
make objects appear in different positions in the image at different scales, which can also reduce
the sensitivity of a model to the target position.
In the code below, we randomly crop an area with an area of 10% ∼ 100% of the original area
each time, and the ratio of width to height of this area is randomly selected from 0.5 ∼ 2. Then,
the width and height of the region are both scaled to 200 pixels. Unless otherwise specified, the
random number between a and b in this section refers to a continuous value obtained by random
and uniform sampling from the interval [a, b].
shape_aug = torchvision.transforms.RandomResizedCrop(
(200, 200), scale=(0.1, 1), ratio=(0.5, 2))
apply(img, shape_aug)
14.1. Image Augmentation 603

Changing Colors
Another augmentation method is changing colors. We can change four aspects of the image color:
brightness, contrast, saturation, and hue. In the example below, we randomly change the bright-
ness of the image to a value between 50% (1 − 0.5) and 150% (1 + 0.5) of the original image.
apply(img, torchvision.transforms.ColorJitter(
brightness=0.5, contrast=0, saturation=0, hue=0))
Similarly, we can randomly change the hue of the image.
apply(img, torchvision.transforms.ColorJitter(
brightness=0, contrast=0, saturation=0, hue=0.5))

We can also create a RandomColorJitter instance and set how to randomly change the brightness,
contrast, saturation, and hue of the image at the same time.
color_aug = torchvision.transforms.ColorJitter(
brightness=0.5, contrast=0.5, saturation=0.5, hue=0.5)
apply(img, color_aug)
Combining Multiple Image Augmentation Methods
In practice, we will combine multiple image augmentation methods. For example, we can com-
bine the different image augmentation methods defined above and apply them to each image via
a Compose instance.
augs = torchvision.transforms.Compose([
torchvision.transforms.RandomHorizontalFlip(), color_aug, shape_aug])
apply(img, augs)

14.1.2 Training with Image Augmentation
Letʼs train a model with image augmentation. Here we use the CIFAR-10 dataset instead of the
Fashion-MNIST dataset that we used before. This is because the position and size of the objects
in the Fashion-MNIST dataset have been normalized, while the color and size of the objects in the
CIFAR-10 dataset have more significant differences. The first 32 training images in the CIFAR-10
dataset are shown below.
all_images = torchvision.datasets.CIFAR10(train=True, root="../data",

download=True)
d2l.show_images([all_images[i][0] for i in range(32)], 4, 8, scale=0.8);
Files already downloaded and verified
In order to obtain definitive results during prediction, we usually only apply image augmentation
to training examples, and do not use image augmentation with random operations during pre-
diction. Here we only use the simplest random left-right flipping method. In addition, we use a
ToTensor instance to convert a minibatch of images into the format required by the deep learn-
ing framework, i.e., 32-bit floating point numbers between 0 and 1 with the shape of (batch size,
number of channels, height, width).

train_augs = torchvision.transforms.Compose([
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.ToTensor()])
test_augs = torchvision.transforms.Compose([
torchvision.transforms.ToTensor()])
Next, we define an auxiliary function to facilitate reading the image and applying image augmen-
tation. The transform argument provided by PyTorchʼs dataset applies augmentation to transform
the images. For a detailed introduction to DataLoader, please refer to Section 4.2.
def load_cifar10(is_train, augs, batch_size):

dataset = torchvision.datasets.CIFAR10(root="../data", train=is_train,
transform=augs, download=True)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
shuffle=is_train, num_workers=d2l.get_dataloader_workers())
return dataloader
Multi-GPU Training
We train the ResNet-18 model from Section 8.6 on the CIFAR-10 dataset. Recall the introduction
to multi-GPU training in Section 13.6. In the following, we define a function to train and evaluate
the model using multiple GPUs.
#@save
def train_batch_ch13(net, X, y, loss, trainer, devices):
"""Train for a minibatch with mutiple GPUs (defined in Chapter 13)."""
if isinstance(X, list):
# Required for BERT fine-tuning (to be covered later)
X = [x.to(devices[0]) for x in X]
else:
X = X.to(devices[0])
y = y.to(devices[0])
net.train()
trainer.zero_grad()
pred = net(X)
l = loss(pred, y)
l.sum().backward()
trainer.step()
train_loss_sum = l.sum()
train_acc_sum = d2l.accuracy(pred, y)
return train_loss_sum, train_acc_sum
#@save
def train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs,
devices=d2l.try_all_gpus()):
"""Train a model with mutiple GPUs (defined in Chapter 13)."""
timer, num_batches = d2l.Timer(), len(train_iter)
animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs], ylim=[0, 1],
net = nn.DataParallel(net, device_ids=devices).to(devices[0])


# Sum of training loss, sum of training accuracy, no. of examples,
# no. of predictions
metric = d2l.Accumulator(4)
for i, (features, labels) in enumerate(train_iter):
timer.start()
l, acc = train_batch_ch13(
net, features, labels, loss, trainer, devices)
metric.add(l, acc, labels.shape[0], labels.numel())
timer.stop()
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(metric[0] / metric[2], metric[1] / metric[3],
None))
test_acc = d2l.evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {metric[0] / metric[2]:.3f}, train acc '
f'{metric[1] / metric[3]:.3f}, test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec on '
f'{str(devices)}')
Now we can define the train_with_data_aug function to train the model with image augmenta-
tion. This function gets all available GPUs, uses Adam as the optimization algorithm, applies im-
age augmentation to the training dataset, and finally calls the train_ch13 function just defined to
train and evaluate the model.
batch_size, devices, net = 256, d2l.try_all_gpus(), d2l.resnet18(10, 3)

net.apply(d2l.init_cnn)
def train_with_data_aug(train_augs, test_augs, net, lr=0.001):

train_iter = load_cifar10(True, train_augs, batch_size)
test_iter = load_cifar10(False, test_augs, batch_size)
loss = nn.CrossEntropyLoss(reduction="none")
trainer = torch.optim.Adam(net.parameters(), lr=lr)
net(next(iter(train_iter))[0])
train_ch13(net, train_iter, test_iter, loss, trainer, 10, devices)
Letʼs train the model using image augmentation based on random left-right flipping.
train_with_data_aug(train_augs, test_augs, net)
loss 0.209, train acc 0.927, test acc 0.830

5329.4 examples/sec on [device(type='cuda', index=0), device(type='cuda', index=1)]

14.1.3 Summary
• Image augmentation generates random images based on existing training data to improve
the generalization ability of models.
• In order to obtain definitive results during prediction, we usually only apply image augmen-
tation to training examples, and do not use image augmentation with random operations
during prediction.
• Deep learning frameworks provide many different image augmentation methods, which can
be applied simultaneously.
Exercises
1. Train the model without using image augmentation: train_with_data_aug(test_augs,

test_augs). Compare training and testing accuracy when using and not using image aug-
mentation. Can this comparative experiment support the argument that image augmenta-
tion can mitigate overfitting? Why?
2. Combine multiple different image augmentation methods in model training on the CIFAR-10
dataset. Does it improve test accuracy?
3. Refer to the online documentation of the deep learning framework. What other image aug-
mentation methods does it also provide?
Discussions212
212

14.2 Fine-Tuning
In earlier chapters, we discussed how to train models on the Fashion-MNIST training dataset with
only 60000 images. We also described ImageNet, the most widely used large-scale image dataset
in academia, which has more than 10 million images and 1000 objects. However, the size of the
dataset that we usually encounter is between those of the two datasets.
Suppose that we want to recognize different types of chairs from images, and then recommend
purchase links to users. One possible method is to first identify 100 common chairs, take 1000
images of different angles for each chair, and then train a classification model on the collected
image dataset. Although this chair dataset may be larger than the Fashion-MNIST dataset, the
number of examples is still less than one-tenth of that in ImageNet. This may lead to overfitting
of complicated models that are suitable for ImageNet on this chair dataset. Besides, due to the
limited amount of training examples, the accuracy of the trained model may not meet practical
requirements.
In order to address the above problems, an obvious solution is to collect more data. However,
collecting and labeling data can take a lot of time and money. For example, in order to collect the
ImageNet dataset, researchers have spent millions of dollars from research funding. Although the
current data collection cost has been significantly reduced, this cost still cannot be ignored.
Another solution is to apply transfer learning to transfer the knowledge learned from the source
dataset to the target dataset. For example, although most of the images in the ImageNet dataset
have nothing to do with chairs, the model trained on this dataset may extract more general image
features, which can help identify edges, textures, shapes, and object composition. These similar
features may also be effective for recognizing chairs.
14.2.1 Steps
In this section, we will introduce a common technique in transfer learning: fine-tuning. As shown
in Fig. 14.2.1, fine-tuning consists of the following four steps:
1. Pretrain a neural network model, i.e., the source model, on a source dataset (e.g., the Ima-
geNet dataset).
2. Create a new neural network model, i.e., the target model. This copies all model designs and
their parameters on the source model except the output layer. We assume that these model
parameters contain the knowledge learned from the source dataset and this knowledge will
also be applicable to the target dataset. We also assume that the output layer of the source
model is closely related to the labels of the source dataset; thus it is not used in the target
model.
3. Add an output layer to the target model, whose number of outputs is the number of cate-
gories in the target dataset. Then randomly initialize the model parameters of this layer.
4. Train the target model on the target dataset, such as a chair dataset. The output layer will
be trained from scratch, while the parameters of all the other layers are fine-tuned based on
the parameters of the source model.

Fig. 14.2.1: Fine tuning.
When target datasets are much smaller than source datasets, fine-tuning helps to improve modelsʼ
generalization ability.
14.2.2 Hot Dog Recognition
Letʼs demonstrate fine-tuning via a concrete case: hot dog recognition. We will fine-tune a ResNet
model on a small dataset, which was pretrained on the ImageNet dataset. This small dataset con-
sists of thousands of images with and without hot dogs. We will use the fine-tuned model to rec-
ognize hot dogs from images.
%matplotlib inline
import os
import torch
import torchvision
Reading the Dataset
The hot dog dataset we use was taken from online images. This dataset consists of 1400 positive-
class images containing hot dogs, and as many negative-class images containing other foods. 1000
images of both classes are used for training and the rest are for testing.
After unzipping the downloaded dataset, we obtain two folders hotdog/train and hotdog/test.
Both folders have hotdog and not-hotdog subfolders, either of which contains images of the cor-
responding class.
#@save
d2l.DATA_HUB['hotdog'] = (d2l.DATA_URL + 'hotdog.zip',
'fba480ffa8aa7e0febbb511d181409f899b9baa5')
14.2. Fine-Tuning 611

data_dir = d2l.download_extract('hotdog')
We create two instances to read all the image files in the training and testing datasets, respectively.
train_imgs = torchvision.datasets.ImageFolder(os.path.join(data_dir, 'train'))

test_imgs = torchvision.datasets.ImageFolder(os.path.join(data_dir, 'test'))
The first 8 positive examples and the last 8 negative images are shown below. As you can see, the
images vary in size and aspect ratio.
hotdogs = [train_imgs[i][0] for i in range(8)]

not_hotdogs = [train_imgs[-i - 1][0] for i in range(8)]
d2l.show_images(hotdogs + not_hotdogs, 2, 8, scale=1.4);
During training, we first crop a random area of random size and random aspect ratio from the
image, and then scale this area to a 224 × 224 input image. During testing, we scale both the height
and width of an image to 256 pixels, and then crop a central 224 × 224 area as input. In addition,
for the three RGB (red, green, and blue) color channels we standardize their values channel by
channel. Concretely, the mean value of a channel is subtracted from each value of that channel
and then the result is divided by the standard deviation of that channel.
# Specify the means and standard deviations of the three RGB channels to
# standardize each channel
normalize = torchvision.transforms.Normalize(
[0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
train_augs = torchvision.transforms.Compose([
torchvision.transforms.RandomResizedCrop(224),
torchvision.transforms.ToTensor(),
normalize])
test_augs = torchvision.transforms.Compose([
torchvision.transforms.Resize(256),
torchvision.transforms.CenterCrop(224),
normalize])

Defining and Initializing the Model
We use ResNet-18, which was pretrained on the ImageNet dataset, as the source model. Here,
we specify pretrained=True to automatically download the pretrained model parameters. If this
model is used for the first time, Internet connection is required for download.
pretrained_net = torchvision.models.resnet18(pretrained=True)
/home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.8/site-packages/
,→torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is␣
,→deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
warnings.warn(
,→torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum␣
,→or None for 'weights' are deprecated since 0.13 and will be removed in 0.15. The␣
,→current behavior is equivalent to passing weights=ResNet18_Weights.IMAGENET1K_
,→V1. You can also use weights=ResNet18_Weights.DEFAULT to get the most up-to-date␣
,→weights.
warnings.warn(msg)
The pretrained source model instance contains a number of feature layers and an output layer fc.
The main purpose of this division is to facilitate the fine-tuning of model parameters of all layers
but the output layer. The member variable fc of source model is given below.
pretrained_net.fc
Linear(in_features=512, out_features=1000, bias=True)
As a fully connected layer, it transforms ResNetʼs final global average pooling outputs into 1000
class outputs of the ImageNet dataset. We then construct a new neural network as the target
model. It is defined in the same way as the pretrained source model except that its number of
outputs in the final layer is set to the number of classes in the target dataset (rather than 1000).
In the code below, the model parameters before the output layer of the target model instance
finetune_net are initialized to model parameters of the corresponding layers from the source
model. Since these model parameters were obtained via pretraining on ImageNet, they are effec-
tive. Therefore, we can only use a small learning rate to fine-tune such pretrained parameters. In
contrast, model parameters in the output layer are randomly initialized and generally require a
larger learning rate to be learned from scratch. Letting the base learning rate be η, a learning rate
of 10η will be used to iterate the model parameters in the output layer.
finetune_net = torchvision.models.resnet18(pretrained=True)
finetune_net.fc = nn.Linear(finetune_net.fc.in_features, 2)
nn.init.xavier_uniform_(finetune_net.fc.weight);

Fine-Tuning the Model
First, we define a training function train_fine_tuning that uses fine-tuning so it can be called
multiple times.
# If `param_group=True`, the model parameters in the output layer will be

# updated using a learning rate ten times greater
def train_fine_tuning(net, learning_rate, batch_size=128, num_epochs=5,
param_group=True):
train_iter = torch.utils.data.DataLoader(torchvision.datasets.ImageFolder(
os.path.join(data_dir, 'train'), transform=train_augs),
batch_size=batch_size, shuffle=True)
test_iter = torch.utils.data.DataLoader(torchvision.datasets.ImageFolder(
os.path.join(data_dir, 'test'), transform=test_augs),
batch_size=batch_size)
if param_group:
params_1x = [param for name, param in net.named_parameters()
if name not in ["fc.weight", "fc.bias"]]
trainer = torch.optim.SGD([{'params': params_1x},
{'params': net.fc.parameters(),
'lr': learning_rate * 10}],
lr=learning_rate, weight_decay=0.001)
else:
trainer = torch.optim.SGD(net.parameters(), lr=learning_rate,
weight_decay=0.001)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs,
devices)
We set the base learning rate to a small value in order to fine-tune the model parameters obtained
via pretraining. Based on the previous settings, we will train the output layer parameters of the
target model from scratch using a learning rate ten times greater.
train_fine_tuning(finetune_net, 5e-5)


For comparison, we define an identical model, but initialize all of its model parameters to random
values. Since the entire model needs to be trained from scratch, we can use a larger learning rate.
scratch_net = torchvision.models.resnet18()
scratch_net.fc = nn.Linear(scratch_net.fc.in_features, 2)
train_fine_tuning(scratch_net, 5e-4, param_group=False)

As we can see, the fine-tuned model tends to perform better for the same epoch because its initial
parameter values are more effective.
14.2.3 Summary
• Transfer learning transfers knowledge learned from the source dataset to the target dataset.
Fine-tuning is a common technique for transfer learning.
• The target model copies all model designs with their parameters from the source model
except the output layer, and fine-tunes these parameters based on the target dataset. In
contrast, the output layer of the target model needs to be trained from scratch.
• Generally, fine-tuning parameters uses a smaller learning rate, while training the output
layer from scratch can use a larger learning rate.
Exercises
1. Keep increasing the learning rate of finetune_net. How does the accuracy of the model
change?
2. Further adjust hyperparameters of finetune_net and scratch_net in the comparative ex-
periment. Do they still differ in accuracy?
3. Set the parameters before the output layer of finetune_net to those of the source model and
do not update them during training. How does the accuracy of the model change? You can
use the following code.

for param in finetune_net.parameters():
param.requires_grad = False
4. In fact, there is a “hotdog” class in the ImageNet dataset. Its corresponding weight parameter
in the output layer can be obtained via the following code. How can we leverage this weight
parameter?
weight = pretrained_net.fc.weight
hotdog_w = torch.split(weight.data, 1, dim=0)[934]
hotdog_w.shape
Discussions213
14.3 Object Detection and Bounding Boxes
In earlier sections (e.g., Section 8.1–Section 8.4), we introduced various models for image classi-
fication. In image classification tasks, we assume that there is only one major object in the image
and we only focus on how to recognize its category. However, there are often multiple objects in
the image of interest. We not only want to know their categories, but also their specific positions
in the image. In computer vision, we refer to such tasks as object detection (or object recognition).
Object detection has been widely applied in many fields. For example, self-driving needs to plan
traveling routes by detecting the positions of vehicles, pedestrians, roads, and obstacles in the
captured video images. Besides, robots may use this technique to detect and localize objects of
interest throughout its navigation of an environment. Moreover, security systems may need to
detect abnormal objects, such as intruders or bombs.
In the next few sections, we will introduce several deep learning methods for object detection. We
will begin with an introduction to positions (or locations) of objects.
%matplotlib inline
import torch
We will load the sample image to be used in this section. We can see that there is a dog on the left
side of the image and a cat on the right. They are the two major objects in this image.
d2l.set_figsize()
img = d2l.plt.imread('../img/catdog.jpg')
d2l.plt.imshow(img);
213

14.3.1 Bounding Boxes
In object detection, we usually use a bounding box to describe the spatial location of an object.
The bounding box is rectangular, which is determined by the x and y coordinates of the upper-left
corner of the rectangle and the such coordinates of the lower-right corner. Another commonly
used bounding box representation is the (x, y)-axis coordinates of the bounding box center, and
the width and height of the box.
Here we define functions to convert between these two representations: box_corner_to_center
converts from the two-corner representation to the center-width-height presentation, and
box_center_to_corner vice versa. The input argument boxes should be a two-dimensional ten-
sor of shape (n, 4), where n is the number of bounding boxes.
#@save
def box_corner_to_center(boxes):
"""Convert from (upper-left, lower-right) to (center, width, height)."""
x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
cx = (x1 + x2) / 2
cy = (y1 + y2) / 2
w = x2 - x1
h = y2 - y1
boxes = torch.stack((cx, cy, w, h), axis=-1)
return boxes
#@save
def box_center_to_corner(boxes):
"""Convert from (center, width, height) to (upper-left, lower-right)."""
cx, cy, w, h = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
x1 = cx - 0.5 * w
y1 = cy - 0.5 * h
x2 = cx + 0.5 * w
y2 = cy + 0.5 * h
boxes = torch.stack((x1, y1, x2, y2), axis=-1)
return boxes
We will define the bounding boxes of the dog and the cat in the image based on the coordinate
information. The origin of the coordinates in the image is the upper-left corner of the image, and
to the right and down are the positive directions of the x and y axes, respectively.
14.3. Object Detection and Bounding Boxes 617

# Here `bbox` is the abbreviation for bounding box
dog_bbox, cat_bbox = [60.0, 45.0, 378.0, 516.0], [400.0, 112.0, 655.0, 493.0]
We can verify the correctness of the two bounding box conversion functions by converting twice.
boxes = torch.tensor((dog_bbox, cat_bbox))

box_center_to_corner(box_corner_to_center(boxes)) == boxes
tensor([[True, True, True, True],

[True, True, True, True]])
Letʼs draw the bounding boxes in the image to check if they are accurate. Before drawing, we will
define a helper function bbox_to_rect. It represents the bounding box in the bounding box format
of the matplotlib package.
#@save
def bbox_to_rect(bbox, color):
"""Convert bounding box to matplotlib format."""
# Convert the bounding box (upper-left x, upper-left y, lower-right x,
# lower-right y) format to the matplotlib format: ((upper-left x,
# upper-left y), width, height)
return d2l.plt.Rectangle(
xy=(bbox[0], bbox[1]), width=bbox[2]-bbox[0], height=bbox[3]-bbox[1],
fill=False, edgecolor=color, linewidth=2)
After adding the bounding boxes on the image, we can see that the main outline of the two objects
are basically inside the two boxes.
fig = d2l.plt.imshow(img)
fig.axes.add_patch(bbox_to_rect(dog_bbox, 'blue'))
fig.axes.add_patch(bbox_to_rect(cat_bbox, 'red'));

14.3.2 Summary
• Object detection not only recognizes all the objects of interest in the image, but also their
positions. The position is generally represented by a rectangular bounding box.
• We can convert between two commonly used bounding box representations.
Exercises
1. Find another image and try to label a bounding box that contains the object. Compare label-
ing bounding boxes and categories: which usually takes longer?
2. Why is the innermost dimension of the input argument boxes of box_corner_to_center and
box_center_to_corner always 4?
Discussions214
14.4 Anchor Boxes
Object detection algorithms usually sample a large number of regions in the input image, deter-
mine whether these regions contain objects of interest, and adjust the boundaries of the regions
so as to predict the ground-truth bounding boxes of the objects more accurately. Different models
may adopt different region sampling schemes. Here we introduce one of such methods: it gener-
ates multiple bounding boxes with varying scales and aspect ratios centered on each pixel. These
bounding boxes are called anchor boxes. We will design an object detection model based on anchor
boxes in Section 14.7.
First, letʼs modify the printing accuracy just for more concise outputs.
%matplotlib inline
import torch
torch.set_printoptions(2) # Simplify printing accuracy
14.4.1 Generating Multiple Anchor Boxes
Suppose that the input image has a height of h and width of w. We generate anchor boxes with
different shapes centered on each pixel of the image. Let the scale be s ∈ (0, 1] and the aspect
√
ratio (ratio of width to height) is r > 0. Then the width and height of the anchor box are ws r
√
and hs/ r, respectively. Note that when the center position is given, an anchor box with known
width and height is determined.
To generate multiple anchor boxes with different shapes, letʼs set a series of scales s1 , . . . , sn and a
series of aspect ratios r1 , . . . , rm . When using all the combinations of these scales and aspect ratios
with each pixel as the center, the input image will have a total of whnm anchor boxes. Although
these anchor boxes may cover all the ground-truth bounding boxes, the computational complexity
214
14.4. Anchor Boxes 619

is easily too high. In practice, we can only consider those combinations containing s1 or r1 :
(s1 , r1 ), (s1 , r2 ), . . . , (s1 , rm ), (s2 , r1 ), (s3 , r1 ), . . . , (sn , r1 ). (14.4.1)
That is to say, the number of anchor boxes centered on the same pixel is n + m − 1. For the entire
input image, we will generate a total of wh(n + m − 1) anchor boxes.
The above method of generating anchor boxes is implemented in the following multibox_prior
function. We specify the input image, a list of scales, and a list of aspect ratios, then this function
will return all the anchor boxes.
#@save
def multibox_prior(data, sizes, ratios):
"""Generate anchor boxes with different shapes centered on each pixel."""
in_height, in_width = data.shape[-2:]
device, num_sizes, num_ratios = data.device, len(sizes), len(ratios)
boxes_per_pixel = (num_sizes + num_ratios - 1)
size_tensor = torch.tensor(sizes, device=device)
ratio_tensor = torch.tensor(ratios, device=device)
# Offsets are required to move the anchor to the center of a pixel. Since
# a pixel has height=1 and width=1, we choose to offset our centers by 0.5
offset_h, offset_w = 0.5, 0.5
steps_h = 1.0 / in_height # Scaled steps in y axis
steps_w = 1.0 / in_width # Scaled steps in x axis
# Generate all center points for the anchor boxes

center_h = (torch.arange(in_height, device=device) + offset_h) * steps_h
center_w = (torch.arange(in_width, device=device) + offset_w) * steps_w
shift_y, shift_x = torch.meshgrid(center_h, center_w)
shift_y, shift_x = shift_y.reshape(-1), shift_x.reshape(-1)
# Generate `boxes_per_pixel` number of heights and widths that are later

# used to create anchor box corner coordinates (xmin, xmax, ymin, ymax)
w = torch.cat((size_tensor * torch.sqrt(ratio_tensor[0]),
sizes[0] * torch.sqrt(ratio_tensor[1:])))\
* in_height / in_width # Handle rectangular inputs
h = torch.cat((size_tensor / torch.sqrt(ratio_tensor[0]),
sizes[0] / torch.sqrt(ratio_tensor[1:])))
# Divide by 2 to get half height and half width
anchor_manipulations = torch.stack((-w, -h, w, h)).T.repeat(
in_height * in_width, 1) / 2
# Each center point will have `boxes_per_pixel` number of anchor boxes, so

# generate a grid of all anchor box centers with `boxes_per_pixel` repeats
out_grid = torch.stack([shift_x, shift_y, shift_x, shift_y],
dim=1).repeat_interleave(boxes_per_pixel, dim=0)
output = out_grid + anchor_manipulations
return output.unsqueeze(0)
We can see that the shape of the returned anchor box variable Y is (batch size, number of anchor
boxes, 4).
h, w = img.shape[:2]
print(h, w)

X = torch.rand(size=(1, 3, h, w)) # Construct input data

Y = multibox_prior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
Y.shape
561 728
torch.Size([1, 2042040, 4])
After changing the shape of the anchor box variable Y to (image height, image width, number of
anchor boxes centered on the same pixel, 4), we can obtain all the anchor boxes centered on a
specified pixel position. In the following, we access the first anchor box centered on (250, 250). It
has four elements: the (x, y)-axis coordinates at the upper-left corner and the (x, y)-axis coordi-
nates at the lower-right corner of the anchor box. The coordinate values of both axes are divided
by the width and height of the image, respectively; thus, the range is between 0 and 1.
boxes = Y.reshape(h, w, 5, 4)
boxes[250, 250, 0, :]
tensor([0.06, 0.07, 0.63, 0.82])
In order to show all the anchor boxes centered on one pixel in the image, we define the following
show_bboxes function to draw multiple bounding boxes on the image.
#@save
def show_bboxes(axes, bboxes, labels=None, colors=None):
"""Show bounding boxes."""
def make_list(obj, default_values=None):

if obj is None:
obj = default_values
elif not isinstance(obj, (list, tuple)):
obj = [obj]
return obj
labels = make_list(labels)
colors = make_list(colors, ['b', 'g', 'r', 'm', 'c'])
for i, bbox in enumerate(bboxes):
color = colors[i % len(colors)]
rect = d2l.bbox_to_rect(bbox.detach().numpy(), color)
axes.add_patch(rect)
if labels and len(labels) > i:
text_color = 'k' if color == 'w' else 'w'
axes.text(rect.xy[0], rect.xy[1], labels[i],
va='center', ha='center', fontsize=9, color=text_color,
bbox=dict(facecolor=color, lw=0))

As we just saw, the coordinate values of the x and y axes in the variable boxes have been divided by
the width and height of the image, respectively. When drawing anchor boxes, we need to restore
their original coordinate values; thus, we define variable bbox_scale below. Now, we can draw all
the anchor boxes centered on (250, 250) in the image. As you can see, the blue anchor box with a
scale of 0.75 and an aspect ratio of 1 well surrounds the dog in the image.
d2l.set_figsize()
bbox_scale = torch.tensor((w, h, w, h))
show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale,
['s=0.75, r=1', 's=0.5, r=1', 's=0.25, r=1', 's=0.75, r=2',
's=0.75, r=0.5'])
14.4.2 Intersection over Union (IoU)
We just mentioned that an anchor box “well” surrounds the dog in the image. If the ground-truth
bounding box of the object is known, how can “well” here be quantified? Intuitively, we can mea-
sure the similarity between the anchor box and the ground-truth bounding box. We know that
the Jaccard index can measure the similarity between two sets. Given sets A and B, their Jaccard
index is the size of their intersection divided by the size of their union:
|A ∩ B|
J(A, B) = . (14.4.2)
|A ∪ B|
In fact, we can consider the pixel area of any bounding box as a set of pixels. In this way, we can
measure the similarity of the two bounding boxes by the Jaccard index of their pixel sets. For two
bounding boxes, we usually refer their Jaccard index as intersection over union (IoU), which is the
ratio of their intersection area to their union area, as shown in Fig. 14.4.1. The range of an IoU is
between 0 and 1: 0 means that two bounding boxes do not overlap at all, while 1 indicates that the
two bounding boxes are equal.

Fig. 14.4.1: IoU is the ratio of the intersection area to the union area of two bounding boxes.
For the remainder of this section, we will use IoU to measure the similarity between anchor boxes
and ground-truth bounding boxes, and between different anchor boxes. Given two lists of anchor
or bounding boxes, the following box_iou computes their pairwise IoU across these two lists.
#@save
def box_iou(boxes1, boxes2):
"""Compute pairwise IoU across two lists of anchor or bounding boxes."""
box_area = lambda boxes: ((boxes[:, 2] - boxes[:, 0]) *
(boxes[:, 3] - boxes[:, 1]))
# Shape of `boxes1`, `boxes2`, àreas1`, àreas2`: (no. of boxes1, 4),
# (no. of boxes2, 4), (no. of boxes1,), (no. of boxes2,)
areas1 = box_area(boxes1)
areas2 = box_area(boxes2)
# Shape of ìnter_upperlefts`, ìnter_lowerrights`, ìnters`: (no. of
# boxes1, no. of boxes2, 2)
inter_upperlefts = torch.max(boxes1[:, None, :2], boxes2[:, :2])
inter_lowerrights = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])
inters = (inter_lowerrights - inter_upperlefts).clamp(min=0)
# Shape of ìnter_areas` and ùnion_areas`: (no. of boxes1, no. of boxes2)
inter_areas = inters[:, :, 0] * inters[:, :, 1]
union_areas = areas1[:, None] + areas2 - inter_areas
return inter_areas / union_areas
14.4.3 Labeling Anchor Boxes in Training Data
In a training dataset, we consider each anchor box as a training example. In order to train an object
detection model, we need class and offset labels for each anchor box, where the former is the class
of the object relevant to the anchor box and the latter is the offset of the ground-truth bounding
box relative to the anchor box. During the prediction, for each image we generate multiple anchor
boxes, predict classes and offsets for all the anchor boxes, adjust their positions according to the
predicted offsets to obtain the predicted bounding boxes, and finally only output those predicted
bounding boxes that satisfy certain criteria.
As we know, an object detection training set comes with labels for locations of ground-truth bound-
ing boxes and classes of their surrounded objects. To label any generated anchor box, we refer to
the labeled location and class of its assigned ground-truth bounding box that is closest to the an-
chor box. In the following, we describe an algorithm for assigning closest ground-truth bounding
boxes to anchor boxes.

Assigning Ground-Truth Bounding Boxes to Anchor Boxes
Given an image, suppose that the anchor boxes are A1 , A2 , . . . , Ana and the ground-truth bounding
boxes are B1 , B2 , . . . , Bnb , where na ≥ nb . Letʼs define a matrix X ∈ Rna ×nb , whose element xij in
the ith row and j th column is the IoU of the anchor box Ai and the ground-truth bounding box Bj .
The algorithm consists of the following steps:
1. Find the largest element in matrix X and denote its row and column indices as i1 and j1 ,
respectively. Then the ground-truth bounding box Bj1 is assigned to the anchor box Ai1 .
This is quite intuitive because Ai1 and Bj1 are the closest among all the pairs of anchor boxes
and ground-truth bounding boxes. After the first assignment, discard all the elements in the
i1 th row and the j1 th column in matrix X.
2. Find the largest of the remaining elements in matrix X and denote its row and column indices
as i2 and j2 , respectively. We assign ground-truth bounding box Bj2 to anchor box Ai2 and
discard all the elements in the i2 th row and the j2 th column in matrix X.
3. At this point, elements in two rows and two columns in matrix X have been discarded. We
proceed until all elements in nb columns in matrix X are discarded. At this time, we have
assigned a ground-truth bounding box to each of nb anchor boxes.
4. Only traverse through the remaining na − nb anchor boxes. For example, given any anchor
box Ai , find the ground-truth bounding box Bj with the largest IoU with Ai throughout the
ith row of matrix X, and assign Bj to Ai only if this IoU is greater than a predefined threshold.
Letʼs illustrate the above algorithm using a concrete example. As shown in Fig. 14.4.2 (left), assum-
ing that the maximum value in matrix X is x23 , we assign the ground-truth bounding box B3 to the
anchor box A2 . Then, we discard all the elements in row 2 and column 3 of the matrix, find the
largest x71 in the remaining elements (shaded area), and assign the ground-truth bounding box
B1 to the anchor box A7 . Next, as shown in Fig. 14.4.2 (middle), discard all the elements in row
7 and column 1 of the matrix, find the largest x54 in the remaining elements (shaded area), and
assign the ground-truth bounding box B4 to the anchor box A5 . Finally, as shown in Fig. 14.4.2
(right), discard all the elements in row 5 and column 4 of the matrix, find the largest x92 in the re-
maining elements (shaded area), and assign the ground-truth bounding box B2 to the anchor box
A9 . After that, we only need to traverse through the remaining anchor boxes A1 , A3 , A4 , A6 , A8
and determine whether to assign them ground-truth bounding boxes according to the threshold.

Fig. 14.4.2: Assigning ground-truth bounding boxes to anchor boxes.
This algorithm is implemented in the following assign_anchor_to_bbox function.
#@save
def assign_anchor_to_bbox(ground_truth, anchors, device, iou_threshold=0.5):
"""Assign closest ground-truth bounding boxes to anchor boxes."""
num_anchors, num_gt_boxes = anchors.shape[0], ground_truth.shape[0]
# Element x_ij in the i-th row and j-th column is the IoU of the anchor
# box i and the ground-truth bounding box j
jaccard = box_iou(anchors, ground_truth)
# Initialize the tensor to hold the assigned ground-truth bounding box for
# each anchor
anchors_bbox_map = torch.full((num_anchors,), -1, dtype=torch.long,
device=device)
# Assign ground-truth bounding boxes according to the threshold
max_ious, indices = torch.max(jaccard, dim=1)
anc_i = torch.nonzero(max_ious >= iou_threshold).reshape(-1)
box_j = indices[max_ious >= iou_threshold]
anchors_bbox_map[anc_i] = box_j
col_discard = torch.full((num_anchors,), -1)
row_discard = torch.full((num_gt_boxes,), -1)
for _ in range(num_gt_boxes):
max_idx = torch.argmax(jaccard) # Find the largest IoU
box_idx = (max_idx % num_gt_boxes).long()
anc_idx = (max_idx / num_gt_boxes).long()
anchors_bbox_map[anc_idx] = box_idx
jaccard[:, box_idx] = col_discard
jaccard[anc_idx, :] = row_discard
return anchors_bbox_map

Labeling Classes and Offsets
Now we can label the class and offset for each anchor box. Suppose that an anchor box A is as-
signed a ground-truth bounding box B. On the one hand, the class of the anchor box A will be
labeled as that of B. On the other hand, the offset of the anchor box A will be labeled according
to the relative position between the central coordinates of B and A together with the relative size
between these two boxes. Given varying positions and sizes of different boxes in the dataset, we
can apply transformations to those relative positions and sizes that may lead to more uniformly
distributed offsets that are easier to fit. Here we describe a common transformation. Given the
central coordinates of A and B as (xa , ya ) and (xb , yb ), their widths as wa and wb , and their heights
as ha and hb , respectively. We may label the offset of A as
( x −x yb −ya
)
wa − µx ha − µy log wa − µw log ha − µh
b a wb hb
, , , , (14.4.3)
σx σy σw σh
where default values of the constants are µx = µy = µw = µh = 0, σx = σy = 0.1, and σw = σh =

0.2. This transformation is implemented below in the offset_boxes function.
#@save
def offset_boxes(anchors, assigned_bb, eps=1e-6):
"""Transform for anchor box offsets."""
c_anc = d2l.box_corner_to_center(anchors)
c_assigned_bb = d2l.box_corner_to_center(assigned_bb)
offset_xy = 10 * (c_assigned_bb[:, :2] - c_anc[:, :2]) / c_anc[:, 2:]
offset_wh = 5 * torch.log(eps + c_assigned_bb[:, 2:] / c_anc[:, 2:])
offset = torch.cat([offset_xy, offset_wh], axis=1)
return offset
If an anchor box is not assigned a ground-truth bounding box, we just label the class of the anchor
box as “background”. Anchor boxes whose classes are background are often referred to as negative
anchor boxes, and the rest are called positive anchor boxes. We implement the following multi-
box_target function to label classes and offsets for anchor boxes (the anchors argument) using
ground-truth bounding boxes (the labels argument). This function sets the background class to
zero and increments the integer index of a new class by one.
#@save
def multibox_target(anchors, labels):
"""Label anchor boxes using ground-truth bounding boxes."""
batch_size, anchors = labels.shape[0], anchors.squeeze(0)
batch_offset, batch_mask, batch_class_labels = [], [], []
device, num_anchors = anchors.device, anchors.shape[0]
for i in range(batch_size):
label = labels[i, :, :]
anchors_bbox_map = assign_anchor_to_bbox(
label[:, 1:], anchors, device)
bbox_mask = ((anchors_bbox_map >= 0).float().unsqueeze(-1)).repeat(
1, 4)
# Initialize class labels and assigned bounding box coordinates with
# zeros
class_labels = torch.zeros(num_anchors, dtype=torch.long,
device=device)
assigned_bb = torch.zeros((num_anchors, 4), dtype=torch.float32,
device=device)

# Label classes of anchor boxes using their assigned ground-truth

# bounding boxes. If an anchor box is not assigned any, we label its
# class as background (the value remains zero)
indices_true = torch.nonzero(anchors_bbox_map >= 0)
bb_idx = anchors_bbox_map[indices_true]
class_labels[indices_true] = label[bb_idx, 0].long() + 1
assigned_bb[indices_true] = label[bb_idx, 1:]
# Offset transformation
offset = offset_boxes(anchors, assigned_bb) * bbox_mask
batch_offset.append(offset.reshape(-1))
batch_mask.append(bbox_mask.reshape(-1))
batch_class_labels.append(class_labels)
bbox_offset = torch.stack(batch_offset)
bbox_mask = torch.stack(batch_mask)
class_labels = torch.stack(batch_class_labels)
return (bbox_offset, bbox_mask, class_labels)
An Example
Letʼs illustrate anchor box labeling via a concrete example. We define ground-truth bounding
boxes for the dog and cat in the loaded image, where the first element is the class (0 for dog and
1 for cat) and the remaining four elements are the (x, y)-axis coordinates at the upper-left corner
and the lower-right corner (range is between 0 and 1). We also construct five anchor boxes to be
labeled using the coordinates of the upper-left corner and the lower-right corner: A0 , . . . , A4 (the
index starts from 0). Then we plot these ground-truth bounding boxes and anchor boxes in the
image.
ground_truth = torch.tensor([[0, 0.1, 0.08, 0.52, 0.92],

[1, 0.55, 0.2, 0.9, 0.88]])
anchors = torch.tensor([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
[0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
[0.57, 0.3, 0.92, 0.9]])
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog', 'cat'], 'k')
show_bboxes(fig.axes, anchors * bbox_scale, ['0', '1', '2', '3', '4']);

Using the multibox_target function defined above, we can label classes and offsets of these anchor
boxes based on the ground-truth bounding boxes for the dog and cat. In this example, indices of
the background, dog, and cat classes are 0, 1, and 2, respectively. Below we add an dimension for
examples of anchor boxes and ground-truth bounding boxes.
labels = multibox_target(anchors.unsqueeze(dim=0),
ground_truth.unsqueeze(dim=0))
There are three items in the returned result, all of which are in the tensor format. The third item
contains the labeled classes of the input anchor boxes.
Letʼs analyze the returned class labels below based on anchor box and ground-truth bounding box
positions in the image. First, among all the pairs of anchor boxes and ground-truth bounding
boxes, the IoU of the anchor box A4 and the ground-truth bounding box of the cat is the largest.
Thus, the class of A4 is labeled as the cat. Taking out pairs containing A4 or the ground-truth
bounding box of the cat, among the rest the pair of the anchor box A1 and the ground-truth bound-
ing box of the dog has the largest IoU. So the class of A1 is labeled as the dog. Next, we need to
traverse through the remaining three unlabeled anchor boxes: A0 , A2 , and A3 . For A0 , the class of
the ground-truth bounding box with the largest IoU is the dog, but the IoU is below the predefined
threshold (0.5), so the class is labeled as background; for A2 , the class of the ground-truth bound-
ing box with the largest IoU is the cat and the IoU exceeds the threshold, so the class is labeled as
the cat; for A3 , the class of the ground-truth bounding box with the largest IoU is the cat, but the
value is below the threshold, so the class is labeled as background.
labels[2]
tensor([[0, 1, 2, 0, 2]])
The second returned item is a mask variable of the shape (batch size, four times the number of
anchor boxes). Every four elements in the mask variable correspond to the four offset values of
each anchor box. Since we do not care about background detection, offsets of this negative class
should not affect the objective function. Through elementwise multiplications, zeros in the mask
variable will filter out negative class offsets before calculating the objective function.
labels[1]
tensor([[0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1.,
1., 1.]])
The first returned item contains the four offset values labeled for each anchor box. Note that the
offsets of negative-class anchor boxes are labeled as zeros.
labels[0]
tensor([[-0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00, 1.40e+00, 1.00e+01,

2.59e+00, 7.18e+00, -1.20e+00, 2.69e-01, 1.68e+00, -1.57e+00,
-0.00e+00, -0.00e+00, -0.00e+00, -0.00e+00, -5.71e-01, -1.00e+00,
4.17e-06, 6.26e-01]])

14.4.4 Predicting Bounding Boxes with Non-Maximum Suppression
During prediction, we generate multiple anchor boxes for the image and predict classes and off-
sets for each of them. A predicted bounding box is thus obtained according to an anchor box with its
predicted offset. Below we implement the offset_inverse function that takes in anchors and off-
set predictions as inputs and applies inverse offset transformations to return the predicted bound-
ing box coordinates.
#@save
def offset_inverse(anchors, offset_preds):
"""Predict bounding boxes based on anchor boxes with predicted offsets."""
anc = d2l.box_corner_to_center(anchors)
pred_bbox_xy = (offset_preds[:, :2] * anc[:, 2:] / 10) + anc[:, :2]
pred_bbox_wh = torch.exp(offset_preds[:, 2:] / 5) * anc[:, 2:]
pred_bbox = torch.cat((pred_bbox_xy, pred_bbox_wh), axis=1)
predicted_bbox = d2l.box_center_to_corner(pred_bbox)
return predicted_bbox
When there are many anchor boxes, many similar (with significant overlap) predicted bounding
boxes can be potentially output for surrounding the same object. To simplify the output, we can
merge similar predicted bounding boxes that belong to the same object by using non-maximum
suppression (NMS).
Here is how non-maximum suppression works. For a predicted bounding box B, the object detec-
tion model calculates the predicted likelihood for each class. Denoting by p the largest predicted
likelihood, the class corresponding to this probability is the predicted class for B. Specifically,
we refer to p as the confidence (score) of the predicted bounding box B. On the same image, all
the predicted non-background bounding boxes are sorted by confidence in descending order to
generate a list L. Then we manipulate the sorted list L in the following steps:
1. Select the predicted bounding box B1 with the highest confidence from L as a basis and
remove all non-basis predicted bounding boxes whose IoU with B1 exceeds a predefined
threshold ϵ from L. At this point, L keeps the predicted bounding box with the highest con-
fidence but drops others that are too similar to it. In a nutshell, those with non-maximum
confidence scores are suppressed.
2. Select the predicted bounding box B2 with the second highest confidence from L as another
basis and remove all non-basis predicted bounding boxes whose IoU with B2 exceeds ϵ from
L.
3. Repeat the above process until all the predicted bounding boxes in L have been used as a ba-
sis. At this time, the IoU of any pair of predicted bounding boxes in L is below the threshold
ϵ; thus, no pair is too similar with each other.
4. Output all the predicted bounding boxes in the list L.
The following nms function sorts confidence scores in descending order and returns their indices.
#@save
def nms(boxes, scores, iou_threshold):
"""Sort confidence scores of predicted bounding boxes."""
B = torch.argsort(scores, dim=-1, descending=True)
keep = [] # Indices of predicted bounding boxes that will be kept
while B.numel() > 0:
i = B[0]

keep.append(i)
if B.numel() == 1: break
iou = box_iou(boxes[i, :].reshape(-1, 4),
boxes[B[1:], :].reshape(-1, 4)).reshape(-1)
inds = torch.nonzero(iou <= iou_threshold).reshape(-1)
B = B[inds + 1]
return torch.tensor(keep, device=boxes.device)
We define the following multibox_detection to apply non-maximum suppression to predicting

bounding boxes. Do not worry if you find the implementation a bit complicated: we will show
how it works with a concrete example right after the implementation.
#@save
def multibox_detection(cls_probs, offset_preds, anchors, nms_threshold=0.5,
pos_threshold=0.009999999):
"""Predict bounding boxes using non-maximum suppression."""
device, batch_size = cls_probs.device, cls_probs.shape[0]
anchors = anchors.squeeze(0)
num_classes, num_anchors = cls_probs.shape[1], cls_probs.shape[2]
out = []
for i in range(batch_size):
cls_prob, offset_pred = cls_probs[i], offset_preds[i].reshape(-1, 4)
conf, class_id = torch.max(cls_prob[1:], 0)
predicted_bb = offset_inverse(anchors, offset_pred)
keep = nms(predicted_bb, conf, nms_threshold)
# Find all non-`keep` indices and set the class to background
all_idx = torch.arange(num_anchors, dtype=torch.long, device=device)
combined = torch.cat((keep, all_idx))
uniques, counts = combined.unique(return_counts=True)
non_keep = uniques[counts == 1]
all_id_sorted = torch.cat((keep, non_keep))
class_id[non_keep] = -1
class_id = class_id[all_id_sorted]
conf, predicted_bb = conf[all_id_sorted], predicted_bb[all_id_sorted]
# Here `pos_threshold` is a threshold for positive (non-background)
# predictions
below_min_idx = (conf < pos_threshold)
class_id[below_min_idx] = -1
conf[below_min_idx] = 1 - conf[below_min_idx]
pred_info = torch.cat((class_id.unsqueeze(1),
conf.unsqueeze(1),
predicted_bb), dim=1)
out.append(pred_info)
return torch.stack(out)
Now letʼs apply the above implementations to a concrete example with four anchor boxes. For sim-
plicity, we assume that the predicted offsets are all zeros. This means that the predicted bounding
boxes are anchor boxes. For each class among the background, dog, and cat, we also define its
predicted likelihood.
anchors = torch.tensor([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95],

[0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]])
offset_preds = torch.tensor([0] * anchors.numel())

cls_probs = torch.tensor([[0] * 4, # Predicted background likelihood

[0.9, 0.8, 0.7, 0.1], # Predicted dog likelihood
[0.1, 0.2, 0.3, 0.9]]) # Predicted cat likelihood
We can plot these predicted bounding boxes with their confidence on the image.
show_bboxes(fig.axes, anchors * bbox_scale,
['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9'])
Now we can invoke the multibox_detection function to perform non-maximum suppression,

where the threshold is set to 0.5. Note that we add a dimension for examples in the tensor in-
put.
We can see that the shape of the returned result is (batch size, number of anchor boxes, 6). The six
elements in the innermost dimension gives the output information for the same predicted bound-
ing box. The first element is the predicted class index, which starts from 0 (0 is dog and 1 is cat).
The value -1 indicates background or removal in non-maximum suppression. The second element
is the confidence of the predicted bounding box. The remaining four elements are the (x, y)-axis
coordinates of the upper-left corner and the lower-right corner of the predicted bounding box,
respectively (range is between 0 and 1).
output = multibox_detection(cls_probs.unsqueeze(dim=0),
offset_preds.unsqueeze(dim=0),
anchors.unsqueeze(dim=0),
nms_threshold=0.5)
output
tensor([[[ 0.00, 0.90, 0.10, 0.08, 0.52, 0.92],

[ 1.00, 0.90, 0.55, 0.20, 0.90, 0.88],
[-1.00, 0.80, 0.08, 0.20, 0.56, 0.95],
[-1.00, 0.70, 0.15, 0.30, 0.62, 0.91]]])
After removing those predicted bounding boxes of class -1, we can output the final predicted
bounding box kept by non-maximum suppression.

for i in output[0].detach().numpy():
if i[0] == -1:
continue
label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
show_bboxes(fig.axes, [torch.tensor(i[2:]) * bbox_scale], label)
In practice, we can remove predicted bounding boxes with lower confidence even before per-
forming non-maximum suppression, thereby reducing computation in this algorithm. We may
also post-process the output of non-maximum suppression, for example, by only keeping results
with higher confidence in the final output.
14.4.5 Summary
• We generate anchor boxes with different shapes centered on each pixel of the image.
• Intersection over union (IoU), also known as Jaccard index, measures the similarity of two
bounding boxes. It is the ratio of their intersection area to their union area.
• In a training set, we need two types of labels for each anchor box. One is the class of the
object relevant to the anchor box and the other is the offset of the ground-truth bounding
box relative to the anchor box.
• During prediction, we can use non-maximum suppression (NMS) to remove similar pre-
dicted bounding boxes, thereby simplifying the output.
Exercises
1. Change values of sizes and ratios in the multibox_prior function. What are the changes
to the generated anchor boxes?
2. Construct and visualize two bounding boxes with an IoU of 0.5. How do they overlap with
each other?
3. Modify the variable anchors in Section 14.4.3 and Section 14.4.4. How do the results change?
4. Non-maximum suppression is a greedy algorithm that suppresses predicted bounding boxes
by removing them. Is it possible that some of these removed ones are actually useful? How

can this algorithm be modified to suppress softly? You may refer to Soft-NMS (Bodla et al.,
2017).
5. Rather than being hand-crafted, can non-maximum suppression be learned?
Discussions215
14.5 Multiscale Object Detection
In Section 14.4, we generated multiple anchor boxes centered on each pixel of an input image.
Essentially these anchor boxes represent samples of different regions of the image. However, we
may end up with too many anchor boxes to compute if they are generated for every pixel. Think
of a 561 × 728 input image. If five anchor boxes with varying shapes are generated for each pixel
as their center, over two million anchor boxes (561 × 728 × 5) need to be labeled and predicted on
the image.
14.5.1 Multiscale Anchor Boxes
You may realize that it is not difficult to reduce anchor boxes on an image. For instance, we can
just uniformly sample a small portion of pixels from the input image to generate anchor boxes
centered on them. In addition, at different scales we can generate different numbers of anchor
boxes of different sizes. Intuitively, smaller objects are more likely to appear on an image than
larger ones. As an example, 1 × 1, 1 × 2, and 2 × 2 objects can appear on a 2 × 2 image in 4, 2,
and 1 possible ways, respectively. Therefore, when using smaller anchor boxes to detect smaller
objects, we can sample more regions, while for larger objects we can sample fewer regions.
To demonstrate how to generate anchor boxes at multiple scales, letʼs read an image. Its height
and width are 561 and 728 pixels, respectively.
%matplotlib inline
import torch
h, w
(561, 728)
Recall that in Section 7.2 we call a two-dimensional array output of a convolutional layer a feature
map. By defining the feature map shape, we can determine centers of uniformly sampled anchor
boxes on any image.
The display_anchors function is defined below. We generate anchor boxes (anchors) on the fea-
ture map (fmap) with each unit (pixel) as the anchor box center. Since the (x, y)-axis coordinate
values in the anchor boxes (anchors) have been divided by the width and height of the feature map
(fmap), these values are between 0 and 1, which indicate the relative positions of anchor boxes in
the feature map.
215
14.5. Multiscale Object Detection 633

Since centers of the anchor boxes (anchors) are spread over all units on the feature map (fmap),
these centers must be uniformly distributed on any input image in terms of their relative spatial
positions. More concretely, given the width and height of the feature map fmap_w and fmap_h, re-
spectively, the following function will uniformly sample pixels in fmap_h rows and fmap_w columns
on any input image. Centered on these uniformly sampled pixels, anchor boxes of scale s (assum-
ing the length of the list s is 1) and different aspect ratios (ratios) will be generated.
def display_anchors(fmap_w, fmap_h, s):

d2l.set_figsize()
# Values on the first two dimensions do not affect the output
fmap = torch.zeros((1, 10, fmap_h, fmap_w))
anchors = d2l.multibox_prior(fmap, sizes=s, ratios=[1, 2, 0.5])
bbox_scale = torch.tensor((w, h, w, h))
d2l.show_bboxes(d2l.plt.imshow(img).axes,
anchors[0] * bbox_scale)
First, letʼs consider detection of small objects. In order to make it easier to distinguish when dis-
played, the anchor boxes with different centers here do not overlap: the anchor box scale is set to
0.15 and the height and width of the feature map are set to 4. We can see that the centers of the
anchor boxes in 4 rows and 4 columns on the image are uniformly distributed.
display_anchors(fmap_w=4, fmap_h=4, s=[0.15])
We move on to reduce the height and width of the feature map by half and use larger anchor
boxes to detect larger objects. When the scale is set to 0.4, some anchor boxes will overlap with
each other.

Finally, we further reduce the height and width of the feature map by half and increase the anchor
box scale to 0.8. Now the center of the anchor box is the center of the image.
14.5.2 Multiscale Detection
Since we have generated multiscale anchor boxes, we will use them to detect objects of various
sizes at different scales. In the following we introduce a CNN-based multiscale object detection
method that we will implement in Section 14.7.
At some scale, say that we have c feature maps of shape h × w. Using the method in Section 14.5.1,
we generate hw sets of anchor boxes, where each set has a anchor boxes with the same center. For
example, at the first scale in the experiments in Section 14.5.1, given ten (number of channels) 4×4
feature maps, we generated 16 sets of anchor boxes, where each set contains 3 anchor boxes with
the same center. Next, each anchor box is labeled with the class and offset based on ground-truth
bounding boxes. At the current scale, the object detection model needs to predict the classes and
offsets of hw sets of anchor boxes on the input image, where different sets have different centers.
Assume that the c feature maps here are the intermediate outputs obtained by the CNN forward
propagation based on the input image. Since there are hw different spatial positions on each fea-
ture map, the same spatial position can be thought of as having c units. According to the definition
of receptive field in Section 7.2, these c units at the same spatial position of the feature maps have
14.5. Multiscale Object Detection 635

the same receptive field on the input image: they represent the input image information in the
same receptive field. Therefore, we can transform the c units of the feature maps at the same spa-
tial position into the classes and offsets of the a anchor boxes generated using this spatial position.
In essence, we use the information of the input image in a certain receptive field to predict the
classes and offsets of the anchor boxes that are close to that receptive field on the input image.
When the feature maps at different layers have varying-size receptive fields on the input image,
they can be used to detect objects of different sizes. For example, we can design a neural network
where units of feature maps that are closer to the output layer have wider receptive fields, so they
can detect larger objects from the input image.
In a nutshell, we can leverage layerwise representations of images at multiple levels by deep neu-
ral networks for multiscale object detection. We will show how this works through a concrete
example in Section 14.7.
14.5.3 Summary
• At multiple scales, we can generate anchor boxes with different sizes to detect objects with
different sizes.
• By defining the shape of feature maps, we can determine centers of uniformly sampled an-
chor boxes on any image.
• We use the information of the input image in a certain receptive field to predict the classes
and offsets of the anchor boxes that are close to that receptive field on the input image.
• Through deep learning, we can leverage its layerwise representations of images at multiple
levels for multiscale object detection.
Exercises
1. According to our discussions in Section 8.1, deep neural networks learn hierarchical features
with increasing levels of abstraction for images. In multiscale object detection, do feature
maps at different scales correspond to different levels of abstraction? Why or why not?
2. At the first scale (fmap_w=4, fmap_h=4) in the experiments in Section 14.5.1, generate uni-
formly distributed anchor boxes that may overlap.
3. Given a feature map variable with shape 1 × c × h × w, where c, h, and w are the number of
channels, height, and width of the feature maps, respectively. How can you transform this
variable into the classes and offsets of anchor boxes? What is the shape of the output?
Discussions216
216

14.6 The Object Detection Dataset
There is no small dataset such as MNIST and Fashion-MNIST in the field of object detection. In
order to quickly demonstrate object detection models, we collected and labeled a small dataset.
First, we took photos of free bananas from our office and generated 1000 banana images with
different rotations and sizes. Then we placed each banana image at a random position on some
background image. In the end, we labeled bounding boxes for those bananas on the images.
14.6.1 Downloading the Dataset
The banana detection dataset with all the image and csv label files can be downloaded directly
from the Internet.
%matplotlib inline
import os
import pandas as pd
import torch
import torchvision
#@save
d2l.DATA_HUB['banana-detection'] = (
d2l.DATA_URL + 'banana-detection.zip',
'5de26c8fce5ccdea9f91267273464dc968d20d72')
We are going to read the banana detection dataset in the read_data_bananas function below. The
dataset includes a csv file for object class labels and ground-truth bounding box coordinates at the
upper-left and lower-right corners.
#@save
def read_data_bananas(is_train=True):
"""Read the banana detection dataset images and labels."""
data_dir = d2l.download_extract('banana-detection')
csv_fname = os.path.join(data_dir, 'bananas_train' if is_train
else 'bananas_val', 'label.csv')
csv_data = pd.read_csv(csv_fname)
csv_data = csv_data.set_index('img_name')
images, targets = [], []
for img_name, target in csv_data.iterrows():
images.append(torchvision.io.read_image(
os.path.join(data_dir, 'bananas_train' if is_train else
'bananas_val', 'images', f'{img_name}')))
# Here `target` contains (class, upper-left x, upper-left y,
# lower-right x, lower-right y), where all the images have the same
# banana class (index 0)
targets.append(list(target))
return images, torch.tensor(targets).unsqueeze(1) / 256
14.6. The Object Detection Dataset 637

By using the read_data_bananas function to read images and labels, the following BananasDataset
class will allow us to create a customized Dataset instance for loading the banana detection
dataset.
#@save
class BananasDataset(torch.utils.data.Dataset):
"""A customized dataset to load the banana detection dataset."""
def __init__(self, is_train):
self.features, self.labels = read_data_bananas(is_train)
print('read ' + str(len(self.features)) + (f' training examples' if
is_train else f' validation examples'))
def __getitem__(self, idx):

return (self.features[idx].float(), self.labels[idx])
def __len__(self):
return len(self.features)
Finally, we define the load_data_bananas function to return two data iterator instances for both
the training and test sets. For the test dataset, there is no need to read it in random order.
#@save
def load_data_bananas(batch_size):
"""Load the banana detection dataset."""
train_iter = torch.utils.data.DataLoader(BananasDataset(is_train=True),
batch_size, shuffle=True)
val_iter = torch.utils.data.DataLoader(BananasDataset(is_train=False),
batch_size)
return train_iter, val_iter
Letʼs read a minibatch and print the shapes of both images and labels in this minibatch. The shape
of the image minibatch, (batch size, number of channels, height, width), looks familiar: it is the
same as in our earlier image classification tasks. The shape of the label minibatch is (batch size,
m, 5), where m is the largest possible number of bounding boxes that any image has in the dataset.
Although computation in minibatches is more efficient, it requires that all the image examples
contain the same number of bounding boxes to form a minibatch via concatenation. In general,
images may have a varying number of bounding boxes; thus, images with fewer than m bound-
ing boxes will be padded with illegal bounding boxes until m is reached. Then the label of each
bounding box is represented by an array of length 5. The first element in the array is the class
of the object in the bounding box, where -1 indicates an illegal bounding box for padding. The
remaining four elements of the array are the (x, y)-coordinate values of the upper-left corner and
the lower-right corner of the bounding box (the range is between 0 and 1). For the banana dataset,
since there is only one bounding box on each image, we have m = 1.
batch_size, edge_size = 32, 256

train_iter, _ = load_data_bananas(batch_size)
batch = next(iter(train_iter))
batch[0].shape, batch[1].shape
read 1000 training examples

read 100 validation examples

(torch.Size([32, 3, 256, 256]), torch.Size([32, 1, 5]))
14.6.3 Demonstration
Letʼs demonstrate ten images with their labeled ground-truth bounding boxes. We can see that
the rotations, sizes, and positions of bananas vary across all these images. Of course, this is just a
simple artificial dataset. In practice, real-world datasets are usually much more complicated.
imgs = (batch[0][:10].permute(0, 2, 3, 1)) / 255

axes = d2l.show_images(imgs, 2, 5, scale=2)
for ax, label in zip(axes, batch[1][:10]):
d2l.show_bboxes(ax, [label[0][1:5] * edge_size], colors=['w'])
14.6.4 Summary
• The banana detection dataset we collected can be used to demonstrate object detection mod-
els.
• The data loading for object detection is similar to that for image classification. However, in
object detection the labels also contain information of ground-truth bounding boxes, which
is missing in image classification.
Exercises
1. Demonstrate other images with ground-truth bounding boxes in the banana detection
dataset. How do they differ with respect to bounding boxes and objects?
2. Say that we want to apply data augmentation, such as random cropping, to object detection.
How can it be different from that in image classification? Hint: what if a cropped image only
contains a small portion of an object?
Discussions217
217
14.6. The Object Detection Dataset 639

14.7 Single Shot Multibox Detection
In Section 14.3–Section 14.6, we introduced bounding boxes, anchor boxes, multiscale object de-
tection, and the dataset for object detection. Now we are ready to use such background knowledge
to design an object detection model: single shot multibox detection (SSD) (Liu et al., 2016). This
model is simple, fast, and widely used. Although this is just one of vast amounts of object de-
tection models, some of the design principles and implementation details in this section are also
applicable to other models.
14.7.1 Model
Fig. 14.7.1 provides an overview of the design of single-shot multibox detection. This model
mainly consists of a base network followed by several multiscale feature map blocks. The base
network is for extracting features from the input image, so it can use a deep CNN. For example,
the original single-shot multibox detection paper adopts a VGG network truncated before the clas-
sification layer (Liu et al., 2016), while ResNet has also been commonly used. Through our design
we can make the base network output larger feature maps so as to generate more anchor boxes
for detecting smaller objects. Subsequently, each multiscale feature map block reduces (e.g., by
half) the height and width of the feature maps from the previous block, and enables each unit of
the feature maps to increase its receptive field on the input image.
Recall the design of multiscale object detection through layerwise representations of images by
deep neural networks in Section 14.5. Since multiscale feature maps closer to the top of Fig. 14.7.1
are smaller but have larger receptive fields, they are suitable for detecting fewer but larger objects.
In a nutshell, via its base network and several multiscale feature map blocks, single-shot multibox
detection generates a varying number of anchor boxes with different sizes, and detects varying-
size objects by predicting classes and offsets of these anchor boxes (thus the bounding boxes);
thus, this is a multiscale object detection model.
Fig. 14.7.1: As a multiscale object detection model, single-shot multibox detection mainly consists
of a base network followed by several multiscale feature map blocks.

In the following, we will describe the implementation details of different blocks in Fig. 14.7.1. To
begin with, we discuss how to implement the class and bounding box prediction.
Class Prediction Layer
Let the number of object classes be q. Then anchor boxes have q + 1 classes, where class 0 is
background. At some scale, suppose that the height and width of feature maps are h and w, re-
spectively. When a anchor boxes are generated with each spatial position of these feature maps as
their center, a total of hwa anchor boxes need to be classified. This often makes classification with
fully connected layers infeasible due to likely heavy parameterization costs. Recall how we used
channels of convolutional layers to predict classes in Section 8.3. Single-shot multibox detection
uses the same technique to reduce model complexity.
Specifically, the class prediction layer uses a convolutional layer without altering width or height of
feature maps. In this way, there can be a one-to-one correspondence between outputs and inputs
at the same spatial dimensions (width and height) of feature maps. More concretely, channels
of the output feature maps at any spatial position (x, y) represent class predictions for all the
anchor boxes centered on (x, y) of the input feature maps. To produce valid predictions, there
must be a(q + 1) output channels, where for the same spatial position the output channel with
index i(q+1)+j represents the prediction of the class j (0 ≤ j ≤ q) for the anchor box i (0 ≤ i < a).
Below we define such a class prediction layer, specifying a and q via arguments num_anchors and
num_classes, respectively. This layer uses a 3 × 3 convolutional layer with a padding of 1. The
width and height of the input and output of this convolutional layer remain unchanged.
%matplotlib inline
import torch
import torchvision
def cls_predictor(num_inputs, num_anchors, num_classes):

return nn.Conv2d(num_inputs, num_anchors * (num_classes + 1),
kernel_size=3, padding=1)
Bounding Box Prediction Layer
The design of the bounding box prediction layer is similar to that of the class prediction layer. The
only difference lies in the number of outputs for each anchor box: here we need to predict four
offsets rather than q + 1 classes.
def bbox_predictor(num_inputs, num_anchors):

return nn.Conv2d(num_inputs, num_anchors * 4, kernel_size=3, padding=1)
14.7. Single Shot Multibox Detection 641

Concatenating Predictions for Multiple Scales
As we mentioned, single-shot multibox detection uses multiscale feature maps to generate anchor
boxes and predict their classes and offsets. At different scales, the shapes of feature maps or the
numbers of anchor boxes centered on the same unit may vary. Therefore, shapes of the prediction
outputs at different scales may vary.
In the following example, we construct feature maps at two different scales, Y1 and Y2, for the same
minibatch, where the height and width of Y2 are half of those of Y1. Letʼs take class prediction as an
example. Suppose that 5 and 3 anchor boxes are generated for every unit in Y1 and Y2, respectively.
Suppose further that the number of object classes is 10. For feature maps Y1 and Y2 the numbers
of channels in the class prediction outputs are 5 × (10 + 1) = 55 and 3 × (10 + 1) = 33, respectively,
where either output shape is (batch size, number of channels, height, width).
def forward(x, block):

return block(x)
Y1 = forward(torch.zeros((2, 8, 20, 20)), cls_predictor(8, 5, 10))

Y2 = forward(torch.zeros((2, 16, 10, 10)), cls_predictor(16, 3, 10))
Y1.shape, Y2.shape
(torch.Size([2, 55, 20, 20]), torch.Size([2, 33, 10, 10]))
As we can see, except for the batch size dimension, the other three dimensions all have differ-
ent sizes. To concatenate these two prediction outputs for more efficient computation, we will
transform these tensors into a more consistent format.
Note that the channel dimension holds the predictions for anchor boxes with the same center. We
first move this dimension to the innermost. Since the batch size remains the same for different
scales, we can transform the prediction output into a two-dimensional tensor with shape (batch
size, height × width × number of channels). Then we can concatenate such outputs at different
scales along dimension 1.
def flatten_pred(pred):
return torch.flatten(pred.permute(0, 2, 3, 1), start_dim=1)
def concat_preds(preds):
return torch.cat([flatten_pred(p) for p in preds], dim=1)
In this way, even though Y1 and Y2 have different sizes in channels, heights, and widths, we can
still concatenate these two prediction outputs at two different scales for the same minibatch.
concat_preds([Y1, Y2]).shape

Downsampling Block
In order to detect objects at multiple scales, we define the following downsampling block
down_sample_blk that halves the height and width of input feature maps. In fact, this block applies
the design of VGG blocks in Section 8.2.1. More concretely, each downsampling block consists of
two 3 × 3 convolutional layers with padding of 1 followed by a 2 × 2 max-pooling layer with stride
of 2. As we know, 3 × 3 convolutional layers with padding of 1 do not change the shape of feature
maps. However, the subsequent 2 × 2 max-pooling reduces the height and width of input fea-
ture maps by half. For both input and output feature maps of this downsampling block, because
1 × 2 + (3 − 1) + (3 − 1) = 6, each unit in the output has a 6 × 6 receptive field on the input.
Therefore, the downsampling block enlarges the receptive field of each unit in its output feature
maps.
def down_sample_blk(in_channels, out_channels):

blk = []
for _ in range(2):
blk.append(nn.Conv2d(in_channels, out_channels,
kernel_size=3, padding=1))
blk.append(nn.BatchNorm2d(out_channels))
blk.append(nn.ReLU())
in_channels = out_channels
blk.append(nn.MaxPool2d(2))
In the following example, our constructed downsampling block changes the number of input
channels and halves the height and width of the input feature maps.
forward(torch.zeros((2, 3, 20, 20)), down_sample_blk(3, 10)).shape
torch.Size([2, 10, 10, 10])
Base Network Block
The base network block is used to extract features from input images. For simplicity, we construct
a small base network consisting of three downsampling blocks that double the number of chan-
nels at each block. Given a 256 × 256 input image, this base network block outputs 32 × 32 feature
maps (256/23 = 32).
def base_net():
blk = []
num_filters = [3, 16, 32, 64]
for i in range(len(num_filters) - 1):
blk.append(down_sample_blk(num_filters[i], num_filters[i+1]))
forward(torch.zeros((2, 3, 256, 256)), base_net()).shape
torch.Size([2, 64, 32, 32])

The Complete Model
The complete single shot multibox detection model consists of five blocks. The feature maps pro-
duced by each block are used for both (i) generating anchor boxes and (ii) predicting classes and
offsets of these anchor boxes. Among these five blocks, the first one is the base network block,
the second to the fourth are downsampling blocks, and the last block uses global max-pooling to
reduce both the height and width to 1. Technically, the second to the fifth blocks are all those
multiscale feature map blocks in Fig. 14.7.1.
def get_blk(i):
if i == 0:
blk = base_net()
elif i == 1:
blk = down_sample_blk(64, 128)
elif i == 4:
blk = nn.AdaptiveMaxPool2d((1,1))
else:
blk = down_sample_blk(128, 128)
return blk
Now we define the forward propagation for each block. Different from in image classification
tasks, outputs here include (i) CNN feature maps Y, (ii) anchor boxes generated using Y at the
current scale, and (iii) classes and offsets predicted (based on Y) for these anchor boxes.
def blk_forward(X, blk, size, ratio, cls_predictor, bbox_predictor):

Y = blk(X)
anchors = d2l.multibox_prior(Y, sizes=size, ratios=ratio)
cls_preds = cls_predictor(Y)
bbox_preds = bbox_predictor(Y)
return (Y, anchors, cls_preds, bbox_preds)
Recall that in Fig. 14.7.1 a multiscale feature map block that is closer to the top is for detecting
larger objects; thus, it needs to generate larger anchor boxes. In the above forward propagation,
at each multiscale feature map block we pass in a list of two scale values via the sizes argument
of the invoked multibox_prior function (described in Section 14.4). In the following, the interval
between 0.2 and 1.05 is split evenly into five sections to determine the smaller scale√values at the
five blocks:
√ 0.2, 0.37, 0.54, 0.71, and 0.88. Then their larger scale values are given by 0.2 × 0.37 =
0.272, 0.37 × 0.54 = 0.447, and so on.
sizes = [[0.2, 0.272], [0.37, 0.447], [0.54, 0.619], [0.71, 0.79],

[0.88, 0.961]]
ratios = [[1, 2, 0.5]] * 5
num_anchors = len(sizes[0]) + len(ratios[0]) - 1
Now we can define the complete model TinySSD as follows.
class TinySSD(nn.Module):
def __init__(self, num_classes, **kwargs):
super(TinySSD, self).__init__(**kwargs)
self.num_classes = num_classes
idx_to_in_channels = [64, 128, 128, 128, 128]
for i in range(5):
# Equivalent to the assignment statement `self.blk_i = get_blk(i)`

setattr(self, f'blk_{i}', get_blk(i))

setattr(self, f'cls_{i}', cls_predictor(idx_to_in_channels[i],
num_anchors, num_classes))
setattr(self, f'bbox_{i}', bbox_predictor(idx_to_in_channels[i],
num_anchors))

anchors, cls_preds, bbox_preds = [None] * 5, [None] * 5, [None] * 5
for i in range(5):
# Here `getattr(self, 'blk_%d' % i)` accesses `self.blk_i`
X, anchors[i], cls_preds[i], bbox_preds[i] = blk_forward(
X, getattr(self, f'blk_{i}'), sizes[i], ratios[i],
getattr(self, f'cls_{i}'), getattr(self, f'bbox_{i}'))
anchors = torch.cat(anchors, dim=1)
cls_preds = concat_preds(cls_preds)
cls_preds = cls_preds.reshape(
cls_preds.shape[0], -1, self.num_classes + 1)
bbox_preds = concat_preds(bbox_preds)
return anchors, cls_preds, bbox_preds
We create a model instance and use it to perform forward propagation on a minibatch of 256 × 256
images X.
As shown earlier in this section, the first block outputs 32×32 feature maps. Recall that the second
to fourth downsampling blocks halve the height and width and the fifth block uses global pooling.
Since 4 anchor boxes are generated for each unit along spatial dimensions of feature maps, at all
the five scales a total of (322 + 162 + 82 + 42 + 1) × 4 = 5444 anchor boxes are generated for each
image.
net = TinySSD(num_classes=1)
X = torch.zeros((32, 3, 256, 256))
anchors, cls_preds, bbox_preds = net(X)
print('output anchors:', anchors.shape)

print('output class preds:', cls_preds.shape)
print('output bbox preds:', bbox_preds.shape)
output anchors: torch.Size([1, 5444, 4])

output class preds: torch.Size([32, 5444, 2])
output bbox preds: torch.Size([32, 21776])

14.7.2 Training
Now we will explain how to train the single shot multibox detection model for object detection.
Reading the Dataset and Initializing the Model
To begin with, letʼs read the banana detection dataset described in Section 14.6.
batch_size = 32
train_iter, _ = d2l.load_data_bananas(batch_size)
read 1000 training examples

read 100 validation examples
There is only one class in the banana detection dataset. After defining the model, we need to
initialize its parameters and define the optimization algorithm.
device, net = d2l.try_gpu(), TinySSD(num_classes=1)

trainer = torch.optim.SGD(net.parameters(), lr=0.2, weight_decay=5e-4)
Defining Loss and Evaluation Functions
Object detection has two types of losses. The first loss concerns classes of anchor boxes: its com-
putation can simply reuse the cross-entropy loss function that we used for image classification.
The second loss concerns offsets of positive (non-background) anchor boxes: this is a regression
problem. For this regression problem, however, here we do not use the squared loss described in
Section 3.1.3. Instead, we use the ℓ1 norm loss, the absolute value of the difference between the
prediction and the ground-truth. The mask variable bbox_masks filters out negative anchor boxes
and illegal (padded) anchor boxes in the loss calculation. In the end, we sum up the anchor box
class loss and the anchor box offset loss to obtain the loss function for the model.
cls_loss = nn.CrossEntropyLoss(reduction='none')
bbox_loss = nn.L1Loss(reduction='none')
def calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels, bbox_masks):

batch_size, num_classes = cls_preds.shape[0], cls_preds.shape[2]
cls = cls_loss(cls_preds.reshape(-1, num_classes),
cls_labels.reshape(-1)).reshape(batch_size, -1).mean(dim=1)
bbox = bbox_loss(bbox_preds * bbox_masks,
bbox_labels * bbox_masks).mean(dim=1)
return cls + bbox
We can use accuracy to evaluate the classification results. Due to the used ℓ1 norm loss for the
offsets, we use the mean absolute error to evaluate the predicted bounding boxes. These prediction
results are obtained from the generated anchor boxes and the predicted offsets for them.
def cls_eval(cls_preds, cls_labels):

# Because the class prediction results are on the final dimension,
# àrgmax` needs to specify this dimension

return float((cls_preds.argmax(dim=-1).type(
cls_labels.dtype) == cls_labels).sum())
def bbox_eval(bbox_preds, bbox_labels, bbox_masks):

return float((torch.abs((bbox_labels - bbox_preds) * bbox_masks)).sum())
Training the Model
When training the model, we need to generate multiscale anchor boxes (anchors) and predict
their classes (cls_preds) and offsets (bbox_preds) in the forward propagation. Then we label the
classes (cls_labels) and offsets (bbox_labels) of such generated anchor boxes based on the label
information Y. Finally, we calculate the loss function using the predicted and labeled values of the
classes and offsets. For concise implementations, evaluation of the test dataset is omitted here.
num_epochs, timer = 20, d2l.Timer()

legend=['class error', 'bbox mae'])
net = net.to(device)
# Sum of training accuracy, no. of examples in sum of training accuracy,
# Sum of absolute error, no. of examples in sum of absolute error
net.train()
for features, target in train_iter:
timer.start()
trainer.zero_grad()
X, Y = features.to(device), target.to(device)
# Generate multiscale anchor boxes and predict their classes and
# offsets
anchors, cls_preds, bbox_preds = net(X)
# Label the classes and offsets of these anchor boxes
bbox_labels, bbox_masks, cls_labels = d2l.multibox_target(anchors, Y)
# Calculate the loss function using the predicted and labeled values
# of the classes and offsets
l = calc_loss(cls_preds, cls_labels, bbox_preds, bbox_labels,
bbox_masks)
l.mean().backward()
trainer.step()
metric.add(cls_eval(cls_preds, cls_labels), cls_labels.numel(),
bbox_eval(bbox_preds, bbox_labels, bbox_masks),
bbox_labels.numel())
cls_err, bbox_mae = 1 - metric[0] / metric[1], metric[2] / metric[3]
animator.add(epoch + 1, (cls_err, bbox_mae))
print(f'class err {cls_err:.2e}, bbox mae {bbox_mae:.2e}')
print(f'{len(train_iter.dataset) / timer.stop():.1f} examples/sec on '
f'{str(device)}')
class err 3.37e-03, bbox mae 3.39e-03

6205.2 examples/sec on cuda:0

14.7.3 Prediction
During prediction, the goal is to detect all the objects of interest on the image. Below we read and
resize a test image, converting it to a four-dimensional tensor that is required by convolutional
layers.
X = torchvision.io.read_image('../img/banana.jpg').unsqueeze(0).float()
img = X.squeeze(0).permute(1, 2, 0).long()
Using the multibox_detection function below, the predicted bounding boxes are obtained from
the anchor boxes and their predicted offsets. Then non-maximum suppression is used to remove
similar predicted bounding boxes.
def predict(X):
net.eval()
anchors, cls_preds, bbox_preds = net(X.to(device))
cls_probs = F.softmax(cls_preds, dim=2).permute(0, 2, 1)
output = d2l.multibox_detection(cls_probs, bbox_preds, anchors)
idx = [i for i, row in enumerate(output[0]) if row[0] != -1]
return output[0, idx]
output = predict(X)
Finally, we display all the predicted bounding boxes with confidence 0.9 or above as output.
def display(img, output, threshold):

for row in output:
score = float(row[1])
if score < threshold:
continue
bbox = [row[2:6] * torch.tensor((w, h, w, h), device=row.device)]
d2l.show_bboxes(fig.axes, bbox, '%.2f' % score, 'w')
display(img, output.cpu(), threshold=0.9)

14.7.4 Summary
• Single shot multibox detection is a multiscale object detection model. Via its base network
and several multiscale feature map blocks, single-shot multibox detection generates a vary-
ing number of anchor boxes with different sizes, and detects varying-size objects by predict-
ing classes and offsets of these anchor boxes (thus the bounding boxes).
• When training the single-shot multibox detection model, the loss function is calculated
based on the predicted and labeled values of the anchor box classes and offsets.
Exercises
1. Can you improve the single-shot multibox detection by improving the loss function? For ex-
ample, replace ℓ1 norm loss with smooth ℓ1 norm loss for the predicted offsets. This loss
function uses a square function around zero for smoothness, which is controlled by the hy-
perparameter σ:
{
(σx)2 /2, if |x| < 1/σ 2
f (x) = (14.7.1)
|x| − 0.5/σ 2 , otherwise
When σ is very large, this loss is similar to the ℓ1 norm loss. When its value is smaller, the loss
function is smoother.
def smooth_l1(data, scalar):

out = []
for i in data:

if abs(i) < 1 / (scalar ** 2):

out.append(((scalar * i) ** 2) / 2)
else:
out.append(abs(i) - 0.5 / (scalar ** 2))
return torch.tensor(out)
sigmas = [10, 1, 0.5]

lines = ['-', '--', '-.']
x = torch.arange(-2, 2, 0.1)
d2l.set_figsize()
for l, s in zip(lines, sigmas):

y = smooth_l1(x, scalar=s)
d2l.plt.plot(x, y, l, label='sigma=%.1f' % s)
d2l.plt.legend();
Besides, in the experiment we used cross-entropy loss for class prediction: denoting by pj the
predicted probability for the ground-truth class j, the cross-entropy loss is − log pj . We can also
use the focal loss (Lin et al., 2017): given hyperparameters γ > 0 and α > 0, this loss is defined as:
−α(1 − pj )γ log pj . (14.7.2)
As we can see, increasing γ can effectively reduce the relative loss for well-classified examples
(e.g., pj > 0.5) so the training can focus more on those difficult examples that are misclassified.
def focal_loss(gamma, x):

return -(1 - x) ** gamma * torch.log(x)
x = torch.arange(0.01, 1, 0.01)
for l, gamma in zip(lines, [0, 1, 5]):
y = d2l.plt.plot(x, focal_loss(gamma, x), l, label='gamma=%.1f' % gamma)
d2l.plt.legend();

2. Due to space limitations, we have omitted some implementation details of the single shot
multibox detection model in this section. Can you further improve the model in the follow-
ing aspects:
1. When an object is much smaller compared with the image, the model could resize the
input image bigger.
2. There are typically a vast number of negative anchor boxes. To make the class distri-
bution more balanced, we could downsample negative anchor boxes.
3. In the loss function, assign different weight hyperparameters to the class loss and the
offset loss.
4. Use other methods to evaluate the object detection model, such as those in the single
shot multibox detection paper (Liu et al., 2016).
Discussions218
14.8 Region-based CNNs (R-CNNs)
Besides single shot multibox detection described in Section 14.7, region-based CNNs or regions
with CNN features (R-CNNs) are also among many pioneering approaches of applying deep learn-
ing to object detection (Girshick et al., 2014). In this section, we will introduce the R-CNN and its
series of improvements: the fast R-CNN (Girshick, 2015), the faster R-CNN (Ren et al., 2015), and
the mask R-CNN (He et al., 2017). Due to limited space, we will only focus on the design of these
models.
14.8.1 R-CNNs
The R-CNN first extracts many (e.g., 2000) region proposals from the input image (e.g., anchor boxes
can also be considered as region proposals), labeling their classes and bounding boxes (e.g., off-
sets).
(Girshick et al., 2014)
Then a CNN is used to perform forward propagation on each region proposal to extract its features.
Next, features of each region proposal are used for predicting the class and bounding box of this
region proposal.
218
14.8. Region-based CNNs (R-CNNs) 651

Fig. 14.8.1: The R-CNN model.
Fig. 14.8.1 shows the R-CNN model. More concretely, the R-CNN consists of the following four
steps:
1. Perform selective search to extract multiple high-quality region proposals on the input image
(Uijlings et al., 2013). These proposed regions are usually selected at multiple scales with
different shapes and sizes. Each region proposal will be labeled with a class and a ground-
truth bounding box.
2. Choose a pretrained CNN and truncate it before the output layer. Resize each region pro-
posal to the input size required by the network, and output the extracted features for the
region proposal through forward propagation.
3. Take the extracted features and labeled class of each region proposal as an example. Train
multiple support vector machines to classify objects, where each support vector machine
individually determines whether the example contains a specific class.
4. Take the extracted features and labeled bounding box of each region proposal as an example.
Train a linear regression model to predict the ground-truth bounding box.
Although the R-CNN model uses pretrained CNNs to effectively extract image features, it is slow.
Imagine that we select thousands of region proposals from a single input image: this requires
thousands of CNN forward propagations to perform object detection. This massive computing
load makes it infeasible to widely use R-CNNs in real-world applications.
14.8.2 Fast R-CNN
The main performance bottleneck of an R-CNN lies in the independent CNN forward propagation
for each region proposal, without sharing computation. Since these regions usually have overlaps,
independent feature extractions lead to much repeated computation. One of the major improve-
ments of the fast R-CNN from the R-CNN is that the CNN forward propagation is only performed
on the entire image (Girshick, 2015).

Fig. 14.8.2: The fast R-CNN model.
Fig. 14.8.2 describes the fast R-CNN model. Its major computations are as follows:
1. Compared with the R-CNN, in the fast R-CNN the input of the CNN for feature extraction is
the entire image, rather than individual region proposals. Moreover, this CNN is trainable.
Given an input image, let the shape of the CNN output be 1 × c × h1 × w1 .
2. Suppose that selective search generates n region proposals. These region proposals (of dif-
ferent shapes) mark regions of interest (of different shapes) on the CNN output. Then these
regions of interest further extract features of the same shape (say height h2 and width w2 are
specified) in order to be easily concatenated. To achieve this, the fast R-CNN introduces the
region of interest (RoI) pooling layer: the CNN output and region proposals are input into this
layer, outputting concatenated features of shape n × c × h2 × w2 that are further extracted
for all the region proposals.
3. Using a fully connected layer, transform the concatenated features into an output of shape
n × d, where d depends on the model design.
4. Predict the class and bounding box for each of the n region proposals. More concretely, in
class and bounding box prediction, transform the fully connected layer output into an output
of shape n × q (q is the number of classes) and an output of shape n × 4, respectively. The
class prediction uses softmax regression.
The region of interest pooling layer proposed in the fast R-CNN is different from the pooling layer
introduced in Section 7.5. In the pooling layer, we indirectly control the output shape by specifying
sizes of the pooling window, padding, and stride. In contrast, we can directly specify the output
shape in the region of interest pooling layer.
For example, letʼs specify the output height and width for each region as h2 and w2 , respectively.
For any region of interest window of shape h × w, this window is divided into a h2 × w2 grid of
subwindows, where the shape of each subwindow is approximately (h/h2 ) × (w/w2 ). In practice,
the height and width of any subwindow shall be rounded up, and the largest element shall be used
as output of the subwindow. Therefore, the region of interest pooling layer can extract features of
the same shape even when regions of interest have different shapes.
As an illustrative example, in Fig. 14.8.3, the upper-left 3 × 3 region of interest is selected on a 4 × 4
input. For this region of interest, we use a 2 × 2 region of interest pooling layer to obtain a 2 × 2

output. Note that each of the four divided subwindows contains elements 0, 1, 4, and 5 (5 is the
maximum); 2 and 6 (6 is the maximum); 8 and 9 (9 is the maximum); and 10.
Fig. 14.8.3: A 2 × 2 region of interest pooling layer.
Below we demonstrate the computation of the region of interest pooling layer. Suppose that the
height and width of the CNN-extracted features X are both 4, and there is only a single channel.
import torch
import torchvision
X = torch.arange(16.).reshape(1, 1, 4, 4)
X
tensor([[[[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.],
[12., 13., 14., 15.]]]])
Letʼs further suppose that the height and width of the input image are both 40 pixels and that
selective search generates two region proposals on this image. Each region proposal is expressed
as five elements: its object class followed by the (x, y)-coordinates of its upper-left and lower-right
corners.
rois = torch.Tensor([[0, 0, 0, 20, 20], [0, 0, 10, 30, 30]])
Because the height and width of X are 1/10 of the height and width of the input image, the coordi-
nates of the two region proposals are multiplied by 0.1 according to the specified spatial_scale
argument. Then the two regions of interest are marked on X as X[:, :, 0:3, 0:3] and X[:, :,
1:4, 0:4], respectively. Finally in the 2 × 2 region of interest pooling, each region of interest is
divided into a grid of sub-windows to further extract features of the same shape 2 × 2.
torchvision.ops.roi_pool(X, rois, output_size=(2, 2), spatial_scale=0.1)
tensor([[[[ 5., 6.],

[ 9., 10.]]],
[[[ 9., 11.],

[13., 15.]]]])

14.8.3 Faster R-CNN
To be more accurate in object detection, the fast R-CNN model usually has to generate a lot of
region proposals in selective search. To reduce region proposals without loss of accuracy, the
faster R-CNN proposes to replace selective search with a region proposal network (Ren et al., 2015).
Fig. 14.8.4: The faster R-CNN model.
Fig. 14.8.4 shows the faster R-CNN model. Compared with the fast R-CNN, the faster R-CNN only
changes the region proposal method from selective search to a region proposal network. The rest
of the model remain unchanged. The region proposal network works in the following steps:
1. Use a 3×3 convolutional layer with padding of 1 to transform the CNN output to a new output
with c channels. In this way, each unit along the spatial dimensions of the CNN-extracted
feature maps gets a new feature vector of length c.
2. Centered on each pixel of the feature maps, generate multiple anchor boxes of different
scales and aspect ratios and label them.
3. Using the length-c feature vector at the center of each anchor box, predict the binary class
(background or objects) and bounding box for this anchor box.
4. Consider those predicted bounding boxes whose predicted classes are objects. Remove over-
lapped results using non-maximum suppression. The remaining predicted bounding boxes
for objects are the region proposals required by the region of interest pooling layer.
It is worth noting that, as part of the faster R-CNN model, the region proposal network is jointly
trained with the rest of the model. In other words, the objective function of the faster R-CNN
includes not only the class and bounding box prediction in object detection, but also the binary
class and bounding box prediction of anchor boxes in the region proposal network. As a result of
the end-to-end training, the region proposal network learns how to generate high-quality region
proposals, so as to stay accurate in object detection with a reduced number of region proposals
that are learned from data.

14.8.4 Mask R-CNN
In the training dataset, if pixel-level positions of object are also labeled on images, the mask R-CNN
can effectively leverage such detailed labels to further improve the accuracy of object detection
(He et al., 2017).
Fig. 14.8.5: The mask R-CNN model.
As shown in Fig. 14.8.5, the mask R-CNN is modified based on the faster R-CNN. Specifically, the
mask R-CNN replaces the region of interest pooling layer with the region of interest (RoI) alignment
layer. This region of interest alignment layer uses bilinear interpolation to preserve the spatial
information on the feature maps, which is more suitable for pixel-level prediction. The output
of this layer contains feature maps of the same shape for all the regions of interest. They are
used to predict not only the class and bounding box for each region of interest, but also the pixel-
level position of the object through an additional fully convolutional network. More details on
using a fully convolutional network to predict pixel-level semantics of an image will be provided
in subsequent sections of this chapter.
14.8.5 Summary
• The R-CNN extracts many region proposals from the input image, uses a CNN to perform
forward propagation on each region proposal to extract its features, then uses these features
to predict the class and bounding box of this region proposal.
• One of the major improvements of the fast R-CNN from the R-CNN is that the CNN forward
propagation is only performed on the entire image. It also introduces the region of inter-
est pooling layer, so that features of the same shape can be further extracted for regions of
interest that have different shapes.
• The faster R-CNN replaces the selective search used in the fast R-CNN with a jointly trained
region proposal network, so that the former can stay accurate in object detection with a
reduced number of region proposals.
• Based on the faster R-CNN, the mask R-CNN additionally introduces a fully convolutional

network, so as to leverage pixel-level labels to further improve the accuracy of object detec-
tion.
Exercises
1. Can we frame object detection as a single regression problem, such as predicting bounding
boxes and class probabilities? You may refer to the design of the YOLO model (Redmon et
al., 2016).
2. Compare single shot multibox detection with the methods introduced in this section. What
are their major differences? You may refer to Figure 2 of (Zhao et al., 2019).
Discussions219
14.9 Semantic Segmentation and the Dataset
When discussing object detection tasks in Section 14.3–Section 14.8, rectangular bounding boxes
are used to label and predict objects in images. This section will discuss the problem of semantic
segmentation, which focuses on how to divide an image into regions belonging to different seman-
tic classes. Different from object detection, semantic segmentation recognizes and understands
what are in images in pixel level: its labeling and prediction of semantic regions are in pixel level.
Fig. 14.9.1 shows the labels of the dog, cat, and background of the image in semantic segmenta-
tion. Compared with in object detection, the pixel-level borders labeled in semantic segmentation
are obviously more fine-grained.
Fig. 14.9.1: Labels of the dog, cat, and background of the image in semantic segmentation.
14.9.1 Image Segmentation and Instance Segmentation
There are also two important tasks in the field of computer vision that are similar to semantic seg-
mentation, namely image segmentation and instance segmentation. We will briefly distinguish
them from semantic segmentation as follows.
• Image segmentation divides an image into several constituent regions. The methods for this
type of problem usually make use of the correlation between pixels in the image. It does not
need label information about image pixels during training, and it cannot guarantee that the
segmented regions will have the semantics that we hope to obtain during prediction. Taking
the image in Fig. 14.9.1 as input, image segmentation may divide the dog into two regions:
219
14.9. Semantic Segmentation and the Dataset 657

one covers the mouth and eyes which are mainly black, and the other covers the rest of the
body which is mainly yellow.
• Instance segmentation is also called simultaneous detection and segmentation. It studies how
to recognize the pixel-level regions of each object instance in an image. Different from se-
mantic segmentation, instance segmentation needs to distinguish not only semantics, but
also different object instances. For example, if there are two dogs in the image, instance
segmentation needs to distinguish which of the two dogs a pixel belongs to.
14.9.2 The Pascal VOC2012 Semantic Segmentation Dataset
On of the most important semantic segmentation dataset is Pascal VOC2012220 . In the following,
we will take a look at this dataset.
%matplotlib inline
import os
import torch
import torchvision
The tar file of the dataset is about 2 GB, so it may take a while to download the file. The extracted
dataset is located at ../data/VOCdevkit/VOC2012.
#@save
d2l.DATA_HUB['voc2012'] = (d2l.DATA_URL + 'VOCtrainval_11-May-2012.tar',
'4e443f8a2eca6b1dac8a6c57641b67dd40621a49')
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')
After entering the path ../data/VOCdevkit/VOC2012, we can see the different components of the
dataset. The ImageSets/Segmentation path contains text files that specify training and test sam-
ples, while the JPEGImages and SegmentationClass paths store the input image and label for each
example, respectively. The label here is also in the image format, with the same size as its labeled
input image. Besides, pixels with the same color in any label image belong to the same semantic
class. The following defines the read_voc_images function to read all the input images and labels
into the memory.
#@save
def read_voc_images(voc_dir, is_train=True):
"""Read all VOC feature and label images."""
txt_fname = os.path.join(voc_dir, 'ImageSets', 'Segmentation',
'train.txt' if is_train else 'val.txt')
mode = torchvision.io.image.ImageReadMode.RGB
with open(txt_fname, 'r') as f:
images = f.read().split()
features, labels = [], []
for i, fname in enumerate(images):
features.append(torchvision.io.read_image(os.path.join(
voc_dir, 'JPEGImages', f'{fname}.jpg')))
labels.append(torchvision.io.read_image(os.path.join(
voc_dir, 'SegmentationClass' ,f'{fname}.png'), mode))
220
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/

return features, labels
train_features, train_labels = read_voc_images(voc_dir, True)
We draw the first five input images and their labels. In the label images, white and black represent
borders and background, respectively, while the other colors correspond to different classes.
n = 5
imgs = train_features[:n] + train_labels[:n]
imgs = [img.permute(1,2,0) for img in imgs]
d2l.show_images(imgs, 2, n);
Next, we enumerate the RGB color values and class names for all the labels in this dataset.
#@save
VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0],
[0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
[64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
[64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
[0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
[0, 64, 128]]
#@save
VOC_CLASSES = ['background', 'aeroplane', 'bicycle', 'bird', 'boat',
'bottle', 'bus', 'car', 'cat', 'chair', 'cow',
'diningtable', 'dog', 'horse', 'motorbike', 'person',
'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor']
With the two constants defined above, we can conveniently find the class index for each pixel in a
label. We define the voc_colormap2label function to build the mapping from the above RGB color
values to class indices, and the voc_label_indices function to map any RGB values to their class
indices in this Pascal VOC2012 dataset.
#@save
def voc_colormap2label():
"""Build the mapping from RGB to class indices for VOC labels."""
colormap2label = torch.zeros(256 ** 3, dtype=torch.long)

for i, colormap in enumerate(VOC_COLORMAP):

colormap2label[
(colormap[0] * 256 + colormap[1]) * 256 + colormap[2]] = i
return colormap2label
#@save
def voc_label_indices(colormap, colormap2label):
"""Map any RGB values in VOC labels to their class indices."""
colormap = colormap.permute(1, 2, 0).numpy().astype('int32')
idx = ((colormap[:, :, 0] * 256 + colormap[:, :, 1]) * 256
+ colormap[:, :, 2])
return colormap2label[idx]
For example, in the first example image, the class index for the front part of the airplane is 1, while
the background index is 0.
y = voc_label_indices(train_labels[0], voc_colormap2label())
y[105:115, 130:140], VOC_CLASSES[1]
(tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1]]),
'aeroplane')
Data Preprocessing
In previous experiments such as in Section 8.1–Section 8.4, images are rescaled to fit the modelʼs
required input shape. However, in semantic segmentation, doing so requires rescaling the pre-
dicted pixel classes back to the original shape of the input image. Such rescaling may be inac-
curate, especially for segmented regions with different classes. To avoid this issue, we crop the
image to a fixed shape instead of rescaling. Specifically, using random cropping from image aug-
mentation, we crop the same area of the input image and the label.
#@save
def voc_rand_crop(feature, label, height, width):
"""Randomly crop both feature and label images."""
rect = torchvision.transforms.RandomCrop.get_params(
feature, (height, width))
feature = torchvision.transforms.functional.crop(feature, *rect)
label = torchvision.transforms.functional.crop(label, *rect)
return feature, label

imgs = []
for _ in range(n):
imgs += voc_rand_crop(train_features[0], train_labels[0], 200, 300)
imgs = [img.permute(1, 2, 0) for img in imgs]

d2l.show_images(imgs[::2] + imgs[1::2], 2, n);
Custom Semantic Segmentation Dataset Class
We define a custom semantic segmentation dataset class VOCSegDataset by inheriting the Dataset
class provided by high-level APIs. By implementing the __getitem__ function, we can arbitrar-
ily access the input image indexed as idx in the dataset and the class index of each pixel in this
image. Since some images in the dataset have a smaller size than the output size of random crop-
ping, these examples are filtered out by a custom filter function. In addition, we also define the
normalize_image function to standardize the values of the three RGB channels of input images.
#@save
class VOCSegDataset(torch.utils.data.Dataset):
"""A customized dataset to load the VOC dataset."""
def __init__(self, is_train, crop_size, voc_dir):

self.transform = torchvision.transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
self.crop_size = crop_size
features, labels = read_voc_images(voc_dir, is_train=is_train)
self.features = [self.normalize_image(feature)
for feature in self.filter(features)]
self.labels = self.filter(labels)
self.colormap2label = voc_colormap2label()
print('read ' + str(len(self.features)) + ' examples')
def normalize_image(self, img):

return self.transform(img.float() / 255)
def filter(self, imgs):

return [img for img in imgs if (
img.shape[1] >= self.crop_size[0] and
img.shape[2] >= self.crop_size[1])]


feature, label = voc_rand_crop(self.features[idx], self.labels[idx],
*self.crop_size)
return (feature, voc_label_indices(label, self.colormap2label))
def __len__(self):
return len(self.features)
Reading the Dataset
We use the custom VOCSegDataset class to create instances of the training set and test set, respec-
tively. Suppose that we specify that the output shape of randomly cropped images is 320 × 480.
Below we can view the number of examples that are retained in the training set and test set.
crop_size = (320, 480)

voc_train = VOCSegDataset(True, crop_size, voc_dir)
voc_test = VOCSegDataset(False, crop_size, voc_dir)
read 1114 examples

read 1078 examples
Setting the batch size to 64, we define the data iterator for the training set. Letʼs print the shape
of the first minibatch. Different from in image classification or object detection, labels here are
three-dimensional tensors.
batch_size = 64
train_iter = torch.utils.data.DataLoader(voc_train, batch_size, shuffle=True,
drop_last=True,
num_workers=d2l.get_dataloader_workers())
for X, Y in train_iter:
print(X.shape)
print(Y.shape)
break
torch.Size([64, 3, 320, 480])

torch.Size([64, 320, 480])
Putting All Things Together
Finally, we define the following load_data_voc function to download and read the Pascal VOC2012
semantic segmentation dataset. It returns data iterators for both the training and test datasets.
#@save
def load_data_voc(batch_size, crop_size):
"""Load the VOC semantic segmentation dataset."""
voc_dir = d2l.download_extract('voc2012', os.path.join(
'VOCdevkit', 'VOC2012'))

num_workers = d2l.get_dataloader_workers()
train_iter = torch.utils.data.DataLoader(
VOCSegDataset(True, crop_size, voc_dir), batch_size,
shuffle=True, drop_last=True, num_workers=num_workers)
test_iter = torch.utils.data.DataLoader(
VOCSegDataset(False, crop_size, voc_dir), batch_size,
drop_last=True, num_workers=num_workers)
return train_iter, test_iter
14.9.3 Summary
• Semantic segmentation recognizes and understands what are in an image in pixel level by
dividing the image into regions belonging to different semantic classes.
• One of the most important semantic segmentation dataset is Pascal VOC2012.
• In semantic segmentation, since the input image and label correspond one-to-one on the
pixel, the input image is randomly cropped to a fixed shape rather than rescaled.
Exercises
1. How can semantic segmentation be applied in autonomous vehicles and medical image di-
agnostics? Can you think of other applications?
2. Recall the descriptions of data augmentation in Section 14.1. Which of the image augmen-
tation methods used in image classification would be infeasible to be applied in semantic
segmentation?
Discussions221
14.10 Transposed Convolution
The CNN layers we have seen so far, such as convolutional layers (Section 7.2) and pooling lay-
ers (Section 7.5), typically reduce (downsample) the spatial dimensions (height and width) of the
input, or keep them unchanged. In semantic segmentation that classifies at pixel-level, it will be
convenient if the spatial dimensions of the input and output are the same. For example, the chan-
nel dimension at one output pixel can hold the classification results for the input pixel at the same
spatial position.
To achieve this, especially after the spatial dimensions are reduced by CNN layers, we can use
another type of CNN layers that can increase (upsample) the spatial dimensions of intermedi-
ate feature maps. In this section, we will introduce transposed convolution, which is also called
fractionally-strided convolution (Dumoulin and Visin, 2016), for reversing downsampling opera-
tions by the convolution.
import torch
221
14.10. Transposed Convolution 663

14.10.1 Basic Operation
Ignoring channels for now, letʼs begin with the basic transposed convolution operation with stride
of 1 and no padding. Suppose that we are given a nh ×nw input tensor and a kh ×kw kernel. Sliding
the kernel window with stride of 1 for nw times in each row and nh times in each column yields
a total of nh nw intermediate results. Each intermediate result is a (nh + kh − 1) × (nw + kw − 1)
tensor that are initialized as zeros. To compute each intermediate tensor, each element in the
input tensor is multiplied by the kernel so that the resulting kh × kw tensor replaces a portion
in each intermediate tensor. Note that the position of the replaced portion in each intermediate
tensor corresponds to the position of the element in the input tensor used for the computation.
In the end, all the intermediate results are summed over to produce the output.
As an example, Fig. 14.10.1 illustrates how transposed convolution with a 2 × 2 kernel is computed
for a 2 × 2 input tensor.
Fig. 14.10.1: Transposed convolution with a 2 × 2 kernel. The shaded portions are a portion of an
intermediate tensor as well as the input and kernel tensor elements used for the computation.
We can implement this basic transposed convolution operation trans_conv for a input matrix X
and a kernel matrix K.
def trans_conv(X, K):

h, w = K.shape
Y = torch.zeros((X.shape[0] + h - 1, X.shape[1] + w - 1))
for i in range(X.shape[0]):
for j in range(X.shape[1]):
Y[i: i + h, j: j + w] += X[i, j] * K
return Y
In contrast to the regular convolution (in Section 7.2) that reduces input elements via the kernel,
the transposed convolution broadcasts input elements via the kernel, thereby producing an output
that is larger than the input. We can construct the input tensor X and the kernel tensor K from Fig.
14.10.1 to validate the output of the above implementation of the basic two-dimensional trans-
posed convolution operation.
X = torch.tensor([[0.0, 1.0], [2.0, 3.0]])

K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
trans_conv(X, K)

tensor([[ 0., 0., 1.],
[ 0., 4., 6.],
[ 4., 12., 9.]])
Alternatively, when the input X and kernel K are both four-dimensional tensors, we can use high-
level APIs to obtain the same results.
X, K = X.reshape(1, 1, 2, 2), K.reshape(1, 1, 2, 2)

tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, bias=False)
tconv.weight.data = K
tconv(X)
tensor([[[[ 0., 0., 1.],

[ 0., 4., 6.],
[ 4., 12., 9.]]]], grad_fn=<ConvolutionBackward0>)
14.10.2 Padding, Strides, and Multiple Channels
Different from in the regular convolution where padding is applied to input, it is applied to out-
put in the transposed convolution. For example, when specifying the padding number on either
side of the height and width as 1, the first and last rows and columns will be removed from the
transposed convolution output.
tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, padding=1, bias=False)

tconv(X)
tensor([[[[4.]]]], grad_fn=<ConvolutionBackward0>)
In the transposed convolution, strides are specified for intermediate results (thus output), not for
input. Using the same input and kernel tensors from Fig. 14.10.1, changing the stride from 1 to
2 increases both the height and weight of intermediate tensors, hence the output tensor in Fig.
14.10.2.

Fig. 14.10.2: Transposed convolution with a 2 × 2 kernel with stride of 2. The shaded portions are
a portion of an intermediate tensor as well as the input and kernel tensor elements used for the
computation.
The following code snippet can validate the transposed convolution output for stride of 2 in Fig.
14.10.2.
tconv = nn.ConvTranspose2d(1, 1, kernel_size=2, stride=2, bias=False)

tconv(X)
tensor([[[[0., 0., 0., 1.],

[0., 0., 2., 3.],
[0., 2., 0., 3.],
[4., 6., 6., 9.]]]], grad_fn=<ConvolutionBackward0>)
For multiple input and output channels, the transposed convolution works in the same way as
the regular convolution. Suppose that the input has ci channels, and that the transposed convo-
lution assigns a kh × kw kernel tensor to each input channel. When multiple output channels are
specified, we will have a ci × kh × kw kernel for each output channel.
As in all, if we feed X into a convolutional layer f to output Y = f (X) and create a transposed con-
volutional layer g with the same hyperparameters as f except for the number of output channels
being the number of channels in X, then g(Y ) will have the same shape as X. This can be illustrated
in the following example.
X = torch.rand(size=(1, 10, 16, 16))

conv = nn.Conv2d(10, 20, kernel_size=5, padding=2, stride=3)
tconv = nn.ConvTranspose2d(20, 10, kernel_size=5, padding=2, stride=3)
tconv(conv(X)).shape == X.shape

True
14.10.3 Connection to Matrix Transposition
The transposed convolution is named after the matrix transposition. To explain, letʼs first see
how to implement convolutions using matrix multiplications. In the example below, we define a
3 × 3 input X and a 2 × 2 convolution kernel K, and then use the corr2d function to compute the
convolution output Y.
X = torch.arange(9.0).reshape(3, 3)
K = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
Y = d2l.corr2d(X, K)
Y
tensor([[27., 37.],
[57., 67.]])
Next, we rewrite the convolution kernel K as a sparse weight matrix W containing a lot of zeros.
The shape of the weight matrix is (4, 9), where the non-zero elements come from the convolution
kernel K.
def kernel2matrix(K):
k, W = torch.zeros(5), torch.zeros((4, 9))
k[:2], k[3:5] = K[0, :], K[1, :]
W[0, :5], W[1, 1:6], W[2, 3:8], W[3, 4:] = k, k, k, k
return W
W = kernel2matrix(K)
W
tensor([[1., 2., 0., 3., 4., 0., 0., 0., 0.],

[0., 1., 2., 0., 3., 4., 0., 0., 0.],
[0., 0., 0., 1., 2., 0., 3., 4., 0.],
[0., 0., 0., 0., 1., 2., 0., 3., 4.]])
Concatenate the input X row by row to get a vector of length 9. Then the matrix multiplication of
W and the vectorized X gives a vector of length 4. After reshaping it, we can obtain the same result
Y from the original convolution operation above: we just implemented convolutions using matrix
multiplications.
Y == torch.matmul(W, X.reshape(-1)).reshape(2, 2)
tensor([[True, True],
[True, True]])
Likewise, we can implement transposed convolutions using matrix multiplications. In the fol-
lowing example, we take the 2 × 2 output Y from the above regular convolution as input to the
transposed convolution. To implement this operation by multiplying matrices, we only need to
transpose the weight matrix W with the new shape (9, 4).

Z = trans_conv(Y, K)
Z == torch.matmul(W.T, Y.reshape(-1)).reshape(3, 3)
tensor([[True, True, True],

[True, True, True],
[True, True, True]])
Consider implementing the convolution by multiplying matrices. Given an input vector x and a
weight matrix W, the forward propagation function of the convolution can be implemented by
multiplying its input with the weight matrix and outputting a vector y = Wx. Since backpropaga-
tion follows the chain rule and ∇x y = W⊤ , the backpropagation function of the convolution can
be implemented by multiplying its input with the transposed weight matrix W⊤ . Therefore, the
transposed convolutional layer can just exchange the forward propagation function and the back-
propagation function of the convolutional layer: its forward propagation and backpropagation
functions multiply their input vector with W⊤ and W, respectively.
14.10.4 Summary
• In contrast to the regular convolution that reduces input elements via the kernel, the trans-
posed convolution broadcasts input elements via the kernel, thereby producing an output
that is larger than the input.
• If we feed X into a convolutional layer f to output Y = f (X) and create a transposed convolu-
tional layer g with the same hyperparameters as f except for the number of output channels
being the number of channels in X, then g(Y ) will have the same shape as X.
• We can implement convolutions using matrix multiplications. The transposed convolutional
layer can just exchange the forward propagation function and the backpropagation function
of the convolutional layer.
Exercises
1. In Section 14.10.3, the convolution input X and the transposed convolution output Z have the
same shape. Do they have the same value? Why?
2. Is it efficient to use matrix multiplications to implement convolutions? Why?
Discussions222
14.11 Fully Convolutional Networks
As discussed in Section 14.9, semantic segmentation classifies images in pixel level. A fully con-
volutional network (FCN) uses a convolutional neural network to transform image pixels to pixel
classes (Long et al., 2015). Unlike the CNNs that we encountered earlier for image classification or
object detection, a fully convolutional network transforms the height and width of intermediate
feature maps back to those of the input image: this is achieved by the transposed convolutional
layer introduced in Section 14.10. As a result, the classification output and the input image have
222

a one-to-one correspondence in pixel level: the channel dimension at any output pixel holds the
classification results for the input pixel at the same spatial position.
%matplotlib inline
import torch
import torchvision
14.11.1 The Model
Here we describe the basic design of the fully convolutional network model. As shown in Fig.
14.11.1, this model first uses a CNN to extract image features, then transforms the number of
channels into the number of classes via a 1 × 1 convolutional layer, and finally transforms the
height and width of the feature maps to those of the input image via the transposed convolution
introduced in Section 14.10. As a result, the model output has the same height and width as the
input image, where the output channel contains the predicted classes for the input pixel at the
same spatial position.
Fig. 14.11.1: Fully convolutional network.
Below, we use a ResNet-18 model pretrained on the ImageNet dataset to extract image features and
denote the model instance as pretrained_net. The last few layers of this model include a global
average pooling layer and a fully connected layer: they are not needed in the fully convolutional
network.
pretrained_net = torchvision.models.resnet18(pretrained=True)
list(pretrained_net.children())[-3:]
14.11. Fully Convolutional Networks 669

warnings.warn(
,→current behavior is equivalent to passing weights=ResNet18_Weights.IMAGENET1K_
,→V1. You can also use weights=ResNet18_Weights.DEFAULT to get the most up-to-date␣
,→weights.
warnings.warn(msg)
[Sequential(
(0): BasicBlock(
(conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): BasicBlock(
(relu): ReLU(inplace=True)
)
),
AdaptiveAvgPool2d(output_size=(1, 1)),
Linear(in_features=512, out_features=1000, bias=True)]
Next, we create the fully convolutional network instance net. It copies all the pretrained layers in
the ResNet-18 except for the final global average pooling layer and the fully connected layer that
are closest to the output.
net = nn.Sequential(*list(pretrained_net.children())[:-2])
Given an input with height and width of 320 and 480 respectively, the forward propagation of net
reduces the input height and width to 1/32 of the original, namely 10 and 15.
X = torch.rand(size=(1, 3, 320, 480))

net(X).shape
torch.Size([1, 512, 10, 15])
Next, we use a 1 × 1 convolutional layer to transform the number of output channels into the
number of classes (21) of the Pascal VOC2012 dataset. Finally, we need to increase the height and
width of the feature maps by 32 times to change them back to the height and width of the input
image. Recall how to calculate the output shape of a convolutional layer in Section 7.3. Since

(320 − 64 + 16 × 2 + 32)/32 = 10 and (480 − 64 + 16 × 2 + 32)/32 = 15, we construct a transposed
convolutional layer with stride of 32, setting the height and width of the kernel to 64, the padding
to 16. In general, we can see that for stride s, padding s/2 (assuming s/2 is an integer), and the
height and width of the kernel 2s, the transposed convolution will increase the height and width
of the input by s times.
num_classes = 21
net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1))
net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes,
kernel_size=64, padding=16, stride=32))
14.11.2 Initializing Transposed Convolutional Layers
We already know that transposed convolutional layers can increase the height and width of feature
maps. In image processing, we may need to scale up an image, i.e., upsampling. Bilinear interpo-
lation is one of the commonly used upsampling techniques. It is also often used for initializing
transposed convolutional layers.
To explain bilinear interpolation, say that given an input image we want to calculate each pixel
of the upsampled output image. In order to calculate the pixel of the output image at coordinate
(x, y), first map (x, y) to coordinate (x′ , y ′ ) on the input image, for example, according to the ratio
of the input size to the output size. Note that the mapped x′ and y ′ are real numbers. Then, find
the four pixels closest to coordinate (x′ , y ′ ) on the input image. Finally, the pixel of the output
image at coordinate (x, y) is calculated based on these four closest pixels on the input image and
their relative distance from (x′ , y ′ ).
Upsampling of bilinear interpolation can be implemented by the transposed convolutional layer
with the kernel constructed by the following bilinear_kernel function. Due to space limitations,
we only provide the implementation of the bilinear_kernel function below without discussions
on its algorithm design.
def bilinear_kernel(in_channels, out_channels, kernel_size):

factor = (kernel_size + 1) // 2
if kernel_size % 2 == 1:
center = factor - 1
else:
center = factor - 0.5
og = (torch.arange(kernel_size).reshape(-1, 1),
torch.arange(kernel_size).reshape(1, -1))
filt = (1 - torch.abs(og[0] - center) / factor) * \
(1 - torch.abs(og[1] - center) / factor)
weight = torch.zeros((in_channels, out_channels,
kernel_size, kernel_size))
weight[range(in_channels), range(out_channels), :, :] = filt
return weight
Letʼs experiment with upsampling of bilinear interpolation that is implemented by a transposed

convolutional layer. We construct a transposed convolutional layer that doubles the height and
weight, and initialize its kernel with the bilinear_kernel function.

conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2,
bias=False)
conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4));
Read the image X and assign the upsampling output to Y. In order to print the image, we need to
adjust the position of the channel dimension.
img = torchvision.transforms.ToTensor()(d2l.Image.open('../img/catdog.jpg'))
X = img.unsqueeze(0)
Y = conv_trans(X)
out_img = Y[0].permute(1, 2, 0).detach()
As we can see, the transposed convolutional layer increases both the height and width of the image
by a factor of two. Except for the different scales in coordinates, the image scaled up by bilinear
interpolation and the original image printed in Section 14.3 look the same.
d2l.set_figsize()
print('input image shape:', img.permute(1, 2, 0).shape)
d2l.plt.imshow(img.permute(1, 2, 0));
print('output image shape:', out_img.shape)
d2l.plt.imshow(out_img);
input image shape: torch.Size([561, 728, 3])

output image shape: torch.Size([1122, 1456, 3])
In a fully convolutional network, we initialize the transposed convolutional layer with upsampling
of bilinear interpolation. For the 1 × 1 convolutional layer, we use Xavier initialization.
W = bilinear_kernel(num_classes, num_classes, 64)

net.transpose_conv.weight.data.copy_(W);

We read the semantic segmentation dataset as introduced in Section 14.9. The output image shape
of random cropping is specified as 320 × 480: both the height and width are divisible by 32.
batch_size, crop_size = 32, (320, 480)

train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size)
read 1114 examples

read 1078 examples
14.11.4 Training
Now we can train our constructed fully convolutional network. The loss function and accuracy
calculation here are not essentially different from those in image classification of earlier chapters.
Because we use the output channel of the transposed convolutional layer to predict the class for
each pixel, the channel dimension is specified in the loss calculation. In addition, the accuracy is
calculated based on correctness of the predicted class for all the pixels.
def loss(inputs, targets):

return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1)
num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus()

trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd)
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)


14.11.5 Prediction
When predicting, we need to standardize the input image in each channel and transform the image
into the four-dimensional input format required by the CNN.
def predict(img):
X = test_iter.dataset.normalize_image(img).unsqueeze(0)
pred = net(X.to(devices[0])).argmax(dim=1)
return pred.reshape(pred.shape[1], pred.shape[2])
To visualize the predicted class of each pixel, we map the predicted class back to its label color in
the dataset.
def label2image(pred):
colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0])
X = pred.long()
return colormap[X, :]
Images in the test dataset vary in size and shape. Since the model uses a transposed convolutional
layer with stride of 32, when the height or width of an input image is indivisible by 32, the out-
put height or width of the transposed convolutional layer will deviate from the shape of the input
image. In order to address this issue, we can crop multiple rectangular areas with height and
width that are integer multiples of 32 in the image, and perform forward propagation on the pix-
els in these areas separately. Note that the union of these rectangular areas needs to completely
cover the input image. When a pixel is covered by multiple rectangular areas, the average of the
transposed convolution outputs in separate areas for this same pixel can be input to the softmax
operation to predict the class.
For simplicity, we only read a few larger test images, and crop a 320 × 480 area for prediction
starting from the upper-left corner of an image. For these test images, we print their cropped
areas, prediction results, and ground-truth row by row.
voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012')

test_images, test_labels = d2l.read_voc_images(voc_dir, False)
n, imgs = 4, []
for i in range(n):
crop_rect = (0, 0, 320, 480)
X = torchvision.transforms.functional.crop(test_images[i], *crop_rect)
pred = label2image(predict(X))
imgs += [X.permute(1,2,0), pred.cpu(),
torchvision.transforms.functional.crop(
test_labels[i], *crop_rect).permute(1,2,0)]
d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2);

14.11.6 Summary
• The fully convolutional network first uses a CNN to extract image features, then transforms
the number of channels into the number of classes via a 1×1 convolutional layer, and finally
transforms the height and width of the feature maps to those of the input image via the
transposed convolution.
• In a fully convolutional network, we can use upsampling of bilinear interpolation to initialize
the transposed convolutional layer.
Exercises
1. If we use Xavier initialization for the transposed convolutional layer in the experiment, how
does the result change?
2. Can you further improve the accuracy of the model by tuning the hyperparameters?
3. Predict the classes of all pixels in test images.
4. The original fully convolutional network paper also uses outputs of some intermediate CNN
layers (Long et al., 2015). Try to implement this idea.
Discussions223
223

14.12 Neural Style Transfer
If you are a photography enthusiast, you may be familiar with the filter. It can change the color
style of photos so that landscape photos become sharper or portrait photos have whitened skins.
However, one filter usually only changes one aspect of the photo. To apply an ideal style to a photo,
you probably need to try many different filter combinations. This process is as complex as tuning
the hyperparameters of a model.
In this section, we will leverage layerwise representations of a CNN to automatically apply the
style of one image to another image, i.e., style transfer (Gatys et al., 2016). This task needs two in-
put images: one is the content image and the other is the style image. We will use neural networks
to modify the content image to make it close to the style image in style. For example, the content
image in Fig. 14.12.1 is a landscape photo taken by us in Mount Rainier National Park in the sub-
urbs of Seattle, while the style image is an oil painting with the theme of autumn oak trees. In
the output synthesized image, the oil brush strokes of the style image are applied, leading to more
vivid colors, while preserving the main shape of the objects in the content image.
Fig. 14.12.1: Given content and style images, style transfer outputs a synthesized image.
14.12.1 Method
Fig. 14.12.2 illustrates the CNN-based style transfer method with a simplified example. First, we
initialize the synthesized image, for example, into the content image. This synthesized image
is the only variable that needs to be updated during the style transfer process, i.e., the model
parameters to be updated during training. Then we choose a pretrained CNN to extract image
features and freeze its model parameters during training. This deep CNN uses multiple layers
to extract hierarchical features for images. We can choose the output of some of these layers as
content features or style features. Take Fig. 14.12.2 as an example. The pretrained neural network
here has 3 convolutional layers, where the second layer outputs the content features, and the first
and third layers output the style features.

Fig. 14.12.2: CNN-based style transfer process. Solid lines show the direction of forward propaga-
tion and dotted lines show backward propagation.
Next, we calculate the loss function of style transfer through forward propagation (direction of
solid arrows), and update the model parameters (the synthesized image for output) through back-
propagation (direction of dashed arrows). The loss function commonly used in style transfer con-
sists of three parts: (i) content loss makes the synthesized image and the content image close in
content features; (ii) style loss makes the synthesized image and style image close in style features;
and (iii) total variation loss helps to reduce the noise in the synthesized image. Finally, when the
model training is over, we output the model parameters of the style transfer to generate the final
synthesized image.
In the following, we will explain the technical details of style transfer via a concrete experiment.
14.12.2 Reading the Content and Style Images
First, we read the content and style images. From their printed coordinate axes, we can tell that
these images have different sizes.
%matplotlib inline
import torch
import torchvision
d2l.set_figsize()
content_img = d2l.Image.open('../img/rainier.jpg')
d2l.plt.imshow(content_img);
14.12. Neural Style Transfer 677

style_img = d2l.Image.open('../img/autumn-oak.jpg')
d2l.plt.imshow(style_img);
14.12.3 Preprocessing and Postprocessing
Below, we define two functions for preprocessing and postprocessing images. The preprocess
function standardizes each of the three RGB channels of the input image and transforms the re-
sults into the CNN input format. The postprocess function restores the pixel values in the output
image to their original values before standardization. Since the image printing function requires
that each pixel has a floating point value from 0 to 1, we replace any value smaller than 0 or greater
than 1 with 0 or 1, respectively.
rgb_mean = torch.tensor([0.485, 0.456, 0.406])

rgb_std = torch.tensor([0.229, 0.224, 0.225])
def preprocess(img, image_shape):

transforms = torchvision.transforms.Compose([
torchvision.transforms.Resize(image_shape),
torchvision.transforms.Normalize(mean=rgb_mean, std=rgb_std)])
return transforms(img).unsqueeze(0)

def postprocess(img):
img = img[0].to(rgb_std.device)
img = torch.clamp(img.permute(1, 2, 0) * rgb_std + rgb_mean, 0, 1)
return torchvision.transforms.ToPILImage()(img.permute(2, 0, 1))
14.12.4 Extracting Features
We use the VGG-19 model pretrained on the ImageNet dataset to extract image features (Gatys et
al., 2016).
pretrained_net = torchvision.models.vgg19(pretrained=True)
warnings.warn(
,→current behavior is equivalent to passing weights=VGG19_Weights.IMAGENET1K_V1. You␣
,→can also use weights=VGG19_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
In order to extract the content features and style features of the image, we can select the output
of certain layers in the VGG network. Generally speaking, the closer to the input layer, the easier
to extract details of the image, and vice versa, the easier to extract the global information of the
image. In order to avoid excessively retaining the details of the content image in the synthesized
image, we choose a VGG layer that is closer to the output as the content layer to output the content
features of the image. We also select the output of different VGG layers for extracting local and
global style features. These layers are also called style layers. As mentioned in Section 8.2, the VGG
network uses 5 convolutional blocks. In the experiment, we choose the last convolutional layer
of the fourth convolutional block as the content layer, and the first convolutional layer of each
convolutional block as the style layer. The indices of these layers can be obtained by printing the
pretrained_net instance.
style_layers, content_layers = [0, 5, 10, 19, 28], [25]
When extracting features using VGG layers, we only need to use all those from the input layer to
the content layer or style layer that is closest to the output layer. Letʼs construct a new network
instance net, which only retains all the VGG layers to be used for feature extraction.
net = nn.Sequential(*[pretrained_net.features[i] for i in

range(max(content_layers + style_layers) + 1)])
Given the input X, if we simply invoke the forward propagation net(X), we can only get the output
of the last layer. Since we also need the outputs of intermediate layers, we need to perform layer-
by-layer computation and keep the content and style layer outputs.

def extract_features(X, content_layers, style_layers):
contents = []
styles = []
for i in range(len(net)):
X = net[i](X)
if i in style_layers:
styles.append(X)
if i in content_layers:
contents.append(X)
return contents, styles
Two functions are defined below: the get_contents function extracts content features from the
content image, and the get_styles function extracts style features from the style image. Since
there is no need to update the model parameters of the pretrained VGG during training, we can
extract the content and the style features even before the training starts. Since the synthesized
image is a set of model parameters to be updated for style transfer, we can only extract the con-
tent and style features of the synthesized image by calling the extract_features function during
training.
def get_contents(image_shape, device):

content_X = preprocess(content_img, image_shape).to(device)
contents_Y, _ = extract_features(content_X, content_layers, style_layers)
return content_X, contents_Y
def get_styles(image_shape, device):

style_X = preprocess(style_img, image_shape).to(device)
_, styles_Y = extract_features(style_X, content_layers, style_layers)
return style_X, styles_Y
Now we will describe the loss function for style transfer. The loss function consists of the content
loss, style loss, and total variation loss.
Content Loss
Similar to the loss function in linear regression, the content loss measures the difference in con-
tent features between the synthesized image and the content image via the squared loss function.
The two inputs of the squared loss function are both outputs of the content layer computed by the
extract_features function.
def content_loss(Y_hat, Y):

# We detach the target content from the tree used to dynamically compute
# the gradient: this is a stated value, not a variable. Otherwise the loss
# will throw an error.
return torch.square(Y_hat - Y.detach()).mean()

Style Loss
Style loss, similar to content loss, also uses the squared loss function to measure the difference in
style between the synthesized image and the style image. To express the style output of any style
layer, we first use the extract_features function to compute the style layer output. Suppose that
the output has 1 example, c channels, height h, and width w, we can transform this output into
matrix X with c rows and hw columns. This matrix can be thought of as the concatenation of c
vectors x1 , . . . , xc , each of which has a length of hw. Here, vector xi represents the style feature of
channel i.
In the Gram matrix of these vectors XX⊤ ∈ Rc×c , element xij in row i and column j is the dot
product of vectors xi and xj . It represents the correlation of the style features of channels i and j.
We use this Gram matrix to represent the style output of any style layer. Note that when the value
of hw is larger, it likely leads to larger values in the Gram matrix. Note also that the height and
width of the Gram matrix are both the number of channels c. To allow style loss not to be affected
by these values, the gram function below divides the Gram matrix by the number of its elements,
i.e., chw.
def gram(X):
num_channels, n = X.shape[1], X.numel() // X.shape[1]
X = X.reshape((num_channels, n))
return torch.matmul(X, X.T) / (num_channels * n)
Obviously, the two Gram matrix inputs of the squared loss function for style loss are based on the
style layer outputs for the synthesized image and the style image. It is assumed here that the Gram
matrix gram_Y based on the style image has been precomputed.
def style_loss(Y_hat, gram_Y):

return torch.square(gram(Y_hat) - gram_Y.detach()).mean()
Total Variation Loss
Sometimes, the learned synthesized image has a lot of high-frequency noise, i.e., particularly
bright or dark pixels. One common noise reduction method is total variation denoising. Denote by
xi,j the pixel value at coordinate (i, j). Reducing total variation loss
∑
|xi,j − xi+1,j | + |xi,j − xi,j+1 | (14.12.1)
i,j
makes values of neighboring pixels on the synthesized image closer.
def tv_loss(Y_hat):
return 0.5 * (torch.abs(Y_hat[:, :, 1:, :] - Y_hat[:, :, :-1, :]).mean() +
torch.abs(Y_hat[:, :, :, 1:] - Y_hat[:, :, :, :-1]).mean())

Loss Function
The loss function of style transfer is the weighted sum of content loss, style loss, and total variation
loss. By adjusting these weight hyperparameters, we can balance among content retention, style
transfer, and noise reduction on the synthesized image.
content_weight, style_weight, tv_weight = 1, 1e3, 10
def compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram):

# Calculate the content, style, and total variance losses respectively
contents_l = [content_loss(Y_hat, Y) * content_weight for Y_hat, Y in zip(
contents_Y_hat, contents_Y)]
styles_l = [style_loss(Y_hat, Y) * style_weight for Y_hat, Y in zip(
styles_Y_hat, styles_Y_gram)]
tv_l = tv_loss(X) * tv_weight
# Add up all the losses
l = sum(10 * styles_l + contents_l + [tv_l])
return contents_l, styles_l, tv_l, l
14.12.6 Initializing the Synthesized Image
In style transfer, the synthesized image is the only variable that needs to be updated during train-
ing. Thus, we can define a simple model, SynthesizedImage, and treat the synthesized image as
the model parameters. In this model, forward propagation just returns the model parameters.
class SynthesizedImage(nn.Module):
def __init__(self, img_shape, **kwargs):
super(SynthesizedImage, self).__init__(**kwargs)
self.weight = nn.Parameter(torch.rand(*img_shape))
def forward(self):
return self.weight
Next, we define the get_inits function. This function creates a synthesized image model in-
stance and initializes it to the image X. Gram matrices for the style image at various style layers,
styles_Y_gram, are computed prior to training.
def get_inits(X, device, lr, styles_Y):

gen_img = SynthesizedImage(X.shape).to(device)
gen_img.weight.data.copy_(X.data)
trainer = torch.optim.Adam(gen_img.parameters(), lr=lr)
styles_Y_gram = [gram(Y) for Y in styles_Y]
return gen_img(), styles_Y_gram, trainer

14.12.7 Training
When training the model for style transfer, we continuously extract content features and style
features of the synthesized image, and calculate the loss function. Below defines the training
loop.
def train(X, contents_Y, styles_Y, device, lr, num_epochs, lr_decay_epoch):

X, styles_Y_gram, trainer = get_inits(X, device, lr, styles_Y)
scheduler = torch.optim.lr_scheduler.StepLR(trainer, lr_decay_epoch, 0.8)
xlim=[10, num_epochs],
legend=['content', 'style', 'TV'],
ncols=2, figsize=(7, 2.5))
trainer.zero_grad()
contents_Y_hat, styles_Y_hat = extract_features(
X, content_layers, style_layers)
contents_l, styles_l, tv_l, l = compute_loss(
X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram)
l.backward()
trainer.step()
scheduler.step()
if (epoch + 1) % 10 == 0:
animator.axes[1].imshow(postprocess(X))
animator.add(epoch + 1, [float(sum(contents_l)),
float(sum(styles_l)), float(tv_l)])
return X
Now we start to train the model. We rescale the height and width of the content and style images
to 300 by 450 pixels. We use the content image to initialize the synthesized image.
device, image_shape = d2l.try_gpu(), (300, 450) # PIL Image (h, w)

content_X, contents_Y = get_contents(image_shape, device)
_, styles_Y = get_styles(image_shape, device)
output = train(content_X, contents_Y, styles_Y, device, 0.3, 500, 50)
We can see that the synthesized image retains the scenery and objects of the content image, and
transfers the color of the style image at the same time. For example, the synthesized image has

blocks of color like those in the style image. Some of these blocks even have the subtle texture of
brush strokes.
14.12.8 Summary
• The loss function commonly used in style transfer consists of three parts: (i) content loss
makes the synthesized image and the content image close in content features; (ii) style loss
makes the synthesized image and style image close in style features; and (iii) total variation
loss helps to reduce the noise in the synthesized image.
• We can use a pretrained CNN to extract image features and minimize the loss function to
continuously update the synthesized image as model parameters during training.
• We use Gram matrices to represent the style outputs from the style layers.
Exercises
1. How does the output change when you select different content and style layers?
2. Adjust the weight hyperparameters in the loss function. Does the output retain more content
or have less noise?
3. Use different content and style images. Can you create more interesting synthesized images?
4. Can we apply style transfer for text? Hint: you may refer to the survey paper by Hu et al. (Hu
et al., 2020).
Discussions224
14.13 Image Classification (CIFAR-10) on Kaggle
So far, we have been using high-level APIs of deep learning frameworks to directly obtain image
datasets in tensor format. However, custom image datasets often come in the form of image files.
In this section, we will start from raw image files, and organize, read, then transform them into
tensor format step by step.
We experimented with the CIFAR-10 dataset in Section 14.1, which is an important dataset in com-
puter vision. In this section, we will apply the knowledge we learned in previous sections to prac-
tice the Kaggle competition of CIFAR-10 image classification. The web address of the competition
is https://www.kaggle.com/c/cifar-10
Fig. 14.13.1 shows the information on the competitionʼs webpage. In order to submit the results,
you need to register a Kaggle account.
224

Fig. 14.13.1: CIFAR-10 image classification competition webpage information. The competition
dataset can be obtained by clicking the “Data” tab.
import collections
import math
import os
import shutil
import pandas as pd
import torch
import torchvision
14.13.1 Obtaining and Organizing the Dataset
The competition dataset is divided into a training set and a test set, which contain 50000 and 300000
images, respectively. In the test set, 10000 images will be used for evaluation, while the remaining
290000 images will not be evaluated: they are included just to make it hard to cheat with manually
labeled results of the test set. The images in this dataset are all png color (RGB channels) image
files, whose height and width are both 32 pixels. The images cover a total of 10 categories, namely
airplanes, cars, birds, cats, deer, dogs, frogs, horses, boats, and trucks. The upper-left corner of
Fig. 14.13.1 shows some images of airplanes, cars, and birds in the dataset.
Downloading the Dataset
After logging in to Kaggle, we can click the “Data” tab on the CIFAR-10 image classification com-
petition webpage shown in Fig. 14.13.1 and download the dataset by clicking the “Download All”
button. After unzipping the downloaded file in ../data, and unzipping train.7z and test.7z in-
side it, you will find the entire dataset in the following paths:
• ../data/cifar-10/train/[1-50000].png
• ../data/cifar-10/test/[1-300000].png
• ../data/cifar-10/trainLabels.csv
• ../data/cifar-10/sampleSubmission.csv
14.13. Image Classification (CIFAR-10) on Kaggle 685

where the train and test directories contain the training and testing images, respectively,
trainLabels.csv provides labels for the training images, and sample_submission.csv is a sample
submission file.
To make it easier to get started, we provide a small-scale sample of the dataset that contains the
first 1000 training images and 5 random testing images. To use the full dataset of the Kaggle com-
petition, you need to set the following demo variable to False.
#@save
d2l.DATA_HUB['cifar10_tiny'] = (d2l.DATA_URL + 'kaggle_cifar10_tiny.zip',
'2068874e4b9a9f0fb07ebe0ad2b29754449ccacd')
# If you use the full dataset downloaded for the Kaggle competition, set
# `demo` to False
demo = True
if demo:
data_dir = d2l.download_extract('cifar10_tiny')
else:
data_dir = '../data/cifar-10/'
Organizing the Dataset
We need to organize datasets to facilitate model training and testing. Letʼs first read the labels
from the csv file. The following function returns a dictionary that maps the non-extension part of
the filename to its label.
#@save
def read_csv_labels(fname):
"""Read `fname` to return a filename to label dictionary."""
with open(fname, 'r') as f:
# Skip the file header line (column name)
lines = f.readlines()[1:]
tokens = [l.rstrip().split(',') for l in lines]
return dict(((name, label) for name, label in tokens))
labels = read_csv_labels(os.path.join(data_dir, 'trainLabels.csv'))

print('# training examples:', len(labels))
print('# classes:', len(set(labels.values())))
# training examples: 1000

# classes: 10
Next, we define the reorg_train_valid function to split the validation set out of the original train-
ing set. The argument valid_ratio in this function is the ratio of the number of examples in the
validation set to the number of examples in the original training set. More concretely, let n be
the number of images of the class with the least examples, and r be the ratio. The validation set
will split out max(⌊nr⌋, 1) images for each class. Letʼs use valid_ratio=0.1 as an example. Since
the original training set has 50000 images, there will be 45000 images used for training in the path
train_valid_test/train, while the other 5000 images will be split out as validation set in the path
train_valid_test/valid. After organizing the dataset, images of the same class will be placed
under the same folder.

#@save
def copyfile(filename, target_dir):
"""Copy a file into a target directory."""
os.makedirs(target_dir, exist_ok=True)
shutil.copy(filename, target_dir)
#@save
def reorg_train_valid(data_dir, labels, valid_ratio):
"""Split the validation set out of the original training set."""
# The number of examples of the class that has the fewest examples in the
# training dataset
n = collections.Counter(labels.values()).most_common()[-1][1]
# The number of examples per class for the validation set
n_valid_per_label = max(1, math.floor(n * valid_ratio))
label_count = {}
for train_file in os.listdir(os.path.join(data_dir, 'train')):
label = labels[train_file.split('.')[0]]
fname = os.path.join(data_dir, 'train', train_file)
copyfile(fname, os.path.join(data_dir, 'train_valid_test',
'train_valid', label))
if label not in label_count or label_count[label] < n_valid_per_label:
'valid', label))
label_count[label] = label_count.get(label, 0) + 1
else:
'train', label))
return n_valid_per_label
The reorg_test function below organizes the testing set for data loading during prediction.
#@save
def reorg_test(data_dir):
"""Organize the testing set for data loading during prediction."""
for test_file in os.listdir(os.path.join(data_dir, 'test')):
copyfile(os.path.join(data_dir, 'test', test_file),
os.path.join(data_dir, 'train_valid_test', 'test',
'unknown'))
Finally, we use a function to invoke the read_csv_labels, reorg_train_valid, and reorg_test

functions defined above.
def reorg_cifar10_data(data_dir, valid_ratio):

labels = read_csv_labels(os.path.join(data_dir, 'trainLabels.csv'))
reorg_train_valid(data_dir, labels, valid_ratio)
reorg_test(data_dir)
Here we only set the batch size to 32 for the small-scale sample of the dataset. When training
and testing the complete dataset of the Kaggle competition, batch_size should be set to a larger
integer, such as 128. We split out 10% of the training examples as the validation set for tuning
hyperparameters.
batch_size = 32 if demo else 128


valid_ratio = 0.1
reorg_cifar10_data(data_dir, valid_ratio)
14.13.2 Image Augmentation
We use image augmentation to address overfitting. For example, images can be flipped horizon-
tally at random during training. We can also perform standardization for the three RGB channels
of color images. Below lists some of these operations that you can tweak.
transform_train = torchvision.transforms.Compose([
# Scale the image up to a square of 40 pixels in both height and width
# Randomly crop a square image of 40 pixels in both height and width to
# produce a small square of 0.64 to 1 times the area of the original
# image, and then scale it to a square of 32 pixels in both height and
# width
torchvision.transforms.RandomResizedCrop(32, scale=(0.64, 1.0),
ratio=(1.0, 1.0)),
# Standardize each channel of the image
torchvision.transforms.Normalize([0.4914, 0.4822, 0.4465],
[0.2023, 0.1994, 0.2010])])
During testing, we only perform standardization on images so as to remove randomness in the

evaluation results.
transform_test = torchvision.transforms.Compose([
[0.2023, 0.1994, 0.2010])])
Next, we read the organized dataset consisting of raw image files. Each example includes an image
and a label.
train_ds, train_valid_ds = [torchvision.datasets.ImageFolder(

os.path.join(data_dir, 'train_valid_test', folder),
transform=transform_train) for folder in ['train', 'train_valid']]
valid_ds, test_ds = [torchvision.datasets.ImageFolder(

transform=transform_test) for folder in ['valid', 'test']]
During training, we need to specify all the image augmentation operations defined above. When
the validation set is used for model evaluation during hyperparameter tuning, no randomness
from image augmentation should be introduced. Before final prediction, we train the model on
the combined training set and validation set to make full use of all the labeled data.

train_iter, train_valid_iter = [torch.utils.data.DataLoader(
dataset, batch_size, shuffle=True, drop_last=True)
for dataset in (train_ds, train_valid_ds)]
valid_iter = torch.utils.data.DataLoader(valid_ds, batch_size, shuffle=False,

drop_last=True)
test_iter = torch.utils.data.DataLoader(test_ds, batch_size, shuffle=False,

drop_last=False)
We define the ResNet-18 model described in Section 8.6.
def get_net():
num_classes = 10
net = d2l.resnet18(num_classes, 3)
return net
14.13.5 Defining the Training Function
We will select models and tune hyperparameters according to the modelʼs performance on the
validation set. In the following, we define the model training function train.
def train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, lr_period,

lr_decay):
trainer = torch.optim.SGD(net.parameters(), lr=lr, momentum=0.9,
weight_decay=wd)
scheduler = torch.optim.lr_scheduler.StepLR(trainer, lr_period, lr_decay)
num_batches, timer = len(train_iter), d2l.Timer()
legend = ['train loss', 'train acc']
if valid_iter is not None:
legend.append('valid acc')
legend=legend)
net.train()
timer.start()
l, acc = d2l.train_batch_ch13(net, features, labels,
loss, trainer, devices)
metric.add(l, acc, labels.shape[0])
timer.stop()
(metric[0] / metric[2], metric[1] / metric[2],
None))


valid_acc = d2l.evaluate_accuracy_gpu(net, valid_iter)
animator.add(epoch + 1, (None, None, valid_acc))
scheduler.step()
measures = (f'train loss {metric[0] / metric[2]:.3f}, '
f'train acc {metric[1] / metric[2]:.3f}')
measures += f', valid acc {valid_acc:.3f}'
print(measures + f'\n{metric[2] * num_epochs / timer.sum():.1f}'
f' examples/sec on {str(devices)}')
14.13.6 Training and Validating the Model
Now, we can train and validate the model. All the following hyperparameters can be tuned. For
example, we can increase the number of epochs. When lr_period and lr_decay are set to 4 and
0.9, respectively, the learning rate of the optimization algorithm will be multiplied by 0.9 after
every 4 epochs. Just for ease of demonstration, we only train 20 epochs here.
devices, num_epochs, lr, wd = d2l.try_all_gpus(), 20, 2e-4, 5e-4

lr_period, lr_decay, net = 4, 0.9, get_net()
train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, lr_period,
lr_decay)
train loss 0.821, train acc 0.722, valid acc 0.453


14.13.7 Classifying the Testing Set and Submitting Results on Kaggle
After obtaining a promising model with hyperparameters, we use all the labeled data (including
the validation set) to retrain the model and classify the testing set.
net, preds = get_net(), []

net(next(iter(train_valid_iter))[0])
train(net, train_valid_iter, None, num_epochs, lr, wd, devices, lr_period,
lr_decay)
for X, _ in test_iter:
y_hat = net(X.to(devices[0]))
preds.extend(y_hat.argmax(dim=1).type(torch.int32).cpu().numpy())
sorted_ids = list(range(1, len(test_ds) + 1))
sorted_ids.sort(key=lambda x: str(x))
df = pd.DataFrame({'id': sorted_ids, 'label': preds})
df['label'] = df['label'].apply(lambda x: train_valid_ds.classes[x])
df.to_csv('submission.csv', index=False)
train loss 0.674, train acc 0.762

The above code will generate a submission.csv file, whose format meets the requirement of the
Kaggle competition. The method for submitting results to Kaggle is similar to that in Section 5.7.
14.13.8 Summary
• We can read datasets containing raw image files after organizing them into the required
format.
• We can use convolutional neural networks and image augmentation in an image classifica-
tion competition.

Exercises
1. Use the complete CIFAR-10 dataset for this Kaggle competition. Set hyperparameters as
batch_size = 128, num_epochs = 100, lr = 0.1, lr_period = 50, and lr_decay = 0.1.
See what accuracy and ranking you can achieve in this competition. Can you further im-
prove them?
2. What accuracy can you get when not using image augmentation?
Discussions225
14.14 Dog Breed Identification (ImageNet Dogs) on Kaggle
In this section, we will practice the dog breed identification problem on Kaggle. The web address
of this competition is https://www.kaggle.com/c/dog-breed-identification
In this competition, 120 different breeds of dogs will be recognized. In fact, the dataset for this
competition is a subset of the ImageNet dataset. Unlike the images in the CIFAR-10 dataset in Sec-
tion 14.13, the images in the ImageNet dataset are both higher and wider in varying dimensions.
Fig. 14.14.1 shows the information on the competitionʼs webpage. You need a Kaggle account to
submit your results.
Fig. 14.14.1: The dog breed identification competition website. The competition dataset can be
obtained by clicking the “Data” tab.
import os
import torch
import torchvision
225


14.14.1 Obtaining and Organizing the Dataset
The competition dataset is divided into a training set and a test set, which contain 10222 and 10357
JPEG images of three RGB (color) channels, respectively. Among the training dataset, there are
120 breeds of dogs such as Labradors, Poodles, Dachshunds, Samoyeds, Huskies, Chihuahuas,
and Yorkshire Terriers.
Downloading the Dataset
After logging into Kaggle, you can click on the “Data” tab on the competition webpage shown in
Fig. 14.14.1 and download the dataset by clicking the “Download All” button. After unzipping the
downloaded file in ../data, you will find the entire dataset in the following paths:
• ../data/dog-breed-identification/labels.csv
• ../data/dog-breed-identification/sample_submission.csv
• ../data/dog-breed-identification/train
• ../data/dog-breed-identification/test
You may have noticed that the above structure is similar to that of the CIFAR-10 competition in
Section 14.13, where folders train/ and test/ contain training and testing dog images, respec-
tively, and labels.csv contains the labels for the training images. Similarly, to make it easier to
get started, we provide a small sample of the dataset mentioned above: train_valid_test_tiny.
zip. If you are going to use the full dataset for the Kaggle competition, you need to change the
demo variable below to False.
#@save
d2l.DATA_HUB['dog_tiny'] = (d2l.DATA_URL + 'kaggle_dog_tiny.zip',
'0cb91d09b814ecdc07b50f31f8dcad3e81d6a86d')
# If you use the full dataset downloaded for the Kaggle competition, change
# the variable below to `False`
demo = True
if demo:
data_dir = d2l.download_extract('dog_tiny')
else:
data_dir = os.path.join('..', 'data', 'dog-breed-identification')
14.14. Dog Breed Identification (ImageNet Dogs) on Kaggle 693

Organizing the Dataset
We can organize the dataset similarly to what we did in Section 14.13, namely splitting out a vali-
dation set from the original training set, and moving images into subfolders grouped by labels.
The reorg_dog_data function below reads the training data labels, splits out the validation set, and
organizes the training set.
def reorg_dog_data(data_dir, valid_ratio):

labels = d2l.read_csv_labels(os.path.join(data_dir, 'labels.csv'))
d2l.reorg_train_valid(data_dir, labels, valid_ratio)
d2l.reorg_test(data_dir)
batch_size = 32 if demo else 128

valid_ratio = 0.1
reorg_dog_data(data_dir, valid_ratio)
14.14.2 Image Augmentation
Recall that this dog breed dataset is a subset of the ImageNet dataset, whose images are larger
than those of the CIFAR-10 dataset in Section 14.13. The following lists a few image augmentation
operations that might be useful for relatively larger images.
transform_train = torchvision.transforms.Compose([
# Randomly crop the image to obtain an image with an area of 0.08 to 1 of
# the original area and height-to-width ratio between 3/4 and 4/3. Then,
# scale the image to create a new 224 x 224 image
torchvision.transforms.RandomResizedCrop(224, scale=(0.08, 1.0),
ratio=(3.0/4.0, 4.0/3.0)),
# Randomly change the brightness, contrast, and saturation
torchvision.transforms.ColorJitter(brightness=0.4,
contrast=0.4,
saturation=0.4),
# Add random noise
# Standardize each channel of the image
[0.229, 0.224, 0.225])])
During prediction, we only use image preprocessing operations without randomness.
transform_test = torchvision.transforms.Compose([
# Crop a 224 x 224 square area from the center of the image
torchvision.transforms.CenterCrop(224),
[0.229, 0.224, 0.225])])

As in Section 14.13, we can read the organized dataset consisting of raw image files.
train_ds, train_valid_ds = [torchvision.datasets.ImageFolder(

transform=transform_train) for folder in ['train', 'train_valid']]
valid_ds, test_ds = [torchvision.datasets.ImageFolder(

transform=transform_test) for folder in ['valid', 'test']]
Below we create data iterator instances the same way as in Section 14.13.
train_iter, train_valid_iter = [torch.utils.data.DataLoader(

dataset, batch_size, shuffle=True, drop_last=True)
for dataset in (train_ds, train_valid_ds)]
valid_iter = torch.utils.data.DataLoader(valid_ds, batch_size, shuffle=False,

drop_last=True)
test_iter = torch.utils.data.DataLoader(test_ds, batch_size, shuffle=False,

drop_last=False)
14.14.4 Fine-Tuning a Pretrained Model
Again, the dataset for this competition is a subset of the ImageNet dataset. Therefore, we can use
the approach discussed in Section 14.2 to select a model pretrained on the full ImageNet dataset
and use it to extract image features to be fed into a custom small-scale output network. High-level
APIs of deep learning frameworks provide a wide range of models pretrained on the ImageNet
dataset. Here, we choose a pretrained ResNet-34 model, where we simply reuse the input of this
modelʼs output layer (i.e., the extracted features). Then we can replace the original output layer
with a small custom output network that can be trained, such as stacking two fully connected
layers. Different from the experiment in Section 14.2, the following does not retrain the pretrained
model used for feature extraction. This reduces training time and memory for storing gradients.
Recall that we standardized images using the means and standard deviations of the three RGB
channels for the full ImageNet dataset. In fact, this is also consistent with the standardization
operation by the pretrained model on ImageNet.
def get_net(devices):
finetune_net = nn.Sequential()
finetune_net.features = torchvision.models.resnet34(pretrained=True)
# Define a new output network (there are 120 output categories)
finetune_net.output_new = nn.Sequential(nn.Linear(1000, 256),
nn.ReLU(),
nn.Linear(256, 120))
# Move the model to devices
finetune_net = finetune_net.to(devices[0])
# Freeze parameters of feature layers
for param in finetune_net.features.parameters():
param.requires_grad = False
return finetune_net

Before calculating the loss, we first obtain the input of the pretrained modelʼs output layer, i.e.,
the extracted feature. Then we use this feature as input for our small custom output network to
calculate the loss.
def evaluate_loss(data_iter, net, devices):

l_sum, n = 0.0, 0
for features, labels in data_iter:
features, labels = features.to(devices[0]), labels.to(devices[0])
outputs = net(features)
l = loss(outputs, labels)
l_sum += l.sum()
n += labels.numel()
return l_sum / n
14.14.5 Defining the Training Function
We will select the model and tune hyperparameters according to the modelʼs performance on the
validation set. The model training function train only iterates parameters of the small custom
output network.
def train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, lr_period,

lr_decay):
# Only train the small custom output network
trainer = torch.optim.SGD((param for param in net.parameters()
if param.requires_grad), lr=lr,
momentum=0.9, weight_decay=wd)
scheduler = torch.optim.lr_scheduler.StepLR(trainer, lr_period, lr_decay)
num_batches, timer = len(train_iter), d2l.Timer()
legend = ['train loss']
legend.append('valid loss')
legend=legend)
timer.start()
features, labels = features.to(devices[0]), labels.to(devices[0])
trainer.zero_grad()
output = net(features)
l = loss(output, labels).sum()
l.backward()
trainer.step()
metric.add(l, labels.shape[0])
timer.stop()
(metric[0] / metric[1], None))
measures = f'train loss {metric[0] / metric[1]:.3f}'
valid_loss = evaluate_loss(valid_iter, net, devices)

animator.add(epoch + 1, (None, valid_loss.detach().cpu()))

scheduler.step()
measures += f', valid loss {valid_loss:.3f}'
print(measures + f'\n{metric[1] * num_epochs / timer.sum():.1f}'
f' examples/sec on {str(devices)}')
14.14.6 Training and Validating the Model
Now we can train and validate the model. The following hyperparameters are all tunable. For
example, the number of epochs can be increased. Because lr_period and lr_decay are set to 2
and 0.9, respectively, the learning rate of the optimization algorithm will be multiplied by 0.9 after
every 2 epochs.
devices, num_epochs, lr, wd = d2l.try_all_gpus(), 10, 1e-4, 1e-4

lr_period, lr_decay, net = 2, 0.9, get_net(devices)
train(net, train_iter, valid_iter, num_epochs, lr, wd, devices, lr_period,
lr_decay)
train loss 1.212, valid loss 1.361

14.14.7 Classifying the Testing Set and Submitting Results on Kaggle
Similar to the final step in Section 14.13, in the end all the labeled data (including the validation
set) are used for training the model and classifying the testing set. We will use the trained custom
output network for classification.
net = get_net(devices)
train(net, train_valid_iter, None, num_epochs, lr, wd, devices, lr_period,
lr_decay)

preds = []
for data, label in test_iter:
output = torch.nn.functional.softmax(net(data.to(devices[0])), dim=0)
preds.extend(output.cpu().detach().numpy())
ids = sorted(os.listdir(
os.path.join(data_dir, 'train_valid_test', 'test', 'unknown')))
with open('submission.csv', 'w') as f:
f.write('id,' + ','.join(train_valid_ds.classes) + '\n')
for i, output in zip(ids, preds):
f.write(i.split('.')[0] + ',' + ','.join(
[str(num) for num in output]) + '\n')
train loss 1.191

The above code will generate a submission.csv file to be submitted to Kaggle in the same way
described in Section 5.7.
14.14.8 Summary
• Images in the ImageNet dataset are larger (with varying dimensions) than CIFAR-10 images.
We may modify image augmentation operations for tasks on a different dataset.
• To classify a subset of the ImageNet dataset, we can leverage pre-trained models on the full
ImageNet dataset to extract features and only train a custom small-scale output network.
This will lead to less computational time and memory cost.

Exercises
1. When using the full Kaggle competition dataset, what results can you achieve when you in-
crease batch_size (batch size) and num_epochs (number of epochs) while setting some other
hyperparameters as lr = 0.01, lr_period = 10, and lr_decay = 0.1?
2. Do you get better results if you use a deeper pretrained model? How do you tune hyperpa-
rameters? Can you further improve the results?
Discussions226
226

15 | Natural Language Processing: Pre-
training
Humans need to communicate. Out of this basic need of the human condition, a vast amount of
written text has been generated on an everyday basis. Given rich text in social media, chat apps,
emails, product reviews, news articles, research papers, and books, it becomes vital to enable
computers to understand them to offer assistance or make decisions based on human languages.
Natural language processing studies interactions between computers and humans using natural
languages. In practice, it is very common to use natural language processing techniques to pro-
cess and analyze text (human natural language) data, such as language models in Section 9.3 and
machine translation models in Section 10.5.
To understand text, we can begin by learning its representations. Leveraging the existing text
sequences from large corpora, self-supervised learning has been extensively used to pretrain text
representations, such as by predicting some hidden part of the text using some other part of their
surrounding text. In this way, models learn through supervision from massive text data without
expensive labeling efforts!
As we will see in this chapter, when treating each word or subword as an individual token, the
representation of each token can be pretrained using word2vec, GloVe, or subword embedding
models on large corpora. After pretraining, representation of each token can be a vector, how-
ever, it remains the same no matter what the context is. For instance, the vector representation
of “bank” is the same in both “go to the bank to deposit some money” and “go to the bank to
sit down”. Thus, many more recent pretraining models adapt representation of the same token
to different contexts. Among them is BERT, a much deeper self-supervised model based on the
transformer encoder. In this chapter, we will focus on how to pretrain such representations for
text, as highlighted in Fig. 15.1.
701
Fig. 15.1: Pretrained text representations can be fed to various deep learning architectures for
different downstream natural language processing applications. This chapter focuses on the up-
stream text representation pretraining.
For sight of the big picture, Fig. 15.1 shows that the pretrained text representations can be fed
to a variety of deep learning architectures for different downstream natural language processing
applications. We will cover them in Chapter 16.
15.1 Word Embedding (word2vec)
Natural language is a complex system used to express meanings. In this system, words are the ba-
sic unit of the meaning. As the name implies, word vectors are vectors used to represent words, and
can also be considered as feature vectors or representations of words. The technique of mapping
words to real vectors is called word embedding. In recent years, word embedding has gradually
become the basic knowledge of natural language processing.
15.1.1 One-Hot Vectors Are a Bad Choice
We used one-hot vectors to represent words (characters are words) in Section 9.5. Suppose that the
number of different words in the dictionary (the dictionary size) is N , and each word corresponds
to a different integer (index) from 0 to N − 1. To obtain the one-hot vector representation for any
word with index i, we create a length-N vector with all 0s and set the element at position i to 1. In
this way, each word is represented as a vector of length N , and it can be used directly by neural
networks.
Although one-hot word vectors are easy to construct, they are usually not a good choice. A main
reason is that one-hot word vectors cannot accurately express the similarity between different
words, such as the cosine similarity that we often use. For vectors x, y ∈ Rd , their cosine similarity
is the cosine of the angle between them:
x⊤ y
∈ [−1, 1]. (15.1.1)
∥x∥∥y∥
Since the cosine similarity between one-hot vectors of any two different words is 0, one-hot vectors
cannot encode similarities among words.
702 Chapter 15. Natural Language Processing: Pretraining

15.1.2 Self-Supervised word2vec
The word2vec227 tool was proposed to address the above issue. It maps each word to a fixed-length
vector, and these vectors can better express the similarity and analogy relationship among dif-
ferent words. The word2vec tool contains two models, namely skip-gram (Mikolov et al., 2013)
and continuous bag of words (CBOW) (Mikolov et al., 2013). For semantically meaningful represen-
tations, their training relies on conditional probabilities that can be viewed as predicting some
words using some of their surrounding words in corpora. Since supervision comes from the data
without labels, both skip-gram and continuous bag of words are self-supervised models.
In the following, we will introduce these two models and their training methods.
15.1.3 The Skip-Gram Model
The skip-gram model assumes that a word can be used to generate its surrounding words in a text
sequence. Take the text sequence “the”, “man”, “loves”, “his”, “son” as an example. Letʼs choose
“loves” as the center word and set the context window size to 2. As shown in Fig. 15.1.1, given the
center word “loves”, the skip-gram model considers the conditional probability for generating the
context words: “the”, “man”, “his”, and “son”, which are no more than 2 words away from the center
word:
P ("the", "man", "his", "son" | "loves"). (15.1.2)
Assume that the context words are independently generated given the center word (i.e., condi-
tional independence). In this case, the above conditional probability can be rewritten as
P ("the" | "loves") · P ("man" | "loves") · P ("his" | "loves") · P ("son" | "loves"). (15.1.3)
Fig. 15.1.1: The skip-gram model considers the conditional probability of generating the sur-
rounding context words given a center word.
In the skip-gram model, each word has two d-dimensional-vector representations for calculating
conditional probabilities. More concretely, for any word with index i in the dictionary, denote by
vi ∈ Rd and ui ∈ Rd its two vectors when used as a center word and a context word, respectively.
The conditional probability of generating any context word wo (with index o in the dictionary)
given the center word wc (with index c in the dictionary) can be modeled by a softmax operation
on vector dot products:
exp(u⊤
o vc )
P (wo | wc ) = ∑ ⊤
, (15.1.4)
i∈V exp(ui vc )
227
https://code.google.com/archive/p/word2vec/
15.1. Word Embedding (word2vec) 703

where the vocabulary index set V = {0, 1, . . . , |V| − 1}. Given a text sequence of length T , where
the word at time step t is denoted as w(t) . Assume that context words are independently generated
given any center word. For context window size m, the likelihood function of the skip-gram model
is the probability of generating all context words given any center word:
∏
T ∏
P (w(t+j) | w(t) ), (15.1.5)
t=1 −m≤j≤m, j̸=0
where any time step that is less than 1 or greater than T can be omitted.
Training
The skip-gram model parameters are the center word vector and context word vector for each
word in the vocabulary. In training, we learn the model parameters by maximizing the likelihood
function (i.e., maximum likelihood estimation). This is equivalent to minimizing the following
loss function:
∑
T ∑
− log P (w(t+j) | w(t) ). (15.1.6)
t=1 −m≤j≤m, j̸=0
When using stochastic gradient descent to minimize the loss, in each iteration we can randomly
sample a shorter subsequence to calculate the (stochastic) gradient for this subsequence to update
the model parameters. To calculate this (stochastic) gradient, we need to obtain the gradients of
the log conditional probability with respect to the center word vector and the context word vector.
In general, according to (15.1.4) the log conditional probability involving any pair of the center
word wc and the context word wo is
( )
∑
log P (wo | wc ) = u⊤
o vc − log exp(u⊤i vc ) . (15.1.7)
i∈V
Through differentiation, we can obtain its gradient with respect to the center word vector vc as
∑ ⊤
∂log P (wo | wc ) j∈V exp(uj vc )uj
= uo − ∑ ⊤
∂vc i∈V exp(ui vc )
( )
∑ exp(u⊤j vc )
= uo − ∑ ⊤
uj (15.1.8)
j∈V i∈V exp(ui vc )
∑
= uo − P (wj | wc )uj .
j∈V
Note that the calculation in (15.1.8) requires the conditional probabilities of all words in the dic-
tionary with wc as the center word. The gradients for the other word vectors can be obtained in
the same way.
After training, for any word with index i in the dictionary, we obtain both word vectors vi (as the
center word) and ui (as the context word). In natural language processing applications, the center
word vectors of the skip-gram model are typically used as the word representations.

15.1.4 The Continuous Bag of Words (CBOW) Model
The continuous bag of words (CBOW) model is similar to the skip-gram model. The major difference
from the skip-gram model is that the continuous bag of words model assumes that a center word
is generated based on its surrounding context words in the text sequence. For example, in the
same text sequence “the”, “man”, “loves”, “his”, and “son”, with “loves” as the center word and
the context window size being 2, the continuous bag of words model considers the conditional
probability of generating the center word “loves” based on the context words “the”, “man”, “his”
and “son” (as shown in Fig. 15.1.2), which is
P ("loves" | "the", "man", "his", "son"). (15.1.9)
Fig. 15.1.2: The continuous bag of words model considers the conditional probability of generat-
ing the center word given its surrounding context words.
Since there are multiple context words in the continuous bag of words model, these context word
vectors are averaged in the calculation of the conditional probability. Specifically, for any word
with index i in the dictionary, denote by vi ∈ Rd and ui ∈ Rd its two vectors when used as a
context word and a center word (meanings are switched in the skip-gram model), respectively. The
conditional probability of generating any center word wc (with index c in the dictionary) given its
surrounding context words wo1 , . . . , wo2m (with index o1 , . . . , o2m in the dictionary) can be modeled
by
( 1 ⊤ )
exp 2m uc (vo1 + . . . + vo2m )
P (wc | wo1 , . . . , wo2m ) = ∑ ( 1 ⊤ ). (15.1.10)
i∈V exp 2m ui (vo1 + . . . + vo2m )
For brevity, let Wo = {wo1 , . . . , wo2m } and v̄o = (vo1 + . . . + vo2m ) /(2m). Then (15.1.10) can be
simplified as
( )
exp u⊤c v̄o
P (wc | Wo ) = ∑ ( ⊤ ). (15.1.11)
i∈V exp ui v̄o
Given a text sequence of length T , where the word at time step t is denoted as w(t) . For context
window size m, the likelihood function of the continuous bag of words model is the probability of
generating all center words given their context words:
∏
T
P (w(t) | w(t−m) , . . . , w(t−1) , w(t+1) , . . . , w(t+m) ). (15.1.12)
t=1
15.1. Word Embedding (word2vec) 705

Training
Training continuous bag of words models is almost the same as training skip-gram models. The
maximum likelihood estimation of the continuous bag of words model is equivalent to minimizing
the following loss function:
∑
T
− log P (w(t) | w(t−m) , . . . , w(t−1) , w(t+1) , . . . , w(t+m) ). (15.1.13)
t=1
Notice that
( )
∑ ( )
log P (wc | Wo ) = u⊤
c v̄o − log exp u⊤
i v̄o . (15.1.14)
i∈V
Through differentiation, we can obtain its gradient with respect to any context word vector voi (i =
1, . . . , 2m) as
   
∂ log P (wc | Wo ) 1  ∑ exp(u ⊤ v̄ )u
j o j  1  ∑
= uc − ∑ ⊤
= uc − P (wj | Wo )uj  .
∂voi 2m i∈V exp(u i v̄o ) 2m
j∈V j∈V
(15.1.15)
The gradients for the other word vectors can be obtained in the same way. Unlike the skip-gram
model, the continuous bag of words model typically uses context word vectors as the word repre-
sentations.
15.1.5 Summary
• Word vectors are vectors used to represent words, and can also be considered as feature
vectors or representations of words. The technique of mapping words to real vectors is called
word embedding.
• The word2vec tool contains both the skip-gram and continuous bag of words models.
• The skip-gram model assumes that a word can be used to generate its surrounding words
in a text sequence; while the continuous bag of words model assumes that a center word is
generated based on its surrounding context words.
Exercises
1. What is the computational complexity for calculating each gradient? What could be the issue
if the dictionary size is huge?
2. Some fixed phrases in English consist of multiple words, such as “new york”. How to train
their word vectors? Hint: see Section 4 in the word2vec paper (Mikolov et al., 2013).
3. Letʼs reflect on the word2vec design by taking the skip-gram model as an example. What is
the relationship between the dot product of two word vectors in the skip-gram model and the
cosine similarity? For a pair of words with similar semantics, why may the cosine similarity
of their word vectors (trained by the skip-gram model) be high?
Discussions228
228

15.2 Approximate Training
Recall our discussions in Section 15.1. The main idea of the skip-gram model is using softmax
operations to calculate the conditional probability of generating a context word wo based on the
given center word wc in (15.1.4), whose corresponding logarithmic loss is given by the opposite of
(15.1.7).
Due to the nature of the softmax operation, since a context word may be anyone in the dictionary
V, the opposite of (15.1.7) contains the summation of items as many as the entire size of the vo-
cabulary. Consequently, the gradient calculation for the skip-gram model in (15.1.8) and that for
the continuous bag-of-words model in (15.1.15) both contain the summation. Unfortunately, the
computational cost for such gradients that sum over a large dictionary (often with hundreds of
thousands or millions of words) is huge!
In order to reduce the aforementioned computational complexity, this section will introduce two
approximate training methods: negative sampling and hierarchical softmax. Due to the similarity
between the skip-gram model and the continuous bag of words model, we will just take the skip-
gram model as an example to describe these two approximate training methods.
15.2.1 Negative Sampling
Negative sampling modifies the original objective function. Given the context window of a center
word wc , the fact that any (context) word wo comes from this context window is considered as an
event with the probability modeled by
P (D = 1 | wc , wo ) = σ(u⊤
o vc ), (15.2.1)
where σ uses the definition of the sigmoid activation function:

1
σ(x) = . (15.2.2)
1 + exp(−x)
Letʼs begin by maximizing the joint probability of all such events in text sequences to train word
embeddings. Specifically, given a text sequence of length T , denote by w(t) the word at time step
t and let the context window size be m, consider maximizing the joint probability
∏
T ∏
P (D = 1 | w(t) , w(t+j) ). (15.2.3)
t=1 −m≤j≤m, j̸=0
However, (15.2.3) only considers those events that involve positive examples. As a result, the joint
probability in (15.2.3) is maximized to 1 only if all the word vectors are equal to infinity. Of course,
such results are meaningless. To make the objective function more meaningful, negative sampling
adds negative examples sampled from a predefined distribution.
Denote by S the event that a context word wo comes from the context window of a center word wc .
For this event involving wo , from a predefined distribution P (w) sample K noise words that are not
from this context window. Denote by Nk the event that a noise word wk (k = 1, . . . , K) does not
come from the context window of wc . Assume that these events involving both the positive exam-
ple and negative examples S, N1 , . . . , NK are mutually independent. Negative sampling rewrites
the joint probability (involving only positive examples) in (15.2.3) as
∏
T ∏
P (w(t+j) | w(t) ), (15.2.4)
t=1 −m≤j≤m, j̸=0
15.2. Approximate Training 707

where the conditional probability is approximated through events S, N1 , . . . , NK :
∏
K
P (w(t+j) | w(t) ) = P (D = 1 | w(t) , w(t+j) ) P (D = 0 | w(t) , wk ). (15.2.5)
k=1, wk ∼P (w)
Denote by it and hk the indices of a word w(t) at time step t of a text sequence and a noise word
wk , respectively. The logarithmic loss with respect to the conditional probabilities in (15.2.5) is
∑
K
− log P (w (t+j)
| w ) = − log P (D = 1 | w , w
(t) (t) (t+j)
)− log P (D = 0 | w(t) , wk )
k=1, wk ∼P (w)
( ) ∑
K ( ( ))
= − log σ u⊤
it+j t −
vi log 1 − σ u⊤ vi
hk t
k=1, wk ∼P (w)
( ) ∑
K ( )
= − log σ u⊤
it+j it −
v log σ −u⊤ v
hk it .
k=1, wk ∼P (w)
(15.2.6)
We can see that now the computational cost for gradients at each training step has nothing to do
with the dictionary size, but linearly depends on K. When setting the hyperparameter K to a
smaller value, the computational cost for gradients at each training step with negative sampling
is smaller.
15.2.2 Hierarchical Softmax
As an alternative approximate training method, hierarchical softmax uses the binary tree, a data
structure illustrated in Fig. 15.2.1, where each leaf node of the tree represents a word in dictionary
V.
Fig. 15.2.1: Hierarchical softmax for approximate training, where each leaf node of the tree rep-
resents a word in the dictionary.
Denote by L(w) the number of nodes (including both ends) on the path from the root node to the
leaf node representing word w in the binary tree. Let n(w, j) be the j th node on this path, with its

context word vector being un(w,j) . For example, L(w3 ) = 4 in Fig. 15.2.1. Hierarchical softmax
approximates the conditional probability in (15.1.4) as
∏
L(wo )−1 ( )
P (wo | wc ) = σ [[n(wo , j + 1) = leftChild(n(wo , j))]] · u⊤ v
n(wo ,j) c , (15.2.7)
j=1
where function σ is defined in (15.2.2), and leftChild(n) is the left child node of node n: if x is true,
[[x]] = 1; otherwise [[x]] = −1.
To illustrate, letʼs calculate the conditional probability of generating word w3 given word wc in Fig.
15.2.1. This requires dot products between the word vector vc of wc and non-leaf node vectors on
the path (the path in bold in Fig. 15.2.1) from the root to w3 , which is traversed left, right, then left:
P (w3 | wc ) = σ(u⊤ ⊤ ⊤
n(w3 ,1) vc ) · σ(−un(w3 ,2) vc ) · σ(un(w3 ,3) vc ). (15.2.8)
Since σ(x) + σ(−x) = 1, it holds that the conditional probabilities of generating all the words in
dictionary V based on any word wc sum up to one:
∑
P (w | wc ) = 1. (15.2.9)
w∈V
Fortunately, since L(wo )−1 is on the order of O(log2 |V|) due to the binary tree structure, when the
dictionary size V is huge, the computational cost for each training step using hierarchical softmax
is significantly reduced compared with that without approximate training.
15.2.3 Summary
• Negative sampling constructs the loss function by considering mutually independent events
that involve both positive and negative examples. The computational cost for training is
linearly dependent on the number of noise words at each step.
• Hierarchical softmax constructs the loss function using the path from the root node to the
leaf node in the binary tree. The computational cost for training is dependent on the loga-
rithm of the dictionary size at each step.
Exercises
1. How can we sample noise words in negative sampling?

2. Verify that (15.2.9) holds.
3. How to train the continuous bag of words model using negative sampling and hierarchical
softmax, respectively?
Discussions229
229
15.2. Approximate Training 709

15.3 The Dataset for Pretraining Word Embeddings
Now that we know the technical details of the word2vec models and approximate training meth-
ods, letʼs walk through their implementations. Specifically, we will take the skip-gram model in
Section 15.1 and negative sampling in Section 15.2 as an example. In this section, we begin with
the dataset for pretraining the word embedding model: the original format of the data will be
transformed into minibatches that can be iterated over during training.
import collections
import math
import os
import random
import torch
The dataset that we use here is Penn Tree Bank (PTB)230 . This corpus is sampled from Wall Street
Journal articles, split into training, validation, and test sets. In the original format, each line of
the text file represents a sentence of words that are separated by spaces. Here we treat each word
as a token.
#@save
d2l.DATA_HUB['ptb'] = (d2l.DATA_URL + 'ptb.zip',
'319d85e578af0cdc590547f26231e4e31cdf1e42')
#@save
def read_ptb():
"""Load the PTB dataset into a list of text lines."""
data_dir = d2l.download_extract('ptb')
# Read the training set
with open(os.path.join(data_dir, 'ptb.train.txt')) as f:
raw_text = f.read()
return [line.split() for line in raw_text.split('\n')]
sentences = read_ptb()
f'# sentences: {len(sentences)}'
'# sentences: 42069'
After reading the training set, we build a vocabulary for the corpus, where any word that appears
less than 10 times is replaced by the “<unk>” token. Note that the original dataset also contains
“<unk>” tokens that represent rare (unknown) words.

f'vocab size: {len(vocab)}'
'vocab size: 6719'
230
https://catalog.ldc.upenn.edu/LDC99T42

15.3.2 Subsampling
Text data typically have high-frequency words such as “the”, “a”, and “in”: they may even occur
billions of times in very large corpora. However, these words often co-occur with many different
words in context windows, providing little useful signals. For instance, consider the word “chip”
in a context window: intuitively its co-occurrence with a low-frequency word “intel” is more useful
in training than the co-occurrence with a high-frequency word “a”. Moreover, training with vast
amounts of (high-frequency) words is slow. Thus, when training word embedding models, high-
frequency words can be subsampled (Mikolov et al., 2013). Specifically, each indexed word wi in
the dataset will be discarded with probability
( √ )
t
P (wi ) = max 1 − ,0 , (15.3.1)
f (wi )
where f (wi ) is the ratio of the number of words wi to the total number of words in the dataset,
and the constant t is a hyperparameter (10−4 in the experiment). We can see that only when the
relative frequency f (wi ) > t can the (high-frequency) word wi be discarded, and the higher the
relative frequency of the word, the greater the probability of being discarded.
#@save
def subsample(sentences, vocab):
"""Subsample high-frequency words."""
# Exclude unknown tokens ('<unk>')
sentences = [[token for token in line if vocab[token] != vocab.unk]
for line in sentences]
counter = collections.Counter([
token for line in sentences for token in line])
num_tokens = sum(counter.values())
# Return True if `token` is kept during subsampling

def keep(token):
return(random.uniform(0, 1) <
math.sqrt(1e-4 / counter[token] * num_tokens))
return ([[token for token in line if keep(token)] for line in sentences],

counter)
subsampled, counter = subsample(sentences, vocab)
The following code snippet plots the histogram of the number of tokens per sentence before and
after subsampling. As expected, subsampling significantly shortens sentences by dropping high-
frequency words, which will lead to training speedup.
d2l.show_list_len_pair_hist(['origin', 'subsampled'], '# tokens per sentence',

'count', sentences, subsampled);
15.3. The Dataset for Pretraining Word Embeddings 711

For individual tokens, the sampling rate of the high-frequency word “the” is less than 1/20.
def compare_counts(token):
return (f'# of "{token}": '
f'before={sum([l.count(token) for l in sentences])}, '
f'after={sum([l.count(token) for l in subsampled])}')
compare_counts('the')
'# of "the": before=50770, after=1958'
In contrast, low-frequency words “join” are completely kept.
compare_counts('join')
'# of "join": before=45, after=45'
After subsampling, we map tokens to their indices for the corpus.
corpus = [vocab[line] for line in subsampled]

corpus[:3]
[[], [6697, 4127, 3228, 416, 4060], [3922, 1922, 4743]]
15.3.3 Extracting Center Words and Context Words
The following get_centers_and_contexts function extracts all the center words and their context
words from corpus. It uniformly samples an integer between 1 and max_window_size at random
as the context window size. For any center word, those words whose distance from it does not
exceed the sampled context window size are its context words.

#@save
def get_centers_and_contexts(corpus, max_window_size):
"""Return center words and context words in skip-gram."""
centers, contexts = [], []
for line in corpus:
# To form a "center word--context word" pair, each sentence needs to
# have at least 2 words
if len(line) < 2:
continue
centers += line
for i in range(len(line)): # Context window centered at ì`
window_size = random.randint(1, max_window_size)
indices = list(range(max(0, i - window_size),
min(len(line), i + 1 + window_size)))
# Exclude the center word from the context words
indices.remove(i)
contexts.append([line[idx] for idx in indices])
return centers, contexts
Next, we create an artificial dataset containing two sentences of 7 and 3 words, respectively. Let
the maximum context window size be 2 and print all the center words and their context words.
tiny_dataset = [list(range(7)), list(range(7, 10))]

print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
print('center', center, 'has contexts', context)
dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]

center 0 has contexts [1, 2]
center 1 has contexts [0, 2, 3]
center 4 has contexts [2, 3, 5, 6]
center 7 has contexts [8]
center 9 has contexts [8]
When training on the PTB dataset, we set the maximum context window size to 5. The following
extracts all the center words and their context words in the dataset.
all_centers, all_contexts = get_centers_and_contexts(corpus, 5)

f'# center-context pairs: {sum([len(contexts) for contexts in all_contexts])}'
'# center-context pairs: 1505258'

15.3.4 Negative Sampling
We use negative sampling for approximate training. To sample noise words according to a prede-
fined distribution, we define the following RandomGenerator class, where the (possibly unnormal-
ized) sampling distribution is passed via the argument sampling_weights.
#@save
class RandomGenerator:
"""Randomly draw among {1, ..., n} according to n sampling weights."""
def __init__(self, sampling_weights):
# Exclude
self.population = list(range(1, len(sampling_weights) + 1))
self.sampling_weights = sampling_weights
self.candidates = []
self.i = 0
def draw(self):
if self.i == len(self.candidates):
# Cache `k` random sampling results
self.candidates = random.choices(
self.population, self.sampling_weights, k=10000)
self.i = 0
self.i += 1
return self.candidates[self.i - 1]
For example, we can draw 10 random variables X among indices 1, 2, and 3 with sampling prob-
abilities P (X = 1) = 2/9, P (X = 2) = 3/9, and P (X = 3) = 4/9 as follows.
For a pair of center word and context word, we randomly sample K (5 in the experiment) noise
words. According to the suggestions in the word2vec paper, the sampling probability P (w) of a
noise word w is set to its relative frequency in the dictionary raised to the power of 0.75 (Mikolov
et al., 2013).
#@save
def get_negatives(all_contexts, vocab, counter, K):
"""Return noise words in negative sampling."""
# Sampling weights for words with indices 1, 2, ... (index 0 is the
# excluded unknown token) in the vocabulary
sampling_weights = [counter[vocab.to_tokens(i)]**0.75
for i in range(1, len(vocab))]
all_negatives, generator = [], RandomGenerator(sampling_weights)
for contexts in all_contexts:
negatives = []
while len(negatives) < len(contexts) * K:
neg = generator.draw()
# Noise words cannot be context words
if neg not in contexts:
negatives.append(neg)
all_negatives.append(negatives)
return all_negatives
all_negatives = get_negatives(all_contexts, vocab, counter, 5)

15.3.5 Loading Training Examples in Minibatches
After all the center words together with their context words and sampled noise words are ex-
tracted, they will be transformed into minibatches of examples that can be iteratively loaded dur-
ing training.
In a minibatch, the ith example includes a center word and its ni context words and mi noise
words. Due to varying context window sizes, ni + mi varies for different i. Thus, for each example
we concatenate its context words and noise words in the contexts_negatives variable, and pad
zeros until the concatenation length reaches maxi ni + mi (max_len). To exclude paddings in the
calculation of the loss, we define a mask variable masks. There is a one-to-one correspondence
between elements in masks and elements in contexts_negatives, where zeros (otherwise ones) in
masks correspond to paddings in contexts_negatives.
To distinguish between positive and negative examples, we separate context words from noise
words in contexts_negatives via a labels variable. Similar to masks, there is also a one-to-one cor-
respondence between elements in labels and elements in contexts_negatives, where ones (oth-
erwise zeros) in labels correspond to context words (positive examples) in contexts_negatives.
The above idea is implemented in the following batchify function. Its input data is a list with
length equal to the batch size, where each element is an example consisting of the center word
center, its context words context, and its noise words negative. This function returns a minibatch
that can be loaded for calculations during training, such as including the mask variable.
#@save
def batchify(data):
"""Return a minibatch of examples for skip-gram with negative sampling."""
max_len = max(len(c) + len(n) for _, c, n in data)
centers, contexts_negatives, masks, labels = [], [], [], []
for center, context, negative in data:
cur_len = len(context) + len(negative)
centers += [center]
contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
masks += [[1] * cur_len + [0] * (max_len - cur_len)]
labels += [[1] * len(context) + [0] * (max_len - len(context))]
return (torch.tensor(centers).reshape((-1, 1)), torch.tensor(
contexts_negatives), torch.tensor(masks), torch.tensor(labels))
Letʼs test this function using a minibatch of two examples.
x_1 = (1, [2, 2], [3, 3, 3, 3])

x_2 = (1, [2, 2, 2], [3, 3])
batch = batchify((x_1, x_2))
names = ['centers', 'contexts_negatives', 'masks', 'labels']

for name, data in zip(names, batch):
print(name, '=', data)
centers = tensor([[1],
[1]])
contexts_negatives = tensor([[2, 2, 3, 3, 3, 3],
[2, 2, 2, 3, 3, 0]])
masks = tensor([[1, 1, 1, 1, 1, 1],

[1, 1, 1, 1, 1, 0]])
labels = tensor([[1, 1, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0]])
Last, we define the load_data_ptb function that reads the PTB dataset and returns the data iterator
and the vocabulary.
#@save
def load_data_ptb(batch_size, max_window_size, num_noise_words):
"""Download the PTB dataset and then load it into memory."""
sentences = read_ptb()
subsampled, counter = subsample(sentences, vocab)
corpus = [vocab[line] for line in subsampled]
all_centers, all_contexts = get_centers_and_contexts(
corpus, max_window_size)
all_negatives = get_negatives(
all_contexts, vocab, counter, num_noise_words)
class PTBDataset(torch.utils.data.Dataset):
def __init__(self, centers, contexts, negatives):
assert len(centers) == len(contexts) == len(negatives)
self.centers = centers
self.contexts = contexts
self.negatives = negatives
def __getitem__(self, index):

return (self.centers[index], self.contexts[index],
self.negatives[index])
def __len__(self):
return len(self.centers)
dataset = PTBDataset(all_centers, all_contexts, all_negatives)
data_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True,

collate_fn=batchify,
num_workers=num_workers)
return data_iter, vocab
Letʼs print the first minibatch of the data iterator.
data_iter, vocab = load_data_ptb(512, 5, 5)

for batch in data_iter:
for name, data in zip(names, batch):
print(name, 'shape:', data.shape)
break

centers shape: torch.Size([512, 1])
contexts_negatives shape: torch.Size([512, 60])
masks shape: torch.Size([512, 60])
labels shape: torch.Size([512, 60])
15.3.7 Summary
• High-frequency words may not be so useful in training. We can subsample them for speedup
in training.
• For computational efficiency, we load examples in minibatches. We can define other vari-
ables to distinguish paddings from non-paddings, and positive examples from negative ones.
Exercises
1. How does the running time of code in this section changes if not using subsampling?
2. The RandomGenerator class caches k random sampling results. Set k to other values and see
how it affects the data loading speed.
3. What other hyperparameters in the code of this section may affect the data loading speed?
Discussions231
15.4 Pretraining word2vec
We go on to implement the skip-gram model defined in Section 15.1. Then we will pretrain
word2vec using negative sampling on the PTB dataset. First of all, letʼs obtain the data iterator and
the vocabulary for this dataset by calling the d2l.load_data_ptb function, which was described in
Section 15.3
import math
import torch
batch_size, max_window_size, num_noise_words = 512, 5, 5

data_iter, vocab = d2l.load_data_ptb(batch_size, max_window_size,
num_noise_words)
231
15.4. Pretraining word2vec 717

15.4.1 The Skip-Gram Model
We implement the skip-gram model by using embedding layers and batch matrix multiplications.
First, letʼs review how embedding layers work.
Embedding Layer
As described in Section 10.7, an embedding layer maps a tokenʼs index to its feature vector. The
weight of this layer is a matrix whose number of rows equals to the dictionary size (input_dim)
and number of columns equals to the vector dimension for each token (output_dim). After a word
embedding model is trained, this weight is what we need.
embed = nn.Embedding(num_embeddings=20, embedding_dim=4)

print(f'Parameter embedding_weight ({embed.weight.shape}, '
f'dtype={embed.weight.dtype})')
Parameter embedding_weight (torch.Size([20, 4]), dtype=torch.float32)
The input of an embedding layer is the index of a token (word). For any token index i, its vector
representation can be obtained from the ith row of the weight matrix in the embedding layer. Since
the vector dimension (output_dim) was set to 4, the embedding layer returns vectors with shape
(2, 3, 4) for a minibatch of token indices with shape (2, 3).
x = torch.tensor([[1, 2, 3], [4, 5, 6]])

embed(x)
tensor([[[ 8.9592e-01, 8.7804e-01, 1.8486e+00, 1.6420e-01],

[-7.0451e-01, 7.3105e-02, 1.0973e+00, 1.3418e+00],
[-2.1124e-01, 1.5728e-01, -4.1019e-01, -1.4305e-01]],
[[ 1.9901e-01, 1.7650e-05, -7.2690e-01, -7.6560e-01],

[-3.9458e-01, 3.9646e-01, -5.9155e-01, -3.2161e-01],
[ 7.9045e-01, 1.2073e+00, -1.4538e+00, -1.2140e+00]]],
grad_fn=<EmbeddingBackward0>)
Defining the Forward Propagation
In the forward propagation, the input of the skip-gram model includes the center word in-
dices center of shape (batch size, 1) and the concatenated context and noise word indices con-
texts_and_negatives of shape (batch size, max_len), where max_len is defined in Section 15.3.5.
These two variables are first transformed from the token indices into vectors via the embedding
layer, then their batch matrix multiplication (described in Section 11.2.4) returns an output of
shape (batch size, 1, max_len). Each element in the output is the dot product of a center word
vector and a context or noise word vector.
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):

v = embed_v(center)
u = embed_u(contexts_and_negatives)

pred = torch.bmm(v, u.permute(0, 2, 1))

return pred
Letʼs print the output shape of this skip_gram function for some example inputs.
skip_gram(torch.ones((2, 1), dtype=torch.long),

torch.ones((2, 4), dtype=torch.long), embed, embed).shape
torch.Size([2, 1, 4])
15.4.2 Training
Before training the skip-gram model with negative sampling, letʼs first define its loss function.
Binary Cross-Entropy Loss
According to the definition of the loss function for negative sampling in Section 15.2.1, we will use
the binary cross-entropy loss.
class SigmoidBCELoss(nn.Module):
# Binary cross-entropy loss with masking
def __init__(self):
super().__init__()
def forward(self, inputs, target, mask=None):

out = nn.functional.binary_cross_entropy_with_logits(
inputs, target, weight=mask, reduction="none")
return out.mean(dim=1)
loss = SigmoidBCELoss()
Recall our descriptions of the mask variable and the label variable in Section 15.3.5. The following
calculates the binary cross-entropy loss for the given variables.
pred = torch.tensor([[1.1, -2.2, 3.3, -4.4]] * 2)

label = torch.tensor([[1.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0]])
mask = torch.tensor([[1, 1, 1, 1], [1, 1, 0, 0]])
loss(pred, label, mask) * mask.shape[1] / mask.sum(axis=1)
tensor([0.9352, 1.8462])
Below shows how the above results are calculated (in a less efficient way) using the sigmoid activa-
tion function in the binary cross-entropy loss. We can consider the two outputs as two normalized
losses that are averaged over non-masked predictions.
def sigmd(x):
return -math.log(1 / (1 + math.exp(-x)))

print(f'{(sigmd(1.1) + sigmd(2.2) + sigmd(-3.3) + sigmd(4.4)) / 4:.4f}')

print(f'{(sigmd(-1.1) + sigmd(-2.2)) / 2:.4f}')
0.9352
1.8462
We define two embedding layers for all the words in the vocabulary when they are used as center
words and context words, respectively. The word vector dimension embed_size is set to 100.
embed_size = 100
net = nn.Sequential(nn.Embedding(num_embeddings=len(vocab),
embedding_dim=embed_size),
nn.Embedding(num_embeddings=len(vocab),
embedding_dim=embed_size))
Defining the Training Loop
The training loop is defined below. Because of the existence of padding, the calculation of the loss
function is slightly different compared to the previous training functions.
def train(net, data_iter, lr, num_epochs, device=d2l.try_gpu()):

if type(module) == nn.Embedding:
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
xlim=[1, num_epochs])
# Sum of normalized losses, no. of normalized losses
timer, num_batches = d2l.Timer(), len(data_iter)
for i, batch in enumerate(data_iter):
center, context_negative, mask, label = [
data.to(device) for data in batch]
pred = skip_gram(center, context_negative, net[0], net[1])

l = (loss(pred.reshape(label.shape).float(), label.float(), mask)
/ mask.sum(axis=1) * mask.shape[1])
l.sum().backward()
optimizer.step()
metric.add(l.sum(), l.numel())

(metric[0] / metric[1],))
print(f'loss {metric[0] / metric[1]:.3f}, '
f'{metric[1] / timer.stop():.1f} tokens/sec on {str(device)}')
Now we can train a skip-gram model using negative sampling.

train(net, data_iter, lr, num_epochs)
loss 0.411, 332449.0 tokens/sec on cuda:0
15.4.3 Applying Word Embeddings
After training the word2vec model, we can use the cosine similarity of word vectors from the
trained model to find words from the dictionary that are most semantically similar to an input
word.
def get_similar_tokens(query_token, k, embed):

W = embed.weight.data
x = W[vocab[query_token]]
# Compute the cosine similarity. Add 1e-9 for numerical stability
cos = torch.mv(W, x) / torch.sqrt(torch.sum(W * W, dim=1) *
torch.sum(x * x) + 1e-9)
topk = torch.topk(cos, k=k+1)[1].cpu().numpy().astype('int32')
for i in topk[1:]: # Remove the input words
print(f'cosine sim={float(cos[i]):.3f}: {vocab.to_tokens(i)}')
get_similar_tokens('chip', 3, net[0])
cosine sim=0.729: microprocessor

cosine sim=0.703: intel
cosine sim=0.653: bugs

15.4.4 Summary
• We can train a skip-gram model with negative sampling using embedding layers and the
binary cross-entropy loss.
• Applications of word embeddings include finding semantically similar words for a given
word based on the cosine similarity of word vectors.
Exercises
1. Using the trained model, find semantically similar words for other input words. Can you
improve the results by tuning hyperparameters?
2. When a training corpus is huge, we often sample context words and noise words for the cen-
ter words in the current minibatch when updating model parameters. In other words, the same
center word may have different context words or noise words in different training epochs.
What are the benefits of this method? Try to implement this training method.
Discussions232
15.5 Word Embedding with Global Vectors (GloVe)
Word-word co-occurrences within context windows may carry rich semantic information. For
example, in a large corpus word “solid” is more likely to co-occur with “ice” than “steam”, but word
“gas” probably co-occurs with “steam” more frequently than “ice”. Besides, global corpus statistics
of such co-occurrences can be precomputed: this can lead to more efficient training. To leverage
statistical information in the entire corpus for word embedding, letʼs first revisit the skip-gram
model in Section 15.1.3, but interpreting it using global corpus statistics such as co-occurrence
counts.
15.5.1 Skip-Gram with Global Corpus Statistics
Denoting by qij the conditional probability P (wj | wi ) of word wj given word wi in the skip-gram
model, we have
exp(uj⊤ vi )
qij = ∑ , (15.5.1)
k∈V exp(u⊤
k vi )
where for any index i vectors vi and ui represent word wi as the center word and context word,
respectively, and V = {0, 1, . . . , |V| − 1} is the index set of the vocabulary.
Consider word wi that may occur multiple times in the corpus. In the entire corpus, all the context
words wherever wi is taken as their center word form a multiset Ci of word indices that allows
for multiple instances of the same element. For any element, its number of instances is called its
multiplicity. To illustrate with an example, suppose that word wi occurs twice in the corpus and
indices of the context words that take wi as their center word in the two context windows are
k, j, m, k and k, l, k, j. Thus, multiset Ci = {j, j, k, k, k, k, l, m}, where multiplicities of elements
j, k, l, m are 2, 4, 1, 1, respectively.
232

Now letʼs denote the multiplicity of element j in multiset Ci as xij . This is the global co-occurrence
count of word wj (as the context word) and word wi (as the center word) in the same context
window in the entire corpus. Using such global corpus statistics, the loss function of the skip-
gram model is equivalent to
∑∑
− xij log qij . (15.5.2)
i∈V j∈V
We further denote by xi the number of all the context words in the context windows where wi
occurs as their center word, which is equivalent to |Ci |. Letting pij be the conditional probability
xij /xi for generating context word wj given center word wi , (15.5.2) can be rewritten as
∑ ∑
− xi pij log qij . (15.5.3)
i∈V j∈V
∑
In (15.5.3), − j∈V pij log qij calculates the cross-entropy of the conditional distribution pij of
global corpus statistics and the conditional distribution qij of model predictions. This loss is also
weighted by xi as explained above. Minimizing the loss function in (15.5.3) will allow the pre-
dicted conditional distribution to get close to the conditional distribution from the global corpus
statistics.
Though being commonly used for measuring the distance between probability distributions, the
cross-entropy loss function may not be a good choice here. On the one hand, as we mentioned
in Section 15.2, the cost of properly normalizing qij results in the sum over the entire vocabulary,
which can be computationally expensive. On the other hand, a large number of rare events from
a large corpus are often modeled by the cross-entropy loss to be assigned with too much weight.
15.5.2 The GloVe Model
In view of this, the GloVe model makes three changes to the skip-gram model based on squared
loss (Pennington et al., 2014):
1. Use variables p′ij = xij and qij
′ = exp(u⊤ v ) that are not probability distributions and take
j i
( )2 ( )2
the logarithm of both, so the squared loss term is log p′ij − log qij
′ = u⊤ v
j i − log x ij .
2. Add two scalar model parameters for each word wi : the center word bias bi and the context
word bias ci .
3. Replace the weight of each loss term with the weight function h(xij ), where h(x) is increasing
in the interval of [0, 1].
Putting all things together, training GloVe is to minimize the following loss function:
∑∑ ( )2
h(xij ) u⊤ v
j i + bi + cj − log x ij . (15.5.4)
i∈V j∈V
For the weight function, a suggested choice is: h(x) = (x/c)α (e.g α = 0.75) if x < c (e.g., c = 100);
otherwise h(x) = 1. In this case, because h(0) = 0, the squared loss term for any xij = 0 can
be omitted for computational efficiency. For example, when using minibatch stochastic gradient
descent for training, at each iteration we randomly sample a minibatch of non-zero xij to calculate
gradients and update the model parameters. Note that these non-zero xij are precomputed global
corpus statistics; thus, the model is called GloVe for Global Vectors.
It should be emphasized that if word wi appears in the context window of word wj , then vice versa.
Therefore, xij = xji . Unlike word2vec that fits the asymmetric conditional probability pij , GloVe
15.5. Word Embedding with Global Vectors (GloVe) 723

fits the symmetric log xij . Therefore, the center word vector and the context word vector of any
word are mathematically equivalent in the GloVe model. However in practice, owing to different
initialization values, the same word may still get different values in these two vectors after training:
GloVe sums them up as the output vector.
15.5.3 Interpreting GloVe from the Ratio of Co-occurrence Probabilities
We can also interpret the GloVe model from another perspective. Using the same notation in Sec-
def
tion 15.5.1, let pij = P (wj | wi ) be the conditional probability of generating the context word wj
given wi as the center word in the corpus. Section 15.5.3 lists several co-occurrence probabilities
given words “ice” and “steam” and their ratios based on statistics from a large corpus.
wk = solid gas water fashion

p1 = P (wk | ice) 0.00019 0.000066 0.003 0.000017
p2 = P (wk | steam) 0.000022 0.00078 0.0022 0.000018
p1 /p2 8.9 0.085 1.36 0.96
Table: Word-word co-occurrence probabilities and their ratios from a large corpus (adapted from
Table 1 in (Pennington et al., 2014))
We can observe the following from Section 15.5.3:
• For a word wk that is related to “ice” but unrelated to “steam”, such as wk = solid, we expect
a larger ratio of co-occurence probabilities, such as 8.9.
• For a word wk that is related to “steam” but unrelated to “ice”, such as wk = gas, we expect a
smaller ratio of co-occurence probabilities, such as 0.085.
• For a word wk that is related to both “ice” and “steam”, such as wk = water, we expect a ratio
of co-occurence probabilities that is close to 1, such as 1.36.
• For a word wk that is unrelated to both “ice” and “steam”, such as wk = fashion, we expect a
ratio of co-occurence probabilities that is close to 1, such as 0.96.
It can be seen that the ratio of co-occurrence probabilities can intuitively express the relationship
between words. Thus, we can design a function of three word vectors to fit this ratio. For the
ratio of co-occurrence probabilities pij /pik with wi being the center word and wj and wk being the
context words, we want to fit this ratio using some function f :
pij
f (uj , uk , vi ) ≈ . (15.5.5)
pik
Among many possible designs for f , we only pick a reasonable choice in the following. Since
the ratio of co-occurrence
( probabilities
) is a scalar, we require that f be a scalar function, such
as f (uj , uk , vi ) = f (uj − uk )⊤ vi . Switching word indices j and k in (15.5.5), it must hold that
f (x)f (−x) = 1, so one possibility is f (x) = exp(x), i.e.,
( )
exp u⊤ v
j i p
f (uj , uk , vi ) = ( ⊤ ) ≈ ij . (15.5.6)
exp uk vi pik
( )
Now letʼs pick exp u⊤ v
j i ≈ αpij , where α is a constant. Since pij = xij /xi , after taking the
logarithm on both sides we get u⊤
j vi ≈ log α + log xij − log xi . We may use additional bias terms

to fit − log α + log xi , such as the center word bias bi and the context word bias cj :
u⊤
j vi + bi + cj ≈ log xij . (15.5.7)
Measuring the squared error of (15.5.7) with weights, the GloVe loss function in (15.5.4) is ob-
tained.
15.5.4 Summary
• The skip-gram model can be interpreted using global corpus statistics such as word-word
co-occurrence counts.
• The cross-entropy loss may not be a good choice for measuring the difference of two proba-
bility distributions, especially for a large corpus. GloVe uses squared loss to fit precomputed
global corpus statistics.
• The center word vector and the context word vector are mathematically equivalent for any
word in GloVe.
• GloVe can be interpreted from the ratio of word-word co-occurrence probabilities.
Exercises
1. If words wi and wj co-occur in the same context window, how can we use their distance
in the text sequence to redesign the method for calculating the conditional probability pij ?
Hint: see Section 4.2 of the GloVe paper (Pennington et al., 2014).
2. For any word, are its center word bias and context word bias mathematically equivalent in
GloVe? Why?
Discussions233
15.6 Subword Embedding
In English, words such as “helps”, “helped”, and “helping” are inflected forms of the same word
“help”. The relationship between “dog” and “dogs” is the same as that between “cat” and “cats”, and
the relationship between “boy” and “boyfriend” is the same as that between “girl” and “girlfriend”.
In other languages such as French and Spanish, many verbs have over 40 inflected forms, while in
Finnish, a noun may have up to 15 cases. In linguistics, morphology studies word formation and
word relationships. However, the internal structure of words was neither explored in word2vec
nor in GloVe.
233
15.6. Subword Embedding 725

15.6.1 The fastText Model
Recall how words are represented in word2vec. In both the skip-gram model and the continuous
bag-of-words model, different inflected forms of the same word are directly represented by dif-
ferent vectors without shared parameters. To use morphological information, the fastText model
proposed a subword embedding approach, where a subword is a character n-gram (Bojanowski et
al., 2017). Instead of learning word-level vector representations, fastText can be considered as the
subword-level skip-gram, where each center word is represented by the sum of its subword vectors.
Letʼs illustrate how to obtain subwords for each center word in fastText using the word “where”.
First, add special characters “<” and “>” at the beginning and end of the word to distinguish pre-
fixes and suffixes from other subwords. Then, extract character n-grams from the word. For
example, when n = 3, we obtain all subwords of length 3: “<wh”, “whe”, “her”, “ere”, “re>”, and
the special subword “<where>”.
In fastText, for any word w, denote by Gw the union of all its subwords of length between 3 and
6 and its special subword. The vocabulary is the union of the subwords of all words. Letting zg
be the vector of subword g in the dictionary, the vector vw for word w as a center word in the
skip-gram model is the sum of its subword vectors:
∑
vw = zg . (15.6.1)
g∈Gw
The rest of fastText is the same as the skip-gram model. Compared with the skip-gram model,
the vocabulary in fastText is larger, resulting in more model parameters. Besides, to calculate the
representation of a word, all its subword vectors have to be summed, leading to higher compu-
tational complexity. However, thanks to shared parameters from subwords among words with
similar structures, rare words and even out-of-vocabulary words may obtain better vector repre-
sentations in fastText.
15.6.2 Byte Pair Encoding
In fastText, all the extracted subwords have to be of the specified lengths, such as 3 to 6, thus the
vocabulary size cannot be predefined. To allow for variable-length subwords in a fixed-size vocab-
ulary, we can apply a compression algorithm called byte pair encoding (BPE) to extract subwords
(Sennrich et al., 2015).
Byte pair encoding performs a statistical analysis of the training dataset to discover common sym-
bols within a word, such as consecutive characters of arbitrary length. Starting from symbols of
length 1, byte pair encoding iteratively merges the most frequent pair of consecutive symbols to
produce new longer symbols. Note that for efficiency, pairs crossing word boundaries are not con-
sidered. In the end, we can use such symbols as subwords to segment words. Byte pair encoding
and its variants has been used for input representations in popular natural language processing
pretraining models such as GPT-2 (Radford et al., 2019) and RoBERTa (Liu et al., 2019). In the
following, we will illustrate how byte pair encoding works.
First, we initialize the vocabulary of symbols as all the English lowercase characters, a special
end-of-word symbol '_', and a special unknown symbol '[UNK]'.
import collections
symbols = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',

'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
'_', '[UNK]']
Since we do not consider symbol pairs that cross boundaries of words, we only need a dictionary
raw_token_freqs that maps words to their frequencies (number of occurrences) in a dataset. Note
that the special symbol '_' is appended to each word so that we can easily recover a word sequence
(e.g., “a taller man”) from a sequence of output symbols ( e.g., “a_ tall er_ man”). Since we start
the merging process from a vocabulary of only single characters and special symbols, space is
inserted between every pair of consecutive characters within each word (keys of the dictionary
token_freqs). In other words, space is the delimiter between symbols within a word.
raw_token_freqs = {'fast_': 4, 'faster_': 3, 'tall_': 5, 'taller_': 4}

token_freqs = {}
for token, freq in raw_token_freqs.items():
token_freqs[' '.join(list(token))] = raw_token_freqs[token]
token_freqs
{'f a s t _': 4, 'f a s t e r _': 3, 't a l l _': 5, 't a l l e r _': 4}
We define the following get_max_freq_pair function that returns the most frequent pair of con-
secutive symbols within a word, where words come from keys of the input dictionary token_freqs.
def get_max_freq_pair(token_freqs):
pairs = collections.defaultdict(int)
for token, freq in token_freqs.items():
symbols = token.split()
for i in range(len(symbols) - 1):
# Key of `pairs` is a tuple of two consecutive symbols
pairs[symbols[i], symbols[i + 1]] += freq
return max(pairs, key=pairs.get) # Key of `pairs` with the max value
As a greedy approach based on frequency of consecutive symbols, byte pair encoding will use
the following merge_symbols function to merge the most frequent pair of consecutive symbols to
produce new symbols.
def merge_symbols(max_freq_pair, token_freqs, symbols):

symbols.append(''.join(max_freq_pair))
new_token_freqs = dict()
for token, freq in token_freqs.items():
new_token = token.replace(' '.join(max_freq_pair),
''.join(max_freq_pair))
new_token_freqs[new_token] = token_freqs[token]
return new_token_freqs
Now we iteratively perform the byte pair encoding algorithm over the keys of the dictionary to-
ken_freqs. In the first iteration, the most frequent pair of consecutive symbols are 't' and 'a',
thus byte pair encoding merges them to produce a new symbol 'ta'. In the second iteration, byte
pair encoding continues to merge 'ta' and 'l' to result in another new symbol 'tal'.

num_merges = 10
for i in range(num_merges):
max_freq_pair = get_max_freq_pair(token_freqs)
token_freqs = merge_symbols(max_freq_pair, token_freqs, symbols)
print(f'merge #{i + 1}:', max_freq_pair)
merge #1: ('t', 'a')

merge #2: ('ta', 'l')
merge #3: ('tal', 'l')
merge #4: ('f', 'a')
merge #5: ('fa', 's')
merge #6: ('fas', 't')
merge #7: ('e', 'r')
merge #8: ('er', '_')
merge #9: ('tall', '_')
merge #10: ('fast', '_')
After 10 iterations of byte pair encoding, we can see that list symbols now contains 10 more sym-
bols that are iteratively merged from other symbols.
print(symbols)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's
,→', 't', 'u', 'v', 'w', 'x', 'y', 'z', '_', '[UNK]', 'ta', 'tal', 'tall', 'fa', 'fas', 'fast
,→', 'er', 'er_', 'tall_', 'fast_']
For the same dataset specified in the keys of the dictionary raw_token_freqs, each word in the
dataset is now segmented by subwords “fast_”, “fast”, “er_”, “tall_”, and “tall” as a result of the byte
pair encoding algorithm. For instance, words “faster_” and “taller_” are segmented as “fast er_”
and “tall er_”, respectively.
print(list(token_freqs.keys()))
['fast_', 'fast er_', 'tall_', 'tall er_']
Note that the result of byte pair encoding depends on the dataset being used. We can also use the
subwords learned from one dataset to segment words of another dataset. As a greedy approach,
the following segment_BPE function tries to break words into the longest possible subwords from
the input argument symbols.
def segment_BPE(tokens, symbols):

outputs = []
for token in tokens:
start, end = 0, len(token)
cur_output = []
# Segment token with the longest possible subwords from symbols
while start < len(token) and start < end:
if token[start: end] in symbols:
cur_output.append(token[start: end])
start = end

end = len(token)
else:
end -= 1
if start < len(token):
cur_output.append('[UNK]')
outputs.append(' '.join(cur_output))
return outputs
In the following, we use the subwords in list symbols, which is learned from the aforementioned
dataset, to segment tokens that represent another dataset.
tokens = ['tallest_', 'fatter_']

print(segment_BPE(tokens, symbols))
['tall e s t _', 'fa t t er_']
15.6.3 Summary
• The fastText model proposes a subword embedding approach. Based on the skip-gram
model in word2vec, it represents a center word as the sum of its subword vectors.
• Byte pair encoding performs a statistical analysis of the training dataset to discover common
symbols within a word. As a greedy approach, byte pair encoding iteratively merges the most
frequent pair of consecutive symbols.
• Subword embedding may improve the quality of representations of rare words and out-of-
dictionary words.
Exercises
1. As an example, there are about 3 × 108 possible 6-grams in English. What is the issue when
there are too many subwords? How to address the issue? Hint: refer to the end of Section
3.2 of the fastText paper (Bojanowski et al., 2017).
2. How to design a subword embedding model based on the continuous bag-of-words model?
3. To get a vocabulary of size m, how many merging operations are needed when the initial
symbol vocabulary size is n?
4. How to extend the idea of byte pair encoding to extract phrases?
Discussions234
234

15.7 Word Similarity and Analogy
In Section 15.4, we trained a word2vec model on a small dataset, and applied it to find seman-
tically similar words for an input word. In practice, word vectors that are pretrained on large
corpora can be applied to downstream natural language processing tasks, which will be covered
later in Chapter 16. To demonstrate semantics of pretrained word vectors from large corpora in a
straightforward way, letʼs apply them in the word similarity and analogy tasks.
import os
import torch
15.7.1 Loading Pretrained Word Vectors
Below lists pretrained GloVe embeddings of dimension 50, 100, and 300, which can be down-
loaded from the GloVe website235 . The pretrained fastText embeddings are available in multiple
languages. Here we consider one English version (300-dimensional “wiki.en”) that can be down-
loaded from the fastText website236 .
#@save
d2l.DATA_HUB['glove.6b.50d'] = (d2l.DATA_URL + 'glove.6B.50d.zip',
'0b8703943ccdb6eb788e6f091b8946e82231bc4d')
#@save
'cd43bfb07e44e6f27cbcc7bc9ae3d80284fdaf5a')
#@save
'b5116e234e9eb9076672cfeabf5469f3eec904fa')
#@save
d2l.DATA_HUB['wiki.en'] = (d2l.DATA_URL + 'wiki.en.zip',
'c1816da3821ae9f43899be655002f6c723e91b88')
To load these pretrained GloVe and fastText embeddings, we define the following TokenEmbedding
class.
#@save
class TokenEmbedding:
"""Token Embedding."""
def __init__(self, embedding_name):
self.idx_to_token, self.idx_to_vec = self._load_embedding(
embedding_name)
self.unknown_idx = 0
self.token_to_idx = {token: idx for idx, token in
enumerate(self.idx_to_token)}

235
https://nlp.stanford.edu/projects/glove/
236
https://fasttext.cc/

def _load_embedding(self, embedding_name):

idx_to_token, idx_to_vec = ['<unk>'], []
data_dir = d2l.download_extract(embedding_name)
# GloVe website: https://nlp.stanford.edu/projects/glove/
# fastText website: https://fasttext.cc/
with open(os.path.join(data_dir, 'vec.txt'), 'r') as f:
for line in f:
elems = line.rstrip().split(' ')
token, elems = elems[0], [float(elem) for elem in elems[1:]]
# Skip header information, such as the top row in fastText
if len(elems) > 1:
idx_to_token.append(token)
idx_to_vec.append(elems)
idx_to_vec = [[0] * len(idx_to_vec[0])] + idx_to_vec
return idx_to_token, torch.tensor(idx_to_vec)
def __getitem__(self, tokens):

indices = [self.token_to_idx.get(token, self.unknown_idx)
for token in tokens]
vecs = self.idx_to_vec[torch.tensor(indices)]
return vecs
def __len__(self):
return len(self.idx_to_token)
Below we load the 50-dimensional GloVe embeddings (pretrained on a Wikipedia subset). When
creating the TokenEmbedding instance, the specified embedding file has to be downloaded if it was
not yet.
glove_6b50d = TokenEmbedding('glove.6b.50d')
Output the vocabulary size. The vocabulary contains 400000 words (tokens) and a special unknown
token.
len(glove_6b50d)
400001
We can get the index of a word in the vocabulary, and vice versa.
glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367]
(3367, 'beautiful')
15.7. Word Similarity and Analogy 731

15.7.2 Applying Pretrained Word Vectors
Using the loaded GloVe vectors, we will demonstrate their semantics by applying them in the fol-
lowing word similarity and analogy tasks.
Word Similarity
Similar to Section 15.4.3, in order to find semantically similar words for an input word based on
cosine similarities between word vectors, we implement the following knn (k-nearest neighbors)
function.
def knn(W, x, k):

# Add 1e-9 for numerical stability
cos = torch.mv(W, x.reshape(-1,)) / (
torch.sqrt(torch.sum(W * W, axis=1) + 1e-9) *
torch.sqrt((x * x).sum()))
_, topk = torch.topk(cos, k=k)
return topk, [cos[int(i)] for i in topk]
Then, we search for similar words using the pretrained word vectors from the TokenEmbedding
instance embed.
def get_similar_tokens(query_token, k, embed):

topk, cos = knn(embed.idx_to_vec, embed[[query_token]], k + 1)
for i, c in zip(topk[1:], cos[1:]): # Exclude the input word
print(f'cosine sim={float(c):.3f}: {embed.idx_to_token[int(i)]}')
The vocabulary of the pretrained word vectors in glove_6b50d contains 400000 words and a special
unknown token. Excluding the input word and unknown token, among this vocabulary letʼs find
three most semantically similar words to word “chip”.
get_similar_tokens('chip', 3, glove_6b50d)
cosine sim=0.856: chips

cosine sim=0.749: intel
cosine sim=0.749: electronics
Below outputs similar words to “baby” and “beautiful”.
get_similar_tokens('baby', 3, glove_6b50d)
cosine sim=0.839: babies

cosine sim=0.800: boy
cosine sim=0.792: girl
get_similar_tokens('beautiful', 3, glove_6b50d)

cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful
Word Analogy
Besides finding similar words, we can also apply word vectors to word analogy tasks. For example,
“man”:“woman”::“son”:“daughter” is the form of a word analogy: “man” is to “woman” as “son”
is to “daughter”. Specifically, the word analogy completion task can be defined as: for a word
analogy a : b :: c : d, given the first three words a, b and c, find d. Denote the vector of word w by
vec(w). To complete the analogy, we will find the word whose vector is most similar to the result
of vec(c) + vec(b) − vec(a).
def get_analogy(token_a, token_b, token_c, embed):

vecs = embed[[token_a, token_b, token_c]]
x = vecs[1] - vecs[0] + vecs[2]
topk, cos = knn(embed.idx_to_vec, x, 1)
return embed.idx_to_token[int(topk[0])] # Remove unknown words
Letʼs verify the “male-female” analogy using the loaded word vectors.
get_analogy('man', 'woman', 'son', glove_6b50d)
'daughter'
Below completes a “capital-country” analogy: “beijing”:“china”::“tokyo”:“japan”. This demon-

strates semantics in the pretrained word vectors.
get_analogy('beijing', 'china', 'tokyo', glove_6b50d)
'japan'
For the “adjective-superlative adjective” analogy such as “bad”:“worst”::“big”:“biggest”, we can see

that the pretrained word vectors may capture the syntactic information.
get_analogy('bad', 'worst', 'big', glove_6b50d)
'biggest'
To show the captured notion of past tense in the pretrained word vectors, we can test the syntax
using the “present tense-past tense” analogy: “do”:“did”::“go”:“went”.
get_analogy('do', 'did', 'go', glove_6b50d)
'went'
15.7. Word Similarity and Analogy 733

15.7.3 Summary
• In practice, word vectors that are pretrained on large corpora can be applied to downstream
natural language processing tasks.
• Pretrained word vectors can be applied to the word similarity and analogy tasks.
Exercises
1. Test the fastText results using TokenEmbedding('wiki.en').

2. When the vocabulary is extremely large, how can we find similar words or complete a word
analogy faster?
Discussions237
15.8 Bidirectional Encoder Representations from Transformers (BERT)
We have introduced several word embedding models for natural language understanding. Af-
ter pretraining, the output can be thought of as a matrix where each row is a vector that repre-
sents a word of a predefined vocabulary. In fact, these word embedding models are all context-
independent. Letʼs begin by illustrating this property.
15.8.1 From Context-Independent to Context-Sensitive
Recall the experiments in Section 15.4 and Section 15.7. For instance, word2vec and GloVe both
assign the same pretrained vector to the same word regardless of the context of the word (if any).
Formally, a context-independent representation of any token x is a function f (x) that only takes
x as its input. Given the abundance of polysemy and complex semantics in natural languages,
context-independent representations have obvious limitations. For instance, the word “crane” in
contexts “a crane is flying” and “a crane driver came” has completely different meanings; thus,
the same word may be assigned different representations depending on contexts.
This motivates the development of context-sensitive word representations, where representations
of words depend on their contexts. Hence, a context-sensitive representation of token x is a func-
tion f (x, c(x)) depending on both x and its context c(x). Popular context-sensitive representations
include TagLM (language-model-augmented sequence tagger) (Peters et al., 2017), CoVe (Context
Vectors) (McCann et al., 2017), and ELMo (Embeddings from Language Models) (Peters et al., 2018).
For example, by taking the entire sequence as input, ELMo is a function that assigns a representa-
tion to each word from the input sequence. Specifically, ELMo combines all the intermediate layer
representations from pretrained bidirectional LSTM as the output representation. Then the ELMo
representation will be added to a downstream taskʼs existing supervised model as additional fea-
tures, such as by concatenating ELMo representation and the original representation (e.g., GloVe)
of tokens in the existing model. On the one hand, all the weights in the pretrained bidirectional
LSTM model are frozen after ELMo representations are added. On the other hand, the existing
supervised model is specifically customized for a given task. Leveraging different best models
237

for different tasks at that time, adding ELMo improved the state of the art across six natural lan-
guage processing tasks: sentiment analysis, natural language inference, semantic role labeling,
coreference resolution, named entity recognition, and question answering.
15.8.2 From Task-Specific to Task-Agnostic
Although ELMo has significantly improved solutions to a diverse set of natural language process-
ing tasks, each solution still hinges on a task-specific architecture. However, it is practically non-
trivial to craft a specific architecture for every natural language processing task. The GPT (Gen-
erative Pre-Training) model represents an effort in designing a general task-agnostic model for
context-sensitive representations (Radford et al., 2018). Built on a transformer decoder, GPT pre-
trains a language model that will be used to represent text sequences. When applying GPT to a
downstream task, the output of the language model will be fed into an added linear output layer to
predict the label of the task. In sharp contrast to ELMo that freezes parameters of the pretrained
model, GPT fine-tunes all the parameters in the pretrained transformer decoder during super-
vised learning of the downstream task. GPT was evaluated on twelve tasks of natural language
inference, question answering, sentence similarity, and classification, and improved the state of
the art in nine of them with minimal changes to the model architecture.
However, due to the autoregressive nature of language models, GPT only looks forward (left-to-
right). In contexts “i went to the bank to deposit cash” and “i went to the bank to sit down”, as
“bank” is sensitive to the context to its left, GPT will return the same representation for “bank”,
though it has different meanings.
15.8.3 BERT: Combining the Best of Both Worlds
As we have seen, ELMo encodes context bidirectionally but uses task-specific architectures; while
GPT is task-agnostic but encodes context left-to-right. Combining the best of both worlds, BERT
(Bidirectional Encoder Representations from Transformers) encodes context bidirectionally and
requires minimal architecture changes for a wide range of natural language processing tasks (De-
vlin et al., 2018). Using a pretrained transformer encoder, BERT is able to represent any token
based on its bidirectional context. During supervised learning of downstream tasks, BERT is sim-
ilar to GPT in two aspects. First, BERT representations will be fed into an added output layer, with
minimal changes to the model architecture depending on nature of tasks, such as predicting for
every token vs. predicting for the entire sequence. Second, all the parameters of the pretrained
transformer encoder are fine-tuned, while the additional output layer will be trained from scratch.
Fig. 15.8.1 depicts the differences among ELMo, GPT, and BERT.
15.8. Bidirectional Encoder Representations from Transformers (BERT) 735

Fig. 15.8.1: A comparison of ELMo, GPT, and BERT.
BERT further improved the state of the art on eleven natural language processing tasks under
broad categories of (i) single text classification (e.g., sentiment analysis), (ii) text pair classifi-
cation (e.g., natural language inference), (iii) question answering, (iv) text tagging (e.g., named
entity recognition). All proposed in 2018, from context-sensitive ELMo to task-agnostic GPT and
BERT, conceptually simple yet empirically powerful pretraining of deep representations for nat-
ural languages have revolutionized solutions to various natural language processing tasks.
In the rest of this chapter, we will dive into the pretraining of BERT. When natural language pro-
cessing applications are explained in Chapter 16, we will illustrate fine-tuning of BERT for down-
stream applications.
import torch
15.8.4 Input Representation
In natural language processing, some tasks (e.g., sentiment analysis) take single text as input,
while in some other tasks (e.g., natural language inference), the input is a pair of text sequences.
The BERT input sequence unambiguously represents both single text and text pairs. In the former,
the BERT input sequence is the concatenation of the special classification token “<cls>”, tokens of
a text sequence, and the special separation token “<sep>”. In the latter, the BERT input sequence
is the concatenation of “<cls>”, tokens of the first text sequence, “<sep>”, tokens of the second text
sequence, and “<sep>”. We will consistently distinguish the terminology “BERT input sequence”
from other types of “sequences”. For instance, one BERT input sequence may include either one
text sequence or two text sequences.
To distinguish text pairs, the learned segment embeddings eA and eB are added to the token em-
beddings of the first sequence and the second sequence, respectively. For single text inputs, only
eA is used.

The following get_tokens_and_segments takes either one sentence or two sentences as input, then
returns tokens of the BERT input sequence and their corresponding segment IDs.
#@save
def get_tokens_and_segments(tokens_a, tokens_b=None):
"""Get tokens of the BERT input sequence and their segment IDs."""
tokens = ['<cls>'] + tokens_a + ['<sep>']
# 0 and 1 are marking segment A and B, respectively
segments = [0] * (len(tokens_a) + 2)
if tokens_b is not None:
tokens += tokens_b + ['<sep>']
segments += [1] * (len(tokens_b) + 1)
return tokens, segments
BERT chooses the transformer encoder as its bidirectional architecture. Common in the trans-
former encoder, positional embeddings are added at every position of the BERT input sequence.
However, different from the original transformer encoder, BERT uses learnable positional embed-
dings. To sum up, Fig. 15.8.2 shows that the embeddings of the BERT input sequence are the sum
of the token embeddings, segment embeddings, and positional embeddings.
Fig. 15.8.2: The embeddings of the BERT input sequence are the sum of the token embeddings,
segment embeddings, and positional embeddings.
The following BERTEncoder class is similar to the TransformerEncoder class as implemented in Sec-
tion 11.7. Different from TransformerEncoder, BERTEncoder uses segment embeddings and learn-
able positional embeddings.
#@save
class BERTEncoder(nn.Module):
"""BERT encoder."""
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
num_blks, dropout, max_len=1000, **kwargs):
super(BERTEncoder, self).__init__(**kwargs)
self.token_embedding = nn.Embedding(vocab_size, num_hiddens)
self.segment_embedding = nn.Embedding(2, num_hiddens)
self.blks.add_module(f"{i}", d2l.TransformerEncoderBlock(
num_hiddens, ffn_num_hiddens, num_heads, dropout, True))
# In BERT, positional embeddings are learnable, thus we create a

# parameter of positional embeddings that are long enough

self.pos_embedding = nn.Parameter(torch.randn(1, max_len,
num_hiddens))
def forward(self, tokens, segments, valid_lens):

# Shape of `X` remains unchanged in the following code snippet:
# (batch size, max sequence length, `num_hiddens`)
X = self.token_embedding(tokens) + self.segment_embedding(segments)
X = X + self.pos_embedding[:, :X.shape[1], :]
for blk in self.blks:
X = blk(X, valid_lens)
return X
Suppose that the vocabulary size is 10000. To demonstrate forward inference of BERTEncoder, letʼs
create an instance of it and initialize its parameters.
vocab_size, num_hiddens, ffn_num_hiddens, num_heads = 10000, 768, 1024, 4

ffn_num_input, num_blks, dropout = 768, 2, 0.2
encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens, num_heads,
num_blks, dropout)
We define tokens to be 2 BERT input sequences of length 8, where each token is an index of the vo-
cabulary. The forward inference of BERTEncoder with the input tokens returns the encoded result
where each token is represented by a vector whose length is predefined by the hyperparameter
num_hiddens. This hyperparameter is usually referred to as the hidden size (number of hidden
units) of the transformer encoder.
tokens = torch.randint(0, vocab_size, (2, 8))

segments = torch.tensor([[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 0, 1, 1, 1, 1, 1]])
encoded_X = encoder(tokens, segments, None)
encoded_X.shape
torch.Size([2, 8, 768])
15.8.5 Pretraining Tasks
The forward inference of BERTEncoder gives the BERT representation of each token of the input
text and the inserted special tokens “<cls>” and “<seq>”. Next, we will use these representations
to compute the loss function for pretraining BERT. The pretraining is composed of the following
two tasks: masked language modeling and next sentence prediction.

Masked Language Modeling
As illustrated in Section 9.3, a language model predicts a token using the context on its left. To en-
code context bidirectionally for representing each token, BERT randomly masks tokens and uses
tokens from the bidirectional context to predict the masked tokens in a self-supervised fashion.
This task is referred to as a masked language model.
In this pretraining task, 15% of tokens will be selected at random as the masked tokens for predic-
tion. To predict a masked token without cheating by using the label, one straightforward approach
is to always replace it with a special “<mask>” token in the BERT input sequence. However, the
artificial special token “<mask>” will never appear in fine-tuning. To avoid such a mismatch be-
tween pretraining and fine-tuning, if a token is masked for prediction (e.g., “great” is selected to
be masked and predicted in “this movie is great”), in the input it will be replaced with:
• a special “<mask>” token for 80% of the time (e.g., “this movie is great” becomes “this movie
is <mask>”);
• a random token for 10% of the time (e.g., “this movie is great” becomes “this movie is drink”);
• the unchanged label token for 10% of the time (e.g., “this movie is great” becomes “this
movie is great”).
Note that for 10% of 15% time a random token is inserted. This occasional noise encourages BERT
to be less biased towards the masked token (especially when the label token remains unchanged)
in its bidirectional context encoding.
We implement the following MaskLM class to predict masked tokens in the masked language model
task of BERT pretraining. The prediction uses a one-hidden-layer MLP (self.mlp). In forward
inference, it takes two inputs: the encoded result of BERTEncoder and the token positions for pre-
diction. The output is the prediction results at these positions.
#@save
class MaskLM(nn.Module):
"""The masked language model task of BERT."""
def __init__(self, vocab_size, num_hiddens, **kwargs):
super(MaskLM, self).__init__(**kwargs)
self.mlp = nn.Sequential(nn.LazyLinear(num_hiddens),
nn.ReLU(),
nn.LayerNorm(num_hiddens),
nn.LazyLinear(vocab_size))
def forward(self, X, pred_positions):

num_pred_positions = pred_positions.shape[1]
pred_positions = pred_positions.reshape(-1)
batch_size = X.shape[0]
batch_idx = torch.arange(0, batch_size)
# Suppose that `batch_size` = 2, `num_pred_positions` = 3, then
# `batch_idx` is `torch.tensor([0, 0, 0, 1, 1, 1])`
batch_idx = torch.repeat_interleave(batch_idx, num_pred_positions)
masked_X = X[batch_idx, pred_positions]
masked_X = masked_X.reshape((batch_size, num_pred_positions, -1))
mlm_Y_hat = self.mlp(masked_X)
return mlm_Y_hat
To demonstrate the forward inference of MaskLM, we create its instance mlm and initialize it. Recall
that encoded_X from the forward inference of BERTEncoder represents 2 BERT input sequences.

We define mlm_positions as the 3 indices to predict in either BERT input sequence of encoded_X.
The forward inference of mlm returns prediction results mlm_Y_hat at all the masked positions
mlm_positions of encoded_X. For each prediction, the size of the result is equal to the vocabulary
size.
mlm = MaskLM(vocab_size, num_hiddens)

mlm_positions = torch.tensor([[1, 5, 2], [6, 1, 5]])
mlm_Y_hat = mlm(encoded_X, mlm_positions)
mlm_Y_hat.shape
torch.Size([2, 3, 10000])
With the ground truth labels mlm_Y of the predicted tokens mlm_Y_hat under masks, we can calcu-
late the cross-entropy loss of the masked language model task in BERT pretraining.
mlm_Y = torch.tensor([[7, 8, 9], [10, 20, 30]])

mlm_l = loss(mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y.reshape(-1))
mlm_l.shape
torch.Size([6])
Next Sentence Prediction
Although masked language modeling is able to encode bidirectional context for representing
words, it does not explicitly model the logical relationship between text pairs. To help under-
stand the relationship between two text sequences, BERT considers a binary classification task,
next sentence prediction, in its pretraining. When generating sentence pairs for pretraining, for
half of the time they are indeed consecutive sentences with the label “True”; while for the other
half of the time the second sentence is randomly sampled from the corpus with the label “False”.
The following NextSentencePred class uses a one-hidden-layer MLP to predict whether the second
sentence is the next sentence of the first in the BERT input sequence. Due to self-attention in the
transformer encoder, the BERT representation of the special token “<cls>” encodes both the two
sentences from the input. Hence, the output layer (self.output) of the MLP classifier takes X as
input, where X is the output of the MLP hidden layer whose input is the encoded “<cls>” token.
#@save
class NextSentencePred(nn.Module):
"""The next sentence prediction task of BERT."""
def __init__(self, **kwargs):
super(NextSentencePred, self).__init__(**kwargs)

# `X` shape: (batch size, `num_hiddens`)
return self.output(X)
We can see that the forward inference of an NextSentencePred instance returns binary predictions
for each BERT input sequence.

# PyTorch by default won't flatten the tensor as seen in mxnet where, if
# flatten=True, all but the first axis of input data are collapsed together
encoded_X = torch.flatten(encoded_X, start_dim=1)
# input_shape for NSP: (batch size, `num_hiddens`)
nsp = NextSentencePred()
nsp_Y_hat = nsp(encoded_X)
nsp_Y_hat.shape
torch.Size([2, 2])
The cross-entropy loss of the 2 binary classifications can also be computed.
nsp_y = torch.tensor([0, 1])

nsp_l = loss(nsp_Y_hat, nsp_y)
nsp_l.shape
torch.Size([2])
It is noteworthy that all the labels in both the aforementioned pretraining tasks can be trivially
obtained from the pretraining corpus without manual labeling effort. The original BERT has been
pretrained on the concatenation of BookCorpus (Zhu et al., 2015) and English Wikipedia. These
two text corpora are huge: they have 800 million words and 2.5 billion words, respectively.
When pretraining BERT, the final loss function is a linear combination of both the loss functions
for masked language modeling and next sentence prediction. Now we can define the BERTModel
class by instantiating the three classes BERTEncoder, MaskLM, and NextSentencePred. The forward
inference returns the encoded BERT representations encoded_X, predictions of masked language
modeling mlm_Y_hat, and next sentence predictions nsp_Y_hat.
#@save
class BERTModel(nn.Module):
"""The BERT model."""
def __init__(self, vocab_size, num_hiddens, ffn_num_hiddens,
num_heads, num_blks, dropout, max_len=1000):
super(BERTModel, self).__init__()
self.encoder = BERTEncoder(vocab_size, num_hiddens, ffn_num_hiddens,
num_heads, num_blks, dropout,
max_len=max_len)
self.hidden = nn.Sequential(nn.LazyLinear(num_hiddens),
nn.Tanh())
self.mlm = MaskLM(vocab_size, num_hiddens)
self.nsp = NextSentencePred()
def forward(self, tokens, segments, valid_lens=None, pred_positions=None):

encoded_X = self.encoder(tokens, segments, valid_lens)
if pred_positions is not None:
mlm_Y_hat = self.mlm(encoded_X, pred_positions)
else:

mlm_Y_hat = None
# The hidden layer of the MLP classifier for next sentence prediction.
# 0 is the index of the '<cls>' token
nsp_Y_hat = self.nsp(self.hidden(encoded_X[:, 0, :]))
return encoded_X, mlm_Y_hat, nsp_Y_hat
15.8.7 Summary
• Word embedding models such as word2vec and GloVe are context-independent. They assign
the same pretrained vector to the same word regardless of the context of the word (if any).
It is hard for them to handle well polysemy or complex semantics in natural languages.
• For context-sensitive word representations such as ELMo and GPT, representations of words
depend on their contexts.
• ELMo encodes context bidirectionally but uses task-specific architectures (however, it is
practically non-trivial to craft a specific architecture for every natural language processing
task); while GPT is task-agnostic but encodes context left-to-right.
• BERT combines the best of both worlds: it encodes context bidirectionally and requires min-
imal architecture changes for a wide range of natural language processing tasks.
• The embeddings of the BERT input sequence are the sum of the token embeddings, segment
embeddings, and positional embeddings.
• Pretraining BERT is composed of two tasks: masked language modeling and next sentence
prediction. The former is able to encode bidirectional context for representing words, while
the latter explicitly models the logical relationship between text pairs.
Exercises
1. All other things being equal, will a masked language model require more or fewer pretrain-
ing steps to converge than a left-to-right language model? Why?
2. In the original implementation of BERT, the positionwise feed-forward network in BERTEn-
coder (via d2l.TransformerEncoderBlock) and the fully connected layer in MaskLM both use
the Gaussian error linear unit (GELU) (Hendrycks and Gimpel, 2016) as the activation func-
tion. Research into the difference between GELU and ReLU.
Discussions238
238

15.9 The Dataset for Pretraining BERT
To pretrain the BERT model as implemented in Section 15.8, we need to generate the dataset in the
ideal format to facilitate the two pretraining tasks: masked language modeling and next sentence
prediction. On the one hand, the original BERT model is pretrained on the concatenation of two
huge corpora BookCorpus and English Wikipedia (see Section 15.8.5), making it hard to run for
most readers of this book. On the other hand, the off-the-shelf pretrained BERT model may not fit
for applications from specific domains like medicine. Thus, it is getting popular to pretrain BERT
on a customized dataset. To facilitate the demonstration of BERT pretraining, we use a smaller
corpus WikiText-2 (Merity et al., 2016).
Comparing with the PTB dataset used for pretraining word2vec in Section 15.3, WikiText-2 (i) re-
tains the original punctuation, making it suitable for next sentence prediction; (ii) retains the
original case and numbers; (iii) is over twice larger.
import os
import random
import torch
In the WikiText-2 dataset, each line represents a paragraph where space is inserted between any
punctuation and its preceding token. Paragraphs with at least two sentences are retained. To split
sentences, we only use the period as the delimiter for simplicity. We leave discussions of more
complex sentence splitting techniques in the exercises at the end of this section.
#@save
d2l.DATA_HUB['wikitext-2'] = (
'https://s3.amazonaws.com/research.metamind.io/wikitext/'
'wikitext-2-v1.zip', '3c914d17d80b1459be871a5039ac23e752a53cbe')
#@save
def _read_wiki(data_dir):
file_name = os.path.join(data_dir, 'wiki.train.tokens')
with open(file_name, 'r') as f:
lines = f.readlines()
# Uppercase letters are converted to lowercase ones
paragraphs = [line.strip().lower().split(' . ')
for line in lines if len(line.split(' . ')) >= 2]
random.shuffle(paragraphs)
return paragraphs
15.9.1 Defining Helper Functions for Pretraining Tasks
In the following, we begin by implementing helper functions for the two BERT pretraining tasks:
next sentence prediction and masked language modeling. These helper functions will be invoked
later when transforming the raw text corpus into the dataset of the ideal format to pretrain BERT.
15.9. The Dataset for Pretraining BERT 743

Generating the Next Sentence Prediction Task
According to descriptions of Section 15.8.5, the _get_next_sentence function generates a training

example for the binary classification task.
#@save
def _get_next_sentence(sentence, next_sentence, paragraphs):
if random.random() < 0.5:
is_next = True
else:
# `paragraphs` is a list of lists of lists
next_sentence = random.choice(random.choice(paragraphs))
is_next = False
return sentence, next_sentence, is_next
The following function generates training examples for next sentence prediction from the input
paragraph by invoking the _get_next_sentence function. Here paragraph is a list of sentences,
where each sentence is a list of tokens. The argument max_len specifies the maximum length of a
BERT input sequence during pretraining.
#@save
def _get_nsp_data_from_paragraph(paragraph, paragraphs, vocab, max_len):
nsp_data_from_paragraph = []
for i in range(len(paragraph) - 1):
tokens_a, tokens_b, is_next = _get_next_sentence(
paragraph[i], paragraph[i + 1], paragraphs)
# Consider 1 '<cls>' token and 2 '<sep>' tokens
if len(tokens_a) + len(tokens_b) + 3 > max_len:
continue
tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
nsp_data_from_paragraph.append((tokens, segments, is_next))
return nsp_data_from_paragraph
Generating the Masked Language Modeling Task
In order to generate training examples for the masked language modeling task from a BERT input
sequence, we define the following _replace_mlm_tokens function. In its inputs, tokens is a list of
tokens representing a BERT input sequence, candidate_pred_positions is a list of token indices
of the BERT input sequence excluding those of special tokens (special tokens are not predicted
in the masked language modeling task), and num_mlm_preds indicates the number of predictions
(recall 15% random tokens to predict). Following the definition of the masked language modeling
task in Section 15.8.5, at each prediction position, the input may be replaced by a special “<mask>”
token or a random token, or remain unchanged. In the end, the function returns the input tokens
after possible replacement, the token indices where predictions take place and labels for these
predictions.
#@save
def _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds,
vocab):
# For the input of a masked language model, make a new copy of tokens and
# replace some of them by '<mask>' or random tokens

mlm_input_tokens = [token for token in tokens]

pred_positions_and_labels = []
# Shuffle for getting 15% random tokens for prediction in the masked
# language modeling task
random.shuffle(candidate_pred_positions)
for mlm_pred_position in candidate_pred_positions:
if len(pred_positions_and_labels) >= num_mlm_preds:
break
masked_token = None
# 80% of the time: replace the word with the '<mask>' token
masked_token = '<mask>'
else:
# 10% of the time: keep the word unchanged
masked_token = tokens[mlm_pred_position]
# 10% of the time: replace the word with a random word
else:
masked_token = random.choice(vocab.idx_to_token)
mlm_input_tokens[mlm_pred_position] = masked_token
pred_positions_and_labels.append(
(mlm_pred_position, tokens[mlm_pred_position]))
return mlm_input_tokens, pred_positions_and_labels
By invoking the aforementioned _replace_mlm_tokens function, the following function takes a

BERT input sequence (tokens) as an input and returns indices of the input tokens (after possible
token replacement as described in Section 15.8.5), the token indices where predictions take place,
and label indices for these predictions.
#@save
def _get_mlm_data_from_tokens(tokens, vocab):
candidate_pred_positions = []
# `tokens` is a list of strings
for i, token in enumerate(tokens):
# Special tokens are not predicted in the masked language modeling
# task
if token in ['<cls>', '<sep>']:
continue
candidate_pred_positions.append(i)
# 15% of random tokens are predicted in the masked language modeling task
num_mlm_preds = max(1, round(len(tokens) * 0.15))
mlm_input_tokens, pred_positions_and_labels = _replace_mlm_tokens(
tokens, candidate_pred_positions, num_mlm_preds, vocab)
pred_positions_and_labels = sorted(pred_positions_and_labels,
key=lambda x: x[0])
pred_positions = [v[0] for v in pred_positions_and_labels]
mlm_pred_labels = [v[1] for v in pred_positions_and_labels]
return vocab[mlm_input_tokens], pred_positions, vocab[mlm_pred_labels]

15.9.2 Transforming Text into the Pretraining Dataset
Now we are almost ready to customize a Dataset class for pretraining BERT. Before that,
we still need to define a helper function _pad_bert_inputs to append the special “<pad>” to-
kens to the inputs. Its argument examples contain the outputs from the helper functions
_get_nsp_data_from_paragraph and _get_mlm_data_from_tokens for the two pretraining tasks.
#@save
def _pad_bert_inputs(examples, max_len, vocab):
max_num_mlm_preds = round(max_len * 0.15)
all_token_ids, all_segments, valid_lens, = [], [], []
all_pred_positions, all_mlm_weights, all_mlm_labels = [], [], []
nsp_labels = []
for (token_ids, pred_positions, mlm_pred_label_ids, segments,
is_next) in examples:
all_token_ids.append(torch.tensor(token_ids + [vocab['<pad>']] * (
max_len - len(token_ids)), dtype=torch.long))
all_segments.append(torch.tensor(segments + [0] * (
max_len - len(segments)), dtype=torch.long))
# `valid_lens` excludes count of '<pad>' tokens
valid_lens.append(torch.tensor(len(token_ids), dtype=torch.float32))
all_pred_positions.append(torch.tensor(pred_positions + [0] * (
max_num_mlm_preds - len(pred_positions)), dtype=torch.long))
# Predictions of padded tokens will be filtered out in the loss via
# multiplication of 0 weights
all_mlm_weights.append(
torch.tensor([1.0] * len(mlm_pred_label_ids) + [0.0] * (
max_num_mlm_preds - len(pred_positions)),
dtype=torch.float32))
all_mlm_labels.append(torch.tensor(mlm_pred_label_ids + [0] * (
max_num_mlm_preds - len(mlm_pred_label_ids)), dtype=torch.long))
nsp_labels.append(torch.tensor(is_next, dtype=torch.long))
return (all_token_ids, all_segments, valid_lens, all_pred_positions,
all_mlm_weights, all_mlm_labels, nsp_labels)
Putting the helper functions for generating training examples of the two pretraining tasks, and the
helper function for padding inputs together, we customize the following _WikiTextDataset class
as the WikiText-2 dataset for pretraining BERT. By implementing the __getitem__function, we
can arbitrarily access the pretraining (masked language modeling and next sentence prediction)
examples generated from a pair of sentences from the WikiText-2 corpus.
The original BERT model uses WordPiece embeddings whose vocabulary size is 30000 (Wu et al.,
2016). The tokenization method of WordPiece is a slight modification of the original byte pair
encoding algorithm in Section 15.6.2. For simplicity, we use the d2l.tokenize function for tok-
enization. Infrequent tokens that appear less than five times are filtered out.
#@save
class _WikiTextDataset(torch.utils.data.Dataset):
def __init__(self, paragraphs, max_len):
# Input `paragraphs[i]` is a list of sentence strings representing a
# paragraph; while output `paragraphs[i]` is a list of sentences
# representing a paragraph, where each sentence is a list of tokens
paragraphs = [d2l.tokenize(
paragraph, token='word') for paragraph in paragraphs]

sentences = [sentence for paragraph in paragraphs

for sentence in paragraph]
self.vocab = d2l.Vocab(sentences, min_freq=5, reserved_tokens=[
'<pad>', '<mask>', '<cls>', '<sep>'])
# Get data for the next sentence prediction task
examples = []
for paragraph in paragraphs:
examples.extend(_get_nsp_data_from_paragraph(
paragraph, paragraphs, self.vocab, max_len))
# Get data for the masked language model task
examples = [(_get_mlm_data_from_tokens(tokens, self.vocab)
+ (segments, is_next))
for tokens, segments, is_next in examples]
# Pad inputs
(self.all_token_ids, self.all_segments, self.valid_lens,
self.all_pred_positions, self.all_mlm_weights,
self.all_mlm_labels, self.nsp_labels) = _pad_bert_inputs(
examples, max_len, self.vocab)

return (self.all_token_ids[idx], self.all_segments[idx],
self.valid_lens[idx], self.all_pred_positions[idx],
self.all_mlm_weights[idx], self.all_mlm_labels[idx],
self.nsp_labels[idx])
def __len__(self):
return len(self.all_token_ids)
By using the _read_wiki function and the _WikiTextDataset class, we define the following
load_data_wiki to download and WikiText-2 dataset and generate pretraining examples from it.
#@save
def load_data_wiki(batch_size, max_len):
"""Load the WikiText-2 dataset."""
data_dir = d2l.download_extract('wikitext-2', 'wikitext-2')
paragraphs = _read_wiki(data_dir)
train_set = _WikiTextDataset(paragraphs, max_len)
train_iter = torch.utils.data.DataLoader(train_set, batch_size,
shuffle=True, num_workers=num_workers)
return train_iter, train_set.vocab
Setting the batch size to 512 and the maximum length of a BERT input sequence to be 64, we
print out the shapes of a minibatch of BERT pretraining examples. Note that in each BERT input
sequence, 10 (64 × 0.15) positions are predicted for the masked language modeling task.
batch_size, max_len = 512, 64

train_iter, vocab = load_data_wiki(batch_size, max_len)
for (tokens_X, segments_X, valid_lens_x, pred_positions_X, mlm_weights_X,

mlm_Y, nsp_y) in train_iter:
print(tokens_X.shape, segments_X.shape, valid_lens_x.shape,
pred_positions_X.shape, mlm_weights_X.shape, mlm_Y.shape,

nsp_y.shape)
break
torch.Size([512, 64]) torch.Size([512, 64]) torch.Size([512]) torch.Size([512, 10]) torch.

,→Size([512, 10]) torch.Size([512, 10]) torch.Size([512])
In the end, letʼs take a look at the vocabulary size. Even after filtering out infrequent tokens, it is
still over twice larger than that of the PTB dataset.
len(vocab)
20256
15.9.3 Summary
• Comparing with the PTB dataset, the WikiText-2 dateset retains the original punctuation,
case and numbers, and is over twice larger.
• We can arbitrarily access the pretraining (masked language modeling and next sentence
prediction) examples generated from a pair of sentences from the WikiText-2 corpus.
Exercises
1. For simplicity, the period is used as the only delimiter for splitting sentences. Try other sen-
tence splitting techniques, such as the spaCy and NLTK. Take NLTK as an example. You need
to install NLTK first: pip install nltk. In the code, first import nltk. Then, download the
Punkt sentence tokenizer: nltk.download('punkt'). To split sentences such as sentences
= 'This is great ! Why not ?', invoking nltk.tokenize.sent_tokenize(sentences) will
return a list of two sentence strings: ['This is great !', 'Why not ?'].
2. What is the vocabulary size if we do not filter out any infrequent token?
Discussions239
15.10 Pretraining BERT
With the BERT model implemented in Section 15.8 and the pretraining examples generated from
the WikiText-2 dataset in Section 15.9, we will pretrain BERT on the WikiText-2 dataset in this
section.
import torch
239

To start, we load the WikiText-2 dataset as minibatches of pretraining examples for masked lan-
guage modeling and next sentence prediction. The batch size is 512 and the maximum length of
a BERT input sequence is 64. Note that in the original BERT model, the maximum length is 512.
batch_size, max_len = 512, 64

train_iter, vocab = d2l.load_data_wiki(batch_size, max_len)
15.10.1 Pretraining BERT
The original BERT has two versions of different model sizes (Devlin et al., 2018). The base model
(BERTBASE ) uses 12 layers (transformer encoder blocks) with 768 hidden units (hidden size) and
12 self-attention heads. The large model (BERTLARGE ) uses 24 layers with 1024 hidden units and
16 self-attention heads. Notably, the former has 110 million parameters while the latter has 340
million parameters. For demonstration with ease, we define a small BERT, using 2 layers, 128
hidden units, and 2 self-attention heads.
net = d2l.BERTModel(len(vocab), num_hiddens=128,

ffn_num_hiddens=256, num_heads=2, num_blks=2, dropout=0.2)

Before defining the training loop, we define a helper function _get_batch_loss_bert. Given the
shard of training examples, this function computes the loss for both the masked language model-
ing and next sentence prediction tasks. Note that the final loss of BERT pretraining is just the sum
of both the masked language modeling loss and the next sentence prediction loss.
#@save
def _get_batch_loss_bert(net, loss, vocab_size, tokens_X,
segments_X, valid_lens_x,
pred_positions_X, mlm_weights_X,
mlm_Y, nsp_y):
# Forward pass
_, mlm_Y_hat, nsp_Y_hat = net(tokens_X, segments_X,
valid_lens_x.reshape(-1),
pred_positions_X)
# Compute masked language model loss
mlm_l = loss(mlm_Y_hat.reshape(-1, vocab_size), mlm_Y.reshape(-1)) *\
mlm_weights_X.reshape(-1, 1)
mlm_l = mlm_l.sum() / (mlm_weights_X.sum() + 1e-8)
# Compute next sentence prediction loss
nsp_l = loss(nsp_Y_hat, nsp_y)
l = mlm_l + nsp_l
return mlm_l, nsp_l, l
Invoking the two aforementioned helper functions, the following train_bert function defines the
procedure to pretrain BERT (net) on the WikiText-2 (train_iter) dataset. Training BERT can take
very long. Instead of specifying the number of epochs for training as in the train_ch13 function
15.10. Pretraining BERT 749

(see Section 14.1), the input num_steps of the following function specifies the number of iteration
steps for training.
def train_bert(train_iter, net, loss, vocab_size, devices, num_steps):

net(*next(iter(train_iter))[:4])
trainer = torch.optim.Adam(net.parameters(), lr=0.01)
step, timer = 0, d2l.Timer()
animator = d2l.Animator(xlabel='step', ylabel='loss',
xlim=[1, num_steps], legend=['mlm', 'nsp'])
# Sum of masked language modeling losses, sum of next sentence prediction
# losses, no. of sentence pairs, count
num_steps_reached = False
while step < num_steps and not num_steps_reached:
for tokens_X, segments_X, valid_lens_x, pred_positions_X,\
mlm_weights_X, mlm_Y, nsp_y in train_iter:
tokens_X = tokens_X.to(devices[0])
segments_X = segments_X.to(devices[0])
valid_lens_x = valid_lens_x.to(devices[0])
pred_positions_X = pred_positions_X.to(devices[0])
mlm_weights_X = mlm_weights_X.to(devices[0])
mlm_Y, nsp_y = mlm_Y.to(devices[0]), nsp_y.to(devices[0])
trainer.zero_grad()
timer.start()
mlm_l, nsp_l, l = _get_batch_loss_bert(
net, loss, vocab_size, tokens_X, segments_X, valid_lens_x,
pred_positions_X, mlm_weights_X, mlm_Y, nsp_y)
l.backward()
trainer.step()
metric.add(mlm_l, nsp_l, tokens_X.shape[0], 1)
timer.stop()
animator.add(step + 1,
(metric[0] / metric[3], metric[1] / metric[3]))
step += 1
if step == num_steps:
num_steps_reached = True
break
print(f'MLM loss {metric[0] / metric[3]:.3f}, '

f'NSP loss {metric[1] / metric[3]:.3f}')
print(f'{metric[2] / timer.sum():.1f} sentence pairs/sec on '
f'{str(devices)}')
We can plot both the masked language modeling loss and the next sentence prediction loss during
BERT pretraining.
train_bert(train_iter, net, loss, len(vocab), devices, 50)
MLM loss 5.487, NSP loss 0.762

4386.3 sentence pairs/sec on [device(type='cuda', index=0), device(type='cuda', index=1)]

15.10.2 Representing Text with BERT
After pretraining BERT, we can use it to represent single text, text pairs, or any token in them. The
following function returns the BERT (net) representations for all tokens in tokens_a and tokens_b.
def get_bert_encoding(net, tokens_a, tokens_b=None):

tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
token_ids = torch.tensor(vocab[tokens], device=devices[0]).unsqueeze(0)
segments = torch.tensor(segments, device=devices[0]).unsqueeze(0)
valid_len = torch.tensor(len(tokens), device=devices[0]).unsqueeze(0)
encoded_X, _, _ = net(token_ids, segments, valid_len)
return encoded_X
Consider the sentence “a crane is flying”. Recall the input representation of BERT as discussed in
Section 15.8.4. After inserting special tokens “<cls>” (used for classification) and “<sep>” (used for
separation), the BERT input sequence has a length of six. Since zero is the index of the “<cls>” to-
ken, encoded_text[:, 0, :] is the BERT representation of the entire input sentence. To evaluate
the polysemy token “crane”, we also print out the first three elements of the BERT representation
of the token.
tokens_a = ['a', 'crane', 'is', 'flying']

encoded_text = get_bert_encoding(net, tokens_a)
# Tokens: '<cls>', 'a', 'crane', 'is', 'flying', '<sep>'
encoded_text_cls = encoded_text[:, 0, :]
encoded_text_crane = encoded_text[:, 2, :]
encoded_text.shape, encoded_text_cls.shape, encoded_text_crane[0][:3]
(torch.Size([1, 6, 128]),
torch.Size([1, 128]),
tensor([-0.0681, 0.0287, -1.1421], device='cuda:0', grad_fn=<SliceBackward0>))
Now consider a sentence pair “a crane driver came” and “he just left”. Similarly, encoded_pair[:,
0, :] is the encoded result of the entire sentence pair from the pretrained BERT. Note that the
first three elements of the polysemy token “crane” are different from those when the context is
different. This supports that BERT representations are context-sensitive.
15.10. Pretraining BERT 751

tokens_a, tokens_b = ['a', 'crane', 'driver', 'came'], ['he', 'just', 'left']
encoded_pair = get_bert_encoding(net, tokens_a, tokens_b)
# Tokens: '<cls>', 'a', 'crane', 'driver', 'came', '<sep>', 'he', 'just',
# 'left', '<sep>'
encoded_pair_cls = encoded_pair[:, 0, :]
encoded_pair_crane = encoded_pair[:, 2, :]
encoded_pair.shape, encoded_pair_cls.shape, encoded_pair_crane[0][:3]
(torch.Size([1, 10, 128]),

torch.Size([1, 128]),
tensor([-0.0611, 0.1988, -0.0668], device='cuda:0', grad_fn=<SliceBackward0>))
In Chapter 16, we will fine-tune a pretrained BERT model for downstream natural language pro-
cessing applications.
15.10.3 Summary
• The original BERT has two versions, where the base model has 110 million parameters and
the large model has 340 million parameters.
• After pretraining BERT, we can use it to represent single text, text pairs, or any token in them.
• In the experiment, the same token has different BERT representation when their contexts
are different. This supports that BERT representations are context-sensitive.
Exercises
1. In the experiment, we can see that the masked language modeling loss is significantly higher
than the next sentence prediction loss. Why?
2. Set the maximum length of a BERT input sequence to be 512 (same as the original BERT
model). Use the configurations of the original BERT model such as BERTLARGE . Do you en-
counter any error when running this section? Why?
Discussions240
240

16 | Natural Language Processing: Ap-
plications
We have seen how to represent tokens in text sequences and train their representations in Chapter
15. Such pretrained text representations can be fed to various models for different downstream
natural language processing tasks.
In fact, earlier chapters have already discussed some natural language processing applications
without pretraining, just for explaining deep learning architectures. For instance, in Chapter 9, we
have relied on RNNs to design language models to generate novella-like text. In Chapter 10 and
Chapter 11, we have also designed models based on RNNs and attention mechanisms for machine
translation.
However, this book does not intend to cover all such applications in a comprehensive manner.
Instead, our focus is on how to apply (deep) representation learning of languages to addressing natural
language processing problems. Given pretrained text representations, this chapter will explore two
popular and representative downstream natural language processing tasks: sentiment analysis
and natural language inference, which analyze single text and relationships of text pairs, respec-
tively.
Fig. 16.1: Pretrained text representations can be fed to various deep learning architectures for
different downstream natural language processing applications. This chapter focuses on how to
design models for different downstream natural language processing applications.
As depicted in Fig. 16.1, this chapter focuses on describing the basic ideas of designing natural
language processing models using different types of deep learning architectures, such as MLPs,
CNNs, RNNs, and attention. Though it is possible to combine any pretrained text representations
753
with any architecture for either application in Fig. 16.1, we select a few representative combina-
tions. Specifically, we will explore popular architectures based on RNNs and CNNs for sentiment
analysis. For natural language inference, we choose attention and MLPs to demonstrate how to
analyze text pairs. In the end, we introduce how to fine-tune a pretrained BERT model for a wide
range of natural language processing applications, such as on a sequence level (single text classi-
fication and text pair classification) and a token level (text tagging and question answering). As a
concrete empirical case, we will fine-tune BERT for natural language inference.
As we have introduced in Section 15.8, BERT requires minimal architecture changes for a wide
range of natural language processing applications. However, this benefit comes at the cost of fine-
tuning a huge number of BERT parameters for the downstream applications. When space or time
is limited, those crafted models based on MLPs, CNNs, RNNs, and attention are more feasible. In
the following, we start by the sentiment analysis application and illustrate the model design based
on RNNs and CNNs, respectively.
16.1 Sentiment Analysis and the Dataset
With the proliferation of online social media and review platforms, a plethora of opinionated
data has been logged, bearing great potential for supporting decision making processes. Senti-
ment analysis studies peopleʼs sentiments in their produced text, such as product reviews, blog
comments, and forum discussions. It enjoys wide applications to fields as diverse as politics (e.g.,
analysis of public sentiments towards policies), finance (e.g., analysis of sentiments of the mar-
ket), and marketing (e.g., product research and brand management).
Since sentiments can be categorized as discrete polarities or scales (e.g., positive and negative),
we can consider sentiment analysis as a text classification task, which transforms a varying-length
text sequence into a fixed-length text category. In this chapter, we will use Stanfordʼs large movie
review dataset241 for sentiment analysis. It consists of a training set and a testing set, either con-
taining 25000 movie reviews downloaded from IMDb. In both datasets, there are equal number
of “positive” and “negative” labels, indicating different sentiment polarities.
import os
import torch
First, download and extract this IMDb review dataset in the path ../data/aclImdb.
#@save
d2l.DATA_HUB['aclImdb'] = (d2l.DATA_URL + 'aclImdb_v1.tar.gz',
'01ada507287d82875905620988597833ad4e0903')
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
Next, read the training and test datasets. Each example is a review and its label: 1 for “positive”
and 0 for “negative”.
241
https://ai.stanford.edu/~amaas/data/sentiment/
754 Chapter 16. Natural Language Processing: Applications

#@save
def read_imdb(data_dir, is_train):
"""Read the IMDb review dataset text sequences and labels."""
data, labels = [], []
for label in ('pos', 'neg'):
folder_name = os.path.join(data_dir, 'train' if is_train else 'test',
label)
for file in os.listdir(folder_name):
with open(os.path.join(folder_name, file), 'rb') as f:
review = f.read().decode('utf-8').replace('\n', '')
data.append(review)
labels.append(1 if label == 'pos' else 0)
return data, labels
train_data = read_imdb(data_dir, is_train=True)

print('# trainings:', len(train_data[0]))
for x, y in zip(train_data[0][:3], train_data[1][:3]):
print('label:', y, 'review:', x[:60])
# trainings: 25000
label: 1 review: Henry Hathaway was daring, as well as enthusiastic, for his
label: 1 review: An unassuming, subtle and lean film, "The Man in the White S
label: 1 review: Eddie Murphy really made me laugh my ass off on this HBO sta
16.1.2 Preprocessing the Dataset
Treating each word as a token and filtering out words that appear less than 5 times, we create a
vocabulary out of the training dataset.
train_tokens = d2l.tokenize(train_data[0], token='word')

vocab = d2l.Vocab(train_tokens, min_freq=5, reserved_tokens=['<pad>'])
After tokenization, letʼs plot the histogram of review lengths in tokens.
d2l.set_figsize()
d2l.plt.xlabel('# tokens per review')
d2l.plt.ylabel('count')
d2l.plt.hist([len(line) for line in train_tokens], bins=range(0, 1000, 50));
16.1. Sentiment Analysis and the Dataset 755

As we expected, the reviews have varying lengths. To process a minibatch of such reviews at each
time, we set the length of each review to 500 with truncation and padding, which is similar to the
preprocessing step for the machine translation dataset in Section 10.5.
num_steps = 500 # sequence length

train_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])
print(train_features.shape)
torch.Size([25000, 500])
16.1.3 Creating Data Iterators
Now we can create data iterators. At each iteration, a minibatch of examples are returned.
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])), 64)
print('X:', X.shape, ', y:', y.shape)
break
print('# batches:', len(train_iter))
X: torch.Size([64, 500]) , y: torch.Size([64])

# batches: 391

Last, we wrap up the above steps into the load_data_imdb function. It returns training and test
data iterators and the vocabulary of the IMDb review dataset.
#@save
def load_data_imdb(batch_size, num_steps=500):
"""Return data iterators and the vocabulary of the IMDb review dataset."""
data_dir = d2l.download_extract('aclImdb', 'aclImdb')
train_data = read_imdb(data_dir, True)
test_data = read_imdb(data_dir, False)
train_tokens = d2l.tokenize(train_data[0], token='word')
test_tokens = d2l.tokenize(test_data[0], token='word')
vocab = d2l.Vocab(train_tokens, min_freq=5)
train_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in train_tokens])
test_features = torch.tensor([d2l.truncate_pad(
vocab[line], num_steps, vocab['<pad>']) for line in test_tokens])
train_iter = d2l.load_array((train_features, torch.tensor(train_data[1])),
batch_size)
test_iter = d2l.load_array((test_features, torch.tensor(test_data[1])),
batch_size,
is_train=False)
return train_iter, test_iter, vocab
16.1.5 Summary
• Sentiment analysis studies peopleʼs sentiments in their produced text, which is considered
as a text classification problem that transforms a varying-length text sequence into a fixed-
length text category.
• After preprocessing, we can load Stanfordʼs large movie review dataset (IMDb review dataset)
into data iterators with a vocabulary.
Exercises
1. What hyperparameters in this section can we modify to accelerate training sentiment anal-
ysis models?
2. Can you implement a function to load the dataset of Amazon reviews242 into data iterators
and labels for sentiment analysis?
Discussions243
242
https://snap.stanford.edu/data/web-Amazon.html
243
16.1. Sentiment Analysis and the Dataset 757

16.2 Sentiment Analysis: Using Recurrent Neural Networks
Like word similarity and analogy tasks, we can also apply pretrained word vectors to sentiment
analysis. Since the IMDb review dataset in Section 16.1 is not very big, using text representations
that were pretrained on large-scale corpora may reduce overfitting of the model. As a specific
example illustrated in Fig. 16.2.1, we will represent each token using the pretrained GloVe model,
and feed these token representations into a multilayer bidirectional RNN to obtain the text se-
quence representation, which will be transformed into sentiment analysis outputs (Maas et al.,
2011). For the same downstream application, we will consider a different architectural choice
later.
Fig. 16.2.1: This section feeds pretrained GloVe to an RNN-based architecture for sentiment anal-
ysis.
import torch
batch_size = 64
train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size)
16.2.1 Representing Single Text with RNNs
In text classifications tasks, such as sentiment analysis, a varying-length text sequence will be
transformed into fixed-length categories. In the following BiRNN class, while each token of a
text sequence gets its individual pretrained GloVe representation via the embedding layer (self.
embedding), the entire sequence is encoded by a bidirectional RNN (self.encoder). More con-
cretely, the hidden states (at the last layer) of the bidirectional LSTM at both the initial and final
time steps are concatenated as the representation of the text sequence. This single text represen-
tation is then transformed into output categories by a fully connected layer (self.decoder) with
two outputs (“positive” and “negative”).
class BiRNN(nn.Module):
def __init__(self, vocab_size, embed_size, num_hiddens,

num_layers, **kwargs):
super(BiRNN, self).__init__(**kwargs)
# Set `bidirectional` to True to get a bidirectional RNN
self.encoder = nn.LSTM(embed_size, num_hiddens, num_layers=num_layers,
bidirectional=True)
self.decoder = nn.Linear(4 * num_hiddens, 2)
def forward(self, inputs):

# The shape of ìnputs` is (batch size, no. of time steps). Because
# LSTM requires its input's first dimension to be the temporal
# dimension, the input is transposed before obtaining token
# representations. The output shape is (no. of time steps, batch size,
# word vector dimension)
embeddings = self.embedding(inputs.T)
self.encoder.flatten_parameters()
# Returns hidden states of the last hidden layer at different time
# steps. The shape of òutputs` is (no. of time steps, batch size,
# 2 * no. of hidden units)
outputs, _ = self.encoder(embeddings)
# Concatenate the hidden states at the initial and final time steps as
# the input of the fully connected layer. Its shape is (batch size,
# 4 * no. of hidden units)
encoding = torch.cat((outputs[0], outputs[-1]), dim=1)
outs = self.decoder(encoding)
return outs
Letʼs construct a bidirectional RNN with two hidden layers to represent single text for sentiment
analysis.
embed_size, num_hiddens, num_layers, devices = 100, 100, 2, d2l.try_all_gpus()

net = BiRNN(len(vocab), embed_size, num_hiddens, num_layers)
if type(module) == nn.LSTM:
for param in module._flat_weights_names:
nn.init.xavier_uniform_(module._parameters[param])
net.apply(init_weights);
16.2.2 Loading Pretrained Word Vectors
Below we load the pretrained 100-dimensional (needs to be consistent with embed_size) GloVe
embeddings for tokens in the vocabulary.
glove_embedding = d2l.TokenEmbedding('glove.6b.100d')
Print the shape of the vectors for all the tokens in the vocabulary.
16.2. Sentiment Analysis: Using Recurrent Neural Networks 759

embeds = glove_embedding[vocab.idx_to_token]
embeds.shape
torch.Size([49346, 100])
We use these pretrained word vectors to represent tokens in the reviews and will not update these
vectors during training.
net.embedding.weight.data.copy_(embeds)
net.embedding.weight.requires_grad = False
16.2.3 Training and Evaluating the Model
Now we can train the bidirectional RNN for sentiment analysis.


We define the following function to predict the sentiment of a text sequence using the trained
model net.
#@save
def predict_sentiment(net, vocab, sequence):
"""Predict the sentiment of a text sequence."""
sequence = torch.tensor(vocab[sequence.split()], device=d2l.try_gpu())
label = torch.argmax(net(sequence.reshape(1, -1)), dim=1)
return 'positive' if label == 1 else 'negative'
Finally, letʼs use the trained model to predict the sentiment for two simple sentences.

predict_sentiment(net, vocab, 'this movie is so great')
'positive'
predict_sentiment(net, vocab, 'this movie is so bad')
'negative'
16.2.4 Summary
• Pretrained word vectors can represent individual tokens in a text sequence.

• Bidirectional RNNs can represent a text sequence, such as via the concatenation of its hidden
states at the initial and final time steps. This single text representation can be transformed
into categories using a fully connected layer.
Exercises
1. Increase the number of epochs. Can you improve the training and testing accuracies? How
about tuning other hyperparameters?
2. Use larger pretrained word vectors, such as 300-dimensional GloVe embeddings. Does it
improve classification accuracy?
3. Can we improve the classification accuracy by using the spaCy tokenization? You need to in-
stall spaCy (pip install spacy) and install the English package (python -m spacy download
en). In the code, first, import spaCy (import spacy). Then, load the spaCy English package
(spacy_en = spacy.load('en')). Finally, define the function def tokenizer(text): return
[tok.text for tok in spacy_en.tokenizer(text)] and replace the original tokenizer func-
tion. Note the different forms of phrase tokens in GloVe and spaCy. For example, the phrase
token “new york” takes the form of “new-york” in GloVe and the form of “new york” after the
spaCy tokenization.
Discussions244
16.3 Sentiment Analysis: Using Convolutional Neural Networks
In Chapter 7, we investigated mechanisms for processing two-dimensional image data with two-
dimensional CNNs, which were applied to local features such as adjacent pixels. Though originally
designed for computer vision, CNNs are also widely used for natural language processing. Simply
put, just think of any text sequence as a one-dimensional image. In this way, one-dimensional
CNNs can process local features such as n-grams in text.
In this section, we will use the textCNN model to demonstrate how to design a CNN architecture
for representing single text (Kim, 2014). Compared with Fig. 16.2.1 that uses an RNN architecture
244
16.3. Sentiment Analysis: Using Convolutional Neural Networks 761

with GloVe pretraining for sentiment analysis, the only difference in Fig. 16.3.1 lies in the choice
of the architecture.
Fig. 16.3.1: This section feeds pretrained GloVe to a CNN-based architecture for sentiment analy-
sis.
import torch
batch_size = 64
train_iter, test_iter, vocab = d2l.load_data_imdb(batch_size)
16.3.1 One-Dimensional Convolutions
Before introducing the model, letʼs see how a one-dimensional convolution works. Bear in mind
that it is just a special case of a two-dimensional convolution based on the cross-correlation oper-
ation.
Fig. 16.3.2: One-dimensional cross-correlation operation. The shaded portions are the first output
element as well as the input and kernel tensor elements used for the output computation: 0 × 1 +
1 × 2 = 2.
As shown in Fig. 16.3.2, in the one-dimensional case, the convolution window slides from left
to right across the input tensor. During sliding, the input subtensor (e.g., 0 and 1 in Fig. 16.3.2)
contained in the convolution window at a certain position and the kernel tensor (e.g., 1 and 2 in
Fig. 16.3.2) are multiplied elementwise. The sum of these multiplications gives the single scalar
value (e.g., 0 × 1 + 1 × 2 = 2 in Fig. 16.3.2) at the corresponding position of the output tensor.
We implement one-dimensional cross-correlation in the following corr1d function. Given an in-
put tensor X and a kernel tensor K, it returns the output tensor Y.

def corr1d(X, K):
w = K.shape[0]
Y = torch.zeros((X.shape[0] - w + 1))
Y[i] = (X[i: i + w] * K).sum()
return Y
We can construct the input tensor X and the kernel tensor K from Fig. 16.3.2 to validate the output
of the above one-dimensional cross-correlation implementation.
X, K = torch.tensor([0, 1, 2, 3, 4, 5, 6]), torch.tensor([1, 2])

corr1d(X, K)
tensor([ 2., 5., 8., 11., 14., 17.])
For any one-dimensional input with multiple channels, the convolution kernel needs to have the
same number of input channels. Then for each channel, perform a cross-correlation operation
on the one-dimensional tensor of the input and the one-dimensional tensor of the convolution
kernel, summing the results over all the channels to produce the one-dimensional output tensor.
Fig. 16.3.3 shows a one-dimensional cross-correlation operation with 3 input channels.
Fig. 16.3.3: One-dimensional cross-correlation operation with 3 input channels. The shaded por-
tions are the first output element as well as the input and kernel tensor elements used for the
output computation: 0 × 1 + 1 × 2 + 1 × 3 + 2 × 4 + 2 × (−1) + 3 × (−3) = 2.
We can implement the one-dimensional cross-correlation operation for multiple input channels
and validate the results in Fig. 16.3.3.
def corr1d_multi_in(X, K):

# First, iterate through the 0th dimension (channel dimension) of `X` and
# `K`. Then, add them together
return sum(corr1d(x, k) for x, k in zip(X, K))
X = torch.tensor([[0, 1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6, 7],
[2, 3, 4, 5, 6, 7, 8]])
K = torch.tensor([[1, 2], [3, 4], [-1, -3]])
corr1d_multi_in(X, K)
tensor([ 2., 8., 14., 20., 26., 32.])
Note that multi-input-channel one-dimensional cross-correlations are equivalent to single-input-

channel two-dimensional cross-correlations. To illustrate, an equivalent form of the multi-

input-channel one-dimensional cross-correlation in Fig. 16.3.3 is the single-input-channel two-
dimensional cross-correlation in Fig. 16.3.4, where the height of the convolution kernel has to be
the same as that of the input tensor.
Fig. 16.3.4: Two-dimensional cross-correlation operation with a single input channel. The shaded
portions are the first output element as well as the input and kernel tensor elements used for the
output computation: 2 × (−1) + 3 × (−3) + 1 × 3 + 2 × 4 + 0 × 1 + 1 × 2 = 2.
Both the outputs in Fig. 16.3.2 and Fig. 16.3.3 have only one channel. Same as two-dimensional
convolutions with multiple output channels described in Section 7.4.2, we can also specify multi-
ple output channels for one-dimensional convolutions.
16.3.2 Max-Over-Time Pooling
Similarly, we can use pooling to extract the highest value from sequence representations as the
most important feature across time steps. The max-over-time pooling used in textCNN works like
the one-dimensional global max-pooling (Collobert et al., 2011). For a multi-channel input where
each channel stores values at different time steps, the output at each channel is the maximum
value for that channel. Note that the max-over-time pooling allows different numbers of time
steps at different channels.
16.3.3 The textCNN Model
Using the one-dimensional convolution and max-over-time pooling, the textCNN model takes in-
dividual pretrained token representations as input, then obtains and transforms sequence repre-
sentations for the downstream application.
For a single text sequence with n tokens represented by d-dimensional vectors, the width, height,
and number of channels of the input tensor are n, 1, and d, respectively. The textCNN model
transforms the input into the output as follows:
1. Define multiple one-dimensional convolution kernels and perform convolution operations
separately on the inputs. Convolution kernels with different widths may capture local fea-
tures among different numbers of adjacent tokens.
2. Perform max-over-time pooling on all the output channels, and then concatenate all the
scalar pooling outputs as a vector.
3. Transform the concatenated vector into the output categories using the fully connected
layer. Dropout can be used for reducing overfitting.

Fig. 16.3.5: The model architecture of textCNN.
Fig. 16.3.5 illustrates the model architecture of textCNN with a concrete example. The input is
a sentence with 11 tokens, where each token is represented by a 6-dimensional vectors. So we
have a 6-channel input with width 11. Define two one-dimensional convolution kernels of widths
2 and 4, with 4 and 5 output channels, respectively. They produce 4 output channels with width
11 − 2 + 1 = 10 and 5 output channels with width 11 − 4 + 1 = 8. Despite different widths of these 9
channels, the max-over-time pooling gives a concatenated 9-dimensional vector, which is finally
transformed into a 2-dimensional output vector for binary sentiment predictions.
Defining the Model
We implement the textCNN model in the following class. Compared with the bidirectional RNN
model in Section 16.2, besides replacing recurrent layers with convolutional layers, we also use
two embedding layers: one with trainable weights and the other with fixed weights.
class TextCNN(nn.Module):
def __init__(self, vocab_size, embed_size, kernel_sizes, num_channels,
**kwargs):
super(TextCNN, self).__init__(**kwargs)
# The embedding layer not to be trained
self.constant_embedding = nn.Embedding(vocab_size, embed_size)
self.dropout = nn.Dropout(0.5)
self.decoder = nn.Linear(sum(num_channels), 2)
# The max-over-time pooling layer has no parameters, so this instance

# can be shared
self.pool = nn.AdaptiveAvgPool1d(1)
# Create multiple one-dimensional convolutional layers
self.convs = nn.ModuleList()
for c, k in zip(num_channels, kernel_sizes):
self.convs.append(nn.Conv1d(2 * embed_size, c, k))

# Concatenate two embedding layer outputs with shape (batch size, no.
# of tokens, token vector dimension) along vectors
embeddings = torch.cat((
self.embedding(inputs), self.constant_embedding(inputs)), dim=2)
# Per the input format of one-dimensional convolutional layers,
# rearrange the tensor so that the second dimension stores channels
embeddings = embeddings.permute(0, 2, 1)
# For each one-dimensional convolutional layer, after max-over-time
# pooling, a tensor of shape (batch size, no. of channels, 1) is
# obtained. Remove the last dimension and concatenate along channels
encoding = torch.cat([
torch.squeeze(self.relu(self.pool(conv(embeddings))), dim=-1)
for conv in self.convs], dim=1)
outputs = self.decoder(self.dropout(encoding))
return outputs
Letʼs create a textCNN instance. It has 3 convolutional layers with kernel widths of 3, 4, and 5, all
with 100 output channels.
embed_size, kernel_sizes, nums_channels = 100, [3, 4, 5], [100, 100, 100]

net = TextCNN(len(vocab), embed_size, kernel_sizes, nums_channels)
if type(module) in (nn.Linear, nn.Conv1d):
net.apply(init_weights);
Loading Pretrained Word Vectors
Same as Section 16.2, we load pretrained 100-dimensional GloVe embeddings as the initialized
token representations. These token representations (embedding weights) will be trained in em-
bedding and fixed in constant_embedding.
net.embedding.weight.data.copy_(embeds)
net.constant_embedding.weight.data.copy_(embeds)
net.constant_embedding.weight.requires_grad = False

Training and Evaluating the Model
Now we can train the textCNN model for sentiment analysis.


Below we use the trained model to predict the sentiment for two simple sentences.
d2l.predict_sentiment(net, vocab, 'this movie is so great')
'positive'
d2l.predict_sentiment(net, vocab, 'this movie is so bad')
'negative'
16.3.4 Summary
• One-dimensional CNNs can process local features such as n-grams in text.

• Multi-input-channel one-dimensional cross-correlations are equivalent to single-input-
channel two-dimensional cross-correlations.
• The max-over-time pooling allows different numbers of time steps at different channels.
• The textCNN model transforms individual token representations into downstream applica-
tion outputs using one-dimensional convolutional layers and max-over-time pooling layers.

Exercises
1. Tune hyperparameters and compare the two architectures for sentiment analysis in Section
16.2 and in this section, such as in classification accuracy and computational efficiency.
2. Can you further improve the classification accuracy of the model by using the methods in-
troduced in the exercises of Section 16.2?
3. Add positional encoding in the input representations. Does it improve the classification ac-
curacy?
Discussions245
16.4 Natural Language Inference and the Dataset
In Section 16.1, we discussed the problem of sentiment analysis. This task aims to classify a sin-
gle text sequence into predefined categories, such as a set of sentiment polarities. However, when
there is a need to decide whether one sentence can be inferred form another, or eliminate redun-
dancy by identifying sentences that are semantically equivalent, knowing how to classify one text
sequence is insufficient. Instead, we need to be able to reason over pairs of text sequences.
16.4.1 Natural Language Inference
Natural language inference studies whether a hypothesis can be inferred from a premise, where both
are a text sequence. In other words, natural language inference determines the logical relation-
ship between a pair of text sequences. Such relationships usually fall into three types:
• Entailment: the hypothesis can be inferred from the premise.
• Contradiction: the negation of the hypothesis can be inferred from the premise.
• Neutral: all the other cases.
Natural language inference is also known as the recognizing textual entailment task. For example,
the following pair will be labeled as entailment because “showing affection” in the hypothesis can
be inferred from “hugging one another” in the premise.
Premise: Two women are hugging each other.
Hypothesis: Two women are showing affection.
The following is an example of contradiction as “running the coding example” indicates “not sleep-
ing” rather than “sleeping”.
Premise: A man is running the coding example from Dive into Deep Learning.
Hypothesis: The man is sleeping.
The third example shows a neutrality relationship because neither “famous” nor “not famous” can
be inferred from the fact that “are performing for us”.
Premise: The musicians are performing for us.
Hypothesis: The musicians are famous.
245

Natural language inference has been a central topic for understanding natural language. It en-
joys wide applications ranging from information retrieval to open-domain question answering.
To study this problem, we will begin by investigating a popular natural language inference bench-
mark dataset.
16.4.2 The Stanford Natural Language Inference (SNLI) Dataset
Stanford Natural Language Inference (SNLI) Corpus is a collection of over 500000 labeled English
sentence pairs (Bowman et al., 2015). We download and store the extracted SNLI dataset in the
path ../data/snli_1.0.
import os
import re
import torch
#@save
d2l.DATA_HUB['SNLI'] = (
'https://nlp.stanford.edu/projects/snli/snli_1.0.zip',
'9fcde07509c7e87ec61c640c1b2753d9041758e4')
data_dir = d2l.download_extract('SNLI')
Reading the Dataset
The original SNLI dataset contains much richer information than what we really need in our ex-
periments. Thus, we define a function read_snli to only extract part of the dataset, then return
lists of premises, hypotheses, and their labels.
#@save
def read_snli(data_dir, is_train):
"""Read the SNLI dataset into premises, hypotheses, and labels."""
def extract_text(s):
# Remove information that will not be used by us
s = re.sub('\$', '', s)
s = re.sub('\$', '', s)
# Substitute two or more consecutive whitespace with space
s = re.sub('\\s{2,}', ' ', s)
return s.strip()
label_set = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
file_name = os.path.join(data_dir, 'snli_1.0_train.txt'
if is_train else 'snli_1.0_test.txt')
with open(file_name, 'r') as f:
rows = [row.split('\t') for row in f.readlines()[1:]]
premises = [extract_text(row[1]) for row in rows if row[0] in label_set]
hypotheses = [extract_text(row[2]) for row in rows if row[0] in label_set]
labels = [label_set[row[0]] for row in rows if row[0] in label_set]
return premises, hypotheses, labels
Now letʼs print the first 3 pairs of premise and hypothesis, as well as their labels (“0”, “1”, and “2”
correspond to “entailment”, “contradiction”, and “neutral”, respectively ).
16.4. Natural Language Inference and the Dataset 769

train_data = read_snli(data_dir, is_train=True)
for x0, x1, y in zip(train_data[0][:3], train_data[1][:3], train_data[2][:3]):
print('premise:', x0)
print('hypothesis:', x1)
print('label:', y)
premise: A person on a horse jumps over a broken down airplane .

hypothesis: A person is training his horse for a competition .
label: 2
hypothesis: A person is at a diner , ordering an omelette .
label: 1
hypothesis: A person is outdoors , on a horse .
label: 0
The training set has about 550000 pairs, and the testing set has about 10000 pairs. The following
shows that the three labels “entailment”, “contradiction”, and “neutral” are balanced in both the
training set and the testing set.
test_data = read_snli(data_dir, is_train=False)

for data in [train_data, test_data]:
print([[row for row in data[2]].count(i) for i in range(3)])
[183416, 183187, 182764]

[3368, 3237, 3219]
Defining a Class for Loading the Dataset
Below we define a class for loading the SNLI dataset by inheriting from the Dataset class in Gluon.
The argument num_steps in the class constructor specifies the length of a text sequence so that
each minibatch of sequences will have the same shape. In other words, tokens after the first
num_steps ones in longer sequence are trimmed, while special tokens “<pad>” will be appended
to shorter sequences until their length becomes num_steps. By implementing the __getitem__
function, we can arbitrarily access the premise, hypothesis, and label with the index idx.
#@save
class SNLIDataset(torch.utils.data.Dataset):
"""A customized dataset to load the SNLI dataset."""
def __init__(self, dataset, num_steps, vocab=None):
self.num_steps = num_steps
all_premise_tokens = d2l.tokenize(dataset[0])
all_hypothesis_tokens = d2l.tokenize(dataset[1])
if vocab is None:
self.vocab = d2l.Vocab(all_premise_tokens + all_hypothesis_tokens,
min_freq=5, reserved_tokens=['<pad>'])
else:
self.vocab = vocab
self.premises = self._pad(all_premise_tokens)
self.hypotheses = self._pad(all_hypothesis_tokens)

self.labels = torch.tensor(dataset[2])
print('read ' + str(len(self.premises)) + ' examples')
def _pad(self, lines):

return torch.tensor([d2l.truncate_pad(
self.vocab[line], self.num_steps, self.vocab['<pad>'])
for line in lines])

return (self.premises[idx], self.hypotheses[idx]), self.labels[idx]
def __len__(self):
return len(self.premises)
Now we can invoke the read_snli function and the SNLIDataset class to download the SNLI dataset
and return DataLoader instances for both training and testing sets, together with the vocabulary
of the training set. It is noteworthy that we must use the vocabulary constructed from the training
set as that of the testing set. As a result, any new token from the testing set will be unknown to
the model trained on the training set.
#@save
def load_data_snli(batch_size, num_steps=50):
"""Download the SNLI dataset and return data iterators and vocabulary."""
train_data = read_snli(data_dir, True)
test_data = read_snli(data_dir, False)
train_set = SNLIDataset(train_data, num_steps)
test_set = SNLIDataset(test_data, num_steps, train_set.vocab)
train_iter = torch.utils.data.DataLoader(train_set, batch_size,
shuffle=True,
test_iter = torch.utils.data.DataLoader(test_set, batch_size,
shuffle=False,
return train_iter, test_iter, train_set.vocab
Here we set the batch size to 128 and sequence length to 50, and invoke the load_data_snli func-
tion to get the data iterators and vocabulary. Then we print the vocabulary size.
train_iter, test_iter, vocab = load_data_snli(128, 50)

len(vocab)
read 549367 examples

read 9824 examples
18678
16.4. Natural Language Inference and the Dataset 771

Now we print the shape of the first minibatch. Contrary to sentiment analysis, we have two inputs
X[0] and X[1] representing pairs of premises and hypotheses.
for X, Y in train_iter:
print(X[0].shape)
print(X[1].shape)
print(Y.shape)
break
torch.Size([128])
16.4.3 Summary
• Natural language inference studies whether a hypothesis can be inferred from a premise,
where both are a text sequence.
• In natural language inference, relationships between premises and hypotheses include en-
tailment, contradiction, and neutral.
• Stanford Natural Language Inference (SNLI) Corpus is a popular benchmark dataset of nat-
ural language inference.
Exercises
1. Machine translation has long been evaluated based on superficial n-gram matching between
an output translation and a ground-truth translation. Can you design a measure for evaluat-
ing machine translation results by using natural language inference?
2. How can we change hyperparameters to reduce the vocabulary size?
Discussions246
16.5 Natural Language Inference: Using Attention
We introduced the natural language inference task and the SNLI dataset in Section 16.4. In view
of many models that are based on complex and deep architectures, Parikh et al. proposed to
address natural language inference with attention mechanisms and called it a “decomposable at-
tention model” (Parikh et al., 2016). This results in a model without recurrent or convolutional
layers, achieving the best result at the time on the SNLI dataset with much fewer parameters. In
this section, we will describe and implement this attention-based method (with MLPs) for natural
language inference, as depicted in Fig. 16.5.1.
246

Fig. 16.5.1: This section feeds pretrained GloVe to an architecture based on attention and MLPs
for natural language inference.
16.5.1 The Model
Simpler than preserving the order of tokens in premises and hypotheses, we can just align tokens
in one text sequence to every token in the other, and vice versa, then compare and aggregate
such information to predict the logical relationships between premises and hypotheses. Similar to
alignment of tokens between source and target sentences in machine translation, the alignment of
tokens between premises and hypotheses can be neatly accomplished by attention mechanisms.
Fig. 16.5.2: Natural language inference using attention mechanisms.
Fig. 16.5.2 depicts the natural language inference method using attention mechanisms. At a high
level, it consists of three jointly trained steps: attending, comparing, and aggregating. We will
illustrate them step by step in the following.
16.5. Natural Language Inference: Using Attention 773

import torch
Attending
The first step is to align tokens in one text sequence to each token in the other sequence. Suppose
that the premise is “i do need sleep” and the hypothesis is “i am tired”. Due to semantical simi-
larity, we may wish to align “i” in the hypothesis with “i” in the premise, and align “tired” in the
hypothesis with “sleep” in the premise. Likewise, we may wish to align “i” in the premise with “i”
in the hypothesis, and align “need” and “sleep” in the premise with “tired” in the hypothesis. Note
that such alignment is soft using weighted average, where ideally large weights are associated with
the tokens to be aligned. For ease of demonstration, Fig. 16.5.2 shows such alignment in a hard
way.
Now we describe the soft alignment using attention mechanisms in more detail. Denote by A =
(a1 , . . . , am ) and B = (b1 , . . . , bn ) the premise and hypothesis, whose number of tokens are m and
n, respectively, where ai , bj ∈ Rd (i = 1, . . . , m, j = 1, . . . , n) is a d-dimensional word vector. For
soft alignment, we compute the attention weights eij ∈ R as
eij = f (ai )⊤ f (bj ), (16.5.1)
where the function f is an MLP defined in the following mlp function. The output dimension of f
is specified by the num_hiddens argument of mlp.
def mlp(num_inputs, num_hiddens, flatten):

net = []
net.append(nn.Dropout(0.2))
net.append(nn.Linear(num_inputs, num_hiddens))
net.append(nn.ReLU())
if flatten:
net.append(nn.Flatten(start_dim=1))
net.append(nn.Dropout(0.2))
net.append(nn.Linear(num_hiddens, num_hiddens))
net.append(nn.ReLU())
if flatten:
net.append(nn.Flatten(start_dim=1))
return nn.Sequential(*net)
It should be highlighted that, in (16.5.1) f takes inputs ai and bj separately rather than takes a
pair of them together as input. This decomposition trick leads to only m + n applications (linear
complexity) of f rather than mn applications (quadratic complexity).
Normalizing the attention weights in (16.5.1), we compute the weighted average of all the token
vectors in the hypothesis to obtain representation of the hypothesis that is softly aligned with the
token indexed by i in the premise:
∑
n
exp(eij )
βi = ∑n bj . (16.5.2)
j=1 k=1 exp(eik )

Likewise, we compute soft alignment of premise tokens for each token indexed by j in the hypoth-
esis:
∑
m
exp(eij )
αj = ∑m ai . (16.5.3)
i=1 k=1 exp(ekj )
Below we define the Attend class to compute the soft alignment of hypotheses (beta) with input
premises A and soft alignment of premises (alpha) with input hypotheses B.
class Attend(nn.Module):
def __init__(self, num_inputs, num_hiddens, **kwargs):
super(Attend, self).__init__(**kwargs)
self.f = mlp(num_inputs, num_hiddens, flatten=False)
def forward(self, A, B):

# Shape of À`/`B`: (`batch_size`, no. of tokens in sequence A/B,
# èmbed_size`)
# Shape of `f_A`/`f_B`: (`batch_size`, no. of tokens in sequence A/B,
# `num_hiddens`)
f_A = self.f(A)
f_B = self.f(B)
# Shape of è`: (`batch_size`, no. of tokens in sequence A,
# no. of tokens in sequence B)
e = torch.bmm(f_A, f_B.permute(0, 2, 1))
# Shape of `beta`: (`batch_size`, no. of tokens in sequence A,
# èmbed_size`), where sequence B is softly aligned with each token
# (axis 1 of `beta`) in sequence A
beta = torch.bmm(F.softmax(e, dim=-1), B)
# Shape of àlpha`: (`batch_size`, no. of tokens in sequence B,
# èmbed_size`), where sequence A is softly aligned with each token
# (axis 1 of àlpha`) in sequence B
alpha = torch.bmm(F.softmax(e.permute(0, 2, 1), dim=-1), A)
return beta, alpha
Comparing
In the next step, we compare a token in one sequence with the other sequence that is softly aligned
with that token. Note that in soft alignment, all the tokens from one sequence, though with prob-
ably different attention weights, will be compared with a token in the other sequence. For easy of
demonstration, Fig. 16.5.2 pairs tokens with aligned tokens in a hard way. For example, suppose
that the attending step determines that “need” and “sleep” in the premise are both aligned with
“tired” in the hypothesis, the pair “tired–need sleep” will be compared.
In the comparing step, we feed the concatenation (operator [·, ·]) of tokens from one sequence and
aligned tokens from the other sequence into a function g (an MLP):
vA,i = g([ai , β i ]), i = 1, . . . , m

(16.5.4)
vB,j = g([bj , αj ]), j = 1, . . . , n.
In (16.5.4), vA,i is the comparison between token i in the premise and all the hypothesis tokens
that are softly aligned with token i; while vB,j is the comparison between token j in the hypothesis
and all the premise tokens that are softly aligned with token j. The following Compare class defines
such as comparing step.

class Compare(nn.Module):
def __init__(self, num_inputs, num_hiddens, **kwargs):
super(Compare, self).__init__(**kwargs)
self.g = mlp(num_inputs, num_hiddens, flatten=False)
def forward(self, A, B, beta, alpha):

V_A = self.g(torch.cat([A, beta], dim=2))
V_B = self.g(torch.cat([B, alpha], dim=2))
return V_A, V_B
Aggregating
With two sets of comparison vectors vA,i (i = 1, . . . , m) and vB,j (j = 1, . . . , n) on hand, in the last
step we will aggregate such information to infer the logical relationship. We begin by summing
up both sets:
∑
m ∑
n
vA = vA,i , vB = vB,j . (16.5.5)
i=1 j=1
Next we feed the concatenation of both summarization results into function h (an MLP) to obtain
the classification result of the logical relationship:
ŷ = h([vA , vB ]). (16.5.6)
The aggregation step is defined in the following Aggregate class.
class Aggregate(nn.Module):
def __init__(self, num_inputs, num_hiddens, num_outputs, **kwargs):
super(Aggregate, self).__init__(**kwargs)
self.h = mlp(num_inputs, num_hiddens, flatten=True)
self.linear = nn.Linear(num_hiddens, num_outputs)
def forward(self, V_A, V_B):

# Sum up both sets of comparison vectors
V_A = V_A.sum(dim=1)
V_B = V_B.sum(dim=1)
# Feed the concatenation of both summarization results into an MLP
Y_hat = self.linear(self.h(torch.cat([V_A, V_B], dim=1)))
return Y_hat
By putting the attending, comparing, and aggregating steps together, we define the decomposable
attention model to jointly train these three steps.
class DecomposableAttention(nn.Module):
def __init__(self, vocab, embed_size, num_hiddens, num_inputs_attend=100,
num_inputs_compare=200, num_inputs_agg=400, **kwargs):
super(DecomposableAttention, self).__init__(**kwargs)
self.embedding = nn.Embedding(len(vocab), embed_size)

self.attend = Attend(num_inputs_attend, num_hiddens)

self.compare = Compare(num_inputs_compare, num_hiddens)
# There are 3 possible outputs: entailment, contradiction, and neutral
self.aggregate = Aggregate(num_inputs_agg, num_hiddens, num_outputs=3)

premises, hypotheses = X
A = self.embedding(premises)
B = self.embedding(hypotheses)
beta, alpha = self.attend(A, B)
V_A, V_B = self.compare(A, B, beta, alpha)
Y_hat = self.aggregate(V_A, V_B)
return Y_hat
16.5.2 Training and Evaluating the Model
Now we will train and evaluate the defined decomposable attention model on the SNLI dataset.
We begin by reading the dataset.
Reading the dataset
We download and read the SNLI dataset using the function defined in Section 16.4. The batch size
and sequence length are set to 256 and 50, respectively.

train_iter, test_iter, vocab = d2l.load_data_snli(batch_size, num_steps)

read 9824 examples
Creating the Model
We use the pretrained 100-dimensional GloVe embedding to represent the input tokens. Thus, we
predefine the dimension of vectors ai and bj in (16.5.1) as 100. The output dimension of func-
tions f in (16.5.1) and g in (16.5.4) is set to 200. Then we create a model instance, initialize its
parameters, and load the GloVe embedding to initialize vectors of input tokens.
embed_size, num_hiddens, devices = 100, 200, d2l.try_all_gpus()

net = DecomposableAttention(vocab, embed_size, num_hiddens)
net.embedding.weight.data.copy_(embeds);

Training and Evaluating the Model
In contrast to the split_batch function in Section 13.5 that takes single inputs such as text se-
quences (or images), we define a split_batch_multi_inputs function to take multiple inputs such
as premises and hypotheses in minibatches.
Now we can train and evaluate the model on the SNLI dataset.


Using the Model
Finally, define the prediction function to output the logical relationship between a pair of premise
and hypothesis.
#@save
def predict_snli(net, vocab, premise, hypothesis):
"""Predict the logical relationship between the premise and hypothesis."""
net.eval()
premise = torch.tensor(vocab[premise], device=d2l.try_gpu())
hypothesis = torch.tensor(vocab[hypothesis], device=d2l.try_gpu())
label = torch.argmax(net([premise.reshape((1, -1)),
hypothesis.reshape((1, -1))]), dim=1)
return 'entailment' if label == 0 else 'contradiction' if label == 1 \
else 'neutral'
We can use the trained model to obtain the natural language inference result for a sample pair of
sentences.

predict_snli(net, vocab, ['he', 'is', 'good', '.'], ['he', 'is', 'bad', '.'])
'contradiction'
16.5.3 Summary
• The decomposable attention model consists of three steps for predicting the logical relation-
ships between premises and hypotheses: attending, comparing, and aggregating.
• With attention mechanisms, we can align tokens in one text sequence to every token in the
other, and vice versa. Such alignment is soft using weighted average, where ideally large
weights are associated with the tokens to be aligned.
• The decomposition trick leads to a more desirable linear complexity than quadratic com-
plexity when computing attention weights.
• We can use pretrained word vectors as the input representation for downstream natural lan-
guage processing task such as natural language inference.
Exercises
1. Train the model with other combinations of hyperparameters. Can you get better accuracy
on the test set?
2. What are major drawbacks of the decomposable attention model for natural language infer-
ence?
3. Suppose that we want to get the level of semantical similarity (e.g., a continuous value be-
tween 0 and 1) for any pair of sentences. How shall we collect and label the dataset? Can you
design a model with attention mechanisms?
Discussions247
16.6 Fine-Tuning BERT for Sequence-Level and Token-Level Applications
In the previous sections of this chapter, we have designed different models for natural language
processing applications, such as based on RNNs, CNNs, attention, and MLPs. These models are
helpful when there is space or time constraint, however, crafting a specific model for every natural
language processing task is practically infeasible. In Section 15.8, we introduced a pretraining
model, BERT, that requires minimal architecture changes for a wide range of natural language
processing tasks. On the one hand, at the time of its proposal, BERT improved the state of the
art on various natural language processing tasks. On the other hand, as noted in Section 15.10,
the two versions of the original BERT model come with 110 million and 340 million parameters.
Thus, when there are sufficient computational resources, we may consider fine-tuning BERT for
downstream natural language processing applications.
247
16.6. Fine-Tuning BERT for Sequence-Level and Token-Level Applications 779

In the following, we generalize a subset of natural language processing applications as sequence-
level and token-level. On the sequence level, we introduce how to transform the BERT represen-
tation of the text input to the output label in single text classification and text pair classification
or regression. On the token level, we will briefly introduce new applications such as text tagging
and question answering and shed light on how BERT can represent their inputs and get trans-
formed into output labels. During fine-tuning, the “minimal architecture changes” required by
BERT across different applications are the extra fully connected layers. During supervised learn-
ing of a downstream application, parameters of the extra layers are learned from scratch while all
the parameters in the pretrained BERT model are fine-tuned.
16.6.1 Single Text Classification
Single text classification takes a single text sequence as input and outputs its classification result.
Besides sentiment analysis that we have studied in this chapter, the Corpus of Linguistic Accept-
ability (CoLA) is also a dataset for single text classification, judging whether a given sentence is
grammatically acceptable or not (Warstadt et al., 2019). For instance, “I should study.” is accept-
able but “I should studying.” is not.
Fig. 16.6.1: Fine-tuning BERT for single text classification applications, such as sentiment analysis
and testing linguistic acceptability. Suppose that the input single text has six tokens.
Section 15.8 describes the input representation of BERT. The BERT input sequence unambiguously
represents both single text and text pairs, where the special classification token “<cls>” is used for
sequence classification and the special classification token “<sep>” marks the end of single text
or separates a pair of text. As shown in Fig. 16.6.1, in single text classification applications, the
BERT representation of the special classification token “<cls>” encodes the information of the
entire input text sequence. As the representation of the input single text, it will be fed into a small
MLP consisting of fully connected (dense) layers to output the distribution of all the discrete label
values.

16.6.2 Text Pair Classification or Regression
We have also examined natural language inference in this chapter. It belongs to text pair classifi-
cation, a type of application classifying a pair of text.
Taking a pair of text as input but outputting a continuous value, semantic textual similarity is a
popular text pair regression task. This task measures semantic similarity of sentences. For instance,
in the Semantic Textual Similarity Benchmark dataset, the similarity score of a pair of sentences is
an ordinal scale ranging from 0 (no meaning overlap) to 5 (meaning equivalence) (Cer et al., 2017).
The goal is to predict these scores. Examples from the Semantic Textual Similarity Benchmark
dataset include (sentence 1, sentence 2, similarity score):
• “A plane is taking off.”, “An air plane is taking off.”, 5.000;
• “A woman is eating something.”, “A woman is eating meat.”, 3.000;
• “A woman is dancing.”, “A man is talking.”, 0.000.
Fig. 16.6.2: Fine-tuning BERT for text pair classification or regression applications, such as natural
language inference and semantic textual similarity. Suppose that the input text pair has two and
three tokens.
Comparing with single text classification in Fig. 16.6.1, fine-tuning BERT for text pair classifica-
tion in Fig. 16.6.2 is different in the input representation. For text pair regression tasks such as
semantic textual similarity, trivial changes can be applied such as outputting a continuous label
value and using the mean squared loss: they are common for regression.
16.6.3 Text Tagging
Now letʼs consider token-level tasks, such as text tagging, where each token is assigned a label.
Among text tagging tasks, part-of-speech tagging assigns each word a part-of-speech tag (e.g., ad-
jective and determiner) according to the role of the word in the sentence. For example, according
to the Penn Treebank II tag set, the sentence “John Smith ʻs car is new” should be tagged as “NNP
(noun, proper singular) NNP POS (possessive ending) NN (noun, singular or mass) VB (verb, base
form) JJ (adjective)”.

Fig. 16.6.3: Fine-tuning BERT for text tagging applications, such as part-of-speech tagging. Sup-
pose that the input single text has six tokens.
Fine-tuning BERT for text tagging applications is illustrated in Fig. 16.6.3. Comparing with Fig.
16.6.1, the only distinction lies in that in text tagging, the BERT representation of every token of the
input text is fed into the same extra fully connected layers to output the label of the token, such as
a part-of-speech tag.
16.6.4 Question Answering
As another token-level application, question answering reflects capabilities of reading comprehen-

sion. For example, the Stanford Question Answering Dataset (SQuAD v1.1) consists of reading
passages and questions, where the answer to every question is just a segment of text (text span)
from the passage that the question is about (Rajpurkar et al., 2016). To explain, consider a passage
“Some experts report that a maskʼs efficacy is inconclusive. However, mask makers insist that
their products, such as N95 respirator masks, can guard against the virus.” and a question “Who
say that N95 respirator masks can guard against the virus?”. The answer should be the text span
“mask makers” in the passage. Thus, the goal in SQuAD v1.1 is to predict the start and end of the
text span in the passage given a pair of question and passage.

Fig. 16.6.4: Fine-tuning BERT for question answering. Suppose that the input text pair has two
and three tokens.
To fine-tune BERT for question answering, the question and passage are packed as the first and
second text sequence, respectively, in the input of BERT. To predict the position of the start of
the text span, the same additional fully connected layer will transform the BERT representation
of any token from the passage of position i into a scalar score si . Such scores of all the passage
tokens are further transformed by the softmax operation into a probability distribution, so that
each token position i in the passage is assigned a probability pi of being the start of the text span.
Predicting the end of the text span is the same as above, except that parameters in its additional
fully connected layer are independent from those for predicting the start. When predicting the
end, any passage token of position i is transformed by the same fully connected layer into a scalar
score ei . Fig. 16.6.4 depicts fine-tuning BERT for question answering.
For question answering, the supervised learningʼs training objective is as straightforward as maxi-
mizing the log-likelihoods of the ground-truth start and end positions. When predicting the span,
we can compute the score si + ej for a valid span from position i to position j (i ≤ j), and output
the span with the highest score.
16.6.5 Summary
• BERT requires minimal architecture changes (extra fully connected layers) for sequence-
level and token-level natural language processing applications, such as single text classifi-
cation (e.g., sentiment analysis and testing linguistic acceptability), text pair classification
or regression (e.g., natural language inference and semantic textual similarity), text tagging
(e.g., part-of-speech tagging), and question answering.
• During supervised learning of a downstream application, parameters of the extra layers are
learned from scratch while all the parameters in the pretrained BERT model are fine-tuned.

Exercises
1. Letʼs design a search engine algorithm for news articles. When the system receives an query
(e.g., “oil industry during the coronavirus outbreak”), it should return a ranked list of news
articles that are most relevant to the query. Suppose that we have a huge pool of news articles
and a large number of queries. To simplify the problem, suppose that the most relevant
article has been labeled for each query. How can we apply negative sampling (see Section
15.2.1) and BERT in the algorithm design?
2. How can we leverage BERT in training language models?
3. Can we leverage BERT in machine translation?
Discussions248
16.7 Natural Language Inference: Fine-Tuning BERT
In earlier sections of this chapter, we have designed an attention-based architecture (in Section
16.5) for the natural language inference task on the SNLI dataset (as described in Section 16.4).
Now we revisit this task by fine-tuning BERT. As discussed in Section 16.6, natural language infer-
ence is a sequence-level text pair classification problem, and fine-tuning BERT only requires an
additional MLP-based architecture, as illustrated in Fig. 16.7.1.
Fig. 16.7.1: This section feeds pretrained BERT to an MLP-based architecture for natural language
inference.
In this section, we will download a pretrained small version of BERT, then fine-tune it for natural
language inference on the SNLI dataset.
import json
import multiprocessing
import os
import torch
248

16.7.1 Loading Pretrained BERT
We have explained how to pretrain BERT on the WikiText-2 dataset in Section 15.9 and Section
15.10 (note that the original BERT model is pretrained on much bigger corpora). As discussed in
Section 15.10, the original BERT model has hundreds of millions of parameters. In the following,
we provide two versions of pretrained BERT: “bert.base” is about as big as the original BERT base
model that requires a lot of computational resources to fine-tune, while “bert.small” is a small
version to facilitate demonstration.
d2l.DATA_HUB['bert.base'] = (d2l.DATA_URL + 'bert.base.torch.zip',

'225d66f04cae318b841a13d32af3acc165f253ac')
d2l.DATA_HUB['bert.small'] = (d2l.DATA_URL + 'bert.small.torch.zip',
'c72329e68a732bef0452e4b96a1c341c8910f81f')
Either pretrained BERT model contains a “vocab.json” file that defines the vocabulary set
and a “pretrained.params” file of the pretrained parameters. We implement the following
load_pretrained_model function to load pretrained BERT parameters.
def load_pretrained_model(pretrained_model, num_hiddens, ffn_num_hiddens,

num_heads, num_blks, dropout, max_len, devices):
data_dir = d2l.download_extract(pretrained_model)
# Define an empty vocabulary to load the predefined vocabulary
vocab = d2l.Vocab()
vocab.idx_to_token = json.load(open(os.path.join(data_dir, 'vocab.json')))
vocab.token_to_idx = {token: idx for idx, token in enumerate(
vocab.idx_to_token)}
bert = d2l.BERTModel(
len(vocab), num_hiddens, ffn_num_hiddens=ffn_num_hiddens, num_heads=4,
num_blks=2, dropout=0.2, max_len=max_len)
# Load pretrained BERT parameters
bert.load_state_dict(torch.load(os.path.join(data_dir,
'pretrained.params')))
return bert, vocab
To facilitate demonstration on most of machines, we will load and fine-tune the small version
(“bert.small”) of the pretrained BERT in this section. In the exercise, we will show how to fine-
tune the much larger “bert.base” to significantly improve the testing accuracy.
bert, vocab = load_pretrained_model(
'bert.small', num_hiddens=256, ffn_num_hiddens=512, num_heads=4,
num_blks=2, dropout=0.1, max_len=512, devices=devices)
16.7. Natural Language Inference: Fine-Tuning BERT 785

16.7.2 The Dataset for Fine-Tuning BERT
For the downstream task natural language inference on the SNLI dataset, we define a customized
dataset class SNLIBERTDataset. In each example, the premise and hypothesis form a pair of text
sequence and is packed into one BERT input sequence as depicted in Fig. 16.6.2. Recall Section
15.8.4 that segment IDs are used to distinguish the premise and the hypothesis in a BERT input se-
quence. With the predefined maximum length of a BERT input sequence (max_len), the last token
of the longer of the input text pair keeps getting removed until max_len is met. To accelerate gen-
eration of the SNLI dataset for fine-tuning BERT, we use 4 worker processes to generate training
or testing examples in parallel.
class SNLIBERTDataset(torch.utils.data.Dataset):
def __init__(self, dataset, max_len, vocab=None):
all_premise_hypothesis_tokens = [[
p_tokens, h_tokens] for p_tokens, h_tokens in zip(
*[d2l.tokenize([s.lower() for s in sentences])
for sentences in dataset[:2]])]
self.labels = torch.tensor(dataset[2])
self.vocab = vocab
self.max_len = max_len
(self.all_token_ids, self.all_segments,
self.valid_lens) = self._preprocess(all_premise_hypothesis_tokens)
print('read ' + str(len(self.all_token_ids)) + ' examples')
def _preprocess(self, all_premise_hypothesis_tokens):

pool = multiprocessing.Pool(4) # Use 4 worker processes
out = pool.map(self._mp_worker, all_premise_hypothesis_tokens)
all_token_ids = [
token_ids for token_ids, segments, valid_len in out]
all_segments = [segments for token_ids, segments, valid_len in out]
valid_lens = [valid_len for token_ids, segments, valid_len in out]
return (torch.tensor(all_token_ids, dtype=torch.long),
torch.tensor(all_segments, dtype=torch.long),
torch.tensor(valid_lens))
def _mp_worker(self, premise_hypothesis_tokens):

p_tokens, h_tokens = premise_hypothesis_tokens
self._truncate_pair_of_tokens(p_tokens, h_tokens)
tokens, segments = d2l.get_tokens_and_segments(p_tokens, h_tokens)
token_ids = self.vocab[tokens] + [self.vocab['<pad>']] \
* (self.max_len - len(tokens))
segments = segments + [0] * (self.max_len - len(segments))
valid_len = len(tokens)
return token_ids, segments, valid_len
def _truncate_pair_of_tokens(self, p_tokens, h_tokens):

# Reserve slots for '<CLS>', '<SEP>', and '<SEP>' tokens for the BERT
# input
while len(p_tokens) + len(h_tokens) > self.max_len - 3:
if len(p_tokens) > len(h_tokens):
p_tokens.pop()
else:
h_tokens.pop()


return (self.all_token_ids[idx], self.all_segments[idx],
self.valid_lens[idx]), self.labels[idx]
def __len__(self):
return len(self.all_token_ids)
After downloading the SNLI dataset, we generate training and testing examples by instantiating
the SNLIBERTDataset class. Such examples will be read in minibatches during training and testing
of natural language inference.
# Reduce `batch_size` if there is an out of memory error. In the original BERT

# model, `max_len` = 512
batch_size, max_len, num_workers = 512, 128, d2l.get_dataloader_workers()
train_set = SNLIBERTDataset(d2l.read_snli(data_dir, True), max_len, vocab)
test_set = SNLIBERTDataset(d2l.read_snli(data_dir, False), max_len, vocab)
train_iter = torch.utils.data.DataLoader(train_set, batch_size, shuffle=True,
test_iter = torch.utils.data.DataLoader(test_set, batch_size,

read 9824 examples
16.7.3 Fine-Tuning BERT
As Fig. 16.6.2 indicates, fine-tuning BERT for natural language inference requires only an extra
MLP consisting of two fully connected layers (see self.hidden and self.output in the following
BERTClassifier class). This MLP transforms the BERT representation of the special “<cls>” to-
ken, which encodes the information of both the premise and the hypothesis, into three outputs of
natural language inference: entailment, contradiction, and neutral.
class BERTClassifier(nn.Module):
def __init__(self, bert):
super(BERTClassifier, self).__init__()
self.encoder = bert.encoder
self.hidden = bert.hidden

tokens_X, segments_X, valid_lens_x = inputs
encoded_X = self.encoder(tokens_X, segments_X, valid_lens_x)
return self.output(self.hidden(encoded_X[:, 0, :]))
In the following, the pretrained BERT model bert is fed into the BERTClassifier instance net for
the downstream application. In common implementations of BERT fine-tuning, only the param-
eters of the output layer of the additional MLP (net.output) will be learned from scratch. All the
parameters of the pretrained BERT encoder (net.encoder) and the hidden layer of the additional
MLP (net.hidden) will be fine-tuned.

net = BERTClassifier(bert)
Recall that in Section 15.8 both the MaskLM class and the NextSentencePred class have parameters
in their employed MLPs. These parameters are part of those in the pretrained BERT model bert,
and thus part of parameters in net. However, such parameters are only for computing the masked
language modeling loss and the next sentence prediction loss during pretraining. These two loss
functions are irrelevant to fine-tuning downstream applications, thus the parameters of the em-
ployed MLPs in MaskLM and NextSentencePred are not updated (staled) when BERT is fine-tuned.
To allow parameters with stale gradients, the flag ignore_stale_grad=True is set in the step func-
tion of d2l.train_batch_ch13. We use this function to train and evaluate the model net using the
training set (train_iter) and the testing set (test_iter) of SNLI. Due to the limited computational
resources, the training and testing accuracy can be further improved: we leave its discussions in
the exercises.
lr, num_epochs = 1e-4, 5


16.7.4 Summary
• We can fine-tune the pretrained BERT model for downstream applications, such as natural
language inference on the SNLI dataset.
• During fine-tuning, the BERT model becomes part of the model for the downstream appli-
cation. Parameters that are only related to pretraining loss will not be updated during fine-
tuning.

Exercises
1. Fine-tune a much larger pretrained BERT model that is about as big as the original BERT base
model if your computational resource allows. Set arguments in the load_pretrained_model
function as: replacing ʻbert.smallʼ with ʻbert.baseʼ, increasing values of num_hiddens=256,
ffn_num_hiddens=512, num_heads=4, and num_blks=2 to 768, 3072, 12, and 12, respectively.
By increasing fine-tuning epochs (and possibly tuning other hyperparameters), can you get
a testing accuracy higher than 0.86?
2. How to truncate a pair of sequences according to their ratio of length? Compare this pair
truncation method and the one used in the SNLIBERTDataset class. What are their pros and
cons?
Discussions249
249

17 | Recommender Systems
Shuai Zhang (Amazon), Aston Zhang (Amazon), and Yi Tay (Google)

Recommender systems are widely employed in industry and are ubiquitous in our daily lives.
These systems are utilized in a number of areas such as online shopping sites (e.g., amazon.com),
music/movie services site (e.g., Netflix and Spotify), mobile application stores (e.g., IOS app store
and google play), online advertising, just to name a few.
The major goal of recommender systems is to help users discover relevant items such as movies
to watch, text to read or products to buy, so as to create a delightful user experience. Moreover,
recommender systems are among the most powerful machine learning systems that online retail-
ers implement in order to drive incremental revenue. Recommender systems are replacements
of search engines by reducing the efforts in proactive searches and surprising users with offers
they never searched for. Many companies managed to position themselves ahead of their com-
petitors with the help of more effective recommender systems. As such, recommender systems
are central to not only our everyday lives but also highly indispensable in some industries.
In this chapter, we will cover the fundamentals and advancements of recommender systems,
along with exploring some common fundamental techniques for building recommender systems
with different data sources available and their implementations. Specifically, you will learn how
to predict the rating a user might give to a prospective item, how to generate a recommendation
list of items and how to predict the click-through rate from abundant features. These tasks are
commonplace in real-world applications. By studying this chapter, you will get hands-on experi-
ence pertaining to solving real world recommendation problems with not only classical methods
but the more advanced deep learning based models as well.
17.1 Overview of Recommender Systems
In the last decade, the Internet has evolved into a platform for large-scale online services, which
profoundly changed the way we communicate, read news, buy products, and watch movies. In
the meanwhile, the unprecedented number of items (we use the term item to refer to movies,
news, books, and products.) offered online requires a system that can help us discover items that
we preferred. Recommender systems are therefore powerful information filtering tools that can
facilitate personalized services and provide tailored experience to individual users. In short, rec-
ommender systems play a pivotal role in utilizing the wealth of data available to make choices
manageable. Nowadays, recommender systems are at the core of a number of online services
providers such as Amazon, Netflix, and YouTube. Recall the example of Deep learning books rec-
ommended by Amazon in Fig. 1.3.3. The benefits of employing recommender systems are two-
folds: On the one hand, it can largely reduce usersʼ effort in finding items and alleviate the issue of
information overload. On the other hand, it can add business value to online service providers and
is an important source of revenue. This chapter will introduce the fundamental concepts, classic
791
models and recent advances with deep learning in the field of recommender systems, together
with implemented examples.
Fig. 17.1.1: Illustration of the Recommendation Process
17.1.1 Collaborative Filtering
We start the journey with the important concept in recommender systems—collaborative filtering
(CF), which was first coined by the Tapestry system (Goldberg et al., 1992), referring to “people
collaborate to help one another perform the filtering process in order to handle the large amounts
of email and messages posted to newsgroups”. This term has been enriched with more senses. In
a broad sense, it is the process of filtering for information or patterns using techniques involving
collaboration among multiple users, agents, and data sources. CF has many forms and numerous
CF methods proposed since its advent.
Overall, CF techniques can be categorized into: memory-based CF, model-based CF, and their
hybrid (Su and Khoshgoftaar, 2009). Representative memory-based CF techniques are nearest
neighbor-based CF such as user-based CF and item-based CF (Sarwar et al., 2001). Latent factor
models such as matrix factorization are examples of model-based CF. Memory-based CF has lim-
itations in dealing with sparse and large-scale data since it computes the similarity values based
on common items. Model-based methods become more popular with its better capability in deal-
ing with sparsity and scalability. Many model-based CF approaches can be extended with neu-
ral networks, leading to more flexible and scalable models with the computation acceleration
in deep learning (Zhang et al., 2019). In general, CF only uses the user-item interaction data to
make predictions and recommendations. Besides CF, content-based and context-based recom-
mender systems are also useful in incorporating the content descriptions of items/users and con-
textual signals such as timestamps and locations. Obviously, we may need to adjust the model
types/structures when different input data is available.
792 Chapter 17. Recommender Systems

17.1.2 Explicit Feedback and Implicit Feedback
To learn the preference of users, the system shall collect feedback from them. The feedback can
be either explicit or implicit (Hu et al., 2008). For example, IMDb250 collects star ratings ranging
from one to ten stars for movies. YouTube provides the thumbs-up and thumbs-down buttons for
users to show their preferences. It is apparent that gathering explicit feedback requires users to
indicate their interests proactively. Nonetheless, explicit feedback is not always readily available
as many users may be reluctant to rate products. Relatively speaking, implicit feedback is often
readily available since it is mainly concerned with modeling implicit behavior such as user clicks.
As such, many recommender systems are centered on implicit feedback which indirectly reflects
userʼs opinion through observing user behavior. There are diverse forms of implicit feedback
including purchase history, browsing history, watches and even mouse movements. For example,
a user that purchased many books by the same author probably likes that author. Note that implicit
feedback is inherently noisy. We can only guess their preferences and true motives. A user watched
a movie does not necessarily indicate a positive view of that movie.
17.1.3 Recommendation Tasks
A number of recommendation tasks have been investigated in the past decades. Based on the
domain of applications, there are movies recommendation, news recommendations, point-of-
interest recommendation (Ye et al., 2011) and so forth. It is also possible to differentiate the tasks
based on the types of feedback and input data, for example, the rating prediction task aims to
predict the explicit ratings. Top-n recommendation (item ranking) ranks all items for each user
personally based on the implicit feedback. If time-stamp information is also included, we can
build sequence-aware recommendation (Quadrana et al., 2018). Another popular task is called
click-through rate prediction, which is also based on implicit feedback, but various categorical
features can be utilized. Recommending for new users and recommending new items to existing
users are called cold-start recommendation (Schein et al., 2002).
17.1.4 Summary
• Recommender systems are important for individual users and industries. Collaborative fil-
tering is a key concept in recommendation.
• There are two types of feedbacks: implicit feedback and explicit feedback. A number of
recommendation tasks have been explored during the last decade.
Exercises
1. Can you explain how recommender systems influence your daily life?
2. What interesting recommendation tasks do you think can be investigated?
Discussions251
250
https://www.imdb.com/
251
17.1. Overview of Recommender Systems 793

794 Chapter 17. Recommender Systems
18 | Generative Adversarial Networks
18.1 Generative Adversarial Networks
Throughout most of this book, we have talked about how to make predictions. In some form or
another, we used deep neural networks learned mappings from data examples to labels. This kind
of learning is called discriminative learning, as in, weʼd like to be able to discriminate between
photos cats and photos of dogs. Classifiers and regressors are both examples of discriminative
learning. And neural networks trained by backpropagation have upended everything we thought
we knew about discriminative learning on large complicated datasets. Classification accuracies
on high-res images has gone from useless to human-level (with some caveats) in just 5-6 years. We
will spare you another spiel about all the other discriminative tasks where deep neural networks
do astoundingly well.
But there is more to machine learning than just solving discriminative tasks. For example, given
a large dataset, without any labels, we might want to learn a model that concisely captures the
characteristics of this data. Given such a model, we could sample synthetic data examples that
resemble the distribution of the training data. For example, given a large corpus of photographs
of faces, we might want to be able to generate a new photorealistic image that looks like it might
plausibly have come from the same dataset. This kind of learning is called generative modeling.
Until recently, we had no method that could synthesize novel photorealistic images. But the suc-
cess of deep neural networks for discriminative learning opened up new possibilities. One big
trend over the last three years has been the application of discriminative deep nets to overcome
challenges in problems that we do not generally think of as supervised learning problems. The
recurrent neural network language models are one example of using a discriminative network
(trained to predict the next character) that once trained can act as a generative model.
In 2014, a breakthrough paper introduced Generative adversarial networks (GANs) (Goodfellow
et al., 2014), a clever new way to leverage the power of discriminative models to get good gener-
ative models. At their heart, GANs rely on the idea that a data generator is good if we cannot tell
fake data apart from real data. In statistics, this is called a two-sample test - a test to answer the
question whether datasets X = {x1 , . . . , xn } and X ′ = {x′1 , . . . , x′n } were drawn from the same dis-
tribution. The main difference between most statistics papers and GANs is that the latter use this
idea in a constructive way. In other words, rather than just training a model to say “hey, these two
datasets do not look like they came from the same distribution”, they use the two-sample test252 to
provide training signals to a generative model. This allows us to improve the data generator until
it generates something that resembles the real data. At the very least, it needs to fool the classifier.
Even if our classifier is a state of the art deep neural network.
252
https://en.wikipedia.org/wiki/Two-sample_hypothesis_testing
795
Fig. 18.1.1: Generative Adversarial Networks
The GAN architecture is illustrated in Fig. 18.1.1. As you can see, there are two pieces in GAN
architecture - first off, we need a device (say, a deep network but it really could be anything, such
as a game rendering engine) that might potentially be able to generate data that looks just like the
real thing. If we are dealing with images, this needs to generate images. If we are dealing with
speech, it needs to generate audio sequences, and so on. We call this the generator network. The
second component is the discriminator network. It attempts to distinguish fake and real data from
each other. Both networks are in competition with each other. The generator network attempts
to fool the discriminator network. At that point, the discriminator network adapts to the new fake
data. This information, in turn is used to improve the generator network, and so on.
The discriminator is a binary classifier to distinguish if the input x is real (from real data) or fake
(from the generator). Typically, the discriminator outputs a scalar prediction o ∈ R for input x,
such as using a fully connected layer with hidden size 1, and then applies sigmoid function to
obtain the predicted probability D(x) = 1/(1 + e−o ). Assume the label y for the true data is 1 and
0 for the fake data. We train the discriminator to minimize the cross-entropy loss, i.e.,
min{−y log D(x) − (1 − y) log(1 − D(x))}, (18.1.1)

D
For the generator, it first draws some parameter z ∈ Rd from a source of randomness, e.g., a
normal distribution z ∼ N (0, 1). We often call z as the latent variable. It then applies a function
to generate x′ = G(z). The goal of the generator is to fool the discriminator to classify x′ = G(z)
as true data, i.e., we want D(G(z)) ≈ 1. In other words, for a given discriminator D, we update
the parameters of the generator G to maximize the cross-entropy loss when y = 0, i.e.,
max{−(1 − y) log(1 − D(G(z)))} = max{− log(1 − D(G(z)))}. (18.1.2)

G G
If the generator does a perfect job, then D(x′ ) ≈ 1 so the above loss near 0, which results the
gradients are too small to make a good progress for the discriminator. So commonly we minimize
the following loss:
min{−y log(D(G(z)))} = min{− log(D(G(z)))}, (18.1.3)

G G
which is just feed x′ = G(z) into the discriminator but giving label y = 1.
To sum up, D and G are playing a “minimax” game with the comprehensive objective function:
minD maxG {−Ex∼Data logD(x) − Ez∼Noise log(1 − D(G(z)))}. (18.1.4)
796 Chapter 18. Generative Adversarial Networks

Many of the GANs applications are in the context of images. As a demonstration purpose, we
are going to content ourselves with fitting a much simpler distribution first. We will illustrate
what happens if we use GANs to build the worldʼs most inefficient estimator of parameters for a
Gaussian. Letʼs get started.
%matplotlib inline
import torch
18.1.1 Generate Some “Real” Data
Since this is going to be the worldʼs lamest example, we simply generate data drawn from a Gaus-
sian.
X = torch.normal(0.0, 1, (1000, 2))

A = torch.tensor([[1, 2], [-0.1, 0.5]])
b = torch.tensor([1, 2])
data = torch.matmul(X, A) + b
Letʼs see what we got. This should be a Gaussian shifted in some rather arbitrary way with mean
b and covariance matrix AT A.
d2l.set_figsize()
d2l.plt.scatter(data[:100, (0)].detach().numpy(), data[:100, (1)].detach().numpy());
print(f'The covariance matrix is\n{torch.matmul(A.T, A)}')
The covariance matrix is

tensor([[1.0100, 1.9500],
[1.9500, 4.2500]])
batch_size = 8
data_iter = d2l.load_array((data,), batch_size)
18.1. Generative Adversarial Networks 797

18.1.2 Generator
Our generator network will be the simplest network possible - a single layer linear model. This
is since we will be driving that linear network with a Gaussian data generator. Hence, it literally
only needs to learn the parameters to fake things perfectly.
net_G = nn.Sequential(nn.Linear(2, 2))
18.1.3 Discriminator
For the discriminator we will be a bit more discriminating: we will use an MLP with 3 layers to
make things a bit more interesting.
net_D = nn.Sequential(
nn.Linear(2, 5), nn.Tanh(),
nn.Linear(5, 3), nn.Tanh(),
nn.Linear(3, 1))
18.1.4 Training
First we define a function to update the discriminator.
#@save
def update_D(X, Z, net_D, net_G, loss, trainer_D):
"""Update discriminator."""
ones = torch.ones((batch_size,), device=X.device)
zeros = torch.zeros((batch_size,), device=X.device)
trainer_D.zero_grad()
real_Y = net_D(X)
fake_X = net_G(Z)
# Do not need to compute gradient for `net_G`, detach it from
# computing gradients.
fake_Y = net_D(fake_X.detach())
loss_D = (loss(real_Y, ones.reshape(real_Y.shape)) +
loss(fake_Y, zeros.reshape(fake_Y.shape))) / 2
loss_D.backward()
trainer_D.step()
return loss_D
The generator is updated similarly. Here we reuse the cross-entropy loss but change the label of
the fake data from 0 to 1.
#@save
def update_G(Z, net_D, net_G, loss, trainer_G):
"""Update generator."""
batch_size = Z.shape[0]
ones = torch.ones((batch_size,), device=Z.device)
trainer_G.zero_grad()
# We could reuse `fake_X` from ùpdate_D` to save computation

fake_X = net_G(Z)
# Recomputing `fake_Y` is needed since `net_D` is changed
fake_Y = net_D(fake_X)
loss_G = loss(fake_Y, ones.reshape(fake_Y.shape))
loss_G.backward()
trainer_G.step()
return loss_G
Both the discriminator and the generator performs a binary logistic regression with the cross-
entropy loss. We use Adam to smooth the training process. In each iteration, we first update the
discriminator and then the generator. We visualize both losses and generated examples.
def train(net_D, net_G, data_iter, num_epochs, lr_D, lr_G, latent_dim, data):

loss = nn.BCEWithLogitsLoss(reduction='sum')
for w in net_D.parameters():
nn.init.normal_(w, 0, 0.02)
for w in net_G.parameters():
trainer_D = torch.optim.Adam(net_D.parameters(), lr=lr_D)
trainer_G = torch.optim.Adam(net_G.parameters(), lr=lr_G)
xlim=[1, num_epochs], nrows=2, figsize=(5, 5),
legend=['discriminator', 'generator'])
animator.fig.subplots_adjust(hspace=0.3)
# Train one epoch
timer = d2l.Timer()
metric = d2l.Accumulator(3) # loss_D, loss_G, num_examples
for (X,) in data_iter:
Z = torch.normal(0, 1, size=(batch_size, latent_dim))
metric.add(update_D(X, Z, net_D, net_G, loss, trainer_D),
update_G(Z, net_D, net_G, loss, trainer_G),
batch_size)
# Visualize generated examples
Z = torch.normal(0, 1, size=(100, latent_dim))
fake_X = net_G(Z).detach().numpy()
animator.axes[1].cla()
animator.axes[1].scatter(data[:, 0], data[:, 1])
animator.axes[1].scatter(fake_X[:, 0], fake_X[:, 1])
animator.axes[1].legend(['real', 'generated'])
# Show the losses
loss_D, loss_G = metric[0]/metric[2], metric[1]/metric[2]
animator.add(epoch + 1, (loss_D, loss_G))
print(f'loss_D {loss_D:.3f}, loss_G {loss_G:.3f}, '
f'{metric[2] / timer.stop():.1f} examples/sec')
Now we specify the hyperparameters to fit the Gaussian distribution.
lr_D, lr_G, latent_dim, num_epochs = 0.05, 0.005, 2, 20

train(net_D, net_G, data_iter, num_epochs, lr_D, lr_G,
latent_dim, data[:100].detach().numpy())
18.1. Generative Adversarial Networks 799

loss_D 0.696, loss_G 0.695, 1572.5 examples/sec
18.1.5 Summary
• Generative adversarial networks (GANs) composes of two deep networks, the generator and
the discriminator.
• The generator generates the image as much closer to the true image as possible to fool the
discriminator, via maximizing the cross-entropy loss, i.e., max log(D(x′ )).
• The discriminator tries to distinguish the generated images from the true images, via mini-
mizing the cross-entropy loss, i.e., min −y log D(x) − (1 − y) log(1 − D(x)).
Exercises
• Does an equilibrium exist where the generator wins, i.e. the discriminator ends up unable
to distinguish the two distributions on finite samples?
Discussions253
253

18.2 Deep Convolutional Generative Adversarial Networks
In Section 18.1, we introduced the basic ideas behind how GANs work. We showed that they can
draw samples from some simple, easy-to-sample distribution, like a uniform or normal distribu-
tion, and transform them into samples that appear to match the distribution of some dataset. And
while our example of matching a 2D Gaussian distribution got the point across, it is not especially
exciting.
In this section, we will demonstrate how you can use GANs to generate photorealistic images. We
will be basing our models on the deep convolutional GANs (DCGAN) introduced in (Radford et
al., 2015). We will borrow the convolutional architecture that have proven so successful for dis-
criminative computer vision problems and show how via GANs, they can be leveraged to generate
photorealistic images.
import warnings
import torch
import torchvision
18.2.1 The Pokemon Dataset
The dataset we will use is a collection of Pokemon sprites obtained from pokemondb254 . First
download, extract and load this dataset.
#@save
d2l.DATA_HUB['pokemon'] = (d2l.DATA_URL + 'pokemon.zip',
'c065c0e2593b8b161a2d7873e42418bf6a21106c')
data_dir = d2l.download_extract('pokemon')
pokemon = torchvision.datasets.ImageFolder(data_dir)
We resize each image into 64 × 64. The ToTensor transformation will project the pixel value into
[0, 1], while our generator will use the tanh function to obtain outputs in [−1, 1]. Therefore we
normalize the data with 0.5 mean and 0.5 standard deviation to match the value range.
batch_size = 256
transformer = torchvision.transforms.Compose([
torchvision.transforms.Resize((64, 64)),
torchvision.transforms.Normalize(0.5, 0.5)
])
pokemon.transform = transformer
data_iter = torch.utils.data.DataLoader(
pokemon, batch_size=batch_size,
shuffle=True, num_workers=d2l.get_dataloader_workers())
Letʼs visualize the first 20 images.

254
https://pokemondb.net/sprites
18.2. Deep Convolutional Generative Adversarial Networks 801

warnings.filterwarnings('ignore')
imgs = X[:20,:,:,:].permute(0, 2, 3, 1)/2+0.5
d2l.show_images(imgs, num_rows=4, num_cols=5)
break
18.2.2 The Generator
The generator needs to map the noise variable z ∈ Rd , a length-d vector, to a RGB image with width
and height to be 64 × 64 . In Section 14.11 we introduced the fully convolutional network that uses
transposed convolution layer (refer to Section 14.10) to enlarge input size. The basic block of the
generator contains a transposed convolution layer followed by the batch normalization and ReLU
activation.
class G_block(nn.Module):
def __init__(self, out_channels, in_channels=3, kernel_size=4, strides=2,
padding=1, **kwargs):
super(G_block, self).__init__(**kwargs)
self.conv2d_trans = nn.ConvTranspose2d(in_channels, out_channels,
kernel_size, strides, padding, bias=False)
self.batch_norm = nn.BatchNorm2d(out_channels)
self.activation = nn.ReLU()


return self.activation(self.batch_norm(self.conv2d_trans(X)))
In default, the transposed convolution layer uses a kh = kw = 4 kernel, a sh = sw = 2 strides, and

′ ′
a ph = pw = 1 padding. With a input shape of nh × nw = 16 × 16, the generator block will double
inputʼs width and height.
′ ′
nh × nw = [(nh kh − (nh − 1)(kh − sh ) − 2ph ] × [(nw kw − (nw − 1)(kw − sw ) − 2pw ]
= [(kh + sh (nh − 1) − 2ph ] × [(kw + sw (nw − 1) − 2pw ]
(18.2.1)
= [(4 + 2 × (16 − 1) − 2 × 1] × [(4 + 2 × (16 − 1) − 2 × 1]
= 32 × 32.
x = torch.zeros((2, 3, 16, 16))

g_blk = G_block(20)
g_blk(x).shape
torch.Size([2, 20, 32, 32])
If changing the transposed convolution layer to a 4 × 4 kernel, 1 × 1 strides and zero padding. With
a input size of 1 × 1, the output will have its width and height increased by 3 respectively.
x = torch.zeros((2, 3, 1, 1))
g_blk = G_block(20, strides=1, padding=0)
g_blk(x).shape
torch.Size([2, 20, 4, 4])
The generator consists of four basic blocks that increase inputʼs both width and height from 1
to 32. At the same time, it first projects the latent variable into 64 × 8 channels, and then halve
the channels each time. At last, a transposed convolution layer is used to generate the output. It
further doubles the width and height to match the desired 64 × 64 shape, and reduces the channel
size to 3. The tanh activation function is applied to project output values into the (−1, 1) range.
n_G = 64
net_G = nn.Sequential(
G_block(in_channels=100, out_channels=n_G*8,
strides=1, padding=0), # Output: (64 * 8, 4, 4)
G_block(in_channels=n_G*8, out_channels=n_G*4), # Output: (64 * 4, 8, 8)
G_block(in_channels=n_G*4, out_channels=n_G*2), # Output: (64 * 2, 16, 16)
G_block(in_channels=n_G*2, out_channels=n_G), # Output: (64, 32, 32)
nn.ConvTranspose2d(in_channels=n_G, out_channels=3,
kernel_size=4, stride=2, padding=1, bias=False),
nn.Tanh()) # Output: (3, 64, 64)
Generate a 100 dimensional latent variable to verify the generatorʼs output shape.
x = torch.zeros((1, 100, 1, 1))

net_G(x).shape

torch.Size([1, 3, 64, 64])
18.2.3 Discriminator
The discriminator is a normal convolutional network network except that it uses a leaky ReLU as
its activation function. Given α ∈ [0, 1], its definition is
{
x if x > 0
leaky ReLU(x) = . (18.2.2)
αx otherwise
As it can be seen, it is normal ReLU if α = 0, and an identity function if α = 1. For α ∈ (0, 1), leaky
ReLU is a nonlinear function that give a non-zero output for a negative input. It aims to fix the
“dying ReLU” problem that a neuron might always output a negative value and therefore cannot
make any progress since the gradient of ReLU is 0.
alphas = [0, .2, .4, .6, .8, 1]

Y = [nn.LeakyReLU(alpha)(x).detach().numpy() for alpha in alphas]
d2l.plot(x.detach().numpy(), Y, 'x', 'y', alphas)
The basic block of the discriminator is a convolution layer followed by a batch normalization layer
and a leaky ReLU activation. The hyperparameters of the convolution layer are similar to the
transpose convolution layer in the generator block.
class D_block(nn.Module):
def __init__(self, out_channels, in_channels=3, kernel_size=4, strides=2,
padding=1, alpha=0.2, **kwargs):
super(D_block, self).__init__(**kwargs)
self.conv2d = nn.Conv2d(in_channels, out_channels, kernel_size,
strides, padding, bias=False)
self.batch_norm = nn.BatchNorm2d(out_channels)
self.activation = nn.LeakyReLU(alpha, inplace=True)

return self.activation(self.batch_norm(self.conv2d(X)))

A basic block with default settings will halve the width and height of the inputs, as we demon-
strated in Section 7.3. For example, given a input shape nh = nw = 16, with a kernel shape
kh = kw = 4, a stride shape sh = sw = 2, and a padding shape ph = pw = 1, the output shape will
be:
′ ′
nh × nw = ⌊(nh − kh + 2ph + sh )/sh ⌋ × ⌊(nw − kw + 2pw + sw )/sw ⌋
= ⌊(16 − 4 + 2 × 1 + 2)/2⌋ × ⌊(16 − 4 + 2 × 1 + 2)/2⌋ (18.2.3)
= 8 × 8.
x = torch.zeros((2, 3, 16, 16))

d_blk = D_block(20)
d_blk(x).shape
torch.Size([2, 20, 8, 8])
The discriminator is a mirror of the generator.
n_D = 64
net_D = nn.Sequential(
D_block(n_D), # Output: (64, 32, 32)
D_block(in_channels=n_D, out_channels=n_D*2), # Output: (64 * 2, 16, 16)
D_block(in_channels=n_D*2, out_channels=n_D*4), # Output: (64 * 4, 8, 8)
D_block(in_channels=n_D*4, out_channels=n_D*8), # Output: (64 * 8, 4, 4)
nn.Conv2d(in_channels=n_D*8, out_channels=1,
kernel_size=4, bias=False)) # Output: (1, 1, 1)
It uses a convolution layer with output channel 1 as the last layer to obtain a single prediction
value.
x = torch.zeros((1, 3, 64, 64))

net_D(x).shape
torch.Size([1, 1, 1, 1])
18.2.4 Training
Compared to the basic GAN in Section 18.1, we use the same learning rate for both generator and
discriminator since they are similar to each other. In addition, we change β1 in Adam (Section
12.10) from 0.9 to 0.5. It decreases the smoothness of the momentum, the exponentially weighted
moving average of past gradients, to take care of the rapid changing gradients because the gener-
ator and the discriminator fight with each other. Besides, the random generated noise Z, is a 4-D
tensor and we are using GPU to accelerate the computation.
def train(net_D, net_G, data_iter, num_epochs, lr, latent_dim,

device=d2l.try_gpu()):
loss = nn.BCEWithLogitsLoss(reduction='sum')
for w in net_D.parameters():

for w in net_G.parameters():
net_D, net_G = net_D.to(device), net_G.to(device)
trainer_hp = {'lr': lr, 'betas': [0.5,0.999]}
trainer_D = torch.optim.Adam(net_D.parameters(), **trainer_hp)
trainer_G = torch.optim.Adam(net_G.parameters(), **trainer_hp)
xlim=[1, num_epochs], nrows=2, figsize=(5, 5),
legend=['discriminator', 'generator'])
animator.fig.subplots_adjust(hspace=0.3)
for epoch in range(1, num_epochs + 1):
# Train one epoch
timer = d2l.Timer()
metric = d2l.Accumulator(3) # loss_D, loss_G, num_examples
for X, _ in data_iter:
Z = torch.normal(0, 1, size=(batch_size, latent_dim, 1, 1))
X, Z = X.to(device), Z.to(device)
metric.add(d2l.update_D(X, Z, net_D, net_G, loss, trainer_D),
d2l.update_G(Z, net_D, net_G, loss, trainer_G),
batch_size)
# Show generated examples
Z = torch.normal(0, 1, size=(21, latent_dim, 1, 1), device=device)
# Normalize the synthetic data to N(0, 1)
fake_x = net_G(Z).permute(0, 2, 3, 1) / 2 + 0.5
imgs = torch.cat(
[torch.cat([
fake_x[i * 7 + j].cpu().detach() for j in range(7)], dim=1)
for i in range(len(fake_x)//7)], dim=0)
animator.axes[1].cla()
animator.axes[1].imshow(imgs)
# Show the losses
loss_D, loss_G = metric[0] / metric[2], metric[1] / metric[2]
animator.add(epoch, (loss_D, loss_G))
print(f'loss_D {loss_D:.3f}, loss_G {loss_G:.3f}, '
f'{metric[2] / timer.stop():.1f} examples/sec on {str(device)}')
We train the model with a small number of epochs just for demonstration. For better perfor-
mance, the variable num_epochs can be set to a larger number.
latent_dim, lr, num_epochs = 100, 0.005, 20

train(net_D, net_G, data_iter, num_epochs, lr, latent_dim)
loss_D 0.313, loss_G 6.414, 1077.0 examples/sec on cuda:0

18.2.5 Summary
• DCGAN architecture has four convolutional layers for the Discriminator and four
“fractionally-strided” convolutional layers for the Generator.
• The Discriminator is a 4-layer strided convolutions with batch normalization (except its input
layer) and leaky ReLU activations.
• Leaky ReLU is a nonlinear function that give a non-zero output for a negative input. It aims
to fix the “dying ReLU” problem and helps the gradients flow easier through the architecture.
Exercises
1. What will happen if we use standard ReLU activation rather than leaky ReLU?
2. Apply DCGAN on Fashion-MNIST and see which category works well and which does not.
Discussions255
255

19 | Appendix: Mathematics for Deep
Learning
Brent Werness (Amazon), Rachel Hu (Amazon), and authors of this book

One of the wonderful parts of modern deep learning is the fact that much of it can be understood
and used without a full understanding of the mathematics below it. This is a sign that the field
is maturing. Just as most software developers no longer need to worry about the theory of com-
putable functions, neither should deep learning practitioners need to worry about the theoretical
foundations of maximum likelihood learning.
But, we are not quite there yet.
In practice, you will sometimes need to understand how architectural choices influence gradient
flow, or the implicit assumptions you make by training with a certain loss function. You might
need to know what in the world entropy measures, and how it can help you understand exactly
what bits-per-character means in your model. These all require deeper mathematical understand-
ing.
This appendix aims to provide you the mathematical background you need to understand the core
theory of modern deep learning, but it is not exhaustive. We will begin with examining linear al-
gebra in greater depth. We develop a geometric understanding of all the common linear algebraic
objects and operations that will enable us to visualize the effects of various transformations on our
data. A key element is the development of the basics of eigen-decompositions.
We next develop the theory of differential calculus to the point that we can fully understand why
the gradient is the direction of steepest descent, and why back-propagation takes the form it does.
Integral calculus is then discussed to the degree needed to support our next topic, probability
theory.
Problems encountered in practice frequently are not certain, and thus we need a language to speak
about uncertain things. We review the theory of random variables and the most commonly en-
countered distributions so we may discuss models probabilistically. This provides the foundation
for the naive Bayes classifier, a probabilistic classification technique.
Closely related to probability theory is the study of statistics. While statistics is far too large a field
to do justice in a short section, we will introduce fundamental concepts that all machine learning
practitioners should be aware of, in particular: evaluating and comparing estimators, conducting
hypothesis tests, and constructing confidence intervals.
Last, we turn to the topic of information theory, which is the mathematical study of information
storage and transmission. This provides the core language by which we may discuss quantitatively
how much information a model holds on a domain of discourse.
809
Taken together, these form the core of the mathematical concepts needed to begin down the path
towards a deep understanding of deep learning.
19.1 Geometry and Linear Algebraic Operations
In Section 2.3, we encountered the basics of linear algebra and saw how it could be used to express
common operations for transforming our data. Linear algebra is one of the key mathematical
pillars underlying much of the work that we do in deep learning and in machine learning more
broadly. While Section 2.3 contained enough machinery to communicate the mechanics of mod-
ern deep learning models, there is a lot more to the subject. In this section, we will go deeper,
highlighting some geometric interpretations of linear algebra operations, and introducing a few
fundamental concepts, including of eigenvalues and eigenvectors.
19.1.1 Geometry of Vectors
First, we need to discuss the two common geometric interpretations of vectors, as either points
or directions in space. Fundamentally, a vector is a list of numbers such as the Python list below.
v = [1, 7, 0, 1]
Mathematicians most often write this as either a column or row vector, which is to say either as
 
1
7
x= 
0 , (19.1.1)
1
or
[ ]
x⊤ = 1 7 0 1 . (19.1.2)
These often have different interpretations, where data examples are column vectors and weights
used to form weighted sums are row vectors. However, it can be beneficial to be flexible. As we
have described in Section 2.3, though a single vectorʼs default orientation is a column vector, for
any matrix representing a tabular dataset, treating each data example as a row vector in the matrix
is more conventional.
Given a vector, the first interpretation that we should give it is as a point in space. In two or three
dimensions, we can visualize these points by using the components of the vectors to define the
location of the points in space compared to a fixed reference called the origin. This can be seen
in Fig. 19.1.1.
810 Chapter 19. Appendix: Mathematics for Deep Learning

Fig. 19.1.1: An illustration of visualizing vectors as points in the plane. The first component of the
vector gives the x-coordinate, the second component gives the y-coordinate. Higher dimensions
are analogous, although much harder to visualize.
This geometric point of view allows us to consider the problem on a more abstract level. No longer
faced with some insurmountable seeming problem like classifying pictures as either cats or dogs,
we can start considering tasks abstractly as collections of points in space and picturing the task as
discovering how to separate two distinct clusters of points.
In parallel, there is a second point of view that people often take of vectors: as directions in space.
Not only can we think of the vector v = [3, 2]⊤ as the location 3 units to the right and 2 units up
from the origin, we can also think of it as the direction itself to take 3 steps to the right and 2 steps
up. In this way, we consider all the vectors in figure Fig. 19.1.2 the same.
Fig. 19.1.2: Any vector can be visualized as an arrow in the plane. In this case, every vector drawn
is a representation of the vector (3, 2)⊤ .
One of the benefits of this shift is that we can make visual sense of the act of vector addition. In
particular, we follow the directions given by one vector, and then follow the directions given by
the other, as is seen in Fig. 19.1.3.
19.1. Geometry and Linear Algebraic Operations 811

Fig. 19.1.3: We can visualize vector addition by first following one vector, and then another.
Vector subtraction has a similar interpretation. By considering the identity that u = v + (u − v),
we see that the vector u − v is the direction that takes us from the point v to the point u.
19.1.2 Dot Products and Angles
As we saw in Section 2.3, if we take two column vectors u and v, we can form their dot product by
computing:
∑
u⊤ v = ui · vi . (19.1.3)
i
Because (19.1.3) is symmetric, we will mirror the notation of classical multiplication and write
u · v = u⊤ v = v⊤ u, (19.1.4)
to highlight the fact that exchanging the order of the vectors will yield the same answer.
The dot product (19.1.3) also admits a geometric interpretation: it is closely related to the angle
between two vectors. Consider the angle shown in Fig. 19.1.4.
Fig. 19.1.4: Between any two vectors in the plane there is a well defined angle θ. We will see this
angle is intimately tied to the dot product.
To start, letʼs consider two specific vectors:

v = (r, 0) and w = (s cos(θ), s sin(θ)). (19.1.5)

The vector v is length r and runs parallel to the x-axis, and the vector w is of length s and at angle
θ with the x-axis. If we compute the dot product of these two vectors, we see that
v · w = rs cos(θ) = ∥v∥∥w∥ cos(θ). (19.1.6)
With some simple algebraic manipulation, we can rearrange terms to obtain

( )
v·w
θ = arccos . (19.1.7)
∥v∥∥w∥
In short, for these two specific vectors, the dot product combined with the norms tell us the angle
between the two vectors. This same fact is true in general. We will not derive the expression here,
however, if we consider writing ∥v − w∥2 in two ways: one with the dot product, and the other
geometrically using the law of cosines, we can obtain the full relationship. Indeed, for any two
vectors v and w, the angle between the two vectors is
( )
v·w
θ = arccos . (19.1.8)
∥v∥∥w∥
This is a nice result since nothing in the computation references two-dimensions. Indeed, we can
use this in three or three million dimensions without issue.
As a simple example, letʼs see how to compute the angle between a pair of vectors:
%matplotlib inline
import torch
import torchvision
def angle(v, w):

return torch.acos(v.dot(w) / (torch.norm(v) * torch.norm(w)))
angle(torch.tensor([0, 1, 2], dtype=torch.float32), torch.tensor([2.0, 3, 4]))
tensor(0.4190)
We will not use it right now, but it is useful to know that we will refer to vectors for which the angle
is π/2 (or equivalently 90◦ ) as being orthogonal. By examining the equation above, we see that this
happens when θ = π/2, which is the same thing as cos(θ) = 0. The only way this can happen is
if the dot product itself is zero, and two vectors are orthogonal if and only if v · w = 0. This will
prove to be a helpful formula when understanding objects geometrically.
It is reasonable to ask: why is computing the angle useful? The answer comes in the kind of
invariance we expect data to have. Consider an image, and a duplicate image, where every pixel
value is the same but 10% the brightness. The values of the individual pixels are in general far from
the original values. Thus, if one computed the distance between the original image and the darker
one, the distance can be large. However, for most ML applications, the content is the same—it is
still an image of a cat as far as a cat/dog classifier is concerned. However, if we consider the angle,
it is not hard to see that for any vector v, the angle between v and 0.1 · v is zero. This corresponds
to the fact that scaling vectors keeps the same direction and just changes the length. The angle
considers the darker image identical.

Examples like this are everywhere. In text, we might want the topic being discussed to not change
if we write twice as long of document that says the same thing. For some encoding (such as count-
ing the number of occurrences of words in some vocabulary), this corresponds to a doubling of
the vector encoding the document, so again we can use the angle.
Cosine Similarity
In ML contexts where the angle is employed to measure the closeness of two vectors, practitioners
adopt the term cosine similarity to refer to the portion
v·w
cos(θ) = . (19.1.9)
∥v∥∥w∥
The cosine takes a maximum value of 1 when the two vectors point in the same direction, a min-
imum value of −1 when they point in opposite directions, and a value of 0 when the two vectors
are orthogonal. Note that if the components of high-dimensional vectors are sampled randomly
with mean 0, their cosine will nearly always be close to 0.
19.1.3 Hyperplanes
In addition to working with vectors, another key object that you must understand to go far in linear
algebra is the hyperplane, a generalization to higher dimensions of a line (two dimensions) or of a
plane (three dimensions). In an d-dimensional vector space, a hyperplane has d − 1 dimensions
and divides the space into two half-spaces.
Letʼs start with an example. Suppose that we have a column vector w = [2, 1]⊤ . We want to know,
“what are the points v with w · v = 1?” By recalling the connection between dot products and
angles above (19.1.8), we can see that this is equivalent to
1 1
∥v∥∥w∥ cos(θ) = 1 ⇐⇒ ∥v∥ cos(θ) = =√ . (19.1.10)
∥w∥ 5
Fig. 19.1.5: Recalling trigonometry, we see the formula ∥v∥ cos(θ) is the length of the projection
of the vector v onto the direction of w
If we consider the geometric meaning of this expression, we see that this is equivalent to saying
that the length of the projection of v onto the direction of w is exactly 1/∥w∥, as is shown in Fig.

19.1.5. The set of all points where this is true is a line at right angles to the vector w. If we wanted,
we could find the equation for this line and see that it is 2x + y = 1 or equivalently y = 1 − 2x.
If we now look at what happens when we ask about the set of points with w · v > 1 or w · v < 1, we
can see that these are cases where the projections are longer or shorter than 1/∥w∥, respectively.
Thus, those two inequalities define either side of the line. In this way, we have found a way to cut
our space into two halves, where all the points on one side have dot product below a threshold,
and the other side above as we see in Fig. 19.1.6.
Fig. 19.1.6: If we now consider the inequality version of the expression, we see that our hyperplane
(in this case: just a line) separates the space into two halves.
The story in higher dimension is much the same. If we now take w = [1, 2, 3]⊤ and ask about the
points in three dimensions with w · v = 1, we obtain a plane at right angles to the given vector w.
The two inequalities again define the two sides of the plane as is shown in Fig. 19.1.7.
Fig. 19.1.7: Hyperplanes in any dimension separate the space into two halves.
While our ability to visualize runs out at this point, nothing stops us from doing this in tens, hun-
dreds, or billions of dimensions. This occurs often when thinking about machine learned models.
For instance, we can understand linear classification models like those from Section 4.1, as meth-
ods to find hyperplanes that separate the different target classes. In this context, such hyperplanes
are often referred to as decision planes. The majority of deep learned classification models end with
a linear layer fed into a softmax, so one can interpret the role of the deep neural network to be to
find a non-linear embedding such that the target classes can be separated cleanly by hyperplanes.
To give a hand-built example, notice that we can produce a reasonable model to classify tiny im-
ages of t-shirts and trousers from the Fashion-MNIST dataset (seen in Section 4.2) by just taking
the vector between their means to define the decision plane and eyeball a crude threshold. First
we will load the data and compute the averages.

# Load in the dataset
trans = []
trans.append(transforms.ToTensor())
trans = transforms.Compose(trans)
train = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=True, download=True)
test = torchvision.datasets.FashionMNIST(root="../data", transform=trans,
train=False, download=True)
X_train_0 = torch.stack(
[x[0] * 256 for x in train if x[1] == 0]).type(torch.float32)
X_train_1 = torch.stack(
[x[0] * 256 for x in train if x[1] == 1]).type(torch.float32)
X_test = torch.stack(
[x[0] * 256 for x in test if x[1] == 0 or x[1] == 1]).type(torch.float32)
y_test = torch.stack([torch.tensor(x[1]) for x in test
if x[1] == 0 or x[1] == 1]).type(torch.float32)
# Compute averages
ave_0 = torch.mean(X_train_0, axis=0)
ave_1 = torch.mean(X_train_1, axis=0)
It can be informative to examine these averages in detail, so letʼs plot what they look like. In this
case, we see that the average indeed resembles a blurry image of a t-shirt.
# Plot average t-shirt

d2l.set_figsize()
d2l.plt.imshow(ave_0.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()
In the second case, we again see that the average resembles a blurry image of trousers.
# Plot average trousers

d2l.plt.imshow(ave_1.reshape(28, 28).tolist(), cmap='Greys')
d2l.plt.show()

In a fully machine learned solution, we would learn the threshold from the dataset. In this case,
I simply eyeballed a threshold that looked good on the training data by hand.
# Print test set accuracy with eyeballed threshold

w = (ave_1 - ave_0).T
# '@' is Matrix Multiplication operator in pytorch.
predictions = X_test.reshape(2000, -1) @ (w.flatten()) > -1500000
# Accuracy
torch.mean((predictions.type(y_test.dtype) == y_test).float(), dtype=torch.float64)
/tmp/ipykernel_14659/3819644537.py:2: UserWarning: The use of x.T on tensors of␣

,→dimension other than 2 to reverse their shape is deprecated and it will throw␣
,→an error in a future release. Consider x.mT to transpose batches of matricesor x.
,→permute(*torch.arange(x.ndim - 1, -1, -1)) to reverse the dimensions of a tensor.␣
,→(Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2985.)

w = (ave_1 - ave_0).T
tensor(0.7870, dtype=torch.float64)
19.1.4 Geometry of Linear Transformations
Through Section 2.3 and the above discussions, we have a solid understanding of the geometry of
vectors, lengths, and angles. However, there is one important object we have omitted discussing,
and that is a geometric understanding of linear transformations represented by matrices. Fully
internalizing what matrices can do to transform data between two potentially different high di-
mensional spaces takes significant practice, and is beyond the scope of this appendix. However,
we can start building up intuition in two dimensions.
Suppose that we have some matrix:
[ ]
a b
A= . (19.1.11)
c d

If we want to apply this to an arbitrary vector v = [x, y]⊤ , we multiply and see that
[ ][ ]
a b x
Av =
c d y
[ ]
ax + by
=
cx + dy
[ ] [ ] (19.1.12)
a b
=x +y
c d
{ [ ]} { [ ]}
1 0
=x A +y A .
0 1
This may seem like an odd computation, where something clear became somewhat impenetrable.
However, it tells us that we can write the way that a matrix transforms any vector in terms of how
it transforms two specific vectors: [1, 0]⊤ and [0, 1]⊤ . This is worth considering for a moment. We
have essentially reduced an infinite problem (what happens to any pair of real numbers) to a finite
one (what happens to these specific vectors). These vectors are an example a basis, where we can
write any vector in our space as a weighted sum of these basis vectors.
Letʼs draw what happens when we use the specific matrix
[ ]
1 2
A= . (19.1.13)
−1 3
If we look at the specific vector v = [2, −1]⊤ , we see this is 2 · [1, 0]⊤ + −1 · [0, 1]⊤ , and thus we
know that the matrix A will send this to 2(A[1, 0]⊤ ) + −1(A[0, 1])⊤ = 2[1, −1]⊤ − [2, 3]⊤ = [0, −5]⊤ .
If we follow this logic through carefully, say by considering the grid of all integer pairs of points,
we see that what happens is that the matrix multiplication can skew, rotate, and scale the grid, but
the grid structure must remain as you see in Fig. 19.1.8.
Fig. 19.1.8: The matrix A acting on the given basis vectors. Notice how the entire grid is trans-
ported along with it.
This is the most important intuitive point to internalize about linear transformations represented
by matrices. Matrices are incapable of distorting some parts of space differently than others. All
they can do is take the original coordinates on our space and skew, rotate, and scale them.
Some distortions can be severe. For instance the matrix
[ ]
2 −1
B= , (19.1.14)
4 −2

compresses the entire two-dimensional plane down to a single line. Identifying and working with
such transformations are the topic of a later section, but geometrically we can see that this is
fundamentally different from the types of transformations we saw above. For instance, the result
from matrix A can be “bent back” to the original grid. The results from matrix B cannot because
we will never know where the vector [1, 2]⊤ came from—was it [1, 1]⊤ or [0, −1]⊤ ?
While this picture was for a 2 × 2 matrix, nothing prevents us from taking the lessons learned
into higher dimensions. If we take similar basis vectors like [1, 0, . . . , 0] and see where our matrix
sends them, we can start to get a feeling for how the matrix multiplication distorts the entire space
in whatever dimension space we are dealing with.
19.1.5 Linear Dependence
Consider again the matrix

[]
2 −1
B= . (19.1.15)
4 −2
This compresses the entire plane down to live on the single line y = 2x. The question now arises:
is there some way we can detect this just looking at the matrix itself? The answer is that indeed
we can. Letʼs take b1 = [2, 4]⊤ and b2 = [−1, −2]⊤ be the two columns of B. Remember that we
can write everything transformed by the matrix B as a weighted sum of the columns of the matrix:
like a1 b1 + a2 b2 . We call this a linear combination. The fact that b1 = −2 · b2 means that we can
write any linear combination of those two columns entirely in terms of say b2 since
a1 b1 + a2 b2 = −2a1 b2 + a2 b2 = (a2 − 2a1 )b2 . (19.1.16)
This means that one of the columns is, in a sense, redundant because it does not define a unique
direction in space. This should not surprise us too much since we already saw that this matrix
collapses the entire plane down into a single line. Moreover, we see that the linear dependence
b1 = −2 · b2 captures this. To make this more symmetrical between the two vectors, we will write
this as
b1 + 2 · b2 = 0. (19.1.17)
In general, we will say that a collection of vectors v1 , . . . , vk are linearly dependent if there exist
coefficients a1 , . . . , ak not all equal to zero so that
∑
k
ai vi = 0. (19.1.18)
i=1
In this case, we can solve for one of the vectors in terms of some combination of the others, and
effectively render it redundant. Thus, a linear dependence in the columns of a matrix is a witness
to the fact that our matrix is compressing the space down to some lower dimension. If there is
no linear dependence we say the vectors are linearly independent. If the columns of a matrix are
linearly independent, no compression occurs and the operation can be undone.

19.1.6 Rank
If we have a general n × m matrix, it is reasonable to ask what dimension space the matrix maps
into. A concept known as the rank will be our answer. In the previous section, we noted that a
linear dependence bears witness to compression of space into a lower dimension and so we will
be able to use this to define the notion of rank. In particular, the rank of a matrix A is the largest
number of linearly independent columns amongst all subsets of columns. For example, the matrix
[ ]
2 4
B= , (19.1.19)
−1 −2
has rank(B) = 1, since the two columns are linearly dependent, but either column by itself is not
linearly dependent. For a more challenging example, we can consider
 
1 3 0 −1 0
−1 0 1 1 −1
C= 0 3 1
, (19.1.20)
0 −1
2 3 −1 −2 1
and show that C has rank two since, for instance, the first two columns are linearly independent,
however any of the four collections of three columns are dependent.
This procedure, as described, is very inefficient. It requires looking at every subset of the columns
of our given matrix, and thus is potentially exponential in the number of columns. Later we will
see a more computationally efficient way to compute the rank of a matrix, but for now, this is
sufficient to see that the concept is well defined and understand the meaning.
19.1.7 Invertibility
We have seen above that multiplication by a matrix with linearly dependent columns cannot be
undone, i.e., there is no inverse operation that can always recover the input. However, multipli-
cation by a full-rank matrix (i.e., some A that is n × n matrix with rank n), we should always be
able to undo it. Consider the matrix
 
1 0 ··· 0
0 1 · · · 0
 
I = . . . ..  . (19.1.21)
. .
. . . . .
0 0 ··· 1
which is the matrix with ones along the diagonal, and zeros elsewhere. We call this the identity
matrix. It is the matrix which leaves our data unchanged when applied. To find a matrix which
undoes what our matrix A has done, we want to find a matrix A−1 such that
A−1 A = AA−1 = I. (19.1.22)
If we look at this as a system, we have n × n unknowns (the entries of A−1 ) and n × n equations
(the equality that needs to hold between every entry of the product A−1 A and every entry of I) so
we should generically expect a solution to exist. Indeed, in the next section we will see a quantity
called the determinant, which has the property that as long as the determinant is not zero, we can
find a solution. We call such a matrix A−1 the inverse matrix. As an example, if A is the general
2 × 2 matrix
[ ]
a b
A= , (19.1.23)
c d

then we can see that the inverse is
[ ]
1 d −b
. (19.1.24)
ad − bc −c a
We can test to see this by seeing that multiplying by the inverse given by the formula above works
in practice.
M = torch.tensor([[1, 2], [1, 4]], dtype=torch.float32)

M_inv = torch.tensor([[2, -1], [-0.5, 0.5]])
M_inv @ M
tensor([[1., 0.],
[0., 1.]])
Numerical Issues
While the inverse of a matrix is useful in theory, we must say that most of the time we do not wish
to use the matrix inverse to solve a problem in practice. In general, there are far more numerically
stable algorithms for solving linear equations like
Ax = b, (19.1.25)
than computing the inverse and multiplying to get
x = A−1 b. (19.1.26)
Just as division by a small number can lead to numerical instability, so can inversion of a matrix
which is close to having low rank.
Moreover, it is common that the matrix A is sparse, which is to say that it contains only a small
number of non-zero values. If we were to explore examples, we would see that this does not mean
the inverse is sparse. Even if A was a 1 million by 1 million matrix with only 5 million non-zero
entries (and thus we need only store those 5 million), the inverse will typically have almost every
entry non-negative, requiring us to store all 1M2 entries—that is 1 trillion entries!
While we do not have time to dive all the way into the thorny numerical issues frequently encoun-
tered when working with linear algebra, we want to provide you with some intuition about when
to proceed with caution, and generally avoiding inversion in practice is a good rule of thumb.
19.1.8 Determinant
The geometric view of linear algebra gives an intuitive way to interpret a fundamental quantity
known as the determinant. Consider the grid image from before, but now with a highlighted region
(Fig. 19.1.9).

Fig. 19.1.9: The matrix A again distorting the grid. This time, I want to draw particular attention
to what happens to the highlighted square.
Look at the highlighted square. This is a square with edges given by (0, 1) and (1, 0) and thus it
has area one. After A transforms this square, we see that it becomes a parallelogram. There is
no reason this parallelogram should have the same area that we started with, and indeed in the
specific case shown here of
[ ]
1 2
A= , (19.1.27)
−1 3
it is an exercise in coordinate geometry to compute the area of this parallelogram and obtain that
the area is 5.
In general, if we have a matrix
]
[
a b
A= , (19.1.28)
c d
we can see with some computation that the area of the resulting parallelogram is ad − bc. This
area is referred to as the determinant.
Letʼs check this quickly with some example code.
torch.det(torch.tensor([[1, -1], [2, 3]], dtype=torch.float32))
tensor(5.)
The eagle-eyed amongst us will notice that this expression can be zero or even negative. For the
negative term, this is a matter of convention taken generally in mathematics: if the matrix flips
the figure, we say the area is negated. Letʼs see now that when the determinant is zero, we learn
more.
Letʼs consider
[ ]
2 4
B= . (19.1.29)
−1 −2
If we compute the determinant of this matrix, we get 2·(−2)−4·(−1) = 0. Given our understanding
above, this makes sense. B compresses the square from the original image down to a line segment,
which has zero area. And indeed, being compressed into a lower dimensional space is the only

way to have zero area after the transformation. Thus we see the following result is true: a matrix
A is invertible if and only if the determinant is not equal to zero.
As a final comment, imagine that we have any figure drawn on the plane. Thinking like computer
scientists, we can decompose that figure into a collection of little squares so that the area of the
figure is in essence just the number of squares in the decomposition. If we now transform that
figure by a matrix, we send each of these squares to parallelograms, each one of which has area
given by the determinant. We see that for any figure, the determinant gives the (signed) number
that a matrix scales the area of any figure.
Computing determinants for larger matrices can be laborious, but the intuition is the same. The
determinant remains the factor that n × n matrices scale n-dimensional volumes.
19.1.9 Tensors and Common Linear Algebra Operations
In Section 2.3 the concept of tensors was introduced. In this section, we will dive more deeply into
tensor contractions (the tensor equivalent of matrix multiplication), and see how it can provide a
unified view on a number of matrix and vector operations.
With matrices and vectors we knew how to multiply them to transform data. We need to have a
similar definition for tensors if they are to be useful to us. Think about matrix multiplication:
C = AB, (19.1.30)
or equivalently
∑
ci,j = ai,k bk,j . (19.1.31)
k
This pattern is one we can repeat for tensors. For tensors, there is no one case of what to sum over
that can be universally chosen, so we need specify exactly which indices we want to sum over. For
instance we could consider
∑
yil = xijkl ajk . (19.1.32)
jk
Such a transformation is called a tensor contraction. It can represent a far more flexible family of
transformations that matrix multiplication alone.
As a often-used notational simplification, we can notice that the sum is over exactly those indices
that occur more than once in the expression, thus people often work with Einstein notation, where
the summation is implicitly taken over all repeated indices. This gives the compact expression:
yil = xijkl ajk . (19.1.33)
Common Examples from Linear Algebra
Letʼs see how many of the linear algebraic definitions we have seen before can be expressed in
this compressed tensor notation:
∑
• v · w = i v i wi
∑
• ∥v∥22 = i vi vi
∑
• (Av)i = j aij vj

∑
• (AB)ik = j aij bjk
∑
• tr(A) = i aii
In this way, we can replace a myriad of specialized notations with short tensor expressions.
Expressing in Code
Tensors may flexibly be operated on in code as well. As seen in Section 2.3, we can create tensors
as is shown below.
# Define tensors
B = torch.tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
A = torch.tensor([[1, 2], [3, 4]])
v = torch.tensor([1, 2])
# Print out the shapes

A.shape, B.shape, v.shape
(torch.Size([2, 2]), torch.Size([2, 2, 3]), torch.Size([2]))
Einstein summation has been implemented directly. The indices that occurs in the Einstein sum-
mation can be passed as a string, followed by the tensors that are being acted upon. For in-
stance, to implement matrix multiplication, we can consider the Einstein summation seen above
(Av = aij vj ) and strip out the indices themselves to get the implementation:
# Reimplement matrix multiplication

torch.einsum("ij, j -> i", A, v), A@v
(tensor([ 5, 11]), tensor([ 5, 11]))
This is a highly flexible notation. For instance if we want to compute what would be traditionally
written as
∑
ckl = bijk ail vj . (19.1.34)
ij
it can be implemented via Einstein summation as:
torch.einsum("ijk, il, j -> kl", B, A, v)
tensor([[ 90, 126],

[102, 144],
[114, 162]])
This notation is readable and efficient for humans, however bulky if for whatever reason we need
to generate a tensor contraction programmatically. For this reason, einsum provides an alternative
notation by providing integer indices for each tensor. For example, the same tensor contraction
can also be written as:

# PyTorch doesn't support this type of notation.
Either notation allows for concise and efficient representation of tensor contractions in code.
19.1.10 Summary
• Vectors can be interpreted geometrically as either points or directions in space.

• Dot products define the notion of angle to arbitrarily high-dimensional spaces.
• Hyperplanes are high-dimensional generalizations of lines and planes. They can be used to
define decision planes that are often used as the last step in a classification task.
• Matrix multiplication can be geometrically interpreted as uniform distortions of the under-
lying coordinates. They represent a very restricted, but mathematically clean, way to trans-
form vectors.
• Linear dependence is a way to tell when a collection of vectors are in a lower dimensional
space than we would expect (say you have 3 vectors living in a 2-dimensional space). The
rank of a matrix is the size of the largest subset of its columns that are linearly independent.
• When a matrixʼs inverse is defined, matrix inversion allows us to find another matrix that un-
does the action of the first. Matrix inversion is useful in theory, but requires care in practice
owing to numerical instability.
• Determinants allow us to measure how much a matrix expands or contracts a space. A
nonzero determinant implies an invertible (non-singular) matrix and a zero-valued deter-
minant means that the matrix is non-invertible (singular).
• Tensor contractions and Einstein summation provide for a neat and clean notation for ex-
pressing many of the computations that are seen in machine learning.
Exercises
1. What is the angle between

   
1 3
0 1
⃗v1 =  
−1 , ⃗v2 =  
0? (19.1.35)
2 1
[ ] [ ]
1 2 1 −2
2. True or false: and are inverses of one another?
0 1 0 1
3. Suppose that we draw a shape in the plane with area 100m2 . What is the area after trans-
forming the figure by the matrix
[ ]
2 3
. (19.1.36)
1 2
4. Which of the following sets of vectors are linearly independent?

     
 1 2 3 
•    
0 , 1 , 1 
 
−1 −1 1
     
 3 1 0 
•    
1 , 1 , 0
 
1 1 0
     
 1 0 1 
• 1 ,  1  , 0
 
0 −1 1
[ ]
c [ ]
5. Suppose that you have a matrix written as A = · a b for some choice of values a, b, c,
d
and d. True or false: the determinant of such a matrix is always 0?
[ ] [ ]
1 0
6. The vectors e1 = and e2 = are orthogonal. What is the condition on a matrix A so
0 1
that Ae1 and Ae2 are orthogonal?
7. How can you write tr(A4 ) in Einstein notation for an arbitrary matrix A?
Discussions256
19.2 Eigendecompositions
Eigenvalues are often one of the most useful notions we will encounter when studying linear alge-
bra, however, as a beginner, it is easy to overlook their importance. Below, we introduce eigen-
decomposition and try to convey some sense of just why it is so important.
Suppose that we have a matrix A with the following entries:
[ ]
2 0
A= . (19.2.1)
0 −1
If we apply A to any vector v = [x, y]⊤ , we obtain a vector Av = [2x, −y]⊤ . This has an intuitive
interpretation: stretch the vector to be twice as wide in the x-direction, and then flip it in the
y-direction.
However, there are some vectors for which something remains unchanged. Namely [1, 0]⊤ gets
sent to [2, 0]⊤ and [0, 1]⊤ gets sent to [0, −1]⊤ . These vectors are still in the same line, and the only
modification is that the matrix stretches them by a factor of 2 and −1 respectively. We call such
vectors eigenvectors and the factor they are stretched by eigenvalues.
In general, if we can find a number λ and a vector v such that
Av = λv. (19.2.2)
We say that v is an eigenvector for A and λ is an eigenvalue.

256

19.2.1 Finding Eigenvalues
Letʼs figure out how to find them. By subtracting off the λv from both sides, and then factoring
out the vector, we see the above is equivalent to:
(A − λI)v = 0. (19.2.3)
For (19.2.3) to happen, we see that (A − λI) must compress some direction down to zero, hence
it is not invertible, and thus the determinant is zero. Thus, we can find the eigenvalues by finding
for what λ is det(A − λI) = 0. Once we find the eigenvalues, we can solve Av = λv to find the
associated eigenvector(s).
An Example
Letʼs see this with a more challenging matrix

[ ]
2 1
A= . (19.2.4)
2 3
If we consider det(A − λI) = 0, we see this is equivalent to the polynomial equation 0 = (2 − λ)(3 −
λ) − 2 = (4 − λ)(1 − λ). Thus, two eigenvalues are 4 and 1. To find the associated vectors, we then
need to solve
[ ][ ] [ ] [ ][ ] [ ]
2 1 x x 2 1 x 4x
= and = . (19.2.5)
2 3 y y 2 3 y 4y
We can solve this with the vectors [1, −1]⊤ and [1, 2]⊤ respectively.
We can check this in code using the built-in numpy.linalg.eig routine.
%matplotlib inline
import torch
torch.eig(torch.tensor([[2, 1], [2, 3]], dtype=torch.float64),

eigenvectors=True)
/tmp/ipykernel_18015/1653148152.py:6: UserWarning: torch.eig is deprecated in favor of torch.

,→linalg.eig and will be removed in a future PyTorch release.
torch.linalg.eig returns complex tensors of dtype cfloat or cdouble rather than real tensors␣
,→mimicking complex tensors.
L, _ = torch.eig(A)
should be replaced with
L_complex = torch.linalg.eigvals(A)
and
L, V = torch.eig(A, eigenvectors=True)
should be replaced with
L_complex, V_complex = torch.linalg.eig(A) (Triggered internally at ../aten/src/ATen/native/
,→BatchLinearAlgebra.cpp:3427.)
torch.eig(torch.tensor([[2, 1], [2, 3]], dtype=torch.float64),
19.2. Eigendecompositions 827

torch.return_types.eig(
eigenvalues=tensor([[1., 0.],
[4., 0.]], dtype=torch.float64),
eigenvectors=tensor([[-0.7071, -0.4472],
[ 0.7071, -0.8944]], dtype=torch.float64))
Note that numpy normalizes the eigenvectors to be of length one, whereas we took ours to be of
arbitrary length. Additionally, the choice of sign is arbitrary. However, the vectors computed are
parallel to the ones we found by hand with the same eigenvalues.
19.2.2 Decomposing Matrices
Letʼs continue the previous example one step further. Let

[ ]
1 1
W= , (19.2.6)
−1 2
be the matrix where the columns are the eigenvectors of the matrix A. Let
[ ]
1 0
Σ= , (19.2.7)
0 4
be the matrix with the associated eigenvalues on the diagonal. Then the definition of eigenvalues
and eigenvectors tells us that
AW = WΣ. (19.2.8)
The matrix W is invertible, so we may multiply both sides by W −1 on the right, we see that we
may write
A = WΣW−1 . (19.2.9)
In the next section we will see some nice consequences of this, but for now we need only know
that such a decomposition will exist as long as we can find a full collection of linearly independent
eigenvectors (so that W is invertible).
19.2.3 Operations on Eigendecompositions
One nice thing about eigendecompositions (19.2.9) is that we can write many operations we usu-
ally encounter cleanly in terms of the eigendecomposition. As a first example, consider:
n times
z }| { z }| {
n times n times
z }| { (19.2.10)
A = A · · · A = (WΣW ) · · · (WΣW ) = W Σ · · · Σ W−1 = WΣn W−1 .
n −1 −1
This tells us that for any positive power of a matrix, the eigendecomposition is obtained by just
raising the eigenvalues to the same power. The same can be shown for negative powers, so if we
want to invert a matrix we need only consider
A−1 = WΣ−1 W−1 , (19.2.11)
or in other words, just invert each eigenvalue. This will work as long as each eigenvalue is non-
zero, so we see that invertible is the same as having no zero eigenvalues.

Indeed, additional work can show that if λ1 , . . . , λn are the eigenvalues of a matrix, then the de-
terminant of that matrix is
det(A) = λ1 · · · λn , (19.2.12)
or the product of all the eigenvalues. This makes sense intuitively because whatever stretching
W does, W −1 undoes it, so in the end the only stretching that happens is by multiplication by the
diagonal matrix Σ, which stretches volumes by the product of the diagonal elements.
Finally, recall that the rank was the maximum number of linearly independent columns of your
matrix. By examining the eigendecomposition closely, we can see that the rank is the same as the
number of non-zero eigenvalues of A.
The examples could continue, but hopefully the point is clear: eigendecomposition can simplify
many linear-algebraic computations and is a fundamental operation underlying many numerical
algorithms and much of the analysis that we do in linear algebra.
19.2.4 Eigendecompositions of Symmetric Matrices
It is not always possible to find enough linearly independent eigenvectors for the above process
to work. For instance the matrix
[ ]
1 1
A= , (19.2.13)
0 1
has only a single eigenvector, namely (1, 0)⊤ . To handle such matrices, we require more advanced
techniques than we can cover (such as the Jordan Normal Form, or Singular Value Decomposition).
We will often need to restrict our attention to those matrices where we can guarantee the existence
of a full set of eigenvectors.
The most commonly encountered family are the symmetric matrices, which are those matrices
where A = A⊤ . In this case, we may take W to be an orthogonal matrix—a matrix whose columns
are all length one vectors that are at right angles to one another, where W⊤ = W−1 —and all the
eigenvalues will be real.
Thus, in this special case, we can write (19.2.9) as
A = WΣW⊤ . (19.2.14)
19.2.5 Gershgorin Circle Theorem
Eigenvalues are often difficult to reason with intuitively. If presented an arbitrary matrix, there is
little that can be said about what the eigenvalues are without computing them. There is, however,
one theorem that can make it easy to approximate well if the largest values are on the diagonal.
∑
Let A = (aij ) be any square matrix (n × n). We will define ri = j̸=i |aij |. Let Di represent the disc
in the complex plane with center aii radius ri . Then, every eigenvalue of A is contained in one of
the Di .
This can be a bit to unpack, so letʼs look at an example.

Consider the matrix:
 
1.0 0.1 0.1 0.1
0.1 3.0 0.2 0.3
A=
0.1
. (19.2.15)
0.2 5.0 0.5
0.1 0.3 0.5 9.0
We have r1 = 0.3, r2 = 0.6, r3 = 0.8 and r4 = 0.9. The matrix is symmetric, so all eigenvalues are
real. This means that all of our eigenvalues will be in one of the ranges of
[a11 − r1 , a11 + r1 ] = [0.7, 1.3], (19.2.16)
[a22 − r2 , a22 + r2 ] = [2.4, 3.6], (19.2.17)
[a33 − r3 , a33 + r3 ] = [4.2, 5.8], (19.2.18)
[a44 − r4 , a44 + r4 ] = [8.1, 9.9]. (19.2.19)

Performing the numerical computation shows that the eigenvalues are approximately 0.99, 2.97,
4.95, 9.08, all comfortably inside the ranges provided.
A = torch.tensor([[1.0, 0.1, 0.1, 0.1],

[0.1, 3.0, 0.2, 0.3],
[0.1, 0.2, 5.0, 0.5],
[0.1, 0.3, 0.5, 9.0]])
v, _ = torch.eig(A)
v
tensor([[0.9923, 0.0000],
[9.0803, 0.0000],
[4.9539, 0.0000],
[2.9734, 0.0000]])
In this way, eigenvalues can be approximated, and the approximations will be fairly accurate in
the case that the diagonal is significantly larger than all the other elements.
It is a small thing, but with a complex and subtle topic like eigendecomposition, it is good to get
any intuitive grasp we can.
19.2.6 A Useful Application: The Growth of Iterated Maps
Now that we understand what eigenvectors are in principle, letʼs see how they can be used to
provide a deep understanding of a problem central to neural network behavior: proper weight
initialization.

Eigenvectors as Long Term Behavior
The full mathematical investigation of the initialization of deep neural networks is beyond the
scope of the text, but we can see a toy version here to understand how eigenvalues can help us see
how these models work. As we know, neural networks operate by interspersing layers of linear
transformations with non-linear operations. For simplicity here, we will assume that there is no
non-linearity, and that the transformation is a single repeated matrix operation A, so that the
output of our model is
vout = A · A · · · Avin = AN vin . (19.2.20)
When these models are initialized, A is taken to be a random matrix with Gaussian entries, so letʼs
make one of those. To be concrete, we start with a mean zero, variance one Gaussian distributed
5 × 5 matrix.
torch.manual_seed(42)
k = 5
A = torch.randn(k, k, dtype=torch.float64)
A
tensor([[ 0.2996, 0.2424, 0.2832, -0.2329, 0.6712],

[ 0.7818, -1.7903, -1.7484, 0.1735, -0.1182],
[-1.7446, -0.4695, 0.4573, 0.5177, -0.2771],
[-0.6641, 0.6551, 0.2616, -1.5265, -0.3311],
[-0.6378, 0.1072, 0.7096, 0.3009, -0.2869]], dtype=torch.float64)
Behavior on Random Data
For simplicity in our toy model, we will assume that the data vector we feed in vin is a random five
dimensional Gaussian vector. Letʼs think about what we want to have happen. For context, lets
think of a generic ML problem, where we are trying to turn input data, like an image, into a pre-
diction, like the probability the image is a picture of a cat. If repeated application of A stretches a
random vector out to be very long, then small changes in input will be amplified into large changes
in output—tiny modifications of the input image would lead to vastly different predictions. This
does not seem right!
On the flip side, if A shrinks random vectors to be shorter, then after running through many layers,
the vector will essentially shrink to nothing, and the output will not depend on the input. This is
also clearly not right either!
We need to walk the narrow line between growth and decay to make sure that our output changes
depending on our input, but not much!
Letʼs see what happens when we repeatedly multiply our matrix A against a random input vector,
and keep track of the norm.
# Calculate the sequence of norms after repeatedly applying À`

v_in = torch.randn(k, 1, dtype=torch.float64)
norm_list = [torch.norm(v_in).item()]

for i in range(1, 100):

v_in = A @ v_in
norm_list.append(torch.norm(v_in).item())
d2l.plot(torch.arange(0, 100), norm_list, 'Iteration', 'Value')
The norm is growing uncontrollably! Indeed if we take the list of quotients, we will see a pattern.
# Compute the scaling factor of the norms

norm_ratio_list = []
norm_ratio_list.append(norm_list[i]/norm_list[i - 1])
d2l.plot(torch.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
If we look at the last portion of the above computation, we see that the random vector is stretched
by a factor of 1.974459321485[...], where the portion at the end shifts a little, but the stretching
factor is stable.

Relating Back to Eigenvectors
We have seen that eigenvectors and eigenvalues correspond to the amount something is stretched,
but that was for specific vectors, and specific stretches. Letʼs take a look at what they are for A. A
bit of a caveat here: it turns out that to see them all, we will need to go to complex numbers. You
can think of these as stretches and rotations. By taking the norm of the complex number (square
root of the sums of squares of real and imaginary parts) we can measure that stretching factor.
Letʼs also sort them.
# Compute the eigenvalues

eigs = torch.eig(A)[0][:,0].tolist()
norm_eigs = [torch.abs(torch.tensor(x)) for x in eigs]
norm_eigs.sort()
print(f'norms of eigenvalues: {norm_eigs}')
norms of eigenvalues: [tensor(0.3490), tensor(0.5691), tensor(0.5691), tensor(1.1828),␣

,→tensor(2.4532)]
An Observation
We see something a bit unexpected happening here: that number we identified before for the long
term stretching of our matrix A applied to a random vector is exactly (accurate to thirteen decimal
places!) the largest eigenvalue of A. This is clearly not a coincidence!
But, if we now think about what is happening geometrically, this starts to make sense. Consider a
random vector. This random vector points a little in every direction, so in particular, it points at
least a little bit in the same direction as the eigenvector of A associated with the largest eigenvalue.
This is so important that it is called the principle eigenvalue and principle eigenvector. After apply-
ing A, our random vector gets stretched in every possible direction, as is associated with every
possible eigenvector, but it is stretched most of all in the direction associated with this principle
eigenvector. What this means is that after apply in A, our random vector is longer, and points
in a direction closer to being aligned with the principle eigenvector. After applying the matrix
many times, the alignment with the principle eigenvector becomes closer and closer until, for
all practical purposes, our random vector has been transformed into the principle eigenvector!
Indeed this algorithm is the basis for what is known as the power iteration for finding the largest
eigenvalue and eigenvector of a matrix. For details see, for example, (Van Loan and Golub, 1983).
Fixing the Normalization
Now, from above discussions, we concluded that we do not want a random vector to be stretched
or squished at all, we would like random vectors to stay about the same size throughout the en-
tire process. To do so, we now rescale our matrix by this principle eigenvalue so that the largest
eigenvalue is instead now just one. Letʼs see what happens in this case.
# Rescale the matrix À`

A /= norm_eigs[-1]
# Do the same experiment again

v_in = torch.randn(k, 1, dtype=torch.float64)

norm_list = [torch.norm(v_in).item()]
v_in = A @ v_in
norm_list.append(torch.norm(v_in).item())
d2l.plot(torch.arange(0, 100), norm_list, 'Iteration', 'Value')
We can also plot the ratio between consecutive norms as before and see that indeed it stabilizes.
# Also plot the ratio

norm_ratio_list = []
norm_ratio_list.append(norm_list[i]/norm_list[i-1])
d2l.plot(torch.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio')

19.2.7 Discussion
We now see exactly what we hoped for! After normalizing the matrices by the principal eigenvalue,
we see that the random data does not explode as before, but rather eventually equilibrates to a
specific value. It would be nice to be able to do these things from first principles, and it turns out
that if we look deeply at the mathematics of it, we can see that the largest eigenvalue of a large
random matrix with √independent mean zero, variance one Gaussian entries is on average about
√
n, or in our case 5 ≈ 2.2, due to a fascinating fact known as the circular law (Ginibre, 1965).
The relationship between the eigenvalues (and a related object called singular values) of random
matrices has been shown to have deep connections to proper initialization of neural networks as
was discussed in (Pennington et al., 2017) and subsequent works.
19.2.8 Summary
• Eigenvectors are vectors which are stretched by a matrix without changing direction.
• Eigenvalues are the amount that the eigenvectors are stretched by the application of the
matrix.
• The eigendecomposition of a matrix can allow for many operations to be reduced to opera-
tions on the eigenvalues.
• The Gershgorin Circle Theorem can provide approximate values for the eigenvalues of a
matrix.
• The behavior of iterated matrix powers depends primarily on the size of the largest eigen-
value. This understanding has many applications in the theory of neural network initializa-
tion.
Exercises
1. What are the eigenvalues and eigenvectors of

[ ]
2 1
A= ? (19.2.21)
1 2
2. What are the eigenvalues and eigenvectors of the following matrix, and what is strange about
this example compared to the previous one?
[ ]
2 1
A= . (19.2.22)
0 2
3. Without computing the eigenvalues, is it possible that the smallest eigenvalue of the follow-
ing matrix is less that 0.5? Note: this problem can be done in your head.
 
3.0 0.1 0.3 1.0
0.1 1.0 0.1 0.2
A= 
0.3 0.1 5.0 0.0 . (19.2.23)
1.0 0.2 0.0 1.8
Discussions257
257

19.3 Single Variable Calculus
In Section 2.4, we saw the basic elements of differential calculus. This section takes a deeper
dive into the fundamentals of calculus and how we can understand and apply it in the context of
machine learning.
19.3.1 Differential Calculus
Differential calculus is fundamentally the study of how functions behave under small changes. To
see why this is so core to deep learning, letʼs consider an example.
Suppose that we have a deep neural network where the weights are, for convenience, concatenated
into a single vector w = (w1 , . . . , wn ). Given a training dataset, we consider the loss of our neural
network on this dataset, which we will write as L(w).
This function is extraordinarily complex, encoding the performance of all possible models of the
given architecture on this dataset, so it is nearly impossible to tell what set of weights w will min-
imize the loss. Thus, in practice, we often start by initializing our weights randomly, and then
iteratively take small steps in the direction which makes the loss decrease as rapidly as possible.
The question then becomes something that on the surface is no easier: how do we find the direc-
tion which makes the weights decrease as quickly as possible? To dig into this, letʼs first examine
the case with only a single weight: L(w) = L(x) for a single real value x.
Letʼs take x and try to understand what happens when we change it by a small amount to x + ϵ. If
you wish to be concrete, think a number like ϵ = 0.0000001. To help us visualize what happens,
letʼs graph an example function, f (x) = sin(xx ), over the [0, 3].
%matplotlib inline
import torch
torch.pi = torch.acos(torch.zeros(1)).item() * 2 # Define pi in torch
# Plot a function in a normal range

x_big = torch.arange(0.01, 3.01, 0.01)
ys = torch.sin(x_big**x_big)
d2l.plot(x_big, ys, 'x', 'f(x)')

At this large scale, the functionʼs behavior is not simple. However, if we reduce our range to some-
thing smaller like [1.75, 2.25], we see that the graph becomes much simpler.
# Plot a the same function in a tiny range

x_med = torch.arange(1.75, 2.25, 0.001)
ys = torch.sin(x_med**x_med)
d2l.plot(x_med, ys, 'x', 'f(x)')
Taking this to an extreme, if we zoom into a tiny segment, the behavior becomes far simpler: it is
just a straight line.
# Plot a the same function in a tiny range

x_small = torch.arange(2.0, 2.01, 0.0001)
ys = torch.sin(x_small**x_small)
d2l.plot(x_small, ys, 'x', 'f(x)')
19.3. Single Variable Calculus 837

This is the key observation of single variable calculus: the behavior of familiar functions can be
modeled by a line in a small enough range. This means that for most functions, it is reasonable to
expect that as we shift the x value of the function by a little bit, the output f (x) will also be shifted
by a little bit. The only question we need to answer is, “How large is the change in the output
compared to the change in the input? Is it half as large? Twice as large?”
Thus, we can consider the ratio of the change in the output of a function for a small change in the
input of the function. We can write this formally as
L(x + ϵ) − L(x) L(x + ϵ) − L(x)
= . (19.3.1)
(x + ϵ) − x ϵ
This is already enough to start to play around with in code. For instance, suppose that we know
that L(x) = x2 + 1701(x − 4)3 , then we can see how large this value is at the point x = 4 as follows.
# Define our function

def L(x):
return x**2 + 1701*(x-4)**3
# Print the difference divided by epsilon for several epsilon

for epsilon in [0.1, 0.001, 0.0001, 0.00001]:
print(f'epsilon = {epsilon:.5f} -> {(L(4+epsilon) - L(4)) / epsilon:.5f}')
epsilon = 0.10000 -> 25.11000

epsilon = 0.00100 -> 8.00270
epsilon = 0.00010 -> 8.00012
epsilon = 0.00001 -> 8.00001
Now, if we are observant, we will notice that the output of this number is suspiciously close to
8. Indeed, if we decrease ϵ, we will see value becomes progressively closer to 8. Thus we may
conclude, correctly, that the value we seek (the degree a change in the input changes the output)
should be 8 at the point x = 4. The way that a mathematician encodes this fact is
L(4 + ϵ) − L(4)
lim = 8. (19.3.2)
ϵ→0 ϵ
As a bit of a historical digression: in the first few decades of neural network research, scientists
used this algorithm (the method of finite differences) to evaluate how a loss function changed under
small perturbation: just change the weights and see how the loss changed. This is computationally

inefficient, requiring two evaluations of the loss function to see how a single change of one vari-
able influenced the loss. If we tried to do this with even a paltry few thousand parameters, it would
require several thousand evaluations of the network over the entire dataset! It was not solved until
1986 that the backpropagation algorithm introduced in (Rumelhart et al., 1988) provided a way to
calculate how any change of the weights together would change the loss in the same computation
time as a single prediction of the network over the dataset.
Back in our example, this value 8 is different for different values of x, so it makes sense to define it
as a function of x. More formally, this value dependent rate of change is referred to as the derivative
which is written as
df f (x + ϵ) − f (x)
(x) = lim . (19.3.3)
dx ϵ→0 ϵ
Different texts will use different notations for the derivative. For instance, all of the below nota-
tions indicate the same thing:
df d
= f = f ′ = ∇x f = Dx f = fx . (19.3.4)
dx dx
Most authors will pick a single notation and stick with it, however even that is not guaranteed. It
df
is best to be familiar with all of these. We will use the notation dx throughout this text, unless
d
we want to take the derivative of a complex expression, in which case we will use dx f to write
expressions like
[ ( 2 )]
d 4 x +1
x + cos . (19.3.5)
dx 2x − 1
Oftentimes, it is intuitively useful to unravel the definition of derivative (19.3.3) again to see how
a function changes when we make a small change of x:
df f (x + ϵ) − f (x) df f (x + ϵ) − f (x)
(x) = lim =⇒ (x) ≈
dx ϵ→0 ϵ dx ϵ
df
=⇒ ϵ (x) ≈ f (x + ϵ) − f (x) (19.3.6)
dx
df
=⇒ f (x + ϵ) ≈ f (x) + ϵ (x).
dx
The last equation is worth explicitly calling out. It tells us that if you take any function and change
the input by a small amount, the output would change by that small amount scaled by the deriva-
tive.
In this way, we can understand the derivative as the scaling factor that tells us how large of change
we get in the output from a change in the input.
19.3.2 Rules of Calculus
We now turn to the task of understanding how to compute the derivative of an explicit function. A
full formal treatment of calculus would derive everything from first principles. We will not indulge
in this temptation here, but rather provide an understanding of the common rules encountered.

Common Derivatives
As was seen in Section 2.4, when computing derivatives one can oftentimes use a series of rules
to reduce the computation to a few core functions. We repeat them here for ease of reference.
d
• Derivative of constants. dx c = 0.
d
• Derivative of linear functions. dx (ax) = a.
d n
• Power rule. dx x = nxn−1 .
d x
• Derivative of exponentials. dx e = ex .
d
• Derivative of the logarithm. dx log(x) = x1 .
Derivative Rules
If every derivative needed to be separately computed and stored in a table, differential calculus
would be near impossible. It is a gift of mathematics that we can generalize the( above derivatives
)
and compute more complex derivatives like finding the derivative of f (x) = log 1 + (x − 1)10 . As
was mentioned in Section 2.4, the key to doing so is to codify what happens when we take functions
and combine them in various ways, most importantly: sums, products, and compositions.
d dg dh
• Sum rule. dx (g(x) + h(x)) = dx (x) + dx (x).
dg
• Product rule. d
dx (g(x) · h(x)) = g(x) dh
dx (x) + dx (x)h(x).
dg
• Chain rule. d
dx g(h(x)) = dh (h(x)) · dh
dx (x).
Letʼs see how we may use (19.3.6) to understand these rules. For the sum rule, consider following
chain of reasoning:
f (x + ϵ) = g(x + ϵ) + h(x + ϵ)
dg dh
≈ g(x) + ϵ (x) + h(x) + ϵ (x)
dx ( dx )
dg dh (19.3.7)
= g(x) + h(x) + ϵ (x) + (x)
dx dx
( )
dg dh
= f (x) + ϵ (x) + (x) .
dx dx
df df dg
By comparing this result with the fact that f (x + ϵ) ≈ f (x) + ϵ dx (x), we see that dx (x) = dx (x) +
dh
dx (x) as desired. The intuition here is: when we change the input x, g and h jointly contribute to
dg
the change of the output by dx (x) and dh
dx (x).
The product is more subtle, and will require a new observation about how to work with these
expressions. We will begin as before using (19.3.6):
f (x + ϵ) = g(x + ϵ) · h(x + ϵ)
( ) ( )
dg dh
≈ g(x) + ϵ (x) · h(x) + ϵ (x)
dx dx
( )
dh dg dg dh (19.3.8)
= g(x) · h(x) + ϵ g(x) (x) + (x)h(x) + ϵ2 (x) (x)
dx dx dx dx
( )
dh dg dg dh
= f (x) + ϵ g(x) (x) + (x)h(x) + ϵ2 (x) (x).
dx dx dx dx

df
This resembles the computation done above, and indeed we see our answer ( dx (x) = g(x) dh
dx (x) +
dg 2
dx (x)h(x)) sitting next to ϵ, but there is the issue of that term of size ϵ . We will refer to this as a
higher-order term, since the power of ϵ2 is higher than the power of ϵ1 . We will see in a later section
that we will sometimes want to keep track of these, however for now observe that if ϵ = 0.0000001,
then ϵ2 = 0.0000000000001, which is vastly smaller. As we send ϵ → 0, we may safely ignore the
higher order terms. As a general convention in this appendix, we will use “≈” to denote that the
two terms are equal up to higher order terms. However, if we wish to be more formal we may
examine the difference quotient
f (x + ϵ) − f (x) dh dg dg dh
= g(x) (x) + (x)h(x) + ϵ (x) (x), (19.3.9)
ϵ dx dx dx dx
and see that as we send ϵ → 0, the right hand term goes to zero as well.
Finally, with the chain rule, we can again progress as before using (19.3.6) and see that
f (x + ϵ) = g(h(x + ϵ))
( )
dh
≈ g h(x) + ϵ (x)
dx
dh dg (19.3.10)
≈ g(h(x)) + ϵ (x) (h(x))
dx dh
dg dh
= f (x) + ϵ (h(x)) (x),
dh dx
where in the second line we view the function g as having its input (h(x)) shifted by the tiny quan-
tity ϵ dh
dx (x).
These rule provide us with a flexible set of tools to compute essentially any expression desired.
For instance,
d [ ( )] ( )−1 d [ ]
log 1 + (x − 1)10 = 1 + (x − 1)10 1 + (x − 1)10
dx dx
( )
( )
10 −1 d d
= 1 + (x − 1) [1] + [(x − 1) ]
10
dx dx
( )
( )
10 −1 9 d
= 1 + (x − 1) 0 + 10(x − 1) [x − 1] (19.3.11)
dx
( )−1
= 10 1 + (x − 1)10 (x − 1)9
10(x − 1)9
= .
1 + (x − 1)10
Where each line has used the following rules:
1. The chain rule and derivative of logarithm.
2. The sum rule.
3. The derivative of constants, chain rule, and power rule.
4. The sum rule, derivative of linear functions, derivative of constants.
Two things should be clear after doing this example:
1. Any function we can write down using sums, products, constants, powers, exponentials, and
logarithms can have its derivate computed mechanically by following these rules.
2. Having a human follow these rules can be tedious and error prone!
Thankfully, these two facts together hint towards a way forward: this is a perfect candidate for
mechanization! Indeed backpropagation, which we will revisit later in this section, is exactly that.

Linear Approximation
When working with derivatives, it is often useful to geometrically interpret the approximation
used above. In particular, note that the equation
df
f (x + ϵ) ≈ f (x) + ϵ (x), (19.3.12)
dx
df
approximates the value of f by a line which passes through the point (x, f (x)) and has slope dx (x).
In this way we say that the derivative gives a linear approximation to the function f , as illustrated
below:
# Compute sin
xs = torch.arange(-torch.pi, torch.pi, 0.01)
plots = [torch.sin(xs)]
# Compute some linear approximations. Use d(sin(x))/dx = cos(x)

for x0 in [-1.5, 0.0, 2.0]:
plots.append(torch.sin(torch.tensor(x0)) + (xs - x0) *
torch.cos(torch.tensor(x0)))
d2l.plot(xs, plots, 'x', 'f(x)', ylim=[-1.5, 1.5])
Higher Order Derivatives
Letʼs now do something that may on the surface seem strange. Take a function f and compute the
df
derivative dx . This gives us the rate of change of f at any point.
df
However, the derivative, dx , can be viewed
( as
) a function itself, so nothing stops us from comput-
df 2
ing the derivative of dx
df
to get ddxf2 = dx df
dx . We will call this the second derivative of f . This
function is the rate of change of the rate of change of f , or in other words, how the rate of change
is changing. We may apply the derivative any number of times to obtain what is called the n-th
derivative. To keep the notation clean, we will denote the n-th derivative as
( )n
(n) dn f d
f (x) = n = f. (19.3.13)
dx dx
Letʼs try to understand why this is a useful notion. Below, we visualize f (2) (x), f (1) (x), and f (x).

First, consider the case that the second derivative f (2) (x) is a positive constant. This means that
the slope of the first derivative is positive. As a result, the first derivative f (1) (x) may start out
negative, becomes zero at a point, and then becomes positive in the end. This tells us the slope of
our original function f and therefore, the function f itself decreases, flattens out, then increases.
In other words, the function f curves up, and has a single minimum as is shown in Fig. 19.3.1.
Fig. 19.3.1: If we assume the second derivative is a positive constant, then the fist derivative in
increasing, which implies the function itself has a minimum.
Second, if the second derivative is a negative constant, that means that the first derivative is de-
creasing. This implies the first derivative may start out positive, becomes zero at a point, and then
becomes negative. Hence, the function f itself increases, flattens out, then decreases. In other
words, the function f curves down, and has a single maximum as is shown in Fig. 19.3.2.
Fig. 19.3.2: If we assume the second derivative is a negative constant, then the fist derivative in
decreasing, which implies the function itself has a maximum.
Third, if the second derivative is a always zero, then the first derivative will never change—it is
constant! This means that f increases (or decreases) at a fixed rate, and f is itself a straight line
as is shown in Fig. 19.3.3.

Fig. 19.3.3: If we assume the second derivative is zero, then the fist derivative is constant, which
implies the function itself is a straight line.
To summarize, the second derivative can be interpreted as describing the way that the function f
curves. A positive second derivative leads to a upwards curve, while a negative second derivative
means that f curves downwards, and a zero second derivative means that f does not curve at all.
Letʼs take this one step further. Consider the function g(x) = ax2 + bx + c. We can then compute
that
dg
(x) = 2ax + b
dx
(19.3.14)
d2 g
(x) = 2a.
dx2
If we have some original function f (x) in mind, we may compute the first two derivatives and
find the values for a, b, and c that make them match this computation. Similarly to the previous
section where we saw that the first derivative gave the best approximation with a straight line, this
construction provides the best approximation by a quadratic. Letʼs visualize this for f (x) = sin(x).
# Compute sin
xs = torch.arange(-torch.pi, torch.pi, 0.01)
plots = [torch.sin(xs)]
# Compute some quadratic approximations. Use d(sin(x)) / dx = cos(x)

for x0 in [-1.5, 0.0, 2.0]:
plots.append(torch.sin(torch.tensor(x0)) + (xs - x0) *
torch.cos(torch.tensor(x0)) - (xs - x0)**2 *
torch.sin(torch.tensor(x0)) / 2)
d2l.plot(xs, plots, 'x', 'f(x)', ylim=[-1.5, 1.5])

We will extend this idea to the idea of a Taylor series in the next section.
Taylor Series
The Taylor series provides a method{to approximate the function f (x) if we } are given values for the
(1) (2) (n)
first n derivatives at a point x0 , i.e., f (x0 ), f (x0 ), f (x0 ), . . . , f (x0 ) . The idea will be to find
a degree n polynomial that matches all the given derivatives at x0 .
We saw the case of n = 2 in the previous section and a little algebra shows this is
1 d2 f df
f (x) ≈ (x0 )(x − x0 )2 + (x0 )(x − x0 ) + f (x0 ). (19.3.15)
2 dx2 dx
As we can see above, the denominator of 2 is there to cancel out the 2 we get when we take two
derivatives of x2 , while the other terms are all zero. Same logic applies for the first derivative and
the value itself.
If we push the logic further to n = 3, we will conclude that
d3 f d2 f
dx3
(x0 ) dx2
(x0 ) df (19.3.16)
f (x) ≈ (x − x0 ) +
3
(x − x0 )2 + (x0 )(x − x0 ) + f (x0 ).
6 2 dx
where the 6 = 3 × 2 = 3! comes from the constant we get in front if we take three derivatives of
x3 .
Furthermore, we can get a degree n polynomial by
∑
n
f (i) (x0 )
Pn (x) = (x − x0 )i . (19.3.17)
i!
i=0
where the notation

( )n
(n) dn f d
f (x) = n = f. (19.3.18)
dx dx
Indeed, Pn (x) can be viewed as the best n-th degree polynomial approximation to our function
f (x).
While we are not going to dive all the way into the error of the above approximations, it is worth
mentioning the infinite limit. In this case, for well behaved functions (known as real analytic

functions) like cos(x) or ex , we can write out the infinite number of terms and approximate the
exactly same function
∞
∑ f (n) (x0 )
f (x) = (x − x0 )n . (19.3.19)
n!
n=0
Take f (x) = ex as am example. Since ex is its own derivative, we know that f (n) (x) = ex . There-
fore, ex can be reconstructed by taking the Taylor series at x0 = 0, i.e.,
∞
∑ xn x2 x3
x
e = =1+x+ + + ··· . (19.3.20)
n! 2 6
n=0
Letʼs see how this works in code and observe how increasing the degree of the Taylor approxima-
tion brings us closer to the desired function ex .
# Compute the exponential function

xs = torch.arange(0, 3, 0.01)
ys = torch.exp(xs)
# Compute a few Taylor series approximations

P1 = 1 + xs
P2 = 1 + xs + xs**2 / 2
P5 = 1 + xs + xs**2 / 2 + xs**3 / 6 + xs**4 / 24 + xs**5 / 120
d2l.plot(xs, [ys, P1, P2, P5], 'x', 'f(x)', legend=[

"Exponential", "Degree 1 Taylor Series", "Degree 2 Taylor Series",
"Degree 5 Taylor Series"])
Taylor series have two primary applications:

1. Theoretical applications: Often when we try to understand a too complex function, using Tay-
lor series enables us to turn it into a polynomial that we can work with directly.
2. Numerical applications: Some functions like ex or cos(x) are difficult for machines to com-
pute. They can store tables of values at a fixed precision (and this is often done), but it still
leaves open questions like “What is the 1000-th digit of cos(1)?” Taylor series are often help-
ful to answer such questions.

19.3.3 Summary
• Derivatives can be used to express how functions change when we change the input by a
small amount.
• Elementary derivatives can be combined using derivative rules to create arbitrarily complex
derivatives.
• Derivatives can be iterated to get second or higher order derivatives. Each increase in order
provides more fine grained information on the behavior of the function.
• Using information in the derivatives of a single data example, we can approximate well be-
haved functions by polynomials obtained from the Taylor series.
Exercises
1. What is the derivative of x3 − 4x + 1?

2. What is the derivative of log( x1 )?
3. True or False: If f ′ (x) = 0 then f has a maximum or minimum at x?
4. Where is the minimum of f (x) = x log(x) for x ≥ 0 (where we assume that f takes the
limiting value of 0 at f (0))?
Discussions258
19.4 Multivariable Calculus
Now that we have a fairly strong understanding of derivatives of a function of a single variable, letʼs
return to our original question where we were considering a loss function of potentially billions
of weights.
19.4.1 Higher-Dimensional Differentiation
What Section 19.3 tells us is that if we change a single one of these billions of weights leaving
every other one fixed, we know what will happen! This is nothing more than a function of a single
variable, so we can write
d
L(w1 + ϵ1 , w2 , . . . , wN ) ≈ L(w1 , w2 , . . . , wN ) + ϵ1 L(w1 , w2 , . . . , wN ). (19.4.1)
dw1
We will call the derivative in one variable while fixing the other variables the partial derivative, and
∂
we will use the notation ∂w 1
for the derivative in (19.4.1).
258
19.4. Multivariable Calculus 847

Now, letʼs take this and change w2 a little bit to w2 + ϵ2 :
∂
L(w1 + ϵ1 , w2 + ϵ2 , . . . , wN ) ≈ L(w1 , w2 + ϵ2 , . . . , wN ) + ϵ1 L(w1 , w2 + ϵ2 , . . . , wN + ϵN )
∂w1
≈ L(w1 , w2 , . . . , wN )
∂
+ ϵ2 L(w1 , w2 , . . . , wN )
∂w2
∂
+ ϵ1 L(w1 , w2 , . . . , wN )
∂w1
∂ ∂
+ ϵ1 ϵ2 L(w1 , w2 , . . . , wN )
∂w2 ∂w1
≈ L(w1 , w2 , . . . , wN )
∂
+ ϵ2 L(w1 , w2 , . . . , wN )
∂w2
∂
+ ϵ1 L(w1 , w2 , . . . , wN ).
∂w1
(19.4.2)
We have again used the idea that ϵ1 ϵ2 is a higher order term that we can discard in the same way
we could discard ϵ2 in the previous section, along with what we saw in (19.4.1). By continuing in
this manner, we may write that
∑ ∂
L(w1 + ϵ1 , w2 + ϵ2 , . . . , wN + ϵN ) ≈ L(w1 , w2 , . . . , wN ) + ϵi L(w1 , w2 , . . . , wN ). (19.4.3)
∂wi
i
This may look like a mess, but we can make this more familiar by noting that the sum on the right
looks exactly like a dot product, so if we let
[ ]⊤
⊤ ∂L ∂L
ϵ = [ϵ1 , . . . , ϵN ] and ∇x L = ,..., , (19.4.4)
∂x1 ∂xN
then
L(w + ϵ) ≈ L(w) + ϵ · ∇w L(w). (19.4.5)
We will call the vector ∇w L the gradient of L.

Equation (19.4.5) is worth pondering for a moment. It has exactly the format that we encountered
in one dimension, just we have converted everything to vectors and dot products. It allows us to
tell approximately how the function L will change given any perturbation to the input. As we will
see in the next section, this will provide us with an important tool in understanding geometrically
how we can learn using information contained in the gradient.
But first, letʼs see this approximation at work with an example. Suppose that we are working with
the function
[ ]
ex ey
f (x, y) = log(e + e ) with gradient ∇f (x, y) = x
x y
, . (19.4.6)
e + ey ex + ey
If we look at a point like (0, log(2)), we see that

] [
1 2
f (x, y) = log(3) with gradient ∇f (x, y) = , . (19.4.7)
3 3

Thus, if we want to approximate f at (ϵ1 , log(2) + ϵ2 ), we see that we should have the specific
instance of (19.4.5):
1 2
f (ϵ1 , log(2) + ϵ2 ) ≈ log(3) + ϵ1 + ϵ2 . (19.4.8)
3 3
We can test this in code to see how good the approximation is.
%matplotlib inline
import numpy as np
import torch
def f(x, y):

return torch.log(torch.exp(x) + torch.exp(y))
def grad_f(x, y):
return torch.tensor([torch.exp(x) / (torch.exp(x) + torch.exp(y)),
torch.exp(y) / (torch.exp(x) + torch.exp(y))])
epsilon = torch.tensor([0.01, -0.03])

grad_approx = f(torch.tensor([0.]), torch.log(
torch.tensor([2.]))) + epsilon.dot(
grad_f(torch.tensor([0.]), torch.log(torch.tensor(2.))))
true_value = f(torch.tensor([0.]) + epsilon[0], torch.log(
torch.tensor([2.])) + epsilon[1])
f'approximation: {grad_approx}, true Value: {true_value}'
'approximation: tensor([1.0819]), true Value: tensor([1.0821])'
19.4.2 Geometry of Gradients and Gradient Descent
Consider the expression from (19.4.5) again:
L(w + ϵ) ≈ L(w) + ϵ · ∇w L(w). (19.4.9)
Letʼs suppose that I want to use this to help minimize our loss L. Letʼs understand geometrically
the algorithm of gradient descent first described in Section 2.5. What we will do is the following:
1. Start with a random choice for the initial parameters w.
2. Find the direction v that makes L decrease the most rapidly at w.
3. Take a small step in that direction: w → w + ϵv.
4. Repeat.
The only thing we do not know exactly how to do is to compute the vector v in the second step.
We will call such a direction the direction of steepest descent. Using the geometric understanding of
dot products from Section 19.1, we see that we can rewrite (19.4.5) as
L(w + v) ≈ L(w) + v · ∇w L(w) = L(w) + ∥∇w L(w)∥ cos(θ). (19.4.10)

Note that we have taken our direction to have length one for convenience, and used θ for the angle
between v and ∇w L(w). If we want to find the direction that decreases L as rapidly as possible, we
want to make this expression as negative as possible. The only way the direction we pick enters
into this equation is through cos(θ), and thus we wish to make this cosine as negative as possible.
Now, recalling the shape of cosine, we can make this as negative as possible by making cos(θ) = −1
or equivalently making the angle between the gradient and our chosen direction to be π radians,
or equivalently 180 degrees. The only way to achieve this is to head in the exact opposite direction:
pick v to point in the exact opposite direction to ∇w L(w)!
This brings us to one of the most important mathematical concepts in machine learning: the
direction of steepest decent points in the direction of −∇w L(w). Thus our informal algorithm
can be rewritten as follows.
1. Start with a random choice for the initial parameters w.
2. Compute ∇w L(w).
3. Take a small step in the opposite of that direction: w → w − ϵ∇w L(w).
4. Repeat.
This basic algorithm has been modified and adapted many ways by many researchers, but the core
concept remains the same in all of them. Use the gradient to find the direction that decreases the
loss as rapidly as possible, and update the parameters to take a step in that direction.
19.4.3 A Note on Mathematical Optimization
Throughout this book, we focus squarely on numerical optimization techniques for the practical
reason that all functions we encounter in the deep learning setting are too complex to minimize
explicitly.
However, it is a useful exercise to consider what the geometric understanding we obtained above
tells us about optimizing functions directly.
Suppose that we wish to find the value of x0 which minimizes some function L(x). Letʼs suppose
that moreover someone gives us a value and tells us that it is the value that minimizes L. Is there
anything we can check to see if their answer is even plausible?
Again consider (19.4.5):
L(x0 + ϵ) ≈ L(x0 ) + ϵ · ∇x L(x0 ). (19.4.11)
If the gradient is not zero, we know that we can take a step in the direction −ϵ∇x L(x0 ) to find a
value of L that is smaller. Thus, if we truly are at a minimum, this cannot be the case! We can
conclude that if x0 is a minimum, then ∇x L(x0 ) = 0. We call points with ∇x L(x0 ) = 0 critical
points.
This is nice, because in some rare settings, we can explicitly find all the points where the gradient
is zero, and find the one with the smallest value.
For a concrete example, consider the function
f (x) = 3x4 − 4x3 − 12x2 . (19.4.12)
This function has derivative

df
= 12x3 − 12x2 − 24x = 12x(x − 2)(x + 1). (19.4.13)
dx

The only possible location of minima are at x = −1, 0, 2, where the function takes the values
−5, 0, −32 respectively, and thus we can conclude that we minimize our function when x = 2. A
quick plot confirms this.
f = (3 * x**4) - (4 * x**3) - (12 * x**2)
d2l.plot(x, f, 'x', 'f(x)')
This highlights an important fact to know when working either theoretically or numerically: the
only possible points where we can minimize (or maximize) a function will have gradient equal to
zero, however, not every point with gradient zero is the true global minimum (or maximum).
19.4.4 Multivariate Chain Rule
Letʼs suppose that we have a function of four variables (w, x, y, and z) which we can make by com-
posing many terms:
f (u, v) = (u + v)2
u(a, b) = (a + b)2 , v(a, b) = (a − b)2 , (19.4.14)
a(w, x, y, z) = (w + x + y + z)2 , b(w, x, y, z) = (w + x − y − z)2 .
Such chains of equations are common when working with neural networks, so trying to under-
stand how to compute gradients of such functions is key. We can start to see visual hints of this
connection in Fig. 19.4.1 if we take a look at what variables directly relate to one another.

Fig. 19.4.1: The function relations above where nodes represent values and edges show functional
dependence.
Nothing stops us from just composing everything from (19.4.14) and writing out that
(( )2 ( )2 ) 2
f (w, x, y, z) = (w + x + y + z)2 + (w + x − y − z)2 + (w + x + y + z)2 − (w + x − y − z)2 .
(19.4.15)
We may then take the derivative by just using single variable derivatives, but if we did that we
would quickly find ourself swamped with terms, many of which are repeats! Indeed, one can see
that, for instance:
∂f ( ( )
= 2 2 (2(w + x + y + z) − 2(w + x − y − z)) (w + x + y + z)2 − (w + x − y − z)2 +
∂w ( ))
2 (2(w + x − y − z) + 2(w + x + y + z)) (w + x − y − z)2 + (w + x + y + z)2 ×
(( )2 ( )2 )
(w + x + y + z)2 − (w + x − y − z)2 + (w + x − y − z)2 + (w + x + y + z)2 .
(19.4.16)
If we then also wanted to compute ∂f

∂x , we would end up with a similar equation again with many
repeated terms, and many shared repeated terms between the two derivatives. This represents
a massive quantity of wasted work, and if we needed to compute derivatives this way, the whole
deep learning revolution would have stalled out before it began!
Letʼs break up the problem. We will start by trying to understand how f changes when we change
a, essentially assuming that w, x, y, and z all do not exist. We will reason as we did back when we
worked with the gradient for the first time. Letʼs take a and add a small amount ϵ to it.
f (u(a + ϵ, b), v(a + ϵ, b))

( )
∂u ∂v
≈f u(a, b) + ϵ (a, b), v(a, b) + ϵ (a, b)
∂a ∂a (19.4.17)
[ ]
∂f ∂u ∂f ∂v
≈f (u(a, b), v(a, b)) + ϵ (u(a, b), v(a, b)) (a, b) + (u(a, b), v(a, b)) (a, b) .
∂u ∂a ∂v ∂a
The first line follows from the definition of partial derivative, and the second follows from the
definition of gradient. It is notationally burdensome to track exactly where we evaluate every
derivative, as in the expression ∂f
∂u (u(a, b), v(a, b)), so we often abbreviate this to the much more
memorable
∂f ∂f ∂u ∂f ∂v
= + . (19.4.18)
∂a ∂u ∂a ∂v ∂a
It is useful to think about the meaning of the process. We are trying to understand how a function
of the form f (u(a, b), v(a, b)) changes its value with a change in a. There are two pathways this can

occur: there is the pathway where a → u → f and where a → v → f . We can compute both of
∂u · ∂x and ∂v · ∂x respectively, and added up.
these contributions via the chain rule: ∂w ∂u ∂w ∂v
Imagine we have a different network of functions where the functions on the right depend on
those that are connected to on the left as is shown in Fig. 19.4.2.
Fig. 19.4.2: Another more subtle example of the chain rule.
∂f
To compute something like ∂y , we need to sum over all (in this case 3) paths from y to f giving
∂f ∂f ∂a ∂u ∂f ∂u ∂f ∂b ∂v
= + + . (19.4.19)
∂y ∂a ∂u ∂y ∂u ∂y ∂b ∂v ∂y
Understanding the chain rule in this way will pay great dividends when trying to understand how
gradients flow through networks, and why various architectural choices like those in LSTMs (Sec-
tion 10.2) or residual layers (Section 8.6) can help shape the learning process by controlling gra-
dient flow.
19.4.5 The Backpropagation Algorithm
Letʼs return to the example of (19.4.14) the previous section where
f (u, v) = (u + v)2
u(a, b) = (a + b)2 , v(a, b) = (a − b)2 , (19.4.20)
a(w, x, y, z) = (w + x + y + z)2 , b(w, x, y, z) = (w + x − y − z)2 .
∂f
If we want to compute say ∂w we may apply the multi-variate chain rule to see:
∂f ∂f ∂u ∂f ∂v
= + ,
∂w ∂u ∂w ∂v ∂w
∂u ∂u ∂a ∂u ∂b
= + , (19.4.21)
∂w ∂a ∂w ∂b ∂w
∂v ∂v ∂a ∂v ∂b
= + .
∂w ∂a ∂w ∂b ∂w
∂f
Letʼs try using this decomposition to compute ∂w . Notice that all we need here are the various
single step partials:
∂f ∂f
= 2(u + v), = 2(u + v),
∂u ∂v
∂u ∂u
= 2(a + b), = 2(a + b),
∂a ∂b (19.4.22)
∂v ∂v
= 2(a − b), = −2(a − b),
∂a ∂b
∂a ∂b
= 2(w + x + y + z), = 2(w + x − y − z).
∂w ∂w
If we write this out into code this becomes a fairly manageable expression.

# Compute the value of the function from inputs to outputs
w, x, y, z = -1, 0, -2, 1
a, b = (w + x + y + z)**2, (w + x - y - z)**2
u, v = (a + b)**2, (a - b)**2
f = (u + v)**2
print(f' f at {w}, {x}, {y}, {z} is {f}')
# Compute the single step partials

df_du, df_dv = 2*(u + v), 2*(u + v)
du_da, du_db, dv_da, dv_db = 2*(a + b), 2*(a + b), 2*(a - b), -2*(a - b)
da_dw, db_dw = 2*(w + x + y + z), 2*(w + x - y - z)
# Compute the final result from inputs to outputs

du_dw, dv_dw = du_da*da_dw + du_db*db_dw, dv_da*da_dw + dv_db*db_dw
df_dw = df_du*du_dw + df_dv*dv_dw
print(f'df/dw at {w}, {x}, {y}, {z} is {df_dw}')
f at -1, 0, -2, 1 is 1024

df/dw at -1, 0, -2, 1 is -4096
However, note that this still does not make it easy to compute something like ∂f
∂x . The reason for
that is the way we chose to apply the chain rule. If we look at what we did above, we always kept
∂w in the denominator when we could. In this way, we chose to apply the chain rule seeing how
w changed every other variable. If that is what we wanted, this would be a good idea. However,
think back to our motivation from deep learning: we want to see how every parameter changes
the loss. In essence, we want to apply the chain rule keeping ∂f in the numerator whenever we
can!
To be more explicit, note that we can write
∂f ∂f ∂a ∂f ∂b
= + ,
∂w ∂a ∂w ∂b ∂w
∂f ∂f ∂u ∂f ∂v
= + , (19.4.23)
∂a ∂u ∂a ∂v ∂a
∂f ∂f ∂u ∂f ∂v
= + .
∂b ∂u ∂b ∂v ∂b
∂f ∂f ∂f ∂f ∂f
Note that this application of the chain rule has us explicitly compute ∂u , ∂v , ∂a , ∂b , and ∂w . Noth-
ing stops us from also including the equations:
∂f ∂f ∂a ∂f ∂b
= + ,
∂x ∂a ∂x ∂b ∂x
∂f ∂f ∂a ∂f ∂b
= + , (19.4.24)
∂y ∂a ∂y ∂b ∂y
∂f ∂f ∂a ∂f ∂b
= + .
∂z ∂a ∂z ∂b ∂z
and then keeping track of how f changes when we change any node in the entire network. Letʼs
implement it.
# Compute the value of the function from inputs to outputs

w, x, y, z = -1, 0, -2, 1
a, b = (w + x + y + z)**2, (w + x - y - z)**2

u, v = (a + b)**2, (a - b)**2
f = (u + v)**2
print(f'f at {w}, {x}, {y}, {z} is {f}')
# Compute the derivative using the decomposition above

# First compute the single step partials
df_du, df_dv = 2*(u + v), 2*(u + v)
du_da, du_db, dv_da, dv_db = 2*(a + b), 2*(a + b), 2*(a - b), -2*(a - b)
da_dw, db_dw = 2*(w + x + y + z), 2*(w + x - y - z)
da_dx, db_dx = 2*(w + x + y + z), 2*(w + x - y - z)
da_dy, db_dy = 2*(w + x + y + z), -2*(w + x - y - z)
da_dz, db_dz = 2*(w + x + y + z), -2*(w + x - y - z)
# Now compute how f changes when we change any value from output to input
df_da, df_db = df_du*du_da + df_dv*dv_da, df_du*du_db + df_dv*dv_db
df_dw, df_dx = df_da*da_dw + df_db*db_dw, df_da*da_dx + df_db*db_dx
df_dy, df_dz = df_da*da_dy + df_db*db_dy, df_da*da_dz + df_db*db_dz
print(f'df/dw at {w}, {x}, {y}, {z} is {df_dw}')

print(f'df/dx at {w}, {x}, {y}, {z} is {df_dx}')
print(f'df/dy at {w}, {x}, {y}, {z} is {df_dy}')
print(f'df/dz at {w}, {x}, {y}, {z} is {df_dz}')
f at -1, 0, -2, 1 is 1024

df/dw at -1, 0, -2, 1 is -4096
df/dx at -1, 0, -2, 1 is -4096
df/dy at -1, 0, -2, 1 is -4096
df/dz at -1, 0, -2, 1 is -4096
The fact that we compute derivatives from f back towards the inputs rather than from the inputs
forward to the outputs (as we did in the first code snippet above) is what gives this algorithm its
name: backpropagation. Note that there are two steps: 1. Compute the value of the function, and
the single step partials from front to back. While not done above, this can be combined into a
single forward pass. 2. Compute the gradient of f from back to front. We call this the backwards
pass.
This is precisely what every deep learning algorithm implements to allow the computation of the
gradient of the loss with respect to every weight in the network at one pass. It is an astonishing
fact that we have such a decomposition.
To see how to encapsulated this, letʼs take a quick look at this example.
# Initialize as ndarrays, then attach gradients

w = torch.tensor([-1.], requires_grad=True)
x = torch.tensor([0.], requires_grad=True)
y = torch.tensor([-2.], requires_grad=True)
z = torch.tensor([1.], requires_grad=True)
# Do the computation like usual, tracking gradients
a, b = (w + x + y + z)**2, (w + x - y - z)**2
u, v = (a + b)**2, (a - b)**2
f = (u + v)**2
# Execute backward pass


f.backward()
print(f'df/dw at {w.data.item()}, {x.data.item()}, {y.data.item()}, '

f'{z.data.item()} is {w.grad.data.item()}')
print(f'df/dx at {w.data.item()}, {x.data.item()}, {y.data.item()}, '
f'{z.data.item()} is {x.grad.data.item()}')
print(f'df/dy at {w.data.item()}, {x.data.item()}, {y.data.item()}, '
f'{z.data.item()} is {y.grad.data.item()}')
print(f'df/dz at {w.data.item()}, {x.data.item()}, {y.data.item()}, '
f'{z.data.item()} is {z.grad.data.item()}')
df/dw at -1.0, 0.0, -2.0, 1.0 is -4096.0

df/dx at -1.0, 0.0, -2.0, 1.0 is -4096.0
df/dy at -1.0, 0.0, -2.0, 1.0 is -4096.0
df/dz at -1.0, 0.0, -2.0, 1.0 is -4096.0
All of what we did above can be done automatically by calling f.backwards().
19.4.6 Hessians
As with single variable calculus, it is useful to consider higher-order derivatives in order to get a
handle on how we can obtain a better approximation to a function than using the gradient alone.
There is one immediate problem one encounters when working with higher order derivatives of
functions of several variables, and that is there are a large number of them. If we have a function
f (x1 , . . . , xn ) of n variables, then we can take n2 many second derivatives, namely for any choice
of i and j:
( )
d2 f d d
= f . (19.4.25)
dxi dxj dxi dxj
This is traditionally assembled into a matrix called the Hessian:

 2 
d f d2 f
· · ·
 dx1.dx1 . .. 
dx1 dxn

Hf =  . . .  (19.4.26)
. . .
2
d f 2
dxn dx1 · · · dxdn dx
f
n
Not every entry of this matrix is independent. Indeed, we can show that as long as both mixed
partials (partial derivatives with respect to more than one variable) exist and are continuous, we
can say that for any i, and j,
d2 f d2 f
= . (19.4.27)
dxi dxj dxj dxi
This follows by considering first perturbing a function in the direction of xi , and then perturbing
it in xj and then comparing the result of that with what happens if we perturb first xj and then xi ,
with the knowledge that both of these orders lead to the same final change in the output of f .
As with single variables, we can use these derivatives to get a far better idea of how the function
behaves near a point. In particular, we can use it to find the best fitting quadratic near a point x0 ,
as we saw in a single variable.

Letʼs see an example. Suppose that f (x1 , x2 ) = a + b1 x1 + b2 x2 + c11 x21 + c12 x1 x2 + c22 x22 . This is the
general form for a quadratic in two variables. If we look at the value of the function, its gradient,
and its Hessian (19.4.26), all at the point zero:
f (0, 0) = a,
[ ]
b
∇f (0, 0) = 1 ,
b2 (19.4.28)
[ ]
2c11 c12
Hf (0, 0) = ,
c12 2c22
we can get our original polynomial back by saying
1
f (x) = f (0) + ∇f (0) · x + x⊤ Hf (0)x. (19.4.29)
2
In general, if we computed this expansion any point x0 , we see that
1
f (x) = f (x0 ) + ∇f (x0 ) · (x − x0 ) + (x − x0 )⊤ Hf (x0 )(x − x0 ). (19.4.30)
2
This works for any dimensional input, and provides the best approximating quadratic to any func-
tion at a point. To give an example, letʼs plot the function
f (x, y) = xe−x
2 −y 2
. (19.4.31)
One can compute that the gradient and Hessian are

( ) ( 3 )
−x2 −y 2 1 − 2x 4x − 6x 4x2 y − 2y
2
−x2 −y 2
∇f (x, y) = e and Hf (x, y) = e . (19.4.32)
−2xy 4x2 y − 2y 4xy 2 − 2x
And thus, with a little algebra, see that the approximating quadratic at [−1, 0]⊤ is
( )
f (x, y) ≈ e−1 −1 − (x + 1) + (x + 1)2 + y 2 . (19.4.33)
# Construct grid and compute function

x, y = torch.meshgrid(torch.linspace(-2, 2, 101),
torch.linspace(-2, 2, 101))
z = x*torch.exp(- x**2 - y**2)
# Compute approximating quadratic with gradient and Hessian at (1, 0)

w = torch.exp(torch.tensor([-1.]))*(-1 - (x + 1) + 2 * (x + 1)**2 + 2 * y**2)
# Plot function
ax.plot_wireframe(x.numpy(), y.numpy(), z.numpy(),
**{'rstride': 10, 'cstride': 10})
ax.plot_wireframe(x.numpy(), y.numpy(), w.numpy(),
**{'rstride': 10, 'cstride': 10}, color='purple')
d2l.plt.xlabel('x')
d2l.plt.ylabel('y')
d2l.set_figsize()
ax.set_xlim(-2, 2)
ax.set_ylim(-2, 2)
ax.set_zlim(-1, 1)
ax.dist = 12

This forms the basis for Newtonʼs Algorithm discussed in Section 12.3, where we perform numer-
ical optimization iteratively finding the best fitting quadratic, and then exactly minimizing that
quadratic.
19.4.7 A Little Matrix Calculus
Derivatives of functions involving matrices turn out to be particularly nice. This section can be-
come notationally heavy, so may be skipped in a first reading, but it is useful to know how deriva-
tives of functions involving common matrix operations are often much cleaner than one might
initially anticipate, particularly given how central matrix operations are to deep learning applica-
tions.
Letʼs begin with an example. Suppose that we have some fixed column vector β, and we want to
take the product function f (x) = β ⊤ x, and understand how the dot product changes when we
change x.
A bit of notation that will be useful when working with matrix derivatives in ML is called the de-
nominator layout matrix derivative where we assemble our partial derivatives into the shape of
whatever vector, matrix, or tensor is in the denominator of the differential. In this case, we will
write
 df 
dx
1
df  
=  ...  , (19.4.34)
dx df
dxn
where we matched the shape of the column vector x.

If we write out our function into components this is
∑
n
f (x) = βi xi = β1 x1 + · · · + βn xn . (19.4.35)
i=1

If we now take the partial derivative with respect to say β1 , note that everything is zero but the first
term, which is just x1 multiplied by β1 , so we obtain that
df
= β1 , (19.4.36)
dx1
or more generally that
df
= βi . (19.4.37)
dxi
We can now reassemble this into a matrix to see
 df   
dx1 β1
df    
=  ...  =  ...  = β. (19.4.38)
dx df
dx
βn
n
This illustrates a few factors about matrix calculus that we will often counter throughout this sec-
tion:
• First, The computations will get rather involved.
• Second, The final results are much cleaner than the intermediate process, and will always
d
look similar to the single variable case. In this case, note that dx d
(bx) = b and dx (β ⊤ x) = β
are both similar.
• Third, transposes can often appear seemingly from nowhere. The core reason for this is the
convention that we match the shape of the denominator, thus when we multiply matrices,
we will need to take transposes to match back to the shape of the original term.
To keep building intuition, letʼs try a computation that is a little harder. Suppose that we have a
column vector x, and a square matrix A and we want to compute
d ⊤
(x Ax). (19.4.39)
dx
To drive towards easier to manipulate notation, letʼs consider this problem using Einstein notation.
In this case we can write the function as
x⊤ Ax = xi aij xj . (19.4.40)
To compute our derivative, we need to understand for every k, what is the value of
d d
(x⊤ Ax) = xi aij xj . (19.4.41)
dxk dxk
By the product rule, this is
d dxi dxj
xi aij xj = aij xj + xi aij . (19.4.42)
dxk dxk dxk
dxi
For a term like dxk
, it is not hard to see that this is one when i = k and zero otherwise. This means
that every term where i and k are different vanish from this sum, so the only terms that remain in
that first sum are the ones where i = k. The same reasoning holds for the second term where we
need j = k. This gives
d
xi aij xj = akj xj + xi aik . (19.4.43)
dxk

Now, the names of the indices in Einstein notation are arbitrary—the fact that i and j are different
is immaterial to this computation at this point, so we can re-index so that they both use i to see
that
d
xi aij xj = aki xi + xi aik = (aki + aik )xi . (19.4.44)
dxk
Now, here is where we start to need some practice to go further. Letʼs try and identify this outcome
in terms of matrix operations. aki + aik is the k, i-th component of A + A⊤ . This gives
d
xi aij xj = [A + A⊤ ]ki xi . (19.4.45)
dxk
Similarly, this term is now the product of the matrix A + A⊤ by the vector x, so we see that
[ ]
d ⊤ d
(x Ax) = xi aij xj = [(A + A⊤ )x]k . (19.4.46)
dx k dx k
Thus, we see that the k-th entry of the desired derivative from (19.4.39) is just the k-th entry of the
vector on the right, and thus the two are the same. Thus yields
d ⊤
(x Ax) = (A + A⊤ )x. (19.4.47)
dx
This required significantly more work than our last one, but the final result is small. More than
that, consider the following computation for traditional single variable derivatives:
d dx dx
(xax) = ax + xa = (a + a)x. (19.4.48)
dx dx dx
d
Equivalently dx (ax2 ) = 2ax = (a + a)x. Again, we get a result that looks rather like the single
variable result but with a transpose tossed in.
At this point, the pattern should be looking rather suspicious, so letʼs try to figure out why. When
we take matrix derivatives like this, letʼs first assume that the expression we get will be another
matrix expression: an expression we can write it in terms of products and sums of matrices and
their transposes. If such an expression exists, it will need to be true for all matrices. In particular,
it will need to be true of 1 × 1 matrices, in which case the matrix product is just the product of the
numbers, the matrix sum is just the sum, and the transpose does nothing at all! In other words,
whatever expression we get must match the single variable expression. This means that, with
some practice, one can often guess matrix derivatives just by knowing what the associated single
variable expression must look like!
Letʼs try this out. Suppose that X is a n × m matrix, U is an n × r and V is an r × m. Letʼs try to
compute
d
∥X − UV∥22 = ? (19.4.49)
dV
This computation is important in an area called matrix factorization. For us, however, it is just a
derivative to compute. Letʼs try to imagine what this would be for 1 × 1 matrices. In that case, we
get the expression
d
(x − uv)2 = −2(x − uv)u, (19.4.50)
dv
where, the derivative is rather standard. If we try to convert this back into a matrix expression we
get
d
∥X − UV∥22 = −2(X − UV)U. (19.4.51)
dV

However, if we look at this it does not quite work. Recall that X is n × m, as is UV, so the matrix
2(X − UV) is n × m. On the other hand U is n × r, and we cannot multiply a n × m and a n × r
matrix since the dimensions do not match!
d
We want to get dV , which is the same shape as V, which is r × m. So somehow we need to take a
n × m matrix and a n × r matrix, multiply them together (perhaps with some transposes) to get a
r × m. We can do this by multiplying U ⊤ by (X − UV). Thus, we can guess the solution to (19.4.49)
is
d
∥X − UV∥22 = −2U⊤ (X − UV). (19.4.52)
dV
To show that this works, we would be remiss to not provide a detailed computation. If we already
believe that this rule-of-thumb works, feel free to skip past this derivation. To compute
d
∥X − UV∥22 , (19.4.53)
dV
we must find for every a, and b
( )2
d d ∑ ∑
∥X − UV∥22 = xij − uik vkj . (19.4.54)
dvab dvab
i,j k
Recalling that all entries of X and U are constants as far as dvdab is concerned, we may push the
derivative inside the sum, and apply the chain rule to the square to get
( )( )
d ∑ ∑ ∑ dv kj
∥X − UV∥22 = 2 xij − uik vkj − uik . (19.4.55)
dvab dvab
i,j k k
dv
As in the previous derivation, we may note that dvkjab
is only non-zero if the k = a and j = b. If
either of those conditions do not hold, the term in the sum is zero, and we may freely discard it.
We see that
( )
d ∑ ∑
∥X − UV∥2 = −2
2
xib − uik vkb uia . (19.4.56)
dvab
i k
An important subtlety here is that the requirement that k = a does not occur inside the inner
sum since that k is a dummy variable which we are summing over inside the inner term. For a
notationally cleaner example, consider why
( )2 ( )
d ∑ ∑
xi =2 xi . (19.4.57)
dx1
i i
From this point, we may start identifying components of the sum. First,
∑
uik vkb = [UV]ib . (19.4.58)
k
So the entire expression in the inside of the sum is

∑
xib − uik vkb = [X − UV]ib . (19.4.59)
k
This means we may now write our derivative as

d ∑
∥X − UV∥22 = −2 [X − UV]ib uia . (19.4.60)
dvab
i

We want this to look like the a, b element of a matrix so we can use the technique as in the previous
example to arrive at a matrix expression, which means that we need to exchange the order of the
indices on uia . If we notice that uia = [U⊤ ]ai , we can then write
d ∑
∥X − UV∥22 = −2 [U⊤ ]ai [X − UV]ib . (19.4.61)
dvab
i
This is a matrix product, and thus we can conclude that
d
∥X − UV∥22 = −2[U⊤ (X − UV)]ab . (19.4.62)
dvab
and thus we may write the solution to (19.4.49)
d
∥X − UV∥22 = −2U⊤ (X − UV). (19.4.63)
dV
This matches the solution we guessed above!
It is reasonable to ask at this point, “Why can I not just write down matrix versions of all the cal-
culus rules I have learned? It is clear this is still mechanical. Why do we not just get it over with!”
And indeed there are such rules and (Petersen et al., 2008) provides an excellent summary. How-
ever, due to the plethora of ways matrix operations can be combined compared to single values,
there are many more matrix derivative rules than single variable ones. It is often the case that it
is best to work with the indices, or leave it up to automatic differentiation when appropriate.
19.4.8 Summary
• In higher dimensions, we can define gradients which serve the same purpose as derivatives
in one dimension. These allow us to see how a multi-variable function changes when we
make an arbitrary small change to the inputs.
• The backpropagation algorithm can be seen to be a method of organizing the multi-variable
chain rule to allow for the efficient computation of many partial derivatives.
• Matrix calculus allows us to write the derivatives of matrix expressions in concise ways.
Exercises
1. Given a column vector β, compute the derivatives of both f (x) = β ⊤ x and g(x) = x⊤ β. Why
do you get the same answer?
∂v ∥v∥2 ?
∂
2. Let v be an n dimension vector. What is
3. Let L(x, y) = log(ex + ey ). Compute the gradient. What is the sum of the components of the
gradient?
4. Let f (x, y) = x2 y + xy 2 . Show that the only critical point is (0, 0). By considering f (x, x),
determine if (0, 0) is a maximum, minimum, or neither.
5. Suppose that we are minimizing a function f (x) = g(x) + h(x). How can we geometrically
interpret the condition of ∇f = 0 in terms of g and h?
Discussions259
259

19.5 Integral Calculus
Differentiation only makes up half of the content of a traditional calculus education. The other
pillar, integration, starts out seeming a rather disjoint question, “What is the area underneath this
curve?” While seemingly unrelated, integration is tightly intertwined with the differentiation via
what is known as the fundamental theorem of calculus.
At the level of machine learning we discuss in this book, we will not need a deep understanding of
integration. However, we will provide a brief introduction to lay the groundwork for any further
applications we will encounter later on.
19.5.1 Geometric Interpretation
Suppose that we have a function f (x). For simplicity, letʼs assume that f (x) is non-negative (never
takes a value less than zero). What we want to try and understand is: what is the area contained
between f (x) and the x-axis?
%matplotlib inline
import torch
f = torch.exp(-x**2)
d2l.set_figsize()
d2l.plt.plot(x, f, color='black')
d2l.plt.fill_between(x.tolist(), f.tolist())
d2l.plt.show()
In most cases, this area will be infinite or undefined (consider the area under f (x) = x2 ), so people
will often talk about the area between a pair of ends, say a and b.
f = torch.exp(-x**2)
19.5. Integral Calculus 863

d2l.set_figsize()
d2l.plt.fill_between(x.tolist()[50:250], f.tolist()[50:250])
d2l.plt.show()
We will denote this area by the integral symbol below:

∫ b
Area(A) = f (x) dx. (19.5.1)
a
∑
The inner variable is a dummy variable, much like the index of a sum in a , and so this can be
equivalently written with any inner value we like:
∫ b ∫ b
f (x) dx = f (z) dz. (19.5.2)
a a
There is a traditional way to try and understand how we might try to approximate such integrals:
we can imagine taking the region in-between a and b and chopping it into N vertical slices. If N
is large, we can approximate the area of each slice by a rectangle, and then add up the areas to get
the total area under the curve. Letʼs take a look at an example doing this in code. We will see how
to get the true value in a later section.
epsilon = 0.05
a = 0
b = 2
x = torch.arange(a, b, epsilon)
f = x / (1 + x**2)
approx = torch.sum(epsilon*f)
true = torch.log(torch.tensor([5.])) / 2
d2l.set_figsize()
d2l.plt.bar(x, f, width=epsilon, align='edge')
d2l.plt.ylim([0, 1])
d2l.plt.show()
f'approximation: {approx}, truth: {true}'

'approximation: 0.7944855690002441, truth: tensor([0.8047])'
The issue is that while it can be done numerically, we can do this approach analytically for only
the simplest functions like
∫ b
x dx. (19.5.3)
a
Anything somewhat more complex like our example from the code above
∫ b
x
dx. (19.5.4)
a 1 + x2
is beyond what we can solve with such a direct method.

We will instead take a different approach. We will work intuitively with the notion of the area, and
learn the main computational tool used to find integrals: the fundamental theorem of calculus. This
will be the basis for our study of integration.
19.5.2 The Fundamental Theorem of Calculus
To dive deeper into the theory of integration, letʼs introduce a function

∫ x
F (x) = f (y)dy. (19.5.5)
0
This function measures the area between 0 and x depending on how we change x. Notice that this
is everything we need since
∫ b
f (x) dx = F (b) − F (a). (19.5.6)
a
This is a mathematical encoding of the fact that we can measure the area out to the far end-point
and then subtract off the area to the near end point as indicated in Fig. 19.5.1.

Fig. 19.5.1: Visualizing why we may reduce the problem of computing the area under a curve
between two points to computing the area to the left of a point.
Thus, we can figure out what the integral over any interval is by figuring out what F (x) is.
To do so, letʼs consider an experiment. As we often do in calculus, letʼs imagine what happens
when we shift the value by a tiny bit. From the comment above, we know that
∫ x+ϵ
F (x + ϵ) − F (x) = f (y) dy. (19.5.7)
x
This tells us that the function changes by the area under a tiny sliver of a function.
This is the point at which we make an approximation. If we look at a tiny sliver of area like this, it
looks like this area is close to the rectangular area with height the value of f (x) and the base width
ϵ. Indeed, one can show that as ϵ → 0 this approximation becomes better and better. Thus we can
conclude:
F (x + ϵ) − F (x) ≈ ϵf (x). (19.5.8)
However, we can now notice: this is exactly the pattern we expect if we were computing the deriva-
tive of F ! Thus we see the following rather surprising fact:
dF
(x) = f (x). (19.5.9)
dx
This is the fundamental theorem of calculus. We may write it in expanded form as
∫ x
d
f (y) dy = f (x). (19.5.10)
dx 0
It takes the concept of finding areas (a priori rather hard), and reduces it to a statement derivatives
(something much more completely understood). One last comment that we must make is that this
does not tell us exactly what F (x) is. Indeed F (x) + C for any C has the same derivative. This
is a fact-of-life in the theory of integration. Thankfully, notice that when working with definite
integrals, the constants drop out, and thus are irrelevant to the outcome.
∫ b
f (x) dx = (F (b) + C) − (F (a) + C) = F (b) − F (a). (19.5.11)
a
This may seem like abstract non-sense, but letʼs take a moment to appreciate that it has given us
a whole new perspective on computing integrals. Our goal is no-longer to do some sort of chop-
and-sum process to try and recover the area, rather we need only find a function whose derivative
is the function we have! This is incredible since we can now list many rather difficult integrals
by just reversing the table from Section 19.3.2. For instance, we know that the derivative of xn is
nxn−1 . Thus, we can say using the fundamental theorem (19.5.10) that
∫ x
ny n−1 dy = xn − 0n = xn . (19.5.12)
0

Similarly, we know that the derivative of ex is itself, so that means
∫ x
ex dx = ex − e0 = ex − 1. (19.5.13)
0
In this way, we can develop the entire theory of integration leveraging ideas from differential
calculus freely. Every integration rule derives from this one fact.
19.5.3 Change of Variables
Just as with differentiation, there are a number of rules which make the computation of integrals
more tractable. In fact, every rule of differential calculus (like the product rule, sum rule, and
chain rule) has a corresponding rule for integral calculus (integration by parts, linearity of inte-
gration, and the change of variables formula respectively). In this section, we will dive into what
is arguably the most important from the list: the change of variables formula.
First, suppose that we have a function which is itself an integral:
∫ x
F (x) = f (y) dy. (19.5.14)
0
Letʼs suppose that we want to know how this function looks when we compose it with another to
obtain F (u(x)). By the chain rule, we know
d dF du
F (u(x)) = (u(x)) · . (19.5.15)
dx du dx
We can turn this into a statement about integration by using the fundamental theorem (19.5.10)
as above. This gives
∫ x
dF du
F (u(x)) − F (u(0)) = (u(y)) · dy. (19.5.16)
0 du dy
Recalling that F is itself an integral gives that the left hand side may be rewritten to be
∫ u(x) ∫ x
dF du
f (y) dy = (u(y)) · dy. (19.5.17)
u(0) 0 du dy
dF
Similarly, recalling that F is an integral allows us to recognize that dx = f using the fundamental
theorem (19.5.10), and thus we may conclude
∫ u(x) ∫ x
du
f (y) dy = f (u(y)) · dy. (19.5.18)
u(0) 0 dy
This is the change of variables formula.

For a more intuitive derivation, consider what happens when we take an integral of f (u(x)) be-
tween x and x + ϵ. For a small ϵ, this integral is approximately ϵf (u(x)), the area of the associated
rectangle. Now, letʼs compare this with the integral of f (y) from u(x) to u(x + ϵ). We know that
u(x + ϵ) ≈ u(x) + ϵ du du
dx (x), so the area of this rectangle is approximately ϵ dx (x)f (u(x)). Thus, to
make the area of these two rectangles to agree, we need to multiply the first one by du dx (x) as is
illustrated in Fig. 19.5.2.

Fig. 19.5.2: Visualizing the transformation of a single thin rectangle under the change of variables.
This tells us that

∫ x+ϵ ∫ u(x+ϵ)
du
f (u(y)) (y) dy = f (y) dy. (19.5.19)
x dy u(x)
This is the change of variables formula expressed for a single small rectangle.
If u(x) and f (x) are properly chosen, this can allow for the computation of incredibly complex in-
tegrals. For instance, if we even chose f (y) = 1 and u(x) = e−x (which means du −x2 ),
2
dx (x) = −2xe
this can show for instance that
∫ e−1 ∫ 1
e−1 − 1 = ye−y dy,
2
1 dy = −2 (19.5.20)
e−0 0
and thus by rearranging that

∫ 1
1 − e−1
ye−y dy =
2
. (19.5.21)
0 2
19.5.4 A Comment on Sign Conventions
Keen-eyed readers will observe something strange about the computations above. Namely, com-
putations like
∫ e−1
1 dy = e−1 − 1 < 0, (19.5.22)
e−0
can produce negative numbers. When thinking about areas, it can be strange to see a negative
value, and so it is worth digging into what the convention is.
Mathematicians take the notion of signed areas. This manifests itself in two ways. First, if we
consider a function f (x) which is sometimes less than zero, then the area will also be negative. So
for instance
∫ 1
(−1) dx = −1. (19.5.23)
0
Similarly, integrals which progress from right to left, rather than left to right are also taken to be
negative areas
∫ −1
1 dx = −1. (19.5.24)
0
The standard area (from left to right of a positive function) is always positive. Anything obtained
by flipping it (say flipping over the x-axis to get the integral of a negative number, or flipping over

the y-axis to get an integral in the wrong order) will produce a negative area. And indeed, flipping
twice will give a pair of negative signs that cancel out to have positive area
∫ −1
(−1) dx = 1. (19.5.25)
0
If this discussion sounds familiar, it is! In Section 19.1 we discussed how the determinant repre-
sented the signed area in much the same way.
19.5.5 Multiple Integrals
In some cases, we will need to work in higher dimensions. For instance, suppose that we have
a function of two variables, like f (x, y) and we want to know the volume under f when x ranges
over [a, b] and y ranges over [c, d].
# Construct grid and compute function

x, y = torch.meshgrid(torch.linspace(-2, 2, 101), torch.linspace(-2, 2, 101))
z = torch.exp(- x**2 - y**2)
# Plot function
ax.plot_wireframe(x, y, z)
d2l.plt.xlabel('x')
d2l.plt.ylabel('y')
d2l.plt.xticks([-2, -1, 0, 1, 2])
d2l.plt.yticks([-2, -1, 0, 1, 2])
d2l.set_figsize()
ax.set_xlim(-2, 2)
ax.set_ylim(-2, 2)
ax.set_zlim(0, 1)
ax.dist = 12


We write this as
∫
f (x, y) dx dy. (19.5.26)
[a,b]×[c,d]
Suppose that we wish to compute this integral. My claim is that we can do this by iteratively com-
puting first the integral in x and then shifting to the integral in y, that is to say
∫ ∫ d (∫ b )
f (x, y) dx dy = f (x, y) dx dy. (19.5.27)
[a,b]×[c,d] c a
Letʼs see why this is.

Consider the figure above where we have split the function into ϵ × ϵ squares which we will index
with integer coordinates i, j. In this case, our integral is approximately
∑
ϵ2 f (ϵi, ϵj). (19.5.28)
i,j
Once we discretize the problem, we may add up the values on these squares in whatever order we
like, and not worry about changing the values. This is illustrated in Fig. 19.5.3. In particular, we
can say that
( )
∑ ∑
ϵ ϵf (ϵi, ϵj) . (19.5.29)
j i
Fig. 19.5.3: Illustrating how to decompose a sum over many squares as a sum over first the columns
(1), then adding the column sums together (2).
The sum on the inside is precisely the discretization of the integral

∫ b
G(ϵj) = f (x, ϵj) dx. (19.5.30)
a
Finally, notice that if we combine these two expressions we get

∑ ∫ d ∫
ϵG(ϵj) ≈ G(y) dy = f (x, y) dx dy. (19.5.31)
j c [a,b]×[c,d]
Thus putting it all together, we have that

∫ ∫ d (∫ b )
f (x, y) dx dy = f (x, y) dx dy. (19.5.32)
[a,b]×[c,d] c a

Notice that, once discretized, all we did was rearrange the order in which we added a list of num-
bers. This may make it seem like it is nothing, however this result (called Fubini’s Theorem) is not
always true! For the type of mathematics encountered when doing machine learning (continu-
ous functions), there is no concern, however it is possible to create examples where it fails (for
example the function f (x, y) = xy(x2 − y 2 )/(x2 + y 2 )3 over the rectangle [0, 2] × [0, 1]).
Note that the choice to do the integral in x first, and then the integral in y was arbitrary. We could
have equally well chosen to do y first and then x to see
∫ ∫ b (∫ d )
f (x, y) dx dy = f (x, y) dy dx. (19.5.33)
[a,b]×[c,d] a c
Often times, we will condense down to vector notation, and say that for U = [a, b] × [c, d] this is
∫
f (x) dx. (19.5.34)
U
19.5.6 Change of Variables in Multiple Integrals
As with single variables in (19.5.18), the ability to change variables inside a higher dimensional
integral is a key tool. Letʼs summarize the result without derivation.
We need a function that reparameterizes our domain of integration. We can take this to be ϕ :
Rn → Rn , that is any function which takes in n real variables and returns another n. To keep
the expressions clean, we will assume that ϕ is injective which is to say it never folds over itself
(ϕ(x) = ϕ(y) =⇒ x = y).
In this case, we can say that
∫ ∫
f (x) dx = f (ϕ(x)) |det(Dϕ(x))| dx. (19.5.35)
ϕ(U ) U
where Dϕ is the Jacobian of ϕ, which is the matrix of partial derivatives of ϕ =

(ϕ1 (x1 , . . . , xn ), . . . , ϕn (x1 , . . . , xn )),
 ∂ϕ1 ∂ϕ1 
∂x1 · · · ∂x n
 .. . . 
Dϕ =  . . . . .
. (19.5.36)
∂ϕn ∂ϕn
∂x1 ··· ∂xn
Looking closely, we see that this is similar to the single variable chain rule (19.5.18), except we have
replaced the term du dx (x) with |det(Dϕ(x))|. Letʼs see how we can to interpret this term. Recall that
the du
dx (x) term existed to say how much we stretched our x-axis by applying u. The same process
in higher dimensions is to determine how much we stretch the area (or volume, or hyper-volume)
of a little square (or little hyper-cube) by applying ϕ. If ϕ was the multiplication by a matrix, then
we know how the determinant already gives the answer.
With some work, one can show that the Jacobian provides the best approximation to a multivari-
able function ϕ at a point by a matrix in the same way we could approximate by lines or planes
with derivatives and gradients. Thus the determinant of the Jacobian exactly mirrors the scaling
factor we identified in one dimension.
It takes some work to fill in the details to this, so do not worry if they are not clear now. Letʼs see
at least one example we will make use of later on. Consider the integral
∫ ∞∫ ∞
e−x −y dx dy.
2 2
(19.5.37)
−∞ −∞

Playing with this integral directly will get us no-where, but if we change variables, we can make
significant progress. If we let ϕ(r, θ) = (r cos(θ), r sin(θ)) (which is to say that x = r cos(θ), y =
r sin(θ)), then we can apply the change of variable formula to see that this is the same thing as
∫ ∞ ∫ 2π
e−r |det(Dϕ(x))| dθ dr,
2
(19.5.38)
0 0
where
[ ]
cos(θ) −r sin(θ)

|det(Dϕ(x))| = det = r(cos2 (θ) + sin2 (θ)) = r. (19.5.39)
sin(θ) r cos(θ)
Thus, the integral is

∫ ∞ ∫ 2π ∫ ∞
re−r dθ dr = 2π re−r dr = π,
2 2
(19.5.40)
0 0 0
where the final equality follows by the same computation that we used in section Section 19.5.3.
We will meet this integral again when we study continuous random variables in Section 19.6.
19.5.7 Summary
• The theory of integration allows us to answer questions about areas or volumes.

• The fundamental theorem of calculus allows us to leverage knowledge about derivatives to
compute areas via the observation that the derivative of the area up to some point is given
by the value of the function being integrated.
• Integrals in higher dimensions can be computed by iterating single variable integrals.
Exercises
∫2 1
1. What is 1 x dx?
∫ √π
2. Use the change of variables formula to integrate 0 x sin(x2 ) dx.
∫
3. What is [0,1]2 xy dx dy?
∫2∫1
4. Use the change of variables formula to compute 0 0 xy(x2 − y 2 )/(x2 + y 2 )3 dy dx and
∫1∫2
0 0 f (x, y) = xy(x − y )/(x + y ) dx dy to see they are different.
2 2 2 2 3
Discussions260
19.6 Random Variables
In Section 2.6 we saw the basics of how to work with discrete random variables, which in our case
refer to those random variables which take either a finite set of possible values, or the integers.
In this section, we develop the theory of continuous random variables, which are random variables
which can take on any real value.
260

19.6.1 Continuous Random Variables
Continuous random variables are a significantly more subtle topic than discrete random variables.
A fair analogy to make is that the technical jump is comparable to the jump between adding lists
of numbers and integrating functions. As such, we will need to take some time to develop the
theory.
From Discrete to Continuous
To understand the additional technical challenges encountered when working with continuous
random variables, letʼs perform a thought experiment. Suppose that we are throwing a dart at the
dart board, and we want to know the probability that it hits exactly 2cm from the center of the
board.
To start with, we imagine measuring a single digit of accuracy, that is to say with bins for 0cm,
1cm, 2cm, and so on. We throw say 100 darts at the dart board, and if 20 of them fall into the bin
for 2cm we conclude that 20% of the darts we throw hit the board 2cm away from the center.
However, when we look closer, this does not match our question! We wanted exact equality,
whereas these bins hold all that fell between say 1.5cm and 2.5cm.
Undeterred, we continue further. We measure even more precisely, say 1.9cm, 2.0cm, 2.1cm, and
now see that perhaps 3 of the 100 darts hit the board in the 2.0cm bucket. Thus we conclude the
probability is 3%.
However, this does not solve anything! We have just pushed the issue down one digit further. Letʼs
abstract a bit. Imagine we know the probability that the first k digits match with 2.00000 . . . and we
want to know the probability it matches for the first k + 1 digits. It is fairly reasonable to assume
that the k + 1th digit is essentially a random choice from the set {0, 1, 2, . . . , 9}. At least, we cannot
conceive of a physically meaningful process which would force the number of micrometers away
form the center to prefer to end in a 7 vs a 3.
What this means is that in essence each additional digit of accuracy we require should decrease
probability of matching by a factor of 10. Or put another way, we would expect that
P (distance is 2.00 . . . , to k digits) ≈ p · 10−k . (19.6.1)
The value p essentially encodes what happens with the first few digits, and the 10−k handles the
rest.
Notice that if we know the position accurate to k = 4 digits after the decimal, that means we know
the value falls within the interval say [1.99995, 2.00005] which is an interval of length 2.00005 −
1.99995 = 10−4 . Thus, if we call the length of this interval ϵ, we can say
P (distance is in an ϵ-sized interval around 2) ≈ ϵ · p. (19.6.2)
Letʼs take this one final step further. We have been thinking about the point 2 the entire time, but
never thinking about other points. Nothing is different there fundamentally, but it is the case that
the value p will likely be different. We would at least hope that a dart thrower was more likely to
hit a point near the center, like 2cm rather than 20cm. Thus, the value p is not fixed, but rather
should depend on the point x. This tells us that we should expect
P (distance is in an ϵ-sized interval around x) ≈ ϵ · p(x). (19.6.3)
19.6. Random Variables 873

Indeed, (19.6.3) precisely defines the probability density function. It is a function p(x) which en-
codes the relative probability of hitting near one point vs. another. Letʼs visualize what such a
function might look like.
%matplotlib inline
import torch
torch.pi = torch.acos(torch.zeros(1)).item() * 2 # Define pi in torch
# Plot the probability density function for some random variable

p = 0.2*torch.exp(-(x - 3)**2 / 2)/torch.sqrt(2 * torch.tensor(torch.pi)) + \
0.8*torch.exp(-(x + 1)**2 / 2)/torch.sqrt(2 * torch.tensor(torch.pi))
d2l.plot(x, p, 'x', 'Density')
The locations where the function value is large indicates regions where we are more likely to find
the random value. The low portions are areas where we are unlikely to find the random value.
Probability Density Functions
Letʼs now investigate this further. We have already seen what a probability density function is
intuitively for a random variable X, namely the density function is a function p(x) so that
P (X is in an ϵ-sized interval around x) ≈ ϵ · p(x). (19.6.4)
But what does this imply for the properties of p(x)?
First, probabilities are never negative, thus we should expect that p(x) ≥ 0 as well.
Second, letʼs imagine that we slice up the R into an infinite number of slices which are ϵ wide, say
with slices (ϵ·i, ϵ·(i+1)]. For each of these, we know from (19.6.4) the probability is approximately
P (X is in an ϵ-sized interval around x) ≈ ϵ · p(ϵ · i), (19.6.5)
so summed over all of them it should be
∑
P (X ∈ R) ≈ ϵ · p(ϵ · i). (19.6.6)
i

This is nothing more than the approximation of an integral discussed in Section 19.5, thus we can
say that
∫ ∞
P (X ∈ R) = p(x) dx. (19.6.7)
−∞
We know that P (X ∈ R) = 1, since the random variable must take on some number, we can
conclude that for any density
∫ ∞
p(x) dx = 1. (19.6.8)
−∞
Indeed, digging into this further shows that for any a, and b, we see that
∫ b
P (X ∈ (a, b]) = p(x) dx. (19.6.9)
a
We may approximate this in code by using the same discrete approximation methods as before.
In this case we can approximate the probability of falling in the blue region.
# Approximate probability using numerical integration

epsilon = 0.01
p = 0.2*torch.exp(-(x - 3)**2 / 2) / torch.sqrt(2 * torch.tensor(torch.pi)) +\
0.8*torch.exp(-(x + 1)**2 / 2) / torch.sqrt(2 * torch.tensor(torch.pi))
d2l.set_figsize()
d2l.plt.plot(x, p, color='black')
d2l.plt.fill_between(x.tolist()[300:800], p.tolist()[300:800])
d2l.plt.show()
f'approximate Probability: {torch.sum(epsilon*p[300:800])}'
'approximate Probability: 0.773617148399353'
It turns out that these two properties describe exactly the space of possible probability density
functions (or p.d.f.ʼs for the commonly encountered abbreviation). They are non-negative func-
tions p(x) ≥ 0 such that
∫ ∞
p(x) dx = 1. (19.6.10)
−∞

We interpret this function by using integration to obtain the probability our random variable is in
a specific interval:
∫ b
P (X ∈ (a, b]) = p(x) dx. (19.6.11)
a
In Section 19.8 we will see a number of common distributions, but letʼs continue working in the
abstract.
Cumulative Distribution Functions
In the previous section, we saw the notion of the p.d.f. In practice, this is a commonly encountered
method to discuss continuous random variables, but it has one significant pitfall: that the values
of the p.d.f. are not themselves probabilities, but rather a function that we must integrate to yield
probabilities. There is nothing wrong with a density being larger than 10, as long as it is not larger
than 10 for more than an interval of length 1/10. This can be counter-intuitive, so people often
also think in terms of the cumulative distribution function, or c.d.f., which is a probability.
In particular, by using (19.6.11), we define the c.d.f. for a random variable X with density p(x) by
∫ x
F (x) = p(x) dx = P (X ≤ x). (19.6.12)
−∞
Letʼs observe a few properties.

• F (x) → 0 as x → −∞.
• F (x) → 1 as x → ∞.
• F (x) is non-decreasing (y > x =⇒ F (y) ≥ F (x)).
• F (x) is continuous (has no jumps) if X is a continuous random variable.
With the fourth bullet point, note that this would not be true if X were discrete, say taking the
values 0 and 1 both with probability 1/2. In that case


0 x < 0,
F (x) = 12 x < 1, (19.6.13)


1 x ≥ 1.
In this example, we see one of the benefits of working with the c.d.f., the ability to deal with con-
tinuous or discrete random variables in the same framework, or indeed mixtures of the two (flip
a coin: if heads return the roll of a die, if tails return the distance of a dart throw from the center
of a dart board).
Means
Suppose that we are dealing with a random variables X. The distribution itself can be hard to
interpret. It is often useful to be able to summarize the behavior of a random variable concisely.
Numbers that help us capture the behavior of a random variable are called summary statistics. The
most commonly encountered ones are the mean, the variance, and the standard deviation.

The mean encodes the average value of a random variable. If we have a discrete random variable
X, which takes the values xi with probabilities pi , then the mean is given by the weighted average:
sum the values times the probability that the random variable takes on that value:
∑
µX = E[X] = xi pi . (19.6.14)
i
The way we should interpret the mean (albeit with caution) is that it tells us essentially where the
random variable tends to be located.
As a minimalistic example that we will examine throughout this section, letʼs take X to be the
random variable which takes the value a − 2 with probability p, a + 2 with probability p and a with
probability 1 − 2p. We can compute using (19.6.14) that, for any possible choice of a and p, the
mean is
∑
µX = E[X] = xi pi = (a − 2)p + a(1 − 2p) + (a + 2)p = a. (19.6.15)
i
Thus we see that the mean is a. This matches the intuition since a is the location around which
we centered our random variable.
Because they are helpful, letʼs summarize a few properties.
• For any random variable X and numbers a and b, we have that µaX+b = aµX + b.
• If we have two random variables X and Y , we have µX+Y = µX + µY .
Means are useful for understanding the average behavior of a random variable, however the mean
is not sufficient to even have a full intuitive understanding. Making a profit of $10 ± $1 per sale is
very different from making $10 ± $15 per sale despite having the same average value. The second
one has a much larger degree of fluctuation, and thus represents a much larger risk. Thus, to
understand the behavior of a random variable, we will need at minimum one more measure: some
measure of how widely a random variable fluctuates.
Variances
This leads us to consider the variance of a random variable. This is a quantitative measure of
how far a random variable deviates from the mean. Consider the expression X − µX . This is the
deviation of the random variable from its mean. This value can be positive or negative, so we need
to do something to make it positive so that we are measuring the magnitude of the deviation.
A reasonable thing to try is to look at |X − µX |, and indeed this leads to a useful quantity called the
mean absolute deviation, however due to connections with other areas of mathematics and statis-
tics, people often use a different solution.
In particular, they look at (X − µX )2 . If we look at the typical size of this quantity by taking the
mean, we arrive at the variance
[ ]
σX2
= Var(X) = E (X − µX )2 = E[X 2 ] − µ2X . (19.6.16)
The last equality in (19.6.16) holds by expanding out the definition in the middle, and applying the
properties of expectation.
Letʼs look at our example where X is the random variable which takes the value a − 2 with proba-
bility p, a + 2 with
[ probability
] p and a with probability 1 − 2p. In this case µX = a, so all we need
to compute is E X 2 . This can readily be done:
[ ]
E X 2 = (a − 2)2 p + a2 (1 − 2p) + (a + 2)2 p = a2 + 8p. (19.6.17)

Thus, we see that by (19.6.16) our variance is
2
σX = Var(X) = E[X 2 ] − µ2X = a2 + 8p − a2 = 8p. (19.6.18)
This result again makes sense. The largest p can be is 1/2 which corresponds to picking a − 2 or
a + 2 with a coin flip. The variance of this being 4 corresponds to the fact that both a − 2 and
a + 2 are 2 units away from the mean, and 22 = 4. On the other end of the spectrum, if p = 0, this
random variable always takes the value 0 and so it has no variance at all.
We will list a few properties of variance below:
• For any random variable X, Var(X) ≥ 0, with Var(X) = 0 if and only if X is a constant.
• For any random variable X and numbers a and b, we have that Var(aX + b) = a2 Var(X).
• If we have two independent random variables X and Y , we have Var(X + Y ) = Var(X) +
Var(Y ).
When interpreting these values, there can be a bit of a hiccup. In particular, letʼs try imagining
what happens if we keep track of units through this computation. Suppose that we are working
with the star rating assigned to a product on the web page. Then a, a−2, and a+2 are all measured
in units of stars. Similarly, the mean µX is then also measured in stars (being a weighted average).
However, if we get to the variance, we immediately encounter an issue, which is we want to look at
(X − µX )2 , which is in units of squared stars. This means that the variance itself is not comparable
to the original measurements. To make it interpretable, we will need to return to our original
units.
Standard Deviations
This summary statistics can always be deduced from the variance by taking the square root! Thus
we define the standard deviation to be
√
σX = Var(X). (19.6.19)
√
In our example, this means we now have the standard deviation is σX = 2 2p. If we are dealing
with units of stars for our review example, σX is again in units of stars.
The properties we had for the variance can be restated for the standard deviation.
• For any random variable X, σX ≥ 0.
• For any random variable X and numbers a and b, we have that σaX+b = |a|σX
√
• If we have two independent random variables X and Y , we have σX+Y = σX 2 + σ2 .
Y
It is natural at this moment to ask, “If the standard deviation is in the units of our original random
variable, does it represent something we can draw with regards to that random variable?” The
answer is a resounding yes! Indeed much like the mean told us the typical location of our random
variable, the standard deviation gives the typical range of variation of that random variable. We
can make this rigorous with what is known as Chebyshevʼs inequality:
1
P (X ̸∈ [µX − ασX , µX + ασX ]) ≤ . (19.6.20)
α2
Or to state it verbally in the case of α = 10, 99% of the samples from any random variable fall
within 10 standard deviations of the mean. This gives an immediate interpretation to our standard
summary statistics.

To see how this statement is rather subtle, letʼs take a look at our running example again where X
is the random variable which takes the value a − 2 with probability p, a + 2 with probability p and
√
a with probability 1 − 2p. We saw that the mean was a and the standard deviation was 2 2p. This
means, if we take Chebyshevʼs inequality (19.6.20) with α = 2, we see that the expression is
( √ √ ) 1
P X ̸∈ [a − 4 2p, a + 4 2p] ≤ . (19.6.21)
4
This means that 75% of the time, this random variable will fall within this interval for any value of
p. Now, notice that as p → 0, this interval also converges to the single point a. But we know that
our random variable takes the values a − 2, a, and a + 2 only so eventually we can be certain a − 2
and a + 2 will fall outside the interval! The question is, at what p does that happen. So we want to
√
solve: for what p does a + 4 2p = a + 2, which is solved when p = 1/8, which is exactly the first p
where it could possibly happen without violating our claim that no more than 1/4 of samples from
the distribution would fall outside the interval (1/8 to the left, and 1/8 to the right).
Letʼs visualize this. We will show the probability of getting the three values as three vertical bars
with height proportional to the probability. The interval will be drawn as a horizontal line in the
middle. The first plot shows what happens for p > 1/8 where the interval safely contains all points.
# Define a helper to plot these figures

def plot_chebyshev(a, p):
d2l.set_figsize()
d2l.plt.stem([a-2, a, a+2], [p, 1-2*p, p], use_line_collection=True)
d2l.plt.xlim([-4, 4])
d2l.plt.xlabel('x')
d2l.plt.ylabel('p.m.f.')
d2l.plt.hlines(0.5, a - 4 * torch.sqrt(2 * p),

a + 4 * torch.sqrt(2 * p), 'black', lw=4)
d2l.plt.vlines(a - 4 * torch.sqrt(2 * p), 0.53, 0.47, 'black', lw=1)
d2l.plt.vlines(a + 4 * torch.sqrt(2 * p), 0.53, 0.47, 'black', lw=1)
d2l.plt.title(f'p = {p:.3f}')
d2l.plt.show()
# Plot interval when p > 1/8

plot_chebyshev(0.0, torch.tensor(0.2))

The second shows that at p = 1/8, the interval exactly touches the two points. This shows that the
inequality is sharp, since no smaller interval could be taken while keeping the inequality true.
# Plot interval when p = 1/8

The third shows that for p < 1/8 the interval only contains the center. This does not invalidate the
inequality since we only needed to ensure that no more than 1/4 of the probability falls outside
the interval, which means that once p < 1/8, the two points at a − 2 and a + 2 can be discarded.
# Plot interval when p < 1/8


Means and Variances in the Continuum
This has all been in terms of discrete random variables, but the case of continuous random vari-
ables is similar. To intuitively understand how this works, imagine that we split the real number
line into intervals of length ϵ given by (ϵi, ϵ(i + 1)]. Once we do this, our continuous random vari-
able has been made discrete and we can use (19.6.14) say that
∑
µX ≈ (ϵi)P (X ∈ (ϵi, ϵ(i + 1)])
i
∑ (19.6.22)
≈ (ϵi)pX (ϵi)ϵ,
i
where pX is the density of X. This is an approximation to the integral of xpX (x), so we can con-
clude that
∫ ∞
µX = xpX (x) dx. (19.6.23)
−∞
Similarly, using (19.6.16) the variance can be written as

∫ ∞ (∫ ∞ )2
2
σX = E[X ] −
2
µ2X = x pX (x) dx −
2
xpX (x) dx . (19.6.24)
−∞ −∞
Everything stated above about the mean, the variance, and the standard deviation still applies in
this case. For instance, if we consider the random variable with density
{
1 x ∈ [0, 1],
p(x) = (19.6.25)
0 otherwise.
we can compute
∫ ∞ ∫ 1
1
µX = xp(x) dx = x dx = . (19.6.26)
−∞ 0 2
and
∫ ∞ ( )2
1 1 1 1
2
σX = x p(x) dx −
2
= − = . (19.6.27)
−∞ 2 3 4 12
As a warning, letʼs examine one more example, known as the Cauchy distribution. This is the dis-
tribution with p.d.f. given by
1
p(x) = . (19.6.28)
1 + x2
# Plot the Cauchy distribution p.d.f.

p = 1 / (1 + x**2)
d2l.plot(x, p, 'x', 'p.d.f.')

This function looks innocent, and indeed consulting a table of integrals will show it has area one
under it, and thus it defines a continuous random variable.
To see what goes astray, letʼs try to compute the variance of this. This would involve using (19.6.16)
computing
∫ ∞
x2
2
dx. (19.6.29)
−∞ 1 + x
The function on the inside looks like this:
# Plot the integrand needed to compute the variance

x = torch.arange(-20, 20, 0.01)
p = x**2 / (1 + x**2)
d2l.plot(x, p, 'x', 'integrand')
This function clearly has infinite area under it since it is essentially the constant one with a small
dip near zero, and indeed we could show that
∫ ∞
x2
2
dx = ∞. (19.6.30)
−∞ 1 + x
This means it does not have a well-defined finite variance.

However, looking deeper shows an even more disturbing result. Letʼs try to compute the mean
using (19.6.14). Using the change of variables formula, we see
∫ ∞ ∫
x 1 ∞1
µX = 2
dx = du. (19.6.31)
−∞ 1 + x 2 1 u
The integral inside is the definition of the logarithm, so this is in essence log(∞) = ∞, so there is
no well-defined average value either!
Machine learning scientists define their models so that we most often do not need to deal with
these issues, and will in the vast majority of cases deal with random variables with well-defined
means and variances. However, every so often random variables with heavy tails (that is those
random variables where the probabilities of getting large values are large enough to make things
like the mean or variance undefined) are helpful in modeling physical systems, thus it is worth
knowing that they exist.
Joint Density Functions
The above work all assumes we are working with a single real valued random variable. But what
if we are dealing with two or more potentially highly correlated random variables? This circum-
stance is the norm in machine learning: imagine random variables like Ri,j which encode the red
value of the pixel at the (i, j) coordinate in an image, or Pt which is a random variable given by
a stock price at time t. Nearby pixels tend to have similar color, and nearby times tend to have
similar prices. We cannot treat them as separate random variables, and expect to create a success-
ful model (we will see in Section 19.9 a model that under-performs due to such an assumption).
We need to develop the mathematical language to handle these correlated continuous random
variables.
Thankfully, with the multiple integrals in Section 19.5 we can develop such a language. Suppose
that we have, for simplicity, two random variables X, Y which can be correlated. Then, similar to
the case of a single variable, we can ask the question:
P (X is in an ϵ-sized interval around x and Y is in an ϵ-sized interval around y). (19.6.32)
Similar reasoning to the single variable case shows that this should be approximately
P (X is in an ϵ-sized interval around x and Y is in an ϵ-sized interval around y) ≈ ϵ2 p(x, y),

(19.6.33)
for some function p(x, y). This is referred to as the joint density of X and Y . Similar properties
are true for this as we saw in the single variable case. Namely:
• p(x, y) ≥ 0;
∫
• R2 p(x, y) dx dy = 1;
∫
• P ((X, Y ) ∈ D) = D p(x, y) dx dy.
In this way, we can deal with multiple, potentially correlated random variables. If we wish to
work with more than two random variables, we can extend the multivariate density to as many
coordinates as desired by considering p(x) = p(x1 , . . . , xn ). The same properties of being non-
negative, and having total integral of one still hold.

Marginal Distributions
When dealing with multiple variables, we oftentimes want to be able to ignore the relationships
and ask, “how is this one variable distributed?” Such a distribution is called a marginal distribution.
To be concrete, letʼs suppose that we have two random variables X, Y with joint density given by
pX,Y (x, y). We will be using the subscript to indicate what random variables the density is for. The
question of finding the marginal distribution is taking this function, and using it to find pX (x).
As with most things, it is best to return to the intuitive picture to figure out what should be true.
Recall that the density is the function pX so that
P (X ∈ [x, x + ϵ]) ≈ ϵ · pX (x). (19.6.34)
There is no mention of Y , but if all we are given is pX,Y , we need to include Y somehow. We can
first observe that this is the same as
P (X ∈ [x, x + ϵ], and Y ∈ R) ≈ ϵ · pX (x). (19.6.35)
Our density does not directly tell us about what happens in this case, we need to split into small
intervals in y as well, so we can write this as
∑
ϵ · pX (x) ≈ P (X ∈ [x, x + ϵ], and Y ∈ [ϵ · i, ϵ · (i + 1)])
i
∑ (19.6.36)
≈ ϵ2 pX,Y (x, ϵ · i).
i
Fig. 19.6.1: By summing along the columns of our array of probabilities, we are able to obtain the
marginal distribution for just the random variable represented along the x-axis.
This tells us to add up the value of the density along a series of squares in a line as is shown in Fig.
19.6.1. Indeed, after canceling one factor of epsilon from both sides, and recognizing the sum on
the right is the integral over y, we can conclude that
∑
pX (x) ≈ ϵpX,Y (x, ϵ · i)
∫i ∞ (19.6.37)
≈ pX,Y (x, y) dy.
−∞

Thus we see
∫ ∞
pX (x) = pX,Y (x, y) dy. (19.6.38)
−∞
This tells us that to get a marginal distribution, we integrate over the variables we do not care
about. This process is often referred to as integrating out or marginalized out the unneeded vari-
ables.
Covariance
When dealing with multiple random variables, there is one additional summary statistic which
is helpful to know: the covariance. This measures the degree that two random variable fluctuate
together.
Suppose that we have two random variables X and Y , to begin with, letʼs suppose they are discrete,
taking on values (xi , yj ) with probability pij . In this case, the covariance is defined as
∑
σXY = Cov(X, Y ) = (xi − µX )(yj − µY )pij . = E[XY ] − E[X]E[Y ]. (19.6.39)
i,j
To think about this intuitively: consider the following pair of random variables. Suppose that X
takes the values 1 and 3, and Y takes the values −1 and 3. Suppose that we have the following
probabilities
p
P (X = 1 and Y = −1) = ,
2
1−p
P (X = 1 and Y = 3) = ,
2 (19.6.40)
1−p
P (X = 3 and Y = −1) = ,
2
p
P (X = 3 and Y = 3) = ,
2
where p is a parameter in [0, 1] we get to pick. Notice that if p = 1 then they are both always
their minimum or maximum values simultaneously, and if p = 0 they are guaranteed to take their
flipped values simultaneously (one is large when the other is small and vice versa). If p = 1/2,
then the four possibilities are all equally likely, and neither should be related. Letʼs compute the
covariance. First, note µX = 2 and µY = 1, so we may compute using (19.6.39):
∑
Cov(X, Y ) = (xi − µX )(yj − µY )pij
i,j
p 1−p 1−p p
= (1 − 2)(−1 − 1) + (1 − 2)(3 − 1) + (3 − 2)(−1 − 1) + (3 − 2)(3 − 1)
2 2 2 2
= 4p − 2.
(19.6.41)
When p = 1 (the case where they are both maximally positive or negative at the same time) has a
covariance of 2. When p = 0 (the case where they are flipped) the covariance is −2. Finally, when
p = 1/2 (the case where they are unrelated), the covariance is 0. Thus we see that the covariance
measures how these two random variables are related.
A quick note on the covariance is that it only measures these linear relationships. More complex
relationships like X = Y 2 where Y is randomly chosen from {−2, −1, 0, 1, 2} with equal probabil-
ity can be missed. Indeed a quick computation shows that these random variables have covariance
zero, despite one being a deterministic function of the other.

For continuous random variables, much the same story holds. At this point, we are pretty comfort-
able with doing the transition between discrete and continuous, so we will provide the continuous
analogue of (19.6.39) without any derivation.
∫
σXY = (x − µX )(y − µY )p(x, y) dx dy. (19.6.42)
R2
For visualization, letʼs take a look at a collection of random variables with tunable covariance.
# Plot a few random variables adjustable covariance

covs = [-0.9, 0.0, 1.2]
d2l.plt.figure(figsize=(12, 3))
for i in range(3):
X = torch.randn(500)
Y = covs[i]*X + torch.randn(500)
d2l.plt.subplot(1, 4, i+1)
d2l.plt.scatter(X.numpy(), Y.numpy())
d2l.plt.xlabel('X')
d2l.plt.ylabel('Y')
d2l.plt.title(f'cov = {covs[i]}')
d2l.plt.show()
Letʼs see some properties of covariances:

• For any random variable X, Cov(X, X) = Var(X).
• For any random variables X, Y and numbers a and b, Cov(aX + b, Y ) = Cov(X, aY + b) =
aCov(X, Y ).
• If X and Y are independent then Cov(X, Y ) = 0.
In addition, we can use the covariance to expand a relationship we saw before. Recall that is X
and Y are two independent random variables then
Var(X + Y ) = Var(X) + Var(Y ). (19.6.43)
With knowledge of covariances, we can expand this relationship. Indeed, some algebra can show
that in general,
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ). (19.6.44)

This allows us to generalize the variance summation rule for correlated random variables.
Correlation
As we did in the case of means and variances, letʼs now consider units. If X is measured in one
unit (say inches), and Y is measured in another (say dollars), the covariance is measured in the
product of these two units inches × dollars. These units can be hard to interpret. What we will
often want in this case is a unit-less measurement of relatedness. Indeed, often we do not care
about exact quantitative correlation, but rather ask if the correlation is in the same direction, and
how strong the relationship is.
To see what makes sense, letʼs perform a thought experiment. Suppose that we convert our ran-
dom variables in inches and dollars to be in inches and cents. In this case the random variable Y is
multiplied by 100. If we work through the definition, this means that Cov(X, Y ) will be multiplied
by 100. Thus we see that in this case a change of units change the covariance by a factor of 100.
Thus, to find our unit-invariant measure of correlation, we will need to divide by something else
that also gets scaled by 100. Indeed we have a clear candidate, the standard deviation! Indeed if
we define the correlation coefficient to be
Cov(X, Y )
ρ(X, Y ) = , (19.6.45)
σX σY
we see that this is a unit-less value. A little mathematics can show that this number is between −1
and 1 with 1 meaning maximally positively correlated, whereas −1 means maximally negatively
correlated.
Returning to our explicit discrete example above, we can see that σX = 1 and σY = 2, so we can
compute the correlation between the two random variables using (19.6.45) to see that
4p − 2
ρ(X, Y ) = = 2p − 1. (19.6.46)
1·2
This now ranges between −1 and 1 with the expected behavior of 1 meaning most correlated, and
−1 meaning minimally correlated.
As another example, consider X as any random variable, and Y = aX + b as any linear determin-
istic function of X. Then, one can compute that
σY = σaX+b = |a|σX , (19.6.47)
Cov(X, Y ) = Cov(X, aX + b) = aCov(X, X) = aVar(X), (19.6.48)

and thus by (19.6.45) that
aVar(X) a
ρ(X, Y ) = = = sign(a). (19.6.49)
|a|σX
2 |a|
Thus we see that the correlation is +1 for any a > 0, and −1 for any a < 0 illustrating that corre-
lation measures the degree and directionality the two random variables are related, not the scale
that the variation takes.
Letʼs again plot a collection of random variables with tunable correlation.

# Plot a few random variables adjustable correlations
cors = [-0.9, 0.0, 1.0]
for i in range(3):
X = torch.randn(500)
Y = cors[i] * X + torch.sqrt(torch.tensor(1) -
cors[i]**2) * torch.randn(500)
d2l.plt.subplot(1, 4, i + 1)
d2l.plt.scatter(X.numpy(), Y.numpy())
d2l.plt.xlabel('X')
d2l.plt.ylabel('Y')
d2l.plt.title(f'cor = {cors[i]}')
d2l.plt.show()
Letʼs list a few properties of the correlation below.

• For any random variable X, ρ(X, X) = 1.
• For any random variables X, Y and numbers a and b, ρ(aX + b, Y ) = ρ(X, aY + b) = ρ(X, Y ).
• If X and Y are independent with non-zero variance then ρ(X, Y ) = 0.
As a final note, you may feel like some of these formulae are familiar. Indeed, if we expand ev-
erything out assuming that µX = µY = 0, we see that this is
∑
i,j xi yi pij
ρ(X, Y ) = √∑ √∑ . (19.6.50)
x 2p y 2p
i,j i ij i,j j ij
This looks like a sum of a product of terms divided by the square root of sums of terms. This is
exactly the formula for the cosine of the angle between two vectors v, w with the different coordi-
nates weighted by pij :
∑
v·w v i wi
cos(θ) = = √∑ i √∑ . (19.6.51)
∥v∥∥w∥ v 2 w 2
i i i i
Indeed if we think of norms as being related to standard deviations, and correlations as being
cosines of angles, much of the intuition we have from geometry can be applied to thinking about
random variables.

19.6.2 Summary
• Continuous random variables are random variables that can take on a continuum of values.
They have some technical difficulties that make them more challenging to work with com-
pared to discrete random variables.
• The probability density function allows us to work with continuous random variables by
giving a function where the area under the curve on some interval gives the probability of
finding a sample point in that interval.
• The cumulative distribution function is the probability of observing the random variable to
be less than a given threshold. It can provide a useful alternate viewpoint which unifies
discrete and continuous variables.
• The mean is the average value of a random variable.
• The variance is the expected square of the difference between the random variable and its
mean.
• The standard deviation is the square root of the variance. It can be thought of as measuring
the range of values the random variable may take.
• Chebyshevʼs inequality allows us to make this intuition rigorous by giving an explicit interval
that contains the random variable most of the time.
• Joint densities allow us to work with correlated random variables. We may marginalize joint
densities by integrating over unwanted random variables to get the distribution of the de-
sired random variable.
• The covariance and correlation coefficient provide a way to measure any linear relationship
between two correlated random variables.
Exercises
1. Suppose that we have the random variable with density given by p(x) = 1
x2
for x ≥ 1 and
p(x) = 0 otherwise. What is P (X > 2)?
2. The Laplace distribution is a random variable whose density is given by = 12 e−|x| . What
∫ ∞p(x −x
is
∫ ∞the2 mean and the standard deviation of this function? As a hint, 0 xe dx = 1 and
−x
0 x e dx = 2.
3. I walk up to you on the street and say “I have a random variable with mean 1, standard devi-
ation 2, and I observed 25% of my samples taking a value larger than 9.” Do you believe me?
Why or why not?
4. Suppose that you have two random variables X, Y , with joint density given by pXY (x, y) =
4xy for x, y ∈ [0, 1] and pXY (x, y) = 0 otherwise. What is the covariance of X and Y ?
Discussions261
261

19.7 Maximum Likelihood
One of the most commonly encountered way of thinking in machine learning is the maximum
likelihood point of view. This is the concept that when working with a probabilistic model with
unknown parameters, the parameters which make the data have the highest probability are the
most likely ones.
19.7.1 The Maximum Likelihood Principle
This has a Bayesian interpretation which can be helpful to think about. Suppose that we have a
model with parameters θ and a collection of data examples X. For concreteness, we can imagine
that θ is a single value representing the probability that a coin comes up heads when flipped, and
X is a sequence of independent coin flips. We will look at this example in depth later.
If we want to find the most likely value for the parameters of our model, that means we want to
find
argmax P (θ | X). (19.7.1)
By Bayesʼ rule, this is the same thing as
P (X | θ)P (θ)
argmax . (19.7.2)
P (X)
The expression P (X), a parameter agnostic probability of generating the data, does not depend
on θ at all, and so can be dropped without changing the best choice of θ. Similarly, we may now
posit that we have no prior assumption on which set of parameters are better than any others, so
we may declare that P (θ) does not depend on theta either! This, for instance, makes sense in our
coin flipping example where the probability it comes up heads could be any value in [0, 1] without
any prior belief it is fair or not (often referred to as an uninformative prior). Thus we see that our
application of Bayesʼ rule shows that our best choice of θ is the maximum likelihood estimate for
θ:
θ̂ = argmax P (X | θ). (19.7.3)

θ
As a matter of common terminology, the probability of the data given the parameters (P (X | θ))
is referred to as the likelihood.
A Concrete Example
Letʼs see how this works in a concrete example. Suppose that we have a single parameter θ rep-
resenting the probability that a coin flip is heads. Then the probability of getting a tails is 1 − θ,
and so if our observed data X is a sequence with nH heads and nT tails, we can use the fact that
independent probabilities multiply to see that
P (X | θ) = θnH (1 − θ)nT . (19.7.4)
If we flip 13 coins and get the sequence “HHHTHTTHHHHHT”, which has nH = 9 and nT = 4, we
see that this is
P (X | θ) = θ9 (1 − θ)4 . (19.7.5)

One nice thing about this example will be that we know the answer going in. Indeed, if we said
verbally, “I flipped 13 coins, and 9 came up heads, what is our best guess for the probability that
the coin comes us heads?, ” everyone would correctly guess 9/13. What this maximum likelihood
method will give us is a way to get that number from first principals in a way that will generalize
to vastly more complex situations.
For our example, the plot of P (X | θ) is as follows:
%matplotlib inline
import torch
theta = torch.arange(0, 1, 0.001)

p = theta**9 * (1 - theta)**4.
d2l.plot(theta, p, 'theta', 'likelihood')
This has its maximum value somewhere near our expected 9/13 ≈ 0.7 . . .. To see if it is exactly
there, we can turn to calculus. Notice that at the maximum, the gradient of the function is flat.
Thus, we could find the maximum likelihood estimate (19.7.1) by finding the values of θ where the
derivative is zero, and finding the one that gives the highest probability. We compute:
d
0= P (X | θ)
dθ
d 9
= θ (1 − θ)4 (19.7.6)
dθ
= 9θ8 (1 − θ)4 − 4θ9 (1 − θ)3
= θ8 (1 − θ)3 (9 − 13θ).
This has three solutions: 0, 1 and 9/13. The first two are clearly minima, not maxima as they assign
probability 0 to our sequence. The final value does not assign zero probability to our sequence,
and thus must be the maximum likelihood estimate θ̂ = 9/13.
19.7. Maximum Likelihood 891

19.7.2 Numerical Optimization and the Negative Log-Likelihood
The previous example is nice, but what if we have billions of parameters and data examples?
First, notice that if we make the assumption that all the data examples are independent, we can
no longer practically consider the likelihood itself as it is a product of many probabilities. Indeed,
each probability is in [0, 1], say typically of value about 1/2, and the product of (1/2)1000000000 is far
below machine precision. We cannot work with that directly.
However, recall that the logarithm turns products to sums, in which case
log((1/2)1000000000 ) = 1000000000 · log(1/2) ≈ −301029995.6 . . . (19.7.7)
This number fits perfectly within even a single precision 32-bit float. Thus, we should consider
the log-likelihood, which is
log(P (X | θ)). (19.7.8)
Since the function x 7→ log(x) is increasing, maximizing the likelihood is the same thing as maxi-
mizing the log-likelihood. Indeed in Section 19.9 we will see this reasoning applied when working
with the specific example of the naive Bayes classifier.
We often work with loss functions, where we wish to minimize the loss. We may turn maximum
likelihood into the minimization of a loss by taking − log(P (X | θ)), which is the negative log-
likelihood.
To illustrate this, consider the coin flipping problem from before, and pretend that we do not know
the closed form solution. We may compute that
− log(P (X | θ)) = − log(θnH (1 − θ)nT ) = −(nH log(θ) + nT log(1 − θ)). (19.7.9)
This can be written into code, and freely optimized even for billions of coin flips.
# Set up our data

n_H = 8675309
n_T = 25624
# Initialize our paramteres

theta = torch.tensor(0.5, requires_grad=True)
# Perform gradient descent

lr = 0.00000000001
for iter in range(10):
loss = -(n_H * torch.log(theta) + n_T * torch.log(1 - theta))
loss.backward()
theta -= lr * theta.grad
theta.grad.zero_()
# Check output
theta, n_H / (n_H + n_T)
(tensor(0.5017, requires_grad=True), 0.9970550284664874)

Numerical convenience is not the only reason why people like to use negative log-likelihoods.
There are several other reasons why it is preferable.
The second reason we consider the log-likelihood is the simplified application of calculus rules. As
discussed above, due to independence assumptions, most probabilities we encounter in machine
learning are products of individual probabilities.
P (X | θ) = p(x1 | θ) · p(x2 | θ) · · · p(xn | θ). (19.7.10)
This means that if we directly apply the product rule to compute a derivative we get
( )
∂ ∂
P (X | θ) = P (x1 | θ) · P (x2 | θ) · · · P (xn | θ)
∂θ ∂θ
( )
∂
+ P (x1 | θ) · P (x2 | θ) · · · P (xn | θ)
∂θ (19.7.11)
..
.
( )
∂
+ P (x1 | θ) · P (x2 | θ) · · · P (xn | θ) .
∂θ
This requires n(n−1) multiplications, along with (n−1) additions, so it is proportional to quadratic
time in the inputs! Sufficient cleverness in grouping terms will reduce this to linear time, but it
requires some thought. For the negative log-likelihood we have instead
− log (P (X | θ)) = − log(P (x1 | θ)) − log(P (x2 | θ)) · · · − log(P (xn | θ)), (19.7.12)
which then gives

( ) ( )
∂ 1 ∂ 1 ∂
− log (P (X | θ)) = P (x1 | θ) + · · · + P (xn | θ) . (19.7.13)
∂θ P (x1 | θ) ∂θ P (xn | θ) ∂θ
This requires only n divides and n − 1 sums, and thus is linear time in the inputs.
The third and final reason to consider the negative log-likelihood is the relationship to information
theory, which we will discuss in detail in Section 19.11. This is a rigorous mathematical theory
which gives a way to measure the degree of information or randomness in a random variable.
The key object of study in that field is the entropy which is
∑
H(p) = − pi log2 (pi ), (19.7.14)
i
which measures the randomness of a source. Notice that this is nothing more than the average
− log probability, and thus if we take our negative log-likelihood and divide by the number of data
examples, we get a relative of entropy known as cross-entropy. This theoretical interpretation
alone would be sufficiently compelling to motivate reporting the average negative log-likelihood
over the dataset as a way of measuring model performance.
19.7.3 Maximum Likelihood for Continuous Variables
Everything that we have done so far assumes we are working with discrete random variables, but
what if we want to work with continuous ones?
19.7. Maximum Likelihood 893

The short summary is that nothing at all changes, except we replace all the instances of the proba-
bility with the probability density. Recalling that we write densities with lower case p, this means
that for example we now say
∑
− log (p(X | θ)) = − log(p(x1 | θ)) − log(p(x2 | θ)) · · · − log(p(xn | θ)) = − log(p(xi | θ)).
i
(19.7.15)
The question becomes, “Why is this OK?” After all, the reason we introduced densities was because
probabilities of getting specific outcomes themselves was zero, and thus is not the probability of
generating our data for any set of parameters zero?
Indeed, this is the case, and understanding why we can shift to densities is an exercise in tracing
what happens to the epsilons.
Letʼs first re-define our goal. Suppose that for continuous random variables we no longer want to
compute the probability of getting exactly the right value, but instead matching to within some
range ϵ. For simplicity, we assume our data is repeated observations x1 , . . . , xN of identically dis-
tributed random variables X1 , . . . , XN . As we have seen previously, this can be written as
P (X1 ∈ [x1 , x1 + ϵ], X2 ∈ [x2 , x2 + ϵ], . . . , XN ∈ [xN , xN + ϵ] | θ)

(19.7.16)
≈ϵN p(x1 | θ) · p(x2 | θ) · · · p(xn | θ).
Thus, if we take negative logarithms of this we obtain
− log(P (X1 ∈ [x1 , x1 + ϵ], X2 ∈ [x2 , x2 + ϵ], . . . , XN ∈ [xN , xN + ϵ] | θ))

∑
≈ − N log(ϵ) − log(p(xi | θ)). (19.7.17)
i
If we examine this expression, the only place that the ϵ occurs is in the additive constant −N log(ϵ).
This does not depend on the parameters θ at all, so the optimal choice of θ does not depend on
our choice of ϵ! If we demand four digits or four-hundred, the best choice of θ remains the same,
thus we may freely drop the epsilon to see that what we want to optimize is
∑
− log(p(xi | θ)). (19.7.18)
i
Thus, we see that the maximum likelihood point of view can operate with continuous random
variables as easily as with discrete ones by replacing the probabilities with probability densities.
19.7.4 Summary
• The maximum likelihood principle tells us that the best fit model for a given dataset is the
one that generates the data with the highest probability.
• Often people work with the negative log-likelihood instead for a variety of reasons: numer-
ical stability, conversion of products to sums (and the resulting simplification of gradient
computations), and theoretical ties to information theory.
• While simplest to motivate in the discrete setting, it may be freely generalized to the contin-
uous setting as well by maximizing the probability density assigned to the datapoints.

Exercises
1. Suppose that you know that a non-negative random variable has density αe−αx for some
value α > 0. You obtain a single observation from the random variable which is the number
3. What is the maximum likelihood estimate for α?
2. Suppose that you have a dataset of samples {xi }N
i=1 drawn from a Gaussian with unknown
mean, but variance 1. What is the maximum likelihood estimate for the mean?
Discussions262
19.8 Distributions
Now that we have learned how to work with probability in both the discrete and the continuous
setting, letʼs get to know some of the common distributions encountered. Depending on the area
of machine learning, we may need to be familiar with vastly more of these, or for some areas of
deep learning potentially none at all. This is, however, a good basic list to be familiar with. Letʼs
first import some common libraries.
%matplotlib inline
from math import erf, factorial
import torch
torch.pi = torch.acos(torch.zeros(1)) * 2 # Define pi in torch
19.8.1 Bernoulli
This is the simplest random variable usually encountered. This random variable encodes a coin
flip which comes up 1 with probability p and 0 with probability 1 − p. If we have a random variable
X with this distribution, we will write
X ∼ Bernoulli(p). (19.8.1)
The cumulative distribution function is



0 x < 0,
F (x) = 1 − p 0 ≤ x < 1, (19.8.2)


1 x >= 1.
The probability mass function is plotted below.
p = 0.3
d2l.set_figsize()
d2l.plt.stem([0, 1], [1 - p, p], use_line_collection=True)
d2l.plt.xlabel('x')
d2l.plt.show()
262
19.8. Distributions 895

Now, letʼs plot the cumulative distribution function (19.8.2).
def F(x):
return 0 if x < 0 else 1 if x > 1 else 1 - p
d2l.plot(x, torch.tensor([F(y) for y in x]), 'x', 'c.d.f.')
If X ∼ Bernoulli(p), then:
• µX = p,
2 = p(1 − p).
• σX
We can sample an array of arbitrary shape from a Bernoulli random variable as follows.
1*(torch.rand(10, 10) < p)
tensor([[1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 1, 1, 0],
[0, 1, 0, 0, 0, 0, 1, 1, 0, 0],

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 1, 0, 0],
[0, 1, 1, 1, 1, 0, 1, 0, 0, 1],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 1]])
19.8.2 Discrete Uniform
The next commonly encountered random variable is a discrete uniform. For our discussion here,
we will assume that it is supported on the integers {1, 2, . . . , n}, however any other set of values
can be freely chosen. The meaning of the word uniform in this context is that every possible value
is equally likely. The probability for each value i ∈ {1, 2, 3, . . . , n} is pi = n1 . We will denote a
random variable X with this distribution as
X ∼ U (n). (19.8.3)



0 x < 1,
F (x) = nk k ≤ x < k + 1 with 1 ≤ k < n, (19.8.4)


1 x >= n.
Letʼs first plot the probability mass function.
n = 5
d2l.plt.stem([i+1 for i in range(n)], n*[1 / n], use_line_collection=True)

d2l.plt.xlabel('x')
d2l.plt.show()

def F(x):
return 0 if x < 1 else 1 if x > n else torch.floor(x) / n
If X ∼ U (n), then:
1+n
• µX = 2 ,
2 = n2 −1
• σX 12 .
We can sample an array of arbitrary shape from a discrete uniform random variable as follows.
torch.randint(1, n, size=(10, 10))
tensor([[1, 2, 4, 3, 1, 2, 1, 4, 1, 1],
[2, 3, 4, 3, 1, 1, 1, 1, 3, 3],
[1, 3, 3, 3, 1, 2, 3, 1, 4, 2],
[4, 2, 4, 3, 4, 2, 3, 1, 4, 3],
[3, 1, 2, 3, 3, 2, 4, 1, 1, 2],
[4, 4, 4, 4, 3, 3, 4, 4, 1, 1],
[1, 3, 3, 4, 2, 2, 2, 2, 1, 4],
[1, 1, 1, 3, 2, 2, 1, 3, 2, 3],
[2, 1, 1, 2, 4, 4, 1, 1, 4, 3],
[2, 4, 2, 1, 2, 4, 3, 4, 2, 3]])

19.8.3 Continuous Uniform
Next, letʼs discuss the continuous uniform distribution. The idea behind this random variable is
that if we increase the n in the discrete uniform distribution, and then scale it to fit within the
interval [a, b], we will approach a continuous random variable that just picks an arbitrary value in
[a, b] all with equal probability. We will denote this distribution as
X ∼ U (a, b). (19.8.5)
The probability density function is

{
1
b−a x ∈ [a, b],
p(x) = (19.8.6)
0 x ̸∈ [a, b].



0 x < a,
F (x) = x−a
x ∈ [a, b], (19.8.7)
 b−a

1 x >= b.
Letʼs first plot the probability density function (19.8.6).
a, b = 1, 3
x = torch.arange(0, 4, 0.01)
p = (x > a).type(torch.float32)*(x < b).type(torch.float32)/(b-a)
def F(x):
return 0 if x < a else 1 if x > b else (x - a) / (b - a)

If X ∼ U (a, b), then:
a+b
• µX = 2 ,
2 = (b−a)2
• σX 12 .
We can sample an array of arbitrary shape from a uniform random variable as follows. Note that
it by default samples from a U (0, 1), so if we want a different range we need to scale it.
(b - a) * torch.rand(10, 10) + a
tensor([[1.3407, 2.9463, 1.3024, 2.3558, 1.1522, 1.8822, 2.3531, 1.2544, 2.0675,

2.1224],
[2.8087, 2.0933, 1.9004, 2.4039, 2.0174, 2.4020, 1.2471, 2.7415, 2.4284,
1.1843],
[2.8933, 2.9785, 1.2011, 2.9047, 2.3221, 1.0419, 2.1727, 2.3175, 1.0767,
2.0220],
[2.8450, 1.1983, 2.8005, 2.6843, 2.1759, 1.1987, 1.3771, 2.2230, 2.5729,
2.4497],
[1.9895, 1.2687, 1.7678, 1.8760, 1.8122, 2.5085, 1.6869, 2.3768, 2.0529,
2.0493],
[2.1291, 1.3207, 1.9768, 1.8941, 2.0006, 1.5537, 2.3956, 2.9053, 1.7037,
1.0791],
[2.9085, 1.4451, 2.8675, 1.5195, 1.1938, 1.6544, 2.6599, 2.6653, 1.0600,
2.8023],
[1.1694, 2.4635, 1.8782, 1.9116, 2.6862, 2.0885, 2.1233, 2.3824, 2.8639,
2.6550],
[1.9362, 1.2924, 1.1864, 2.7568, 1.5160, 2.1315, 2.8026, 1.6864, 1.6140,
1.1142],
[1.1725, 1.5998, 1.9147, 2.7414, 1.7226, 1.2918, 2.9093, 2.9673, 1.7565,
1.2769]])

19.8.4 Binomial
Letʼs make things a little more complex and examine the binomial random variable. This random
variable originates from performing a sequence of n independent experiments, each of which has
probability p of succeeding, and asking how many successes we expect to see.
Letʼs express this mathematically. Each experiment is an independent random variable Xi where
we will use 1 to encode success, and 0 to encode failure. Since each is an independent coin flip
which is successful with probability p, we can say that Xi ∼ Bernoulli(p). Then, the binomial
random variable is
∑
n
X= Xi . (19.8.8)
i=1
In this case, we will write
X ∼ Binomial(n, p). (19.8.9)
To get the( cumulative

) distribution function, we need to notice that getting exactly k successes can
occur in nk = k!(n−k)!
n!
ways each of which has a probability of pk (1 − p)n−k of occurring. Thus the
cumulative distribution function is


0 x < 0,
∑ (n) m
m≤k m p (1 − p) k ≤ x < k + 1 with 0 ≤ k < n,
F (x) = n−m (19.8.10)


1 x >= n.
Letʼs first plot the probability mass function.
n, p = 10, 0.2
# Compute binomial coefficient

def binom(n, k):
comb = 1
for i in range(min(k, n - k)):
comb = comb * (n - i) // (i + 1)
return comb
pmf = torch.tensor([p**i * (1-p)**(n - i) * binom(n, i) for i in range(n + 1)])
d2l.plt.stem([i for i in range(n + 1)], pmf, use_line_collection=True)

d2l.plt.xlabel('x')
d2l.plt.show()


cmf = torch.cumsum(pmf, dim=0)
def F(x):
return 0 if x < 0 else 1 if x > n else cmf[int(x)]
d2l.plot(x, torch.tensor([F(y) for y in x.tolist()]), 'x', 'c.d.f.')
If X ∼ Binomial(n, p), then:

• µX = np,
2 = np(1 − p).
• σX
This follows from the linearity of expected value over the sum of n Bernoulli random variables, and
the fact that the variance of the sum of independent random variables is the sum of the variances.
This can be sampled as follows.
m = torch.distributions.binomial.Binomial(n, p)
m.sample(sample_shape=(10, 10))

tensor([[3., 3., 2., 4., 1., 2., 4., 3., 1., 3.],
[3., 2., 0., 1., 5., 1., 1., 1., 1., 0.],
[1., 2., 1., 2., 3., 4., 1., 2., 0., 1.],
[1., 1., 2., 2., 2., 3., 1., 3., 2., 4.],
[2., 3., 1., 4., 1., 2., 2., 1., 4., 2.],
[0., 1., 5., 1., 4., 2., 1., 1., 0., 2.],
[2., 2., 1., 4., 2., 1., 1., 2., 2., 3.],
[3., 1., 0., 3., 1., 2., 3., 3., 0., 2.],
[2., 0., 2., 0., 2., 3., 4., 4., 1., 1.],
[4., 2., 5., 2., 2., 3., 1., 2., 0., 3.]])
19.8.5 Poisson
Letʼs now perform a thought experiment. We are standing at a bus stop and we want to know how
many buses will arrive in the next minute. Letʼs start by considering X (1) ∼ Bernoulli(p) which
is simply the probability that a bus arrives in the one minute window. For bus stops far from an
urban center, this might be a pretty good approximation. We may never see more than one bus in
a minute.
However, if we are in a busy area, it is possible or even likely that two buses will arrive. We can
model this by splitting our random variable into two parts for the first 30 seconds, or the second
30 seconds. In this case we can write
(2) (2)
X (2) ∼ X1 + X2 , (19.8.11)
(2)
where X (2) is the total sum, and Xi ∼ Bernoulli(p/2). The total distribution is then X (2) ∼
Binomial(2, p/2).
Why stop here? Letʼs continue to split that minute into n parts. By the same reasoning as above,
we see that
X (n) ∼ Binomial(n, p/n). (19.8.12)
Consider these random variables. By the previous section, we know that (19.8.12) has mean
(n) = n(p/n)(1 − (p/n)) = p(1 − p/n). If we take n → ∞,
µX (n) = n(p/n) = p, and variance σX2
2
we can see that these numbers stabilize to µX (∞) = p, and variance σX (∞) = p. This indicates that
there could be some random variable we can define in this infinite subdivision limit.
This should not come as too much of a surprise, since in the real world we can just count the
number of bus arrivals, however it is nice to see that our mathematical model is well defined.
This discussion can be made formal as the law of rare events.
Following through this reasoning carefully, we can arrive at the following model. We will say that
X ∼ Poisson(λ) if it is a random variable which takes the values {0, 1, 2, . . .} with probability
λk e−λ (19.8.13)
pk = .
k!
The value λ > 0 is known as the rate (or the shape parameter), and denotes the average number of
arrivals we expect in one unit of time.
We may sum this probability mass function to get the cumulative distribution function.
{
0 x < 0,
F (x) = −λ
∑k λm
(19.8.14)
e m=0 m! k ≤ x < k + 1 with 0 ≤ k.

Letʼs first plot the probability mass function (19.8.13).
lam = 5.0
xs = [i for i in range(20)]
pmf = torch.tensor([torch.exp(torch.tensor(-lam)) * lam**k
/ factorial(k) for k in xs])
d2l.plt.stem(xs, pmf, use_line_collection=True)

d2l.plt.xlabel('x')
d2l.plt.show()

cmf = torch.cumsum(pmf, dim=0)
def F(x):
return 0 if x < 0 else 1 if x > n else cmf[int(x)]
d2l.plot(x, torch.tensor([F(y) for y in x.tolist()]), 'x', 'c.d.f.')
As we saw above, the means and variances are particularly concise. If X ∼ Poisson(λ), then:

• µX = λ,
2 = λ.
• σX
This can be sampled as follows.
m = torch.distributions.poisson.Poisson(lam)
m.sample((10, 10))
tensor([[ 2., 8., 2., 3., 2., 6., 3., 6., 3., 5.],
[ 5., 6., 6., 5., 5., 7., 1., 5., 6., 2.],
[ 1., 2., 6., 4., 7., 6., 1., 2., 6., 6.],
[ 2., 3., 4., 4., 8., 5., 2., 6., 5., 4.],
[ 2., 4., 10., 7., 2., 6., 6., 7., 4., 4.],
[ 6., 5., 9., 9., 5., 5., 4., 4., 4., 8.],
[ 6., 4., 4., 8., 7., 3., 5., 3., 3., 3.],
[ 3., 3., 5., 4., 9., 5., 6., 3., 1., 10.],
[11., 4., 6., 5., 4., 5., 4., 3., 2., 3.],
[ 5., 4., 6., 6., 8., 4., 11., 6., 7., 5.]])
19.8.6 Gaussian
Now Letʼs try a different, but related experiment. Letʼs say we again are performing n independent
Bernoulli(p) measurements Xi . The distribution of the sum of these is X (n) ∼ Binomial(n, p).
Rather than taking a limit as n increases and p decreases, Letʼs fix p, and then send n → ∞. In this
case µX (n) = np → ∞ and σX 2
(n) = np(1 − p) → ∞, so there is no reason to think this limit should
be well defined.
However, not all hope is lost! Letʼs just make the mean and variance be well behaved by defining
X (n) − µX (n)
Y (n) = . (19.8.15)
σX (n)
This can be seen to have mean zero and variance one, and so it is plausible to believe that it will
converge to some limiting distribution. If we plot what these distributions look like, we will be-
come even more convinced that it will work.
p = 0.2
ns = [1, 10, 100, 1000]
for i in range(4):
n = ns[i]
pmf = torch.tensor([p**i * (1-p)**(n-i) * binom(n, i)
for i in range(n + 1)])
d2l.plt.subplot(1, 4, i + 1)
d2l.plt.stem([(i - n*p)/torch.sqrt(torch.tensor(n*p*(1 - p)))
for i in range(n + 1)], pmf,
use_line_collection=True)
d2l.plt.xlim([-4, 4])
d2l.plt.xlabel('x')
d2l.plt.title("n = {}".format(n))
d2l.plt.show()

One thing to note: compared to the Poisson case, we are now dividing by the standard deviation
which means that we are squeezing the possible outcomes into smaller and smaller areas. This is
an indication that our limit will no longer be discrete, but rather continuous.
A derivation of what occurs is beyond the scope of this document, but the central limit theorem
states that as n → ∞, this will yield the Gaussian Distribution (or sometimes normal distribution).
More explicitly, for any a, b:
lim P (Y (n) ∈ [a, b]) = P (N (0, 1) ∈ [a, b]), (19.8.16)

n→∞
where we say a random variable is normally distributed with given mean µ and variance σ 2 , writ-
ten X ∼ N (µ, σ 2 ) if X has density
1 (x−µ)2
pX (x) = √ e− 2σ 2 . (19.8.17)
2πσ 2
Letʼs first plot the probability density function (19.8.17).
mu, sigma = 0, 1
p = 1 / torch.sqrt(2 * torch.pi * sigma**2) * torch.exp(
-(x - mu)**2 / (2 * sigma**2))

Now, letʼs plot the cumulative distribution function. It is beyond the scope of this appendix, but
the Gaussian c.d.f. does not have a closed-form formula in terms of more elementary functions.
We will use erf which provides a way to compute this integral numerically.
def phi(x):
return (1.0 + erf((x - mu) / (sigma * torch.sqrt(torch.tensor(2.))))) / 2.0
d2l.plot(x, torch.tensor([phi(y) for y in x.tolist()]), 'x', 'c.d.f.')
Keen-eyed readers will recognize some of these terms. Indeed, we encountered this integral in
Section 19.5. Indeed we need exactly that computation to see that this pX (x) has total area one
and is thus a valid density.
Our choice of working with coin flips made computations shorter, but nothing about that choice
was fundamental. Indeed, if we take any collection of independent identically distributed random
variables Xi , and form
∑
N
X (N ) = Xi . (19.8.18)
i=1
Then
X (N ) − µX (N )
(19.8.19)
σX (N )
will be approximately Gaussian. There are additional requirements needed to make it work, most
commonly E[X 4 ] < ∞, but the philosophy is clear.
The central limit theorem is the reason why the Gaussian is fundamental to probability, statis-
tics, and machine learning. Whenever we can say that something we measured is a sum of many
small independent contributions, we can assume that the thing being measured will be close to
Gaussian.
There are many more fascinating properties of Gaussians, and we would like to discuss one more
here. The Gaussian is what is known as a maximum entropy distribution. We will get into entropy
more deeply in Section 19.11, however all we need to know at this point is that it is a measure of
randomness. In a rigorous mathematical sense, we can think of the Gaussian as the most ran-
dom choice of random variable with fixed mean and variance. Thus, if we know that our random

variable has some mean and variance, the Gaussian is in a sense the most conservative choice of
distribution we can make.
To close the section, letʼs recall that if X ∼ N (µ, σ 2 ), then:
• µX = µ,
2 = σ2.
• σX
We can sample from the Gaussian (or standard normal) distribution as shown below.
torch.normal(mu, sigma, size=(10, 10))
tensor([[-1.8274, -1.2824, 0.0736, -1.0364, 1.1049, 0.3847, -0.5475, -0.1503,

-0.8055, -1.4943],
[ 0.0645, -0.4899, -4.2413, -0.5047, -0.8355, -0.2549, 0.5958, -0.2833,
0.4181, -0.8447],
[-0.4652, -1.5496, -0.6649, 1.5355, -0.6436, -0.0681, -0.2087, -0.6262,
-0.9643, -0.0152],
[-0.0606, 1.0457, 0.2949, 0.6858, -0.0811, -1.4702, -0.6672, -0.6295,
-2.5748, 0.6624],
[ 0.2815, 2.1227, 0.2182, 0.6966, 0.4717, -1.4600, -0.2671, 0.8072,
-0.6009, -0.7226],
[ 1.1128, 1.8266, 0.5954, -0.4019, -0.2920, -0.2601, 0.4484, 1.4754,
-0.9301, 1.2265],
[-1.9695, 0.1694, -1.8153, -0.2674, 1.0005, -0.0562, -0.8068, -0.6456,
0.0242, -0.1119],
[ 0.3159, 0.4522, 1.5821, 0.3642, -0.8109, -0.0558, 0.0953, 1.4040,
-0.6535, 0.2136],
[ 0.3021, -0.8677, 0.4516, -1.0713, 0.6192, 0.7036, -1.0788, 0.6413,
-1.4546, -0.8210],
[-0.9459, -1.2138, 0.0987, -1.2443, 1.0690, -0.0527, -1.0614, 0.0205,
0.8417, -1.0435]])
19.8.7 Exponential Family
One shared property for all the distributions listed above is that they all belong to which is known
as the exponential family. The exponential family is a set of distributions whose density can be
expressed in the following form:
( )
p(x|η) = h(x) · exp η ⊤ · T (x) − A(η) (19.8.20)
As this definition can be a little subtle, letʼs examine it closely.

First, h(x) is known as the underlying measure or the base measure. This can be viewed as an original
choice of measure we are modifying with our exponential weight.
Second, we have the vector η = (η1 , η2 , ..., ηl ) ∈ Rl called the natural parameters or canonical pa-
rameters. These define how the base measure will be modified. The natural parameters enter
into the new measure by taking the dot product of these parameters against some function T (·) of
x = (x1 , x2 , ..., xn ) ∈ Rn and exponentiated. The vector T (x) = (T1 (x), T2 (x), ..., Tl (x)) is called the
sufficient statistics for η. This name is used since the information represented by T (x) is sufficient
to calculate the probability density and no other information from the sample xʼs are required.

Third, we have A(η), which is referred to as the cumulant function, which ensures that the above
distribution (19.8.20) integrates to one, i.e.,
[∫ ( ) ]
⊤
A(η) = log h(x) · exp η · T (x) dx . (19.8.21)
To be concrete, letʼs consider the Gaussian. Assuming that x is an univariate variable, we saw that
it had a density of
{ }
1 −(x − µ)2
p(x|µ, σ) = √ · exp
2πσ 2 2σ 2
{ ( )} (19.8.22)
1 µ 1 2 1 2
= √ · exp x − 2x − µ + log(σ) .
2π σ2 2σ 2σ 2
This matches the definition of the exponential family with:

• underlying measure: h(x) = √12π ,
[ ] [ µ ]
η
• natural parameters: η = 1 = σ12 ,
η2 2σ 2
[ ]
x
• sufficient statistics: T (x) = , and
−x2
η12
• cumulant function: A(η) = 1
2σ 2
µ2 + log(σ) = 4η2 − 12 log(2η2 ).
It is worth noting that the exact choice of each of above terms is somewhat arbitrary. Indeed, the
important feature is that the distribution can be expressed in this form, not the exact form itself.
As we allude to in Section 4.1.2, a widely used technique is to assume that the final output y follows
an exponential family distribution. The exponential family is a common and powerful family of
distributions encountered frequently in machine learning.
19.8.8 Summary
• Bernoulli random variables can be used to model events with a yes/no outcome.
• Discrete uniform distributions model selects from a finite set of possibilities.
• Continuous uniform distributions select from an interval.
• Binomial distributions model a series of Bernoulli random variables, and count the number
of successes.
• Poisson random variables model the arrival of rare events.
• Gaussian random variables model the result of adding a large number of independent ran-
dom variables together.
• All the above distributions belong to exponential family.

Exercises
1. What is the standard deviation of a random variable that is the difference X − Y of two
independent binomial random variables X, Y ∼ Binomial(16, 1/2).
√
2. If we take a Poisson random variable X ∼ Poisson(λ) and consider (X − λ)/ λ as λ → ∞,
we can show that this becomes approximately Gaussian. Why does this make sense?
3. What is the probability mass function for a sum of two discrete uniform random variables
on n elements?
Discussions263
19.9 Naive Bayes
Throughout the previous sections, we learned about the theory of probability and random vari-
ables. To put this theory to work, letʼs introduce the naive Bayes classifier. This uses nothing but
probabilistic fundamentals to allow us to perform classification of digits.
Learning is all about making assumptions. If we want to classify a new data example that we have
never seen before we have to make some assumptions about which data examples are similar to
each other. The naive Bayes classifier, a popular and remarkably clear algorithm, assumes all
features are independent from each other to simplify the computation. In this section, we will
apply this model to recognize characters in images.
%matplotlib inline
import math
import torch
import torchvision
19.9.1 Optical Character Recognition
MNIST (LeCun et al., 1998) is one of widely used datasets. It contains 60,000 images for training
and 10,000 images for validation. Each image contains a handwritten digit from 0 to 9. The task is
classifying each image into the corresponding digit.
Gluon provides a MNIST class in the data.vision module to automatically retrieve the dataset from
the Internet. Subsequently, Gluon will use the already-downloaded local copy. We specify whether
we are requesting the training set or the test set by setting the value of the parameter train to
True or False, respectively. Each image is a grayscale image with both width and height of 28 with
shape (28,28,1). We use a customized transformation to remove the last channel dimension. In
addition, the dataset represents each pixel by an unsigned 8-bit integer. We quantize them into
binary features to simplify the problem.
263

data_transform = torchvision.transforms.Compose([
lambda x: torch.floor(x * 255 / 128).squeeze(dim=0)
])
mnist_train = torchvision.datasets.MNIST(
root='./temp', train=True, transform=data_transform, download=True)
mnist_test = torchvision.datasets.MNIST(
root='./temp', train=False, transform=data_transform, download=True)
We can access a particular example, which contains the image and the corresponding label.
image, label = mnist_train[2]

image.shape, label
(torch.Size([28, 28]), 4)
Our example, stored here in the variable image, corresponds to an image with a height and width
of 28 pixels.
image.shape, image.dtype
(torch.Size([28, 28]), torch.float32)
Our code stores the label of each image as a scalar. Its type is a 32-bit integer.
label, type(label)
(4, int)
We can also access multiple examples at the same time.
images = torch.stack([mnist_train[i][0] for i in range(10, 38)], dim=0)

labels = torch.tensor([mnist_train[i][1] for i in range(10, 38)])
images.shape, labels.shape
(torch.Size([28, 28, 28]), torch.Size([28]))
Letʼs visualize these examples.
d2l.show_images(images, 2, 9);
19.9. Naive Bayes 911

19.9.2 The Probabilistic Model for Classification
In a classification task, we map an example into a category. Here an example is a grayscale 28 × 28

image, and a category is a digit. (Refer to Section 4.1 for a more detailed explanation.) One natural
way to express the classification task is via the probabilistic question: what is the most likely label
given the features (i.e., image pixels)? Denote by x ∈ Rd the features of the example and y ∈ R the
label. Here features are image pixels, where we can reshape a 2-dimensional image to a vector so
that d = 282 = 784, and labels are digits. The probability of the label given the features is p(y | x).
If we are able to compute these probabilities, which are p(y | x) for y = 0, . . . , 9 in our example,
then the classifier will output the prediction ŷ given by the expression:
ŷ = argmax p(y | x). (19.9.1)
Unfortunately, this requires that we estimate p(y | x) for every value of x = x1 , ..., xd . Imagine
that each feature could take one of 2 values. For example, the feature x1 = 1 might signify that
the word apple appears in a given document and x1 = 0 would signify that it does not. If we had
30 such binary features, that would mean that we need to be prepared to classify any of 230 (over
1 billion!) possible values of the input vector x.
Moreover, where is the learning? If we need to see every single possible example in order to predict
the corresponding label then we are not really learning a pattern but just memorizing the dataset.
19.9.3 The Naive Bayes Classifier
Fortunately, by making some assumptions about conditional independence, we can introduce

some inductive bias and build a model capable of generalizing from a comparatively modest se-
lection of training examples. To begin, letʼs use Bayes theorem, to express the classifier as
p(x | y)p(y)
ŷ = argmaxy p(y | x) = argmaxy . (19.9.2)
p(x)
Note that the denominator is the normalizing term p(x) which does not depend on the value of
the label y. As a result, we only need to worry about comparing the numerator across different
values of y. Even if calculating the denominator turned out to be intractable, we could get away
with ignoring it, so long as we could evaluate the numerator. Fortunately, even if we wanted to
recover
∑ the normalizing constant, we could. We can always recover the normalization term since
y p(y | x) = 1.
Now, letʼs focus on p(x | y). Using the chain rule of probability, we can express the term p(x | y) as
p(x1 | y) · p(x2 | x1 , y) · ... · p(xd | x1 , ..., xd−1 , y). (19.9.3)

By itself, this expression does not get us any further. We still must estimate roughly 2d parameters.
However, if we assume that the features are conditionally independent of∏ each other, given the label,
then suddenly we are in much better shape, as this term simplifies to i p(xi | y), giving us the
predictor
∏
d
ŷ = argmaxy p(xi | y)p(y). (19.9.4)
i=1
If we can estimate p(xi = 1 | y) for every i and y, and save its value in Pxy [i, y], here Pxy is a d × n
matrix with n being the number of classes and y ∈ {1, . . . , n}, then we can also use this to estimate
p(xi = 0 | y), i.e.,
{
Pxy [i, y] for ti = 1;
p(xi = ti | y) = (19.9.5)
1 − Pxy [i, y] for ti = 0.
In addition, we estimate p(y) for every y and save it in Py [y], with Py a n-length vector. Then, for
any new example t = (t1 , t2 , . . . , td ), we could compute
∏
d
ŷ = argmaxy p(y) p(xt = ti | y)
i=1
(19.9.6)
∏d
= argmaxy Py [y] Pxy [i, y]ti (1 − Pxy [i, y])1−ti
i=1
for any y. So our assumption of conditional independence has taken the complexity of our model
from an exponential dependence on the number of features O(2d n) to a linear dependence, which
is O(dn).
19.9.4 Training
The problem now is that we do not know Pxy and Py . So we need to estimate their values given
some training data first. This is training the model. Estimating Py is not too hard. Since we are
only dealing with 10 classes, we may count the number of occurrences ny for each of the digits
and divide it by the total amount of data n. For instance, if digit 8 occurs n8 = 5, 800 times and we
have a total of n = 60, 000 images, the probability estimate is p(y = 8) = 0.0967.
X = torch.stack([mnist_train[i][0] for i in range(len(mnist_train))], dim=0)

Y = torch.tensor([mnist_train[i][1] for i in range(len(mnist_train))])
n_y = torch.zeros(10)
for y in range(10):
n_y[y] = (Y == y).sum()
P_y = n_y / n_y.sum()
P_y
tensor([0.0987, 0.1124, 0.0993, 0.1022, 0.0974, 0.0904, 0.0986, 0.1044, 0.0975,

0.0992])
Now on to slightly more difficult things Pxy . Since we picked black and white images, p(xi | y)
denotes the probability that pixel i is switched on for class y. Just like before we can go and count
the number of times niy such that an event occurs and divide it by the total number of occurrences

of y, i.e., ny . But there is something slightly troubling: certain pixels may never be black (e.g., for
well cropped images the corner pixels might always be white). A convenient way for statisticians
to deal with this problem is to add pseudo counts to all occurrences. Hence, rather than niy we
use niy + 1 and instead of ny we use ny + 2 (since there are two possible values pixel i can take - it
can either be black or white). This is also called Laplace Smoothing. It may seem ad-hoc, however
it can be motivated from a Bayesian point-of-view by a Beta-binomial model.
n_x = torch.zeros((10, 28, 28))

for y in range(10):
n_x[y] = torch.tensor(X.numpy()[Y.numpy() == y].sum(axis=0))
P_xy = (n_x + 1) / (n_y + 2).reshape(10, 1, 1)
d2l.show_images(P_xy, 2, 5);
By visualizing these 10 × 28 × 28 probabilities (for each pixel for each class) we could get some
mean looking digits.
Now we can use (19.9.6) to predict a new image. Given x, the following functions computes p(x |
y)p(y) for every y.
def bayes_pred(x):
x = x.unsqueeze(0) # (28, 28) -> (1, 28, 28)
p_xy = P_xy * x + (1 - P_xy)*(1 - x)
p_xy = p_xy.reshape(10, -1).prod(dim=1) # p(x|y)
return p_xy * P_y
image, label = mnist_test[0]

bayes_pred(image)
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
This went horribly wrong! To find out why, letʼs look at the per pixel probabilities. They are typ-
ically numbers between 0.001 and 1. We are multiplying 784 of them. At this point it is worth
mentioning that we are calculating these numbers on a computer, hence with a fixed range for
the exponent. What happens is that we experience numerical underflow, i.e., multiplying all the
small numbers leads to something even smaller until it is rounded down to zero. We discussed
this as a theoretical issue in Section 19.7, but we see the phenomena clearly here in practice.

As discussed in that section, we fix this by use the fact that log ab = log a + log b, i.e., we switch to
summing logarithms. Even if both a and b are small numbers, the logarithm values should be in
a proper range.
a = 0.1
print('underflow:', a**784)
print('logarithm is normal:', 784*math.log(a))
underflow: 0.0
logarithm is normal: -1805.2267129073316
Since the logarithm is an increasing function, we can rewrite (19.9.6) as

d [
∑ ]
ŷ = argmaxy log Py [y] + ti log Pxy [xi , y] + (1 − ti ) log(1 − Pxy [xi , y]) . (19.9.7)
i=1
We can implement the following stable version:
log_P_xy = torch.log(P_xy)
log_P_xy_neg = torch.log(1 - P_xy)
log_P_y = torch.log(P_y)
def bayes_pred_stable(x):
x = x.unsqueeze(0) # (28, 28) -> (1, 28, 28)
p_xy = log_P_xy * x + log_P_xy_neg * (1 - x)
p_xy = p_xy.reshape(10, -1).sum(axis=1) # p(x|y)
return p_xy + log_P_y
py = bayes_pred_stable(image)
py
tensor([-268.9725, -301.7044, -245.1951, -218.8738, -193.4570, -206.0909,

-292.5226, -114.6257, -220.3313, -163.1784])
We may now check if the prediction is correct.
py.argmax(dim=0) == label
tensor(True)
If we now predict a few validation examples, we can see the Bayes classifier works pretty well.
def predict(X):
return [bayes_pred_stable(x).argmax(dim=0).type(torch.int32).item()
for x in X]
X = torch.stack([mnist_test[i][0] for i in range(18)], dim=0)

y = torch.tensor([mnist_test[i][1] for i in range(18)])
preds = predict(X)
d2l.show_images(X, 2, 9, titles=[str(d) for d in preds]);

Finally, letʼs compute the overall accuracy of the classifier.
X = torch.stack([mnist_test[i][0] for i in range(len(mnist_test))], dim=0)

y = torch.tensor([mnist_test[i][1] for i in range(len(mnist_test))])
preds = torch.tensor(predict(X), dtype=torch.int32)
float((preds == y).sum()) / len(y) # Validation accuracy
0.8427
Modern deep networks achieve error rates of less than 0.01. The relatively poor performance is
due to the incorrect statistical assumptions that we made in our model: we assumed that each
and every pixel are independently generated, depending only on the label. This is clearly not how
humans write digits, and this wrong assumption led to the downfall of our overly naive (Bayes)
classifier.
19.9.5 Summary
• Using Bayesʼ rule, a classifier can be made by assuming all observed features are indepen-
dent.
• This classifier can be trained on a dataset by counting the number of occurrences of combi-
nations of labels and pixel values.
• This classifier was the gold standard for decades for tasks such as spam detection.
Exercises
1. Consider the dataset [[0, 0], [0, 1], [1, 0], [1, 1]] with labels given by the XOR of the two elements
[0, 1, 1, 0]. What are the probabilities for a Naive Bayes classifier built on this dataset. Does
it successfully classify our points? If not, what assumptions are violated?
2. Suppose that we did not use Laplace smoothing when estimating probabilities and a data
example arrived at testing time which contained a value never observed in training. What
would the model output?
3. The naive Bayes classifier is a specific example of a Bayesian network, where the dependence
of random variables are encoded with a graph structure. While the full theory is beyond the
scope of this section (see (Koller and Friedman, 2009) for full details), explain why allow-
ing explicit dependence between the two input variables in the XOR model allows for the
creation of a successful classifier.

Discussions264
19.10 Statistics
Undoubtedly, to be a top deep learning practitioner, the ability to train the state-of-the-art and
high accurate models is crucial. However, it is often unclear when improvements are significant,
or only the result of random fluctuations in the training process. To be able to discuss uncertainty
in estimated values, we must learn some statistics.
The earliest reference of statistics can be traced back to an Arab scholar Al-Kindi in the 9th -century,
who gave a detailed description of how to use statistics and frequency analysis to decipher en-
crypted messages. After 800 years, the modern statistics arose from Germany in 1700s, when
the researchers focused on the demographic and economic data collection and analysis. Today,
statistics is the science subject that concerns the collection, processing, analysis, interpretation
and visualization of data. What is more, the core theory of statistics has been widely used in the
research within academia, industry, and government.
More specifically, statistics can be divided to descriptive statistics and statistical inference. The for-
mer focus on summarizing and illustrating the features of a collection of observed data, which is
referred to as a sample. The sample is drawn from a population, denotes the total set of similar
individuals, items, or events of our experiment interests. Contrary to descriptive statistics, sta-
tistical inference further deduces the characteristics of a population from the given samples, based
on the assumptions that the sample distribution can replicate the population distribution at some
degree.
You may wonder: “What is the essential difference between machine learning and statistics?”
Fundamentally speaking, statistics focuses on the inference problem. This type of problems in-
cludes modeling the relationship between the variables, such as causal inference, and testing the
statistically significance of model parameters, such as A/B testing. In contrast, machine learning
emphasizes on making accurate predictions, without explicitly programming and understanding
each parameterʼs functionality.
In this section, we will introduce three types of statistics inference methods: evaluating and com-
paring estimators, conducting hypothesis tests, and constructing confidence intervals. These
methods can help us infer the characteristics of a given population, i.e., the true parameter θ. For
brevity, we assume that the true parameter θ of a given population is a scalar value. It is straight-
forward to extend to the case where θ is a vector or a tensor, thus we omit it in our discussion.
19.10.1 Evaluating and Comparing Estimators
In statistics, an estimator is a function of given samples used to estimate the true parameter θ. We
will write θ̂n = fˆ(x1 , . . . , xn ) for the estimate of θ after observing the samples {x1 , x2 , . . . , xn }.
We have seen simple examples of estimators before in section Section 19.7. If you have a num-
ber of samples from a Bernoulli random variable, then the maximum likelihood estimate for the
probability the random variable is one can be obtained by counting the number of ones observed
and dividing by the total number of samples. Similarly, an exercise asked you to show that the
maximum likelihood estimate of the mean of a Gaussian given a number of samples is given by
the average value of all the samples. These estimators will almost never give the true value of the
parameter, but ideally for a large number of samples the estimate will be close.
264
19.10. Statistics 917

As an example, we show below the true density of a Gaussian random variable with mean zero and
variance one, along with a collection samples from that Gaussian. We constructed the y coordinate
so every point is visible and the relationship to the original density is clearer.
import torch
torch.pi = torch.acos(torch.zeros(1)) * 2 #define pi in torch
# Sample datapoints and create y coordinate

epsilon = 0.1
xs = torch.randn(size=(300,))
ys = torch.tensor(
[torch.sum(torch.exp(-(xs[:i] - xs[i])**2 / (2 * epsilon**2))\
/ torch.sqrt(2*torch.pi*epsilon**2)) / len(xs)\
for i in range(len(xs))])
# Compute true density

xd = torch.arange(torch.min(xs), torch.max(xs), 0.01)
yd = torch.exp(-xd**2/2) / torch.sqrt(2 * torch.pi)
# Plot the results

d2l.plot(xd, yd, 'x', 'density')
d2l.plt.scatter(xs, ys)
d2l.plt.axvline(x=0)
d2l.plt.axvline(x=torch.mean(xs), linestyle='--', color='purple')
d2l.plt.title(f'sample mean: {float(torch.mean(xs).item()):.2f}')
d2l.plt.show()
There can be many ways to compute an estimator of a parameter θ̂n . In this section, we intro-
duce three common methods to evaluate and compare estimators: the mean squared error, the
standard deviation, and statistical bias.

Mean Squared Error
Perhaps the simplest metric used to evaluate estimators is the mean squared error (MSE) (or l2 loss)
estimator which can be defined as
MSE(θ̂n , θ) = E[(θ̂n − θ)2 ]. (19.10.1)
This allows us to quantify the average squared deviation from the true value. MSE is always non-
negative. If you have read Section 3.1, you will recognize it as the most commonly used regression
loss function. As a measure to evaluate an estimator, the closer its value to zero, the closer the
estimator is close to the true parameter θ.
Statistical Bias
The MSE provides a natural metric, but we can easily imagine multiple different phenomena that
might make it large. Two fundamentally important are fluctuation in the estimator due to ran-
domness in the dataset, and systematic error in the estimator due to the estimation procedure.
First, letʼs measure the systematic error. For an estimator θ̂n , the mathematical illustration of
statistical bias can be defined as
bias(θ̂n ) = E(θ̂n − θ) = E(θ̂n ) − θ. (19.10.2)
Note that when bias(θ̂n ) = 0, the expectation of the estimator θ̂n is equal to the true value of
parameter. In this case, we say θ̂n is an unbiased estimator. In general, an unbiased estimator is
better than a biased estimator since its expectation is the same as the true parameter.
It is worth being aware, however, that biased estimators are frequently used in practice. There are
cases where unbiased estimators do not exist without further assumptions, or are intractable to
compute. This may seem like a significant flaw in an estimator, however the majority of estimators
encountered in practice are at least asymptotically unbiased in the sense that the bias tends to zero
as the number of available samples tends to infinity: limn→∞ bias(θ̂n ) = 0.
Variance and Standard Deviation
Second, letʼs measure the randomness in the estimator. Recall from Section 19.6, the standard devi-
ation (or standard error) is defined as the squared root of the variance. We may measure the degree
of fluctuation of an estimator by measuring the standard deviation or variance of that estimator.
√ √
σθ̂n = Var(θ̂n ) = E[(θ̂n − E(θ̂n ))2 ]. (19.10.3)
It is important to compare (19.10.3) to (19.10.1). In this equation we do not compare to the true
population value θ, but instead to E(θ̂n ), the expected sample mean. Thus we are not measuring
how far the estimator tends to be from the true value, but instead we measuring the fluctuation of
the estimator itself.

The Bias-Variance Trade-off
It is intuitively clear that these two main components contribute to the mean squared error. What
is somewhat shocking is that we can show that this is actually a decomposition of the mean squared
error into these two contributions plus a third one. That is to say that we can write the mean
squared error as the sum of the square of the bias, the variance and the irreducible error.
MSE(θ̂n , θ) = E[(θ̂n − θ)2 ]

= E[(θ̂n )2 ] + E[θ2 ] − 2E[θ̂n θ]
= Var[θ̂n ] + E[θ̂n ]2 + Var[θ] + E[θ]2 − 2E[θ̂n ]E[θ]
(19.10.4)
= (E[θ̂n ] − E[θ])2 + Var[θ̂n ] + Var[θ]
= (E[θ̂n − θ])2 + Var[θ̂n ] + Var[θ]
= (bias[θ̂n ])2 + Var(θ̂n ) + Var[θ].
We refer the above formula as bias-variance trade-off. The mean squared error can be divided into
three sources of error: the error from high bias, the error from high variance and the irreducible
error. The bias error is commonly seen in a simple model (such as a linear regression model),
which cannot extract high dimensional relations between the features and the outputs. If a model
suffers from high bias error, we often say it is underfitting or lack of flexibilty as introduced in (Sec-
tion 3.6). The high variance usually results from a too complex model, which overfits the training
data. As a result, an overfitting model is sensitive to small fluctuations in the data. If a model suf-
fers from high variance, we often say it is overfitting and lack of generalization as introduced in
(Section 3.6). The irreducible error is the result from noise in the θ itself.
Evaluating Estimators in Code
Since the standard deviation of an estimator has been implementing by simply calling a.std() for
a tensor a, we will skip it but implement the statistical bias and the mean squared error.
# Statistical bias
def stat_bias(true_theta, est_theta):
return(torch.mean(est_theta) - true_theta)
# Mean squared error

def mse(data, true_theta):
return(torch.mean(torch.square(data - true_theta)))
To illustrate the equation of the bias-variance trade-off, letʼs simulate of normal distribution
N (θ, σ 2 ) with 10, 000 samples. Here, we use a θ = 1 and σ = 4. As the estimator is a function
of the given samples, here we use the mean of the samples as an estimator for true θ in this nor-
mal distribution N (θ, σ 2 ) .
theta_true = 1
sigma = 4
sample_len = 10000
samples = torch.normal(theta_true, sigma, size=(sample_len, 1))
theta_est = torch.mean(samples)
theta_est

tensor(1.0170)
Letʼs validate the trade-off equation by calculating the summation of the squared bias and the
variance of our estimator. First, calculate the MSE of our estimator.
mse(samples, theta_true)
tensor(16.0298)
Next, we calculate Var(θ̂n )+[bias(θ̂n )]2 as below. As you can see, the two values agree to numerical
precision.
bias = stat_bias(theta_true, theta_est)

torch.square(samples.std(unbiased=False)) + torch.square(bias)
tensor(16.0298)
19.10.2 Conducting Hypothesis Tests
The most commonly encountered topic in statistical inference is hypothesis testing. While hy-
pothesis testing was popularized in the early 20th century, the first use can be traced back to John
Arbuthnot in the 1700s. John tracked 80-year birth records in London and concluded that more
men were born than women each year. Following that, the modern significance testing is the in-
telligence heritage by Karl Pearson who invented p-value and Pearsonʼs chi-squared test, William
Gosset who is the father of Studentʼs t-distribution, and Ronald Fisher who initialed the null hy-
pothesis and the significance test.
A hypothesis test is a way of evaluating some evidence against the default statement about a pop-
ulation. We refer the default statement as the null hypothesis H0 , which we try to reject using the
observed data. Here, we use H0 as a starting point for the statistical significance testing. The alter-
native hypothesis HA (or H1 ) is a statement that is contrary to the null hypothesis. A null hypothesis
is often stated in a declarative form which posits a relationship between variables. It should reflect
the brief as explicit as possible, and be testable by statistics theory.
Imagine you are a chemist. After spending thousands of hours in the lab, you develop a new
medicine which can dramatically improve oneʼs ability to understand math. To show its magic
power, you need to test it. Naturally, you may need some volunteers to take the medicine and see
whether it can help them learn mathematics better. How do you get started?
First, you will need carefully random selected two groups of volunteers, so that there is no dif-
ference between their mathematical understanding ability measured by some metrics. The two
groups are commonly referred to as the test group and the control group. The test group (or treat-
ment group) is a group of individuals who will experience the medicine, while the control group
represents the group of users who are set aside as a benchmark, i.e., identical environment se-
tups except taking this medicine. In this way, the influence of all the variables are minimized,
except the impact of the independent variable in the treatment.
Second, after a period of taking the medicine, you will need to measure the two groupsʼ mathe-
matical understanding by the same metrics, such as letting the volunteers do the same tests after
learning a new mathematical formula. Then, you can collect their performance and compare the

results. In this case, our null hypothesis will be that there is no difference between the two groups,
and our alternate will be that there is.
This is still not fully formal. There are many details you have to think of carefully. For example,
what is the suitable metrics to test their mathematical understanding ability? How many volun-
teers for your test so you can be confident to claim the effectiveness of your medicine? How long
should you run the test? How do you decide if there is a difference between the two groups? Do
you care about the average performance only, or also the range of variation of the scores? And so
on.
In this way, hypothesis testing provides a framework for experimental design and reasoning about
certainty in observed results. If we can now show that the null hypothesis is very unlikely to be
true, we may reject it with confidence.
To complete the story of how to work with hypothesis testing, we need to now introduce some
additional terminology and make some of our concepts above formal.
Statistical Significance
The statistical significance measures the probability of erroneously rejecting the null hypothesis,
H0 , when it should not be rejected, i.e.,
statistical significance = 1 − α = 1 − P (reject H0 | H0 is true). (19.10.5)
It is also referred to as the type I error or false positive. The α, is called as the significance level and
its commonly used value is 5%, i.e., 1 − α = 95%. The significance level can be explained as the
level of risk that we are willing to take, when we reject a true null hypothesis.
Fig. 19.10.1 shows the observationsʼ values and probability of a given normal distribution in a two-
sample hypothesis test. If the observation data example is located outsides the 95% threshold, it
will be a very unlikely observation under the null hypothesis assumption. Hence, there might be
something wrong with the null hypothesis and we will reject it.
Fig. 19.10.1: Statistical significance.

Statistical Power
The statistical power (or sensitivity) measures the probability of reject the null hypothesis, H0 , when
it should be rejected, i.e.,
statistical power = 1 − β = 1 − P ( fail to reject H0 | H0 is false). (19.10.6)
Recall that a type I error is error caused by rejecting the null hypothesis when it is true, whereas a
type II error is resulted from failing to reject the null hypothesis when it is false. A type II error is
usually denoted as β, and hence the corresponding statistical power is 1 − β.
Intuitively, statistical power can be interpreted as how likely our test will detect a real discrepancy
of some minimum magnitude at a desired statistical significance level. 80% is a commonly used
statistical power threshold. The higher the statistical power, the more likely we are to detect true
differences.
One of the most common uses of statistical power is in determining the number of samples
needed. The probability you reject the null hypothesis when it is false depends on the degree
to which it is false (known as the effect size) and the number of samples you have. As you might
expect, small effect sizes will require a very large number of samples to be detectable with high
probability. While beyond the scope of this brief appendix to derive in detail, as an example, want
to be able to reject a null hypothesis that our sample came from a mean zero variance one Gaus-
sian, and we believe that our sampleʼs mean is actually close to one, we can do so with acceptable
error rates with a sample size of only 8. However, if we think our sample population true mean is
close to 0.01, then weʼd need a sample size of nearly 80000 to detect the difference.
We can imagine the power as a water filter. In this analogy, a high power hypothesis test is like a
high quality water filtration system that will reduce harmful substances in the water as much as
possible. On the other hand, a smaller discrepancy is like a low quality water filter, where some
relative small substances may easily escape from the gaps. Similarly, if the statistical power is not
of enough high power, then the test may not catch the smaller discrepancy.
Test Statistic
A test statistic T (x) is a scalar which summarizes some characteristic of the sample data. The goal
of defining such a statistic is that it should allow us to distinguish between different distributions
and conduct our hypothesis test. Thinking back to our chemist example, if we wish to show that
one population performs better than the other, it could be reasonable to take the mean as the
test statistic. Different choices of test statistic can lead to statistical test with drastically different
statistical power.
Often, T (X) (the distribution of the test statistic under our null hypothesis) will follow, at least
approximately, a common probability distribution such as a normal distribution when considered
under the null hypothesis. If we can derive explicitly such a distribution, and then measure our
test statistic on our dataset, we can safely reject the null hypothesis if our statistic is far outside
the range that we would expect. Making this quantitative leads us to the notion of p-values.

p-value
The p-value (or the probability value) is the probability that T (X) is at least as extreme as the ob-
served test statistic T (x) assuming that the null hypothesis is true, i.e.,
p-value = PH0 (T (X) ≥ T (x)). (19.10.7)
If the p-value is smaller than or equal to a predefined and fixed statistical significance level α, we
may reject the null hypothesis. Otherwise, we will conclude that we are lack of evidence to reject
the null hypothesis. For a given population distribution, the region of rejection will be the interval
contained of all the points which has a p-value smaller than the statistical significance level α.
One-side Test and Two-sided Test
Normally there are two kinds of significance test: the one-sided test and the two-sided test. The
one-sided test (or one-tailed test) is applicable when the null hypothesis and the alternative hypoth-
esis only have one direction. For example, the null hypothesis may state that the true parameter
θ is less than or equal to a value c. The alternative hypothesis would be that θ is greater than c.
That is, the region of rejection is on only one side of the sampling distribution. Contrary to the
one-sided test, the two-sided test (or two-tailed test) is applicable when the region of rejection is on
both sides of the sampling distribution. An example in this case may have a null hypothesis state
that the true parameter θ is equal to a value c. The alternative hypothesis would be that θ is not
equal to c.
General Steps of Hypothesis Testing
After getting familiar with the above concepts, letʼs go through the general steps of hypothesis
testing.
1. State the question and establish a null hypotheses H0 .
2. Set the statistical significance level α and a statistical power (1 − β).
3. Obtain samples through experiments. The number of samples needed will depend on the
statistical power, and the expected effect size.
4. Calculate the test statistic and the p-value.
5. Make the decision to keep or reject the null hypothesis based on the p-value and the statistical
significance level α.
To conduct a hypothesis test, we start by defining a null hypothesis and a level of risk that we
are willing to take. Then we calculate the test statistic of the sample, taking an extreme value of
the test statistic as evidence against the null hypothesis. If the test statistic falls within the reject
region, we may reject the null hypothesis in favor of the alternative.
Hypothesis testing is applicable in a variety of scenarios such as the clinical trails and A/B testing.

19.10.3 Constructing Confidence Intervals
When estimating the value of a parameter θ, point estimators like θ̂ are of limited utility since they
contain no notion of uncertainty. Rather, it would be far better if we could produce an interval
that would contain the true parameter θ with high probability. If you were interested in such
ideas a century ago, then you would have been excited to read “Outline of a Theory of Statistical
Estimation Based on the Classical Theory of Probability” by Jerzy Neyman (Neyman, 1937), who
first introduced the concept of confidence interval in 1937.
To be useful, a confidence interval should be as small as possible for a given degree of certainty.
Letʼs see how to derive it.
Definition
Mathematically, a confidence interval for the true parameter θ is an interval Cn that computed from
the sample data such that
Pθ (Cn ∋ θ) ≥ 1 − α, ∀θ. (19.10.8)
Here α ∈ (0, 1), and 1 − α is called the confidence level or coverage of the interval. This is the same
α as the significance level as we discussed about above.
Note that (19.10.8) is about variable Cn , not about the fixed θ. To emphasize this, we write Pθ (Cn ∋
θ) rather than Pθ (θ ∈ Cn ).
Interpretation
It is very tempting to interpret a 95% confidence interval as an interval where you can be 95% sure
the true parameter lies, however this is sadly not true. The true parameter is fixed, and it is the
interval that is random. Thus a better interpretation would be to say that if you generated a large
number of confidence intervals by this procedure, 95% of the generated intervals would contain
the true parameter.
This may seem pedantic, but it can have real implications for the interpretation of the results.
In particular, we may satisfy (19.10.8) by constructing intervals that we are almost certain do not
contain the true value, as long as we only do so rarely enough. We close this section by providing
three tempting but false statements. An in-depth discussion of these points can be found in (Morey
et al., 2016).
• Fallacy 1. Narrow confidence intervals mean we can estimate the parameter precisely.
• Fallacy 2. The values inside the confidence interval are more likely to be the true value than
those outside the interval.
• Fallacy 3. The probability that a particular observed 95% confidence interval contains the
true value is 95%.
Sufficed to say, confidence intervals are subtle objects. However, if you keep the interpretation
clear, they can be powerful tools.

A Gaussian Example
Letʼs discuss the most classical example, the confidence interval for the mean of a Gaussian of
unknown mean and variance. Suppose we collect n samples {xi }ni=1 from our Gaussian N (µ, σ 2 ).
We can compute estimators for the mean and variance by taking
1∑ 1 ∑
n n
µ̂n = xi and σ̂n2 = (xi − µ̂)2 . (19.10.9)
n n−1
i=1 i=1
If we now consider the random variable

µ̂n − µ
T = √ , (19.10.10)
σ̂n / n
we obtain a random variable following a well-known distribution called the Student’s t-distribution
on n − 1 degrees of freedom.
This distribution is very well studied, and it is known, for instance, that as n → ∞, it is approx-
imately a standard Gaussian, and thus by looking up values of the Gaussian c.d.f. in a table, we
may conclude that the value of T is in the interval [−1.96, 1.96] at least 95% of the time. For finite
values of n, the interval needs to be somewhat larger, but are well known and precomputed in
tables.
Thus, we may conclude that for large n,
( )
µ̂n − µ
P √ ∈ [−1.96, 1.96] ≥ 0.95. (19.10.11)
σ̂n / n
√
Rearranging this by multiplying both sides by σ̂n / n and then adding µ̂n , we obtain
( [ ])
σ̂n σ̂n
P µ ∈ µ̂n − 1.96 √ , µ̂n + 1.96 √ ≥ 0.95. (19.10.12)
n n
Thus we know that we have found our 95% confidence interval:

[ ]
σ̂n σ̂n
µ̂n − 1.96 √ , µ̂n + 1.96 √ . (19.10.13)
n n
It is safe to say that (19.10.13) is one of the most used formula in statistics. Letʼs close our discussion
of statistics by implementing it. For simplicity, we assume we are in the asymptotic regime. Small
values of N should include the correct value of t_star obtained either programmatically or from
a t-table.
# PyTorch uses Bessel's correction by default, which means the use of ddof=1
# instead of default ddof=0 in numpy. We can use unbiased=False to imitate
# ddof=0.
# Number of samples
N = 1000
# Sample dataset
samples = torch.normal(0, 1, size=(N,))
# Lookup Students's t-distribution c.d.f.

t_star = 1.96

# Construct interval
mu_hat = torch.mean(samples)
sigma_hat = samples.std(unbiased=True)
(mu_hat - t_star*sigma_hat/torch.sqrt(torch.tensor(N, dtype=torch.float32)),\
mu_hat + t_star*sigma_hat/torch.sqrt(torch.tensor(N, dtype=torch.float32)))
(tensor(-0.0568), tensor(0.0704))
19.10.4 Summary
• Statistics focuses on inference problems, whereas deep learning emphasizes on making ac-
curate predictions without explicitly programming and understanding.
• There are three common statistics inference methods: evaluating and comparing estima-
tors, conducting hypothesis tests, and constructing confidence intervals.
• There are three most common estimators: statistical bias, standard deviation, and mean
square error.
• A confidence interval is an estimated range of a true population parameter that we can con-
struct by given the samples.
• Hypothesis testing is a way of evaluating some evidence against the default statement about
a population.
Exercises
iid
1. Let X1 , X2 , . . . , Xn ∼ Unif(0, θ), where “iid” stands for independent and identically distributed.
Consider the following estimators of θ:
θ̂ = max{X1 , X2 , . . . , Xn }; (19.10.14)
2∑
n
θ̃ = 2X¯n = Xi . (19.10.15)
n
i=1
• Find the statistical bias, standard deviation, and mean square error of θ̂.
• Find the statistical bias, standard deviation, and mean square error of θ̃.
• Which estimator is better?
2. For our chemist example in introduction, can you derive the 5 steps to conduct a two-sided
hypothesis testing? Given the statistical significance level α = 0.05 and the statistical power
1 − β = 0.8.
3. Run the confidence interval code with N = 2 and α = 0.5 for 100 independently generated
dataset, and plot the resulting intervals (in this case t_star = 1.0). You will see several very
short intervals which are very far from containing the true mean 0. Does this contradict the
interpretation of the confidence interval? Do you feel comfortable using short intervals to
indicate high precision estimates?

Discussions265
19.11 Information Theory
The universe is overflowing with information. Information provides a common language across
disciplinary rifts: from Shakespeareʼs Sonnet to researchersʼ paper on Cornell ArXiv, from Van
Goghʼs printing Starry Night to Beethovenʼs music Symphony No. 5, from the first programming
language Plankalkül to the state-of-the-art machine learning algorithms. Everything must follow
the rules of information theory, no matter the format. With information theory, we can measure
and compare how much information is present in different signals. In this section, we will inves-
tigate the fundamental concepts of information theory and applications of information theory in
machine learning.
Before we get started, letʼs outline the relationship between machine learning and information
theory. Machine learning aims to extract interesting signals from data and make critical pre-
dictions. On the other hand, information theory studies encoding, decoding, transmitting, and
manipulating information. As a result, information theory provides fundamental language for
discussing the information processing in machine learned systems. For example, many machine
learning applications use the cross-entropy loss as described in Section 4.1. This loss can be di-
rectly derived from information theoretic considerations.
19.11.1 Information
Letʼs start with the “soul” of information theory: information. Information can be encoded in any-
thing with a particular sequence of one or more encoding formats. Suppose that we task ourselves
with trying to define a notion of information. What could be our starting point?
Consider the following thought experiment. We have a friend with a deck of cards. They will
shuffle the deck, flip over some cards, and tell us statements about the cards. We will try to assess
the information content of each statement.
First, they flip over a card and tell us, “I see a card.” This provides us with no information at all.
We were already certain that this was the case so we hope the information should be zero.
Next, they flip over a card and say, “I see a heart.” This provides us some information, but in reality
there are only 4 different suits that were possible, each equally likely, so we are not surprised by
this outcome. We hope that whatever the measure of information, this event should have low
information content.
Next, they flip over a card and say, “This is the 3 of spades.” This is more information. Indeed there
were 52 equally likely possible outcomes, and our friend told us which one it was. This should be
a medium amount of information.
Letʼs take this to the logical extreme. Suppose that finally they flip over every card from the deck
and read off the entire sequence of the shuffled deck. There are 52! different orders to the deck,
again all equally likely, so we need a lot of information to know which one it is.
Any notion of information we develop must conform to this intuition. Indeed, in the next sec-
tions we will learn how to compute that these events have 0 bits, 2 bits, 5.7 bits, and 225.6 bits of
information respectively.
265

If we read through these thought experiments, we see a natural idea. As a starting point, rather
than caring about the knowledge, we may build off the idea that information represents the degree
of surprise or the abstract possibility of the event. For example, if we want to describe an unusual
event, we need a lot information. For a common event, we may not need much information.
In 1948, Claude E. Shannon published A Mathematical Theory of Communication (Shannon, 1948)
establishing the theory of information. In his article, Shannon introduced the concept of infor-
mation entropy for the first time. We will begin our journey here.
Self-information
Since information embodies the abstract possibility of an event, how do we map the possibility
to the number of bits? Shannon introduced the terminology bit as the unit of information, which
was originally created by John Tukey. So what is a “bit” and why do we use it to measure infor-
mation? Historically, an antique transmitter can only send or receive two types of code: 0 and
1. Indeed, binary encoding is still in common use on all modern digital computers. In this way,
any information is encoded by a series of 0 and 1. And hence, a series of binary digits of length n
contains n bits of information.
Now, suppose that for any series of codes, each 0 or 1 occurs with a probability of 21 . Hence, an
event X with a series of codes of length n, occurs with a probability of 21n . At the same time,
as we mentioned before, this series contains n bits of information. So, can we generalize to a
mathematical function which can transfer the probability p to the number of bits? Shannon gave
the answer by defining self-information
I(X) = − log2 (p), (19.11.1)
as the bits of information we have received for this event X. Note that we will always use base-2
logarithms in this section. For the sake of simplicity, the rest of this section will omit the subscript
2 in the logarithm notation, i.e., log(.) always refers to log2 (.). For example, the code “0010” has a
self-information
( )
1
I("0010") = − log(p("0010")) = − log = 4 bits. (19.11.2)
24
We can calculate self information as shown below. Before that, letʼs first import all the necessary
packages in this section.
import torch
from torch.nn import NLLLoss
def nansum(x):
# Define nansum, as pytorch doesn't offer it inbuilt.
return x[~torch.isnan(x)].sum()
def self_information(p):
return -torch.log2(torch.tensor(p)).item()
self_information(1 / 64)
6.0
19.11. Information Theory 929

19.11.2 Entropy
As self-information only measures the information of a single discrete event, we need a more
generalized measure for any random variable of either discrete or continuous distribution.
Motivating Entropy
Letʼs try to get specific about what we want. This will be an informal statement of what are known
as the axioms of Shannon entropy. It will turn out that the following collection of common-sense
statements force us to a unique definition of information. A formal version of these axioms, along
with several others may be found in (Csiszár, 2008).
1. The information we gain by observing a random variable does not depend on what we call
the elements, or the presence of additional elements which have probability zero.
2. The information we gain by observing two random variables is no more than the sum of the
information we gain by observing them separately. If they are independent, then it is exactly
the sum.
3. The information gained when observing (nearly) certain events is (nearly) zero.
While proving this fact is beyond the scope of our text, it is important to know that this uniquely
determines the form that entropy must take. The only ambiguity that these allow is in the choice
of fundamental units, which is most often normalized by making the choice we saw before that
the information provided by a single fair coin flip is one bit.
Definition
For any random variable X that follows a probability distribution P with a probability density
function (p.d.f.) or a probability mass function (p.m.f.) p(x), we measure the expected amount of
information through entropy (or Shannon entropy)
H(X) = −Ex∼P [log p(x)]. (19.11.3)
To be specific, if X is discrete,
∑
H(X) = − pi log pi , where pi = P (Xi ). (19.11.4)
i
Otherwise, if X is continuous, we also refer entropy as differential entropy

∫
H(X) = − p(x) log p(x) dx. (19.11.5)
x
We can define entropy as below.
def entropy(p):
entropy = - p * torch.log2(p)
# Operator `nansum` will sum up the non-nan number
out = nansum(entropy)
return out
entropy(torch.tensor([0.1, 0.5, 0.1, 0.3]))

tensor(1.6855)
Interpretations
You may be curious: in the entropy definition (19.11.3), why do we use an expectation of a negative
logarithm? Here are some intuitions.
First, why do we use a logarithm function log? Suppose that p(x) = f1 (x)f2 (x) . . . , fn (x), where
each component function fi (x) is independent from each other. This means that each fi (x) con-
tributes independently to the total information obtained from p(x). As discussed above, we want
the entropy formula to be additive over independent random variables. Luckily, log can naturally
turn a product of probability distributions to a summation of the individual terms.
Next, why do we use a negative log? Intuitively, more frequent events should contain less infor-
mation than less common events, since we often gain more information from an unusual case
than from an ordinary one. However, log is monotonically increasing with the probabilities, and
indeed negative for all values in [0, 1]. We need to construct a monotonically decreasing relation-
ship between the probability of events and their entropy, which will ideally be always positive (for
nothing we observe should force us to forget what we have known). Hence, we add a negative sign
in front of log function.
Last, where does the expectation function come from? Consider a random variable X. We can
interpret the self-information (− log(p)) as the amount of surprise we have at seeing a particular
outcome. Indeed, as the probability approaches zero, the surprise becomes infinite. Similarly,
we can interpret the entropy as the average amount of surprise from observing X. For exam-
ple, imagine that a slot machine system emits statistical independently symbols s1 , . . . , sk with
probabilities p1 , . . . , pk respectively. Then the entropy of this system equals to the average self-
information from observing each output, i.e.,
∑ ∑
H(S) = pi · I(si ) = − pi · log pi . (19.11.6)
i i
Properties of Entropy
By the above examples and interpretations, we can derive the following properties of entropy
(19.11.3). Here, we refer to X as an event and P as the probability distribution of X.
• H(X) ≥ 0 for all discrete X (entropy can be negative for continuous X).
• If X ∼ P with a p.d.f. or a p.m.f. p(x), and we try to estimate P by a new probability
distribution Q with a p.d.f. or a p.m.f. q(x), then
H(X) = −Ex∼P [log p(x)] ≤ −Ex∼P [log q(x)], with equality if and only if P = Q. (19.11.7)
Alternatively, H(X) gives a lower bound of the average number of bits needed to encode
symbols drawn from P .
• If X ∼ P , then x conveys the maximum amount of information if it spreads evenly among
all possible outcomes. Specifically, if the probability distribution P is discrete with k-class
{p1 , . . . , pk }, then
1
H(X) ≤ log(k), with equality if and only if pi = , ∀i. (19.11.8)
k

If P is a continuous random variable, then the story becomes much more complicated.
However, if we additionally impose that P is supported on a finite interval (with all values
between 0 and 1), then P has the highest entropy if it is the uniform distribution on that
interval.
19.11.3 Mutual Information
Previously we defined entropy of a single random variable X, how about the entropy of a pair
random variables (X, Y )? We can think of these techniques as trying to answer the following type
of question, “What information is contained in X and Y together compared to each separately? Is
there redundant information, or is it all unique?”
For the following discussion, we always use (X, Y ) as a pair of random variables that follows a joint
probability distribution P with a p.d.f. or a p.m.f. pX,Y (x, y), while X and Y follow probability
distribution pX (x) and pY (y), respectively.
Joint Entropy
Similar to entropy of a single random variable (19.11.3), we define the joint entropy H(X, Y ) of a
pair random variables (X, Y ) as
H(X, Y ) = −E(x,y)∼P [log pX,Y (x, y)]. (19.11.9)
Precisely, on the one hand, if (X, Y ) is a pair of discrete random variables, then
∑∑
H(X, Y ) = − pX,Y (x, y) log pX,Y (x, y). (19.11.10)
x y
On the other hand, if (X, Y ) is a pair of continuous random variables, then we define the differential
joint entropy as
∫
H(X, Y ) = − pX,Y (x, y) log pX,Y (x, y) dx dy. (19.11.11)
x,y
We can think of (19.11.9) as telling us the total randomness in the pair of random variables. As a
pair of extremes, if X = Y are two identical random variables, then the information in the pair
is exactly the information in one and we have H(X, Y ) = H(X) = H(Y ). On the other extreme,
if X and Y are independent then H(X, Y ) = H(X) + H(Y ). Indeed we will always have that
the information contained in a pair of random variables is no smaller than the entropy of either
random variable and no more than the sum of both.
H(X), H(Y ) ≤ H(X, Y ) ≤ H(X) + H(Y ). (19.11.12)
Letʼs implement joint entropy from scratch.
def joint_entropy(p_xy):
joint_ent = -p_xy * torch.log2(p_xy)
out = nansum(joint_ent)
return out
joint_entropy(torch.tensor([[0.1, 0.5], [0.1, 0.3]]))

tensor(1.6855)
Notice that this is the same code as before, but now we interpret it differently as working on the
joint distribution of the two random variables.
Conditional Entropy
The joint entropy defined above the amount of information contained in a pair of random vari-
ables. This is useful, but oftentimes it is not what we care about. Consider the setting of machine
learning. Letʼs take X to be the random variable (or vector of random variables) that describes
the pixel values of an image, and Y to be the random variable which is the class label. X should
contain substantial information—a natural image is a complex thing. However, the information
contained in Y once the image has been show should be low. Indeed, the image of a digit should
already contain the information about what digit it is unless the digit is illegible. Thus, to continue
to extend our vocabulary of information theory, we need to be able to reason about the informa-
tion content in a random variable conditional on another.
In the probability theory, we saw the definition of the conditional probability to measure the rela-
tionship between variables. We now want to analogously define the conditional entropy H(Y | X).
We can write this as
H(Y | X) = −E(x,y)∼P [log p(y | x)], (19.11.13)
p (x,y)
where p(y | x) = X,Y
pX (x) is the conditional probability. Specifically, if (X, Y ) is a pair of discrete
random variables, then
∑∑
H(Y | X) = − p(x, y) log p(y | x). (19.11.14)
x y
If (X, Y ) is a pair of continuous random variables, then the differential conditional entropy is simi-
larly defined as
∫ ∫
H(Y | X) = − p(x, y) log p(y | x) dx dy. (19.11.15)
x y
It is now natural to ask, how does the conditional entropy H(Y | X) relate to the entropy H(X) and
the joint entropy H(X, Y )? Using the definitions above, we can express this cleanly:
H(Y | X) = H(X, Y ) − H(X). (19.11.16)
This has an intuitive interpretation: the information in Y given X (H(Y | X)) is the same as the
information in both X and Y together (H(X, Y )) minus the information already contained in X.
This gives us the information in Y which is not also represented in X.
Now, letʼs implement conditional entropy (19.11.13) from scratch.
def conditional_entropy(p_xy, p_x):

p_y_given_x = p_xy/p_x
cond_ent = -p_xy * torch.log2(p_y_given_x)
out = nansum(cond_ent)
return out
conditional_entropy(torch.tensor([[0.1, 0.5], [0.2, 0.3]]),

torch.tensor([0.2, 0.8]))

tensor(0.8635)
Mutual Information
Given the previous setting of random variables (X, Y ), you may wonder: “Now that we know how
much information is contained in Y but not in X, can we similarly ask how much information is
shared between X and Y ?” The answer will be the mutual information of (X, Y ), which we will
write as I(X, Y ).
Rather than diving straight into the formal definition, letʼs practice our intuition by first trying
to derive an expression for the mutual information entirely based on terms we have constructed
before. We wish to find the information shared between two random variables. One way we could
try to do this is to start with all the information contained in both X and Y together, and then
we take off the parts that are not shared. The information contained in both X and Y together is
written as H(X, Y ). We want to subtract from this the information contained in X but not in Y ,
and the information contained in Y but not in X. As we saw in the previous section, this is given
by H(X | Y ) and H(Y | X) respectively. Thus, we have that the mutual information should be
I(X, Y ) = H(X, Y ) − H(Y | X) − H(X | Y ). (19.11.17)
Indeed, this is a valid definition for the mutual information. If we expand out the definitions of
these terms and combine them, a little algebra shows that this is the same as
{ }
pX,Y (x, y)
I(X, Y ) = Ex Ey pX,Y (x, y) log . (19.11.18)
pX (x)pY (y)
We can summarize all of these relationships in image Fig. 19.11.1. It is an excellent test of intuition
to see why the following statements are all also equivalent to I(X, Y ).
• H(X) − H(X | Y )
• H(Y ) − H(Y | X)
• H(X) + H(Y ) − H(X, Y )
Fig. 19.11.1: Mutual informationʼs relationship with joint entropy and conditional entropy.
In many ways we can think of the mutual information (19.11.18) as principled extension of cor-
relation coefficient we saw in Section 19.6. This allows us to ask not only for linear relationships

between variables, but for the maximum information shared between the two random variables
of any kind.
Now, letʼs implement mutual information from scratch.
def mutual_information(p_xy, p_x, p_y):

p = p_xy / (p_x * p_y)
mutual = p_xy * torch.log2(p)
out = nansum(mutual)
return out
mutual_information(torch.tensor([[0.1, 0.5], [0.1, 0.3]]),

torch.tensor([0.2, 0.8]), torch.tensor([[0.75, 0.25]]))
tensor(0.7195)
Properties of Mutual Information
Rather than memorizing the definition of mutual information (19.11.18), you only need to keep in
mind its notable properties:
• Mutual information is symmetric, i.e., I(X, Y ) = I(Y, X).
• Mutual information is non-negative, i.e., I(X, Y ) ≥ 0.
• I(X, Y ) = 0 if and only if X and Y are independent. For example, if X and Y are indepen-
dent, then knowing Y does not give any information about X and vice versa, so their mutual
information is zero.
• Alternatively, if X is an invertible function of Y , then Y and X share all information and
I(X, Y ) = H(Y ) = H(X). (19.11.19)
Pointwise Mutual Information
When we worked with entropy at the beginning of this chapter, we were able to provide an inter-
pretation of − log(pX (x)) as how surprised we were with the particular outcome. We may give a
similar interpretation to the logarithmic term in the mutual information, which is often referred
to as the pointwise mutual information:
pX,Y (x, y)
pmi(x, y) = log . (19.11.20)
pX (x)pY (y)
We can think of (19.11.20) as measuring how much more or less likely the specific combination
of outcomes x and y are compared to what we would expect for independent random outcomes.
If it is large and positive, then these two specific outcomes occur much more frequently than they
would compared to random chance (note: the denominator is pX (x)pY (y) which is the probability
of the two outcomes were independent), whereas if it is large and negative it represents the two
outcomes happening far less than we would expect by random chance.
This allows us to interpret the mutual information (19.11.18) as the average amount that we were
surprised to see two outcomes occurring together compared to what we would expect if they were
independent.

Applications of Mutual Information
Mutual information may be a little abstract in it pure definition, so how does it related to machine
learning? In natural language processing, one of the most difficult problems is the ambiguity res-
olution, or the issue of the meaning of a word being unclear from context. For example, recently
a headline in the news reported that “Amazon is on fire”. You may wonder whether the company
Amazon has a building on fire, or the Amazon rain forest is on fire.
In this case, mutual information can help us resolve this ambiguity. We first find the group of
words that each has a relatively large mutual information with the company Amazon, such as
e-commerce, technology, and online. Second, we find another group of words that each has a
relatively large mutual information with the Amazon rain forest, such as rain, forest, and tropical.
When we need to disambiguate “Amazon”, we can compare which group has more occurrence in
the context of the word Amazon. In this case the article would go on to describe the forest, and
make the context clear.
19.11.4 Kullback–Leibler Divergence
As what we have discussed in Section 2.3, we can use norms to measure distance between two
points in space of any dimensionality. We would like to be able to do a similar task with probability
distributions. There are many ways to go about this, but information theory provides one of the
nicest. We now explore the Kullback–Leibler (KL) divergence, which provides a way to measure if
two distributions are close together or not.
Definition
Given a random variable X that follows the probability distribution P with a p.d.f. or a p.m.f. p(x),
and we estimate P by another probability distribution Q with a p.d.f. or a p.m.f. q(x). Then the
Kullback–Leibler (KL) divergence (or relative entropy) between P and Q is
[ ]
p(x)
DKL (P ∥Q) = Ex∼P log . (19.11.21)
q(x)
As with the pointwise mutual information (19.11.20), we can again provide an interpretation of
q(x)
the logarithmic term: − log p(x) = − log(q(x)) − (− log(p(x))) will be large and positive if we see x
far more often under P than we would expect for Q, and large and negative if we see the outcome
far less than expected. In this way, we can interpret it as our relative surprise at observing the
outcome compared to how surprised we would be observing it from our reference distribution.
Letʼs implement the KL divergence from Scratch.
def kl_divergence(p, q):

kl = p * torch.log2(p / q)
out = nansum(kl)
return out.abs().item()

KL Divergence Properties
Letʼs take a look at some properties of the KL divergence (19.11.21).

• KL divergence is non-symmetric, i.e., there are P, Q such that
DKL (P ∥Q) ̸= DKL (Q∥P ). (19.11.22)
• KL divergence is non-negative, i.e.,
DKL (P ∥Q) ≥ 0. (19.11.23)
Note that the equality holds only when P = Q.

• If there exists an x such that p(x) > 0 and q(x) = 0, then DKL (P ∥Q) = ∞.
• There is a close relationship between KL divergence and mutual information. Besides the
relationship shown in Fig. 19.11.1, I(X, Y ) is also numerically equivalent with the following
terms:
1. DKL (P (X, Y ) ∥ P (X)P (Y ));
2. EY {DKL (P (X | Y ) ∥ P (X))};
3. EX {DKL (P (Y | X) ∥ P (Y ))}.
For the first term, we interpret mutual information as the KL divergence between P (X, Y ) and
the product of P (X) and P (Y ), and thus is a measure of how different the joint distribution is
from the distribution if they were independent. For the second term, mutual information tells
us the average reduction in uncertainty about Y that results from learning the value of the Xʼs
distribution. Similarly to the third term.
Example
Letʼs go through a toy example to see the non-symmetry explicitly.

First, letʼs generate and sort three tensors of length 10, 000: an objective tensor p which follows
a normal distribution N (0, 1), and two candidate tensors q1 and q2 which follow normal distribu-
tions N (−1, 1) and N (1, 1) respectively.
tensor_len = 10000
p = torch.normal(0, 1, (tensor_len, ))
q1 = torch.normal(-1, 1, (tensor_len, ))
q2 = torch.normal(1, 1, (tensor_len, ))
p = torch.sort(p)[0]
q1 = torch.sort(q1)[0]
q2 = torch.sort(q2)[0]
Since q1 and q2 are symmetric with respect to the y-axis (i.e., x = 0), we expect a similar value of
KL divergence between DKL (p∥q1 ) and DKL (p∥q2 ). As you can see below, there is only a less than
3% off between DKL (p∥q1 ) and DKL (p∥q2 ).

kl_pq1 = kl_divergence(p, q1)
kl_pq2 = kl_divergence(p, q2)
similar_percentage = abs(kl_pq1 - kl_pq2) / ((kl_pq1 + kl_pq2) / 2) * 100
kl_pq1, kl_pq2, similar_percentage
(8582.0341796875, 8828.3095703125, 2.8290698237936858)
In contrast, you may find that DKL (q2 ∥p) and DKL (p∥q2 ) are off a lot, with around 40% off as shown
below.
kl_q2p = kl_divergence(q2, p)
differ_percentage = abs(kl_q2p - kl_pq2) / ((kl_q2p + kl_pq2) / 2) * 100
kl_q2p, differ_percentage
(14130.125, 46.18621024399691)
19.11.5 Cross-Entropy
If you are curious about applications of information theory in deep learning, here is a quick ex-
ample. We define the true distribution P with probability distribution p(x), and the estimated
distribution Q with probability distribution q(x), and we will use them in the rest of this section.
Say we need to solve a binary classification problem based on given n data examples {x1 , . . . , xn }.
Assume that we encode 1 and 0 as the positive and negative class label yi respectively, and our
neural network is parameterized by θ. If we aim to find a best θ so that ŷi = pθ (yi | xi ), it is
natural to apply the maximum log-likelihood approach as was seen in Section 19.7. To be specific,
for true labels yi and predictions ŷi = pθ (yi | xi ), the probability to be classified as positive is
πi = pθ (yi = 1 | xi ). Hence, the log-likelihood function would be
l(θ) = log L(θ)

∏n
= log πiyi (1 − πi )1−yi
i=1 (19.11.24)
∑
n
= yi log(πi ) + (1 − yi ) log(1 − πi ).
i=1
Maximizing the log-likelihood function l(θ) is identical to minimizing −l(θ), and hence we can
find the best θ from here. To generalize the above loss to any distributions, we also called −l(θ)
the cross-entropy loss CE(y, ŷ), where y follows the true distribution P and ŷ follows the estimated
distribution Q.
This was all derived by working from the maximum likelihood point of view. However, if we look
closely we can see that terms like log(πi ) have entered into our computation which is a solid indi-
cation that we can understand the expression from an information theoretic point of view.

Formal Definition
Like KL divergence, for a random variable X, we can also measure the divergence between the
estimating distribution Q and the true distribution P via cross-entropy,
CE(P, Q) = −Ex∼P [log(q(x))]. (19.11.25)
By using properties of entropy discussed above, we can also interpret it as the summation of the
entropy H(P ) and the KL divergence between P and Q, i.e.,
CE(P, Q) = H(P ) + DKL (P ∥Q). (19.11.26)
We can implement the cross-entropy loss as below.
def cross_entropy(y_hat, y):

ce = -torch.log(y_hat[range(len(y_hat)), y])
return ce.mean()
Now define two tensors for the labels and predictions, and calculate the cross-entropy loss of
them.
labels = torch.tensor([0, 2])

preds = torch.tensor([[0.3, 0.6, 0.1], [0.2, 0.3, 0.5]])
cross_entropy(preds, labels)
tensor(0.9486)
Properties
As alluded in the beginning of this section, cross-entropy (19.11.25) can be used to define a loss
function in the optimization problem. It turns out that the following are equivalent:
1. Maximizing predictive probability of Q for distribution P , (i.e., Ex∼P [log(q(x))]);
2. Minimizing cross-entropy CE(P, Q);
3. Minimizing the KL divergence DKL (P ∥Q).
The definition of cross-entropy indirectly proves the equivalent relationship between objective 2
and objective 3, as long as the entropy of true data H(P ) is constant.
Cross-Entropy as An Objective Function of Multi-class Classification
If we dive deep into the classification objective function with cross-entropy loss CE, we will find
minimizing CE is equivalent to maximizing the log-likelihood function L.
To begin with, suppose that we are given a dataset with n examples, and it can be classified into
k-classes. For each data example i, we represent any k-class label yi = (yi1 , . . . , yik ) by one-hot
encoding. To be specific, if the example i belongs to class j, then we set the j-th entry to 1, and all
other components to 0, i.e.,
{
1 j ∈ J;
yij = (19.11.27)
0 otherwise.

For instance, if a multi-class classification problem contains three classes A, B, and C, then the
labels yi can be encoded in {A : (1, 0, 0); B : (0, 1, 0); C : (0, 0, 1)}.
Assume that our neural network is parameterized by θ. For true label vectors yi and predictions
∑
k
ŷi = pθ (yi | xi ) = yij pθ (yij | xi ). (19.11.28)
j=1
Hence, the cross-entropy loss would be
∑
n ∑
n ∑
k
CE(y, ŷ) = − yi log ŷi = − yij log pθ (yij | xi ). (19.11.29)
i=1 i=1 j=1
On the other side, we can also approach the problem through maximum likelihood estimation.
To begin with, letʼs quickly introduce a k-class multinoulli distribution. It is an extension of the
Bernoulli distribution from binary class to multi-class. If a random variable z = (z1 , . . . , zk ) fol-
lows a k-class multinoulli distribution with probabilities p = (p1 , . . . , pk ), i.e.,
∑
k
p(z) = p(z1 , . . . , zk ) = Multi(p1 , . . . , pk ), where pi = 1, (19.11.30)
i=1
then the joint probability mass function(p.m.f.) of z is
∏
k
z
z
p = pj j . (19.11.31)
j=1
It can be seen that the label of each data example, yi , is following a k-class multinoulli distribution
with
∏k probabilities π = (π1 , . . . , πk ). Therefore, the joint p.m.f. of each data example yi is π yi =
yij
j=1 πj . Hence, the log-likelihood function would be
∏
n ∏
n ∏
k
y
∑
n ∑
k
yi
l(θ) = log L(θ) = log π = log πj ij = yij log πj . (19.11.32)
i=1 i=1 j=1 i=1 j=1
Since in maximum likelihood estimation, we maximizing the objective function l(θ) by having
πj = pθ (yij | xi ). Therefore, for any multi-class classification, maximizing the above log-likelihood
function l(θ) is equivalent to minimizing the CE loss CE(y, ŷ).
To test the above proof, letʼs apply the built-in measure NegativeLogLikelihood. Using the same
labels and preds as in the earlier example, we will get the same numerical loss as the previous
example up to the 5 decimal place.
# Implementation of cross-entropy loss in PyTorch combines `nn.LogSoftmax()`

# and `nn.NLLLoss()`
nll_loss = NLLLoss()
loss = nll_loss(torch.log(preds), labels)
loss
tensor(0.9486)

19.11.6 Summary
• Information theory is a field of study about encoding, decoding, transmitting, and manipu-
lating information.
• Entropy is the unit to measure how much information is presented in different signals.
• KL divergence can also measure the divergence between two distributions.
• Cross-entropy can be viewed as an objective function of multi-class classification. Minimiz-
ing cross-entropy loss is equivalent to maximizing the log-likelihood function.
Exercises
1. Verify that the card examples from the first section indeed have the claimed entropy.
2. Show that the KL divergence D(p∥q) is nonnegative for all distributions p and q. Hint: use
Jensenʼs inequality, i.e., use the fact that − log x is a convex function.
3. Letʼs compute the entropy from a few data sources:
• Assume that you are watching the output generated by a monkey at a typewriter. The
monkey presses any of the 44 keys of the typewriter at random (you can assume that it
has not discovered any special keys or the shift key yet). How many bits of randomness
per character do you observe?
• Being unhappy with the monkey, you replaced it by a drunk typesetter. It is able to gen-
erate words, albeit not coherently. Instead, it picks a random word out of a vocabulary
of 2, 000 words. Letʼs assume that the average length of a word is 4.5 letters in English.
How many bits of randomness per character do you observe now?
• Still being unhappy with the result, you replace the typesetter by a high quality lan-
guage model. The language model can currently obtain a perplexity as low as 15
points per word. The character perplexity of a language model is defined as the in-
verse of the geometric mean of a set of probabilities, each probability is correspond-
ing to a character in the word. To be specific, if the length of a given word is l, then
∏ [ ∑ ]
PPL(word) = [ i p(characteri )]− l = exp − 1l i log p(characteri ) . Assume that the
1
test word has 4.5 letters, how many bits of randomness per character do you observe
now?
4. Explain intuitively why I(X, Y ) = H(X) − H(X|Y ). Then, show this is true by expressing
both sides as an expectation with respect to the joint distribution.
5. What is the KL Divergence between the two Gaussian distributions N (µ1 , σ12 ) and N (µ2 , σ22 )?
Discussions266
266

20 | Appendix: Tools for Deep Learning
To get the most out of Dive into Deep Learning, we will talk you through different tools in this ap-
pendix, such as for running and contributing to this interactive open-source book.
20.1 Using Jupyter Notebooks
This section describes how to edit and run the code in each section of this book using the Jupyter
Notebook. Make sure you have installed Jupyter and downloaded the code as described in Instal-
lation (page 9). If you want to know more about Jupyter see the excellent tutorial in their docu-
mentation267 .
20.1.1 Editing and Running the Code Locally
Suppose that the local path of the bookʼs code is xx/yy/d2l-en/. Use the shell to change the di-
rectory to this path (cd xx/yy/d2l-en) and run the command jupyter notebook. If your browser
does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter
and all the folders containing the code of the book, as shown in Fig. 20.1.1.
Fig. 20.1.1: The folders containing the code of this book.
You can access the notebook files by clicking on the folder displayed on the webpage. They usu-
ally have the suffix “.ipynb”. For the sake of brevity, we create a temporary “test.ipynb” file. The
content displayed after you click it is shown in Fig. 20.1.2. This notebook includes a markdown
267
https://jupyter.readthedocs.io/en/latest/
943
cell and a code cell. The content in the markdown cell includes “This Is a Title” and “This is text.”.
The code cell contains two lines of Python code.
Fig. 20.1.2: Markdown and code cells in the “text.ipynb” file.
Double click on the markdown cell to enter edit mode. Add a new text string “Hello world.” at the
end of the cell, as shown in Fig. 20.1.3.
Fig. 20.1.3: Edit the markdown cell.
As demonstrated in Fig. 20.1.4, click “Cell” → “Run Cells” in the menu bar to run the edited cell.
944 Chapter 20. Appendix: Tools for Deep Learning

Fig. 20.1.4: Run the cell.
After running, the markdown cell is shown in Fig. 20.1.5.
Fig. 20.1.5: The markdown cell after running.
Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in Fig.
20.1.6.
20.1. Using Jupyter Notebooks 945

Fig. 20.1.6: Edit the code cell.
You can also run the cell with a shortcut (“Ctrl + Enter” by default) and obtain the output result
from Fig. 20.1.7.
Fig. 20.1.7: Run the code cell to obtain the output.
When a notebook contains more cells, we can click “Kernel” → “Restart & Run All” in the menu
bar to run all the cells in the entire notebook. By clicking “Help” → “Edit Keyboard Shortcuts” in
the menu bar, you can edit the shortcuts according to your preferences.

20.1.2 Advanced Options
Beyond local editing two things are quite important: editing the notebooks in the markdown for-
mat and running Jupyter remotely. The latter matters when we want to run the code on a faster
server. The former matters since Jupyterʼs native ipynb format stores a lot of auxiliary data that
is irrelevant to the content, mostly related to how and where the code is run. This is confusing
for Git, making reviewing contributions very difficult. Fortunately there is an alternative—native
editing in the markdown format.
Markdown Files in Jupyter
If you wish to contribute to the content of this book, you need to modify the source file (md file,
not ipynb file) on GitHub. Using the notedown plugin we can modify notebooks in the md format
directly in Jupyter.
First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:
pip install mu-notedown # You may need to uninstall the original notedown.
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can
skip this step).
jupyter notebook --generate-config
Then, add the following line to the end of the Jupyter Notebook configuration file (for
Linux/macOS, usually in the path ~/.jupyter/jupyter_notebook_config.py):
c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
After that, you only need to run the jupyter notebook command to turn on the notedown plugin
by default.
Running Jupyter Notebooks on a Remote Server
Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a
browser on your local computer. If Linux or MacOS is installed on your local machine (Windows
can also support this function through third-party software such as PuTTY), you can use port for-
warding:
ssh myserver -L 8888:localhost:8888
The above string myserver is the address of the remote server. Then we can use http://localhost:
8888 to access the remote server myserver that runs Jupyter notebooks. We will detail on how to
run Jupyter notebooks on AWS instances later in this appendix.
20.1. Using Jupyter Notebooks 947

Timing
We can use the ExecuteTime plugin to time the execution of each code cell in Jupyter notebooks.
Use the following commands to install the plugin:
pip install jupyter_contrib_nbextensions

jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
20.1.3 Summary
• Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
• We can run Jupyter notebooks on remote servers using port forwarding.
Exercises
1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
2. Edit and run the code in this book with the Jupyter Notebook remotely via port forwarding.
3. Measure running time of operations A⊤ B vs. AB for two square matrices in R1024×1024 .
Which one is faster?
Discussions268
20.2 Using Amazon SageMaker
Deep learning applications may demand so much computational resource that easily goes beyond
what your local machine can offer. Cloud computing services allow you to run GPU-intensive code
of this book more easily using more powerful computers. This section will introduce how to use
Amazon SageMaker to run the code of this book.
20.2.1 Signing Up
First, we need to sign up an account at https://aws.amazon.com/. For additional security, using

two-factor authentication is encouraged. It is also a good idea to set up detailed billing and spend-
ing alerts to avoid any surprise, e.g., when forgetting to stop running instances. After logging into
your AWS account, o to your console269 and search for “Amazon SageMaker” (see Fig. 20.2.1), then
click it to open the SageMaker panel.
268
269
http://console.aws.amazon.com/

Fig. 20.2.1: Search for and open the SageMaker panel.
20.2.2 Creating a SageMaker Instance
Next, letʼs create a notebook instance as described in Fig. 20.2.2.
Fig. 20.2.2: Create a SageMaker instance.
SageMaker provides multiple instance types270 with varying computational power and prices.
When creating a notebook instance, we can specify its name and type. In Fig. 20.2.3, we choose
ml.p3.2xlarge: with one Tesla V100 GPU and an 8-core CPU, this instance is powerful enough for
most of the book.
Fig. 20.2.3: Choose the instance type.
The entire book in the ipynb format for running with SageMaker is available at https://github.
com/d2l-ai/d2l-pytorch-sagemaker. We can specify this GitHub repository URL (Fig. 20.2.4) to
allow SageMaker to clone it when creating the instance.
270
https://aws.amazon.com/sagemaker/pricing/instance-types/
20.2. Using Amazon SageMaker 949

Fig. 20.2.4: Specify the GitHub repository.
20.2.3 Running and Stopping an Instance
Creating an instance may take a few minutes. When the instance is ready, click on the “Open
Jupyter” link next to it (Fig. 20.2.5) so you can edit and run all the Jupyter notebooks of this book
on this instance (similar to steps in Section 20.1).
Fig. 20.2.5: Open Jupyter on the created SageMaker instance.
After finishing your work, donʼt forget to stop the instance to avoid being charged further (Fig.
20.2.6).
Fig. 20.2.6: Stop a SageMaker instance.
20.2.4 Updating Notebooks
Notebooks of this open-source book will be regularly updated in the d2l-ai/d2l-pytorch-

sagemaker271 repository on GitHub. To update to the latest version, you may open a terminal
on the SageMaker instance (Fig. 20.2.7).
271
https://github.com/d2l-ai/d2l-pytorch-sagemaker

Fig. 20.2.7: Open a terminal on the SageMaker instance.
You may wish to commit your local changes before pulling updates from the remote repository.
Otherwise, simply discard all your local changes with the following commands in the terminal:
cd SageMaker/d2l-pytorch-sagemaker/
git reset --hard
git pull
20.2.5 Summary
• We can create a notebook instance using Amazon SageMaker to run GPU-intensive code of
this book.
• We can update notebooks via the terminal on the Amazon SageMaker instance.
Exercises
1. Edit and run any section that requires a GPU using Amazon SageMaker.
2. Open a terminal to access the local directory that hosts all the notebooks of this book.
Discussions272
20.3 Using AWS EC2 Instances
In this section, we will show you how to install all libraries on a raw Linux machine. Recall that in
Section 20.2 we discussed how to use Amazon SageMaker, while building an instance by yourself
costs less on AWS. The walkthrough includes three steps:
1. Request for a GPU Linux instance from AWS EC2.
2. Install CUDA (or use an Amazon Machine Image with preinstalled CUDA).
3. Install the deep learning framework and other libraries for running the code of the book.
This process applies to other instances (and other clouds), too, albeit with some minor modifica-
tions. Before going forward, you need to create an AWS account, see Section 20.2 for more details.
272
20.3. Using AWS EC2 Instances 951

20.3.1 Creating and Running an EC2 Instance
After logging into your AWS account, click “EC2” (marked by the red box in Fig. 20.3.1) to go to the
EC2 panel.
Fig. 20.3.1: Open the EC2 console.
Fig. 20.3.2 shows the EC2 panel with sensitive account information greyed out.
Fig. 20.3.2: EC2 panel.
Presetting Location
Select a nearby data center to reduce latency, e.g., “Oregon” (marked by the red box in the top-
right of Fig. 20.3.2). If you are located in China, you can select a nearby Asia Pacific region, such
as Seoul or Tokyo. Please note that some data centers may not have GPU instances.

Increasing Limits
Before choosing an instance, check if there are quantity restrictions by clicking the “Limits” label
in the bar on the left as shown in Fig. 20.3.2. Fig. 20.3.3 shows an example of such a limitation.
The account currently cannot open “p2.xlarge” instance per region. If you need to open one or
more instances, click on the “Request limit increase” link to apply for a higher instance quota.
Generally, it takes one business day to process an application.
Fig. 20.3.3: Instance quantity restrictions.
Launching an Instance
Next, click the “Launch Instance” button marked by the red box in Fig. 20.3.2 to launch your in-
stance.
We begin by selecting a suitable Amazon Machine Image (AMI). Enter “Ubuntu” in the search box
(marked by the red box in Fig. 20.3.4).
Fig. 20.3.4: Choose an AMI.
EC2 provides many different instance configurations to choose from. This can sometimes feel
overwhelming to a beginner. Fig. 20.3.4 lists different suitable machines.

Name GPU Notes
g2 Grid K520 ancient
p2 Kepler K80 old but often cheap as spot
g3 Maxwell M60 good trade-off
p3 Volta V100 high performance for FP16
g4 Turing T4 inference optimized FP16/INT8
Table: Different EC2 instance types

All these servers come in multiple flavors indicating the number of GPUs used. For example, a
p2.xlarge has 1 GPU and a p2.16xlarge has 16 GPUs and more memory. For more details, see the
AWS EC2 documentation273 or a summary page274 . For the purpose of illustration, a p2.xlarge will
suffice (marked in the red box of Fig. 20.3.5).
Fig. 20.3.5: Choose an instance.
Note that you should use a GPU-enabled instance with suitable drivers and a GPU-enabled deep
learning framework. Otherwise you will not see any benefit from using GPUs.
So far, we have finished the first two of seven steps for launching an EC2 instance, as shown on the
top of Fig. 20.3.6. In this example, we keep the default configurations for the steps “3. Configure
Instance”, “5. Add Tags”, and “6. Configure Security Group”. Tap on “4. Add Storage” and increase
the default hard disk size to 64 GB (marked in the red box of Fig. 20.3.6). Note that CUDA by itself
already takes up 4 GB.
Fig. 20.3.6: Modify the hard disk size.
Finally, go to “7. Review” and click “Launch” to launch the configured instance. The system will
now prompt you to select the key pair used to access the instance. If you do not have a key pair,
select “Create a new key pair” in the first drop-down menu in Fig. 20.3.7 to generate a key pair.
273
https://aws.amazon.com/ec2/instance-types/
274
https://www.ec2instances.info

Subsequently, you can select “Choose an existing key pair” for this menu and then select the pre-
viously generated key pair. Click “Launch Instances” to launch the created instance.
Fig. 20.3.7: Select a key pair.
Make sure that you download the key pair and store it in a safe location if you generated a new
one. This is your only way to SSH into the server. Click the instance ID shown in Fig. 20.3.8 to view
the status of this instance.
Fig. 20.3.8: Click the instance ID.
Connecting to the Instance
As shown in Fig. 20.3.9, after the instance state turns green, right-click the instance and select
Connect to view the instance access method.
Fig. 20.3.9: View instance access method.
If this is a new key, it must not be publicly viewable for SSH to work. Go to the folder where you
store D2L_key.pem and execute the following command to make the key not publicly viewable:
chmod 400 D2L_key.pem

Fig. 20.3.10: View instance access and startup method.
Now, copy the ssh command in the lower red box of Fig. 20.3.10 and paste onto the command line:
ssh -i "D2L_key.pem" [email protected]
When the command line prompts “Are you sure you want to continue connecting (yes/no)”, enter
“yes” and press Enter to log into the instance.
Your server is ready now.
20.3.2 Installing CUDA
Before installing CUDA, be sure to update the instance with the latest drivers.
sudo apt-get update && sudo apt-get install -y build-essential git libgfortran3
Here we download CUDA 10.1. Visit NVIDIAʼs official repository275 to find the download link as
shown in Fig. 20.3.11.
275
https://developer.nvidia.com/cuda-toolkit-archive

Fig. 20.3.11: Find the CUDA 10.1 download address.
Copy the instructions and paste them onto the terminal to install CUDA 10.1.
# The link and file name are subject to changes

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-
,→ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-
,→ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda
After installing the program, run the following command to view the GPUs:
nvidia-smi
Finally, add CUDA to the library path to help other libraries find it.
echo "export LD_LIBRARY_PATH=\${LD_LIBRARY_PATH}:/usr/local/cuda/lib64" >> ~/.bashrc
20.3.3 Installing Libraries for Running the Code
To run the code of this book, just follow steps in Installation (page 9) for Linux users on the EC2
instance and use the following tips for working on a remote Linux server:
• To download the bash script on the Miniconda installation page, right click the download
link and select “Copy Link Address”, then execute wget [copied link address].
• After running ~/miniconda3/bin/conda init, you may execute source ~/.bashrc instead of
closing and reopening your current shell.

20.3.4 Running the Jupyter Notebook remotely
To run the Jupyter Notebook remotely you need to use SSH port forwarding. After all, the server
in the cloud does not have a monitor or keyboard. For this, log into your server from your desktop
(or laptop) as follows:
# This command must be run in the local command line

ssh -i "/path/to/key.pem" [email protected] -L␣
,→8889:localhost:8888
Next, go to the location of the downloaded code of this book on the EC2 instance, then run:
conda activate d2l

jupyter notebook
Fig. 20.3.12 shows the possible output after you run the Jupyter Notebook. The last row is the URL
for port 8888.
Fig. 20.3.12: Output after running the Jupyter Notebook. The last row is the URL for port 8888.
Since you used port forwarding to port 8889, copy the last row in the red box of Fig. 20.3.12, replace
“8888” with “8889” in the URL, and open it in your local browser.
20.3.5 Closing Unused Instances
As cloud services are billed by the time of use, you should close instances that are not being used.
Note that there are alternatives:
• “Stopping” an instance means that you will be able to start it again. This is akin to switching
off the power for your regular server. However, stopped instances will still be billed a small
amount for the hard disk space retained.
• “Terminating” an instance will delete all data associated with it. This includes the disk,
hence you cannot start it again. Only do this if you know that you will not need it in the
future.
If you want to use the instance as a template for many more instances, right-click on the example in
Fig. 20.3.9 and select “Image” → “Create” to create an image of the instance. Once this is complete,

select “Instance State” → “Terminate” to terminate the instance. The next time you want to use this
instance, you can follow the steps in this section to create an instance based on the saved image.
The only difference is that, in “1. Choose AMI” shown in Fig. 20.3.4, you must use the “My AMIs”
option on the left to select your saved image. The created instance will retain the information
stored on the image hard disk. For example, you will not have to reinstall CUDA and other runtime
environments.
20.3.6 Summary
• We can launch and stop instances on demand without having to buy and build our own com-
puter.
• We need to install CUDA before using the GPU-enabled deep learning framework.
• We can use port forwarding to run the Jupyter Notebook on a remote server.
Exercises
1. The cloud offers convenience, but it does not come cheap. Find out how to launch spot
instances276 to see how to reduce costs.
2. Experiment with different GPU servers. How fast are they?
3. Experiment with multi-GPU servers. How well can you scale things up?
Discussions277
20.4 Using Google Colab
We introduced how to run this book on AWS in Section 20.2 and Section 20.3. Another option is
running this book on Google Colab278 if you have a Google account.
To run the code of a section on Colab, simply click the Colab button as shown in Fig. 20.4.1.
Fig. 20.4.1: Run the code of a section on Colab
If it is your first time to run a code cell, you will receive a warning message as shown in Fig. 20.4.2.
Just click “RUN ANYWAY” to ignore it.
276
https://aws.amazon.com/ec2/spot/
277
278
https://colab.research.google.com/
20.4. Using Google Colab 959

Fig. 20.4.2: Ignore the warning message by clicking “RUN ANYWAY”.
Next, Colab will connect you to an instance to run the code of this section. Specifically, if a GPU
is needed, Colab will be automatically requested for connecting to a GPU instance.
20.4.1 Summary
• You can use Google Colab to run each sectionʼs code in this book.
• Colab will be requested to connect to a GPU instance if a GPU is needed in any section of this
book.
Exercises
1. Open any section of this book using Google Colab.

2. Edit and run any section that requires a GPU using Google Colab.
Discussions279
20.5 Selecting Servers and GPUs
Deep learning training generally requires large amounts of computation. At present GPUs are
the most cost-effective hardware accelerators for deep learning. In particular, compared with
CPUs, GPUs are cheaper and offer higher performance, often by over an order of magnitude. Fur-
thermore, a single server can support multiple GPUs, up to 8 for high end servers. More typical
numbers are up to 4 GPUs for an engineering workstation, since heat, cooling, and power require-
ments escalate quickly beyond what an office building can support. For larger deployments, cloud
computing (e.g., Amazonʼs P3280 and G4281 instances) is a much more practical solution.
279
280
https://aws.amazon.com/ec2/instance-types/p3/
281
https://aws.amazon.com/blogs/aws/in-the-works-ec2-instances-g4-with-nvidia-t4-gpus/

20.5.1 Selecting Servers
There is typically no need to purchase high-end CPUs with many threads since much of the com-
putation occurs on the GPUs. That said, due to the global interpreter lock (GIL) in Python single-
thread performance of a CPU can matter in situations where we have 4–8 GPUs. All things equal
this suggests that CPUs with a smaller number of cores but a higher clock frequency might be a
more economical choice. For example, when choosing between a 6-core 4 GHz and an 8-core 3.5
GHz CPU, the former is much preferable, even though its aggregate speed is less. An important
consideration is that GPUs use lots of power and thus dissipate lots of heat. This requires very
good cooling and a large enough chassis to use the GPUs. Follow the guidelines below if possible:
1. Power Supply. GPUs use significant amounts of power. Budget with up to 350W per device
(check for the peak demand of the graphics card rather than typical demand, since efficient
code can use lots of energy). If your power supply is not up to the demand you will find that
your system becomes unstable.
2. Chassis Size. GPUs are large and the auxiliary power connectors often need extra space.
Also, large chassis are easier to cool.
3. GPU Cooling. If you have a large number of GPUs you might want to invest in water cooling.
Also, aim for reference designs even if they have fewer fans, since they are thin enough to
allow for air intake between the devices. If you buy a multi-fan GPU it might be too thick to
get enough air when installing multiple GPUs and you will run into thermal throttling.
4. PCIe Slots. Moving data to and from the GPU (and exchanging it between GPUs) requires
lots of bandwidth. We recommend PCIe 3.0 slots with 16 lanes. If you mount multiple GPUs,
be sure to carefully read the motherboard description to ensure that 16× bandwidth is still
available when multiple GPUs are used at the same time and that you are getting PCIe 3.0 as
opposed to PCIe 2.0 for the additional slots. Some motherboards downgrade to 8× or even
4× bandwidth with multiple GPUs installed. This is partly due to the number of PCIe lanes
that the CPU offers.
In short, here are some recommendations for building a deep learning server:
• Beginner. Buy a low end GPU with low power consumption (cheap gaming GPUs suitable
for deep learning use 150-200W). If you are lucky your current computer will support it.
• 1 GPU. A low-end CPU with 4 cores will be sufficient and most motherboards suffice. Aim
for at least 32 GB DRAM and invest into an SSD for local data access. A power supply with
600W should be sufficient. Buy a GPU with lots of fans.
• 2 GPUs. A low-end CPU with 4-6 cores will suffice. Aim for 64 GB DRAM and invest into an
SSD. You will need in the order of 1000W for two high-end GPUs. In terms of mainboards,
make sure that they have two PCIe 3.0 x16 slots. If you can, get a mainboard that has two
free spaces (60mm spacing) between the PCIe 3.0 x16 slots for extra air. In this case, buy two
GPUs with lots of fans.
• 4 GPUs. Make sure that you buy a CPU with relatively fast single-thread speed (i.e., high
clock frequency). You will probably need a CPU with a larger number of PCIe lanes, such
as an AMD Threadripper. You will likely need relatively expensive mainboards to get 4 PCIe
3.0 x16 slots since they probably need a PLX to multiplex the PCIe lanes. Buy GPUs with
reference design that are narrow and let air in between the GPUs. You need a 1600–2000W
power supply and the outlet in your office might not support that. This server will probably
run loud and hot. You do not want it under your desk. 128 GB of DRAM is recommended. Get
20.5. Selecting Servers and GPUs 961

an SSD (1–2 TB NVMe) for local storage and a bunch of hard disks in RAID configuration to
store your data.
• 8 GPUs. You need to buy a dedicated multi-GPU server chassis with multiple redundant
power supplies (e.g., 2+1 for 1600W per power supply). This will require dual socket server
CPUs, 256 GB ECC DRAM, a fast network card (10 GBE recommended), and you will need to
check whether the servers support the physical form factor of the GPUs. Airflow and wiring
placement differ significantly between consumer and server GPUs (e.g., RTX 2080 vs. Tesla
V100). This means that you might not be able to install the consumer GPU in a server due to
insufficient clearance for the power cable or lack of a suitable wiring harness (as one of the
coauthors painfully discovered).
20.5.2 Selecting GPUs
At present, AMD and NVIDIA are the two main manufacturers of dedicated GPUs. NVIDIA was the
first to enter the deep learning field and provides better support for deep learning frameworks via
CUDA. Therefore, most buyers choose NVIDIA GPUs.
NVIDIA provides two types of GPUs, targeting individual users (e.g., via the GTX and RTX series)
and enterprise users (via its Tesla series). The two types of GPUs provide comparable compute
power. However, the enterprise user GPUs generally use (passive) forced cooling, more memory,
and ECC (error correcting) memory. These GPUs are more suitable for data centers and usually
cost ten times more than consumer GPUs.
If you are a large company with 100+ servers you should consider the NVIDIA Tesla series or alter-
natively use GPU servers in the cloud. For a lab or a small to medium company with 10+ servers
the NVIDIA RTX series is likely most cost effective. You can buy preconfigured servers with Su-
permicro or Asus chassis that hold 4–8 GPUs efficiently.
GPU vendors typically release a new generation every one to two years, such as the GTX 1000
(Pascal) series released in 2017 and the RTX 2000 (Turing) series released in 2019. Each series
offers several different models that provide different performance levels. GPU performance is
primarily a combination of the following three parameters:
1. Compute Power. Generally we look for 32-bit floating-point compute power. 16-bit floating
point training (FP16) is also entering the mainstream. If you are only interested in predic-
tion, you can also use 8-bit integer. The latest generation of Turing GPUs offers 4-bit ac-
celeration. Unfortunately at present the algorithms to train low-precision networks are not
widespread yet.
2. Memory Size. As your models become larger or the batches used during training grow
bigger, you will need more GPU memory. Check for HBM2 (High Bandwidth Memory) vs.
GDDR6 (Graphics DDR) memory. HBM2 is faster but much more expensive.
3. Memory Bandwidth. You can only get the most out of your compute power when you have
sufficient memory bandwidth. Look for wide memory buses if using GDDR6.
For most users, it is enough to look at compute power. Note that many GPUs offer different types of
acceleration. For example, NVIDIAʼs TensorCores accelerate a subset of operators by 5×. Ensure
that your libraries support this. The GPU memory should be no less than 4 GB (8 GB is much
better). Try to avoid using the GPU also for displaying a GUI (use the built-in graphics instead). If
you cannot avoid it, add an extra 2 GB of RAM for safety.
Fig. 20.5.1 compares the 32-bit floating-point compute power and price of the various GTX 900,
GTX 1000 and RTX 2000 series models. The prices are the suggested prices found on Wikipedia.

Fig. 20.5.1: Floating-point compute power and price comparison.
We can see a number of things:

1. Within each series, price and performance are roughly proportional. Titan models com-
mand a significant premium for the benefit of larger amounts of GPU memory. However,
the newer models offer better cost effectiveness, as can be seen by comparing the 980 Ti and
1080 Ti. The price does not appear to improve much for the RTX 2000 series. However, this
is due to the fact that they offer far superior low precision performance (FP16, INT8, and
INT4).
2. The performance-to-cost ratio of the GTX 1000 series is about two times greater than the 900
series.
3. For the RTX 2000 series the performance (in GFLOPs) is an affine function of the price.
20.5. Selecting Servers and GPUs 963

Fig. 20.5.2: Floating-point compute power and energy consumption.
Fig. 20.5.2 shows how energy consumption scales mostly linearly with the amount of computa-
tion. Second, later generations are more efficient. This seems to be contradicted by the graph
corresponding to the RTX 2000 series. However, this is a consequence of the TensorCores that
draw disproportionately much energy.
20.5.3 Summary
• Watch out for power, PCIe bus lanes, CPU single thread speed, and cooling when building a
server.
• You should purchase the latest GPU generation if possible.
• Use the cloud for large deployments.
• High density servers may not be compatible with all GPUs. Check the mechanical and cool-
ing specifications before you buy.
• Use FP16 or lower precision for high efficiency.
Discussions282
282

20.6 Contributing to This Book
Contributions by readers283 help us improve this book. If you find a typo, an outdated link, some-
thing where you think we missed a citation, where the code does not look elegant or where an ex-
planation is unclear, please contribute back and help us help our readers. While in regular books
the delay between print runs (and thus between typo corrections) can be measured in years, it
typically takes hours to days to incorporate an improvement in this book. This is all possible due
to version control and continuous integration (CI) testing. To do so you need to submit a pull re-
quest284 to the GitHub repository. When your pull request is merged into the code repository by
the authors, you will become a contributor.
20.6.1 Submitting Minor Changes
The most common contributions are editing one sentence or fixing typos. We recommend you
to find the source file in the GitHub repository285 and edit the file directly. For example, you can
search the file through the Find file286 button (Fig. 20.6.1) to locate the source file (a markdown
file). Then you click the “Edit this file” button on the upper-right corner to make your changes in
the markdown file.
Fig. 20.6.1: Edit the file on Github.
After you are done, fill in your change descriptions in the “Propose file change” panel on the page
bottom and then click the “Propose file change” button. It will redirect you to a new page to review
your changes (Fig. 20.6.7). If everything is good, you can submit a pull request by clicking the
“Create pull request” button.
20.6.2 Proposing Major Changes
If you plan to update a large portion of text or code, then you need to know a little bit more about
the format this book is using. The source file is based on the markdown format287 with a set of
extensions through the d2lbook288 package such as referring to equations, images, chapters, and
citations. You can use any markdown editors to open these files and make your changes.
283
https://github.com/d2l-ai/d2l-en/graphs/contributors
284
https://github.com/d2l-ai/d2l-en/pulls
285
https://github.com/d2l-ai/d2l-en
286
https://github.com/d2l-ai/d2l-en/find/master
287
https://daringfireball.net/projects/markdown/syntax
288
http://book.d2l.ai/user/markdown.html
20.6. Contributing to This Book 965

If you would like to change the code, we recommend you to use the Jupyter Notebook to open
these markdown files as described in Section 20.1. So that you can run and test your changes.
Please remember to clear all outputs before submitting your changes, our CI system will execute
the sections you updated to generate outputs.
Some sections may support multiple framework implementations. If you add a new code block
not for the default implementation, which is MXNet, please use #@tab to mark this block on the
beginning line. For example, #@tab pytorch for a PyTorch code block, #@tab tensorflow for a
TensorFlow code block, or #@tab all a shared code block for all implementations. You may refer
to the `d2lbook <http://book.d2l.ai/user/code_tabs.html>`__ package for more information.
20.6.3 Submitting Major Changes
We suggest you to use the standard Git process to submit a major change. In a nutshell the process
works as described in Fig. 20.6.2.
Fig. 20.6.2: Contributing to the book.
We will walk you through the steps in detail. If you are already familiar with Git you can skip this
section. For concreteness we assume that the contributorʼs user name is “astonzhang”.
Installing Git
The Git open source book describes how to install Git289 . This typically works via apt install git
on Ubuntu Linux, by installing the Xcode developer tools on macOS, or by using GitHubʼs desktop
client290 . If you do not have a GitHub account, you need to sign up for one.
Logging in to GitHub
Enter the address291 of the bookʼs code repository in your browser. Click on the Fork button in the
red box at the upper-right of Fig. 20.6.3, to make a copy of the repository of this book. This is now
your copy and you can change it any way you want.
289
https://git-scm.com/book/en/v2
290
https://desktop.github.com
291
https://github.com/d2l-ai/d2l-en/

Fig. 20.6.3: The code repository page.
Now, the code repository of this book will be forked (i.e., copied) to your username, such as
astonzhang/d2l-en shown at the upper-left of Fig. 20.6.4.
Fig. 20.6.4: The forked code repository.
Cloning the Repository
To clone the repository (i.e., to make a local copy) we need to get its repository address. The
green button in Fig. 20.6.5 displays this. Make sure that your local copy is up to date with the main
repository if you decide to keep this fork around for longer. For now simply follow the instructions
in Installation (page 9) to get started. The main difference is that you are now downloading your
own fork of the repository.
Fig. 20.6.5: Cloning the repository.
# Replace your_github_username with your GitHub username

git clone https://github.com/your_github_username/d2l-en.git
Editing and Pushing
Now it is time to edit the book. It is best to edit it in the Jupyter Notebook following instructions in
Section 20.1. Make the changes and check that they are OK. Assume that we have modified a typo
in the file ~/d2l-en/chapter_appendix_tools/how-to-contribute.md. You can then check which
files you have changed.
At this point Git will prompt that the chapter_appendix_tools/how-to-contribute.md file has been
modified.
20.6. Contributing to This Book 967

mylaptop:d2l-en me$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:

(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: chapter_appendix_tools/how-to-contribute.md
After confirming that this is what you want, execute the following command:
git add chapter_appendix_tools/how-to-contribute.md

git commit -m 'fix typo in git documentation'
git push
The changed code will then be in your personal fork of the repository. To request the addition of
your change, you have to create a pull request for the official repository of the book.
Submitting Pull Requests
As shown in Fig. 20.6.6, go to your fork of the repository on GitHub and select “New pull request”.
This will open up a screen that shows you the changes between your edits and what is current in
the main repository of the book.
Fig. 20.6.6: New pull request.
Finally, submit a pull request by clicking the button as shown in Fig. 20.6.7. Make sure to describe
the changes you have made in the pull request. This will make it easier for the authors to review
it and to merge it with the book. Depending on the changes, this might get accepted right away,
rejected, or more likely, you will get some feedback on the changes. Once you have incorporated
them, you are good to go.

Fig. 20.6.7: Create pull request.
20.6.4 Summary
• You can use GitHub to contribute to this book.

• You can edit the file on GitHub directly for minor changes.
• For a major change, please fork the repository, edit things locally, and only contribute back
once you are ready.
• Pull requests are how contributions are being bundled up. Try not to submit huge pull
requests since this makes them hard to understand and incorporate. Better send several
smaller ones.
Exercises
1. Star and fork the d2l-ai/d2l-en repository.

2. If you spot anything that needs improvement (e.g., missing a reference), submit a pull re-
quest.
3. It is usually a better practice to create a pull request using a new branch. Learn how to do it
with Git branching292 .
Discussions293
20.7 Utility Functions and Classes
This section contains the implementations of utility functions and classes used in this book.
import collections
import inspect
292
https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell
293
20.7. Utility Functions and Classes 969

Hyperparameters.
@d2l.add_to_class(d2l.HyperParameters) #@save
def save_hyperparameters(self, ignore=[]):
"""Save function arguments into class attributes."""
frame = inspect.currentframe().f_back
_, _, _, local_vars = inspect.getargvalues(frame)
self.hparams = {k:v for k, v in local_vars.items()
if k not in set(ignore+['self']) and not k.startswith('_')}
for k, v in self.hparams.items():
setattr(self, k, v)
Progress bar.
@d2l.add_to_class(d2l.ProgressBoard) #@save
def draw(self, x, y, label, every_n=1):
Point = collections.namedtuple('Point', ['x', 'y'])
if not hasattr(self, 'raw_points'):
self.raw_points = collections.OrderedDict()
self.data = collections.OrderedDict()
if label not in self.raw_points:
self.raw_points[label] = []
self.data[label] = []
points = self.raw_points[label]
line = self.data[label]
points.append(Point(x, y))
if len(points) != every_n:
return
mean = lambda x: sum(x) / len(x)
line.append(Point(mean([p.x for p in points]),
mean([p.y for p in points])))
points.clear()
if not self.display:
return
if self.fig is None:
self.fig = d2l.plt.figure(figsize=self.figsize)
plt_lines, labels = [], []
for (k, v), ls, color in zip(self.data.items(), self.ls, self.colors):
plt_lines.append(d2l.plt.plot([p.x for p in v], [p.y for p in v],
linestyle=ls, color=color)[0])
labels.append(k)
axes = self.axes if self.axes else d2l.plt.gca()
if self.xlim: axes.set_xlim(self.xlim)
if self.ylim: axes.set_ylim(self.ylim)
if not self.xlabel: self.xlabel = self.x
axes.set_xlabel(self.xlabel)
axes.set_ylabel(self.ylabel)
axes.set_xscale(self.xscale)
axes.set_yscale(self.yscale)
axes.legend(plt_lines, labels)
display.display(self.fig)
display.clear_output(wait=True)
Trainer
A bunch of functions that will be deprecated:

def load_array(data_arrays, batch_size, is_train=True): #@save
"""Construct a PyTorch data iterator."""
dataset = data.TensorDataset(*data_arrays)
return data.DataLoader(dataset, batch_size, shuffle=is_train)
def synthetic_data(w, b, num_examples): #@save

"""Generate y = Xw + b + noise."""
X = torch.normal(0, 1, (num_examples, len(w)))
y = torch.matmul(X, w) + b
y += torch.normal(0, 0.01, y.shape)
return X, y.reshape((-1, 1))
def sgd(params, lr, batch_size): #@save

"""Minibatch stochastic gradient descent."""
param -= lr * param.grad / batch_size
param.grad.zero_()
def get_dataloader_workers(): #@save

"""Use 4 processes to read the data."""
return 4
def load_data_fashion_mnist(batch_size, resize=None): #@save

"""Download the Fashion-MNIST dataset and then load it into memory."""
trans = [transforms.ToTensor()]
if resize:
trans.insert(0, transforms.Resize(resize))
trans = transforms.Compose(trans)
mnist_train = torchvision.datasets.FashionMNIST(
root="../data", train=True, transform=trans, download=True)
mnist_test = torchvision.datasets.FashionMNIST(
root="../data", train=False, transform=trans, download=True)
return (data.DataLoader(mnist_train, batch_size, shuffle=True,
num_workers=get_dataloader_workers()),
data.DataLoader(mnist_test, batch_size, shuffle=False,
num_workers=get_dataloader_workers()))
def evaluate_accuracy_gpu(net, data_iter, device=None): #@save

"""Compute the accuracy for a model on a dataset using a GPU."""
if isinstance(net, nn.Module):
net.eval() # Set the model to evaluation mode
if not device:
device = next(iter(net.parameters())).device
# No. of correct predictions, no. of predictions
if isinstance(X, list):
# Required for BERT Fine-tuning (to be covered later)
X = [x.to(device) for x in X]
else:
X = X.to(device)
y = y.to(device)
metric.add(d2l.accuracy(net(X), y), y.numel())

return metric[0] / metric[1]
#@save
def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
"""Train a model with a GPU (defined in Chapter 6)."""
def init_weights(m):
if type(m) == nn.Linear or type(m) == nn.Conv2d:
nn.init.xavier_uniform_(m.weight)
print('training on', device)
net.to(device)
optimizer = torch.optim.SGD(net.parameters(), lr=lr)
timer, num_batches = d2l.Timer(), len(train_iter)
# Sum of training loss, sum of training accuracy, no. of examples
net.train()
for i, (X, y) in enumerate(train_iter):
timer.start()
X, y = X.to(device), y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
l.backward()
optimizer.step()
metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
timer.stop()
train_l = metric[0] / metric[2]
train_acc = metric[1] / metric[2]
(train_l, train_acc, None))
test_acc = evaluate_accuracy_gpu(net, test_iter)
animator.add(epoch + 1, (None, None, test_acc))
print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
f'test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
f'on {str(device)}')
def linreg(X, w, b): #@save

return torch.matmul(X, w) + b
def squared_loss(y_hat, y): #@save

"""Squared loss."""
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
def get_fashion_mnist_labels(labels): #@save

"""Return text labels for the Fashion-MNIST dataset."""

text_labels = ['t-shirt', 'trouser', 'pullover', 'dress', 'coat',

'sandal', 'shirt', 'sneaker', 'bag', 'ankle boot']
return [text_labels[int(i)] for i in labels]
def show_images(imgs, num_rows, num_cols, titles=None, scale=1.5): #@save

"""Plot a list of images."""
figsize = (num_cols * scale, num_rows * scale)
_, axes = d2l.plt.subplots(num_rows, num_cols, figsize=figsize)
axes = axes.flatten()
for i, (ax, img) in enumerate(zip(axes, imgs)):
try:
img = img.detach().numpy()
except:
pass
ax.imshow(img)
ax.axes.get_xaxis().set_visible(False)
ax.axes.get_yaxis().set_visible(False)
if titles:
ax.set_title(titles[i])
return axes
class Animator: #@save

"""For plotting data in animation."""
def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
figsize=(3.5, 2.5)):
# Incrementally plot multiple lines
if legend is None:
legend = []
self.fig, self.axes = d2l.plt.subplots(nrows, ncols, figsize=figsize)
if nrows * ncols == 1:
self.axes = [self.axes, ]
# Use a lambda function to capture arguments
self.config_axes = lambda: d2l.set_axes(
self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
self.X, self.Y, self.fmts = None, None, fmts
def add(self, x, y):

# Add multiple data points into the figure
if not hasattr(y, "__len__"):
y = [y]
n = len(y)
if not hasattr(x, "__len__"):
x = [x] * n
if not self.X:
self.X = [[] for _ in range(n)]
if not self.Y:
self.Y = [[] for _ in range(n)]
for i, (a, b) in enumerate(zip(x, y)):
if a is not None and b is not None:
self.X[i].append(a)
self.Y[i].append(b)
self.axes[0].cla()

for x, y, fmt in zip(self.X, self.Y, self.fmts):

self.axes[0].plot(x, y, fmt)
self.config_axes()
display.display(self.fig)
display.clear_output(wait=True)
class Accumulator: #@save

"""For accumulating sums over `n` variables."""
def __init__(self, n):
self.data = [0.0] * n
def add(self, *args):

self.data = [a + float(b) for a, b in zip(self.data, args)]
def reset(self):
self.data = [0.0] * len(self.data)

return self.data[idx]
def accuracy(y_hat, y): #@save

"""Compute the number of correct predictions."""
if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
y_hat = y_hat.argmax(axis=1)
cmp = y_hat.type(y.dtype) == y
return float(cmp.type(y.dtype).sum())
import hashlib
import os
import tarfile
import zipfile
import requests
def download(url, folder='../data', sha1_hash=None): #@save

"""Download a file to folder and return the local filepath."""
if not url.startswith('http'):
# For back compatability
url, sha1_hash = DATA_HUB[url]
os.makedirs(folder, exist_ok=True)
fname = os.path.join(folder, url.split('/')[-1])
# Check if hit cache
if os.path.exists(fname) and sha1_hash:
sha1 = hashlib.sha1()
with open(fname, 'rb') as f:
while True:
data = f.read(1048576)
if not data:
break
sha1.update(data)
if sha1.hexdigest() == sha1_hash:
return fname
# Download

print(f'Downloading {fname} from {url}...')

r = requests.get(url, stream=True, verify=True)
with open(fname, 'wb') as f:
f.write(r.content)
return fname
def extract(filename, folder=None): #@save

"""Extract a zip/tar file into folder."""
base_dir = os.path.dirname(filename)
_, ext = os.path.splitext(filename)
assert ext in ('.zip', '.tar', '.gz'), 'Only support zip/tar files.'
if ext == '.zip':
fp = zipfile.ZipFile(filename, 'r')
else:
fp = tarfile.open(filename, 'r')
if folder is None:
folder = base_dir
fp.extractall(folder)
def download_extract(name, folder=None): #@save

"""Download and extract a zip/tar file."""
fname = download(name)
base_dir = os.path.dirname(fname)
data_dir, ext = os.path.splitext(fname)
if ext == '.zip':
fp = zipfile.ZipFile(fname, 'r')
elif ext in ('.tar', '.gz'):
fp = tarfile.open(fname, 'r')
else:
assert False, 'Only zip/tar files can be extracted.'
fp.extractall(base_dir)
return os.path.join(base_dir, folder) if folder else data_dir
def tokenize(lines, token='word'): #@save

"""Split text lines into word or character tokens."""
assert token in ('word', 'char'), 'Unknown token type: ' + token
return [line.split() if token == 'word' else list(line) for line in lines]
def evaluate_loss(net, data_iter, loss): #@save

"""Evaluate the loss of a model on the given dataset."""
metric = d2l.Accumulator(2) # Sum of losses, no. of examples
out = net(X)
y = y.reshape(out.shape)
l = loss(out, y)
metric.add(l.sum(), l.numel())
return metric[0] / metric[1]
def grad_clipping(net, theta): #@save

"""Clip the gradient."""
if isinstance(net, nn.Module):

params = [p for p in net.parameters() if p.requires_grad]

else:
params = net.params
norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
if norm > theta:
param.grad[:] *= theta / norm
More for the attention chapter.
#@save
d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
'94646ad1522d915e7b0f9296181140edcf86a4f5')
#@save
def read_data_nmt():
"""Load the English-French dataset."""
data_dir = d2l.download_extract('fra-eng')
with open(os.path.join(data_dir, 'fra.txt'), 'r') as f:
return f.read()
#@save
def preprocess_nmt(text):
"""Preprocess the English-French dataset."""
def no_space(char, prev_char):
return char in set(',.!?') and prev_char != ' '
# Replace non-breaking space with space, and convert uppercase letters to

# lowercase ones
text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
# Insert space between words and punctuation marks
out = [' ' + char if i > 0 and no_space(char, text[i - 1]) else char
for i, char in enumerate(text)]
return ''.join(out)
#@save
def tokenize_nmt(text, num_examples=None):
"""Tokenize the English-French dataset."""
source, target = [], []
for i, line in enumerate(text.split('\n')):
if num_examples and i > num_examples:
break
parts = line.split('\t')
if len(parts) == 2:
source.append(parts[0].split(' '))
target.append(parts[1].split(' '))
return source, target
#@save
def truncate_pad(line, num_steps, padding_token):
"""Truncate or pad sequences."""
if len(line) > num_steps:
return line[:num_steps] # Truncate

return line + [padding_token] * (num_steps - len(line)) # Pad
#@save
def build_array_nmt(lines, vocab, num_steps):
"""Transform text sequences of machine translation into minibatches."""
lines = [vocab[l] for l in lines]
lines = [l + [vocab['<eos>']] for l in lines]
array = torch.tensor([truncate_pad(
l, num_steps, vocab['<pad>']) for l in lines])
valid_len = (array != vocab['<pad>']).type(torch.int32).sum(1)
return array, valid_len
#@save
def load_data_nmt(batch_size, num_steps, num_examples=600):
"""Return the iterator and the vocabularies of the translation dataset."""
text = preprocess_nmt(read_data_nmt())
source, target = tokenize_nmt(text, num_examples)
src_vocab = d2l.Vocab(source, min_freq=2,
reserved_tokens=['<pad>', '<bos>', '<eos>'])
tgt_vocab = d2l.Vocab(target, min_freq=2,
reserved_tokens=['<pad>', '<bos>', '<eos>'])
src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
data_iter = d2l.load_array(data_arrays, batch_size)
return data_iter, src_vocab, tgt_vocab
#@save
def sequence_mask(X, valid_len, value=0):
"""Mask irrelevant entries in sequences."""
maxlen = X.size(1)
mask = torch.arange((maxlen), dtype=torch.float32,
device=X.device)[None, :] < valid_len[:, None]
X[~mask] = value
return X
#@save
class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):
"""The softmax cross-entropy loss with masks."""
# `pred` shape: (`batch_size`, `num_steps`, `vocab_size`)
# `label` shape: (`batch_size`, `num_steps`)
# `valid_len` shape: (`batch_size`,)
def forward(self, pred, label, valid_len):
weights = torch.ones_like(label)
weights = sequence_mask(weights, valid_len)
self.reduction='none'
unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(
pred.permute(0, 2, 1), label)
weighted_loss = (unweighted_loss * weights).mean(dim=1)
return weighted_loss

#@save
def train_seq2seq(net, data_iter, lr, num_epochs, tgt_vocab, device):
"""Train a model for sequence to sequence."""
def xavier_init_weights(m):
if type(m) == nn.Linear:
nn.init.xavier_uniform_(m.weight)
if type(m) == nn.GRU:
for param in m._flat_weights_names:
nn.init.xavier_uniform_(m._parameters[param])
net.apply(xavier_init_weights)
net.to(device)
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
loss = MaskedSoftmaxCELoss()
net.train()
xlim=[10, num_epochs])
timer = d2l.Timer()
metric = d2l.Accumulator(2) # Sum of training loss, no. of tokens
for batch in data_iter:
X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]
bos = torch.tensor([tgt_vocab['<bos>']] * Y.shape[0],
device=device).reshape(-1, 1)
dec_input = torch.cat([bos, Y[:, :-1]], 1) # Teacher forcing
Y_hat, _ = net(X, dec_input, X_valid_len)
l = loss(Y_hat, Y, Y_valid_len)
l.sum().backward() # Make the loss scalar for `backward`
d2l.grad_clipping(net, 1)
num_tokens = Y_valid_len.sum()
optimizer.step()
metric.add(l.sum(), num_tokens)
if (epoch + 1) % 10 == 0:
animator.add(epoch + 1, (metric[0] / metric[1],))
print(f'loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
f'tokens/sec on {str(device)}')
#@save
def predict_seq2seq(net, src_sentence, src_vocab, tgt_vocab, num_steps,
device, save_attention_weights=False):
"""Predict for sequence to sequence."""
# Set `net` to eval mode for inference
net.eval()
src_tokens = src_vocab[src_sentence.lower().split(' ')] + [
src_vocab['<eos>']]
enc_valid_len = torch.tensor([len(src_tokens)], device=device)
src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab['<pad>'])
# Add the batch axis
enc_X = torch.unsqueeze(
torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)
enc_outputs = net.encoder(enc_X, enc_valid_len)
dec_state = net.decoder.init_state(enc_outputs, enc_valid_len)

# Add the batch axis

dec_X = torch.unsqueeze(torch.tensor(
[tgt_vocab['<bos>']], dtype=torch.long, device=device), dim=0)
output_seq, attention_weight_seq = [], []
for _ in range(num_steps):
Y, dec_state = net.decoder(dec_X, dec_state)
# We use the token with the highest prediction likelihood as input
# of the decoder at the next time step
dec_X = Y.argmax(dim=2)
pred = dec_X.squeeze(dim=0).type(torch.int32).item()
# Save attention weights (to be covered later)
if save_attention_weights:
attention_weight_seq.append(net.decoder.attention_weights)
# Once the end-of-sequence token is predicted, the generation of the
# output sequence is complete
if pred == tgt_vocab['<eos>']:
break
output_seq.append(pred)
return ' '.join(tgt_vocab.to_tokens(output_seq)), attention_weight_seq
20.8 d2l API Document
The implementations of the following members of the d2l package and sections where they are
defined and explained can be found in the source file294 .
20.8.1 Models
class d2l.torch.Module(plot_train_per_epoch=2, plot_valid_per_epoch=1)

Bases: Module, HyperParameters (page 983)
Defined in Section 3.2
apply_init(inputs, init=None)
configure_optimizers()
forward(X)
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note: Although the recipe for forward pass needs to be defined within this function,
one should call the Module (page 979) instance afterwards instead of this since the for-
mer takes care of running the registered hooks while the latter silently ignores them.
294
20.8. d2l API Document 979

loss(y_hat, y)
plot(key, value, train)

Plot a point in animation.
training: bool
training_step(batch)
validation_step(batch)
class d2l.torch.LinearRegressionScratch(num_inputs, lr, sigma=0.01)

Bases: Module (page 979)
forward(X)
The linear regression model.
loss(y_hat, y)
training: bool
class d2l.torch.LinearRegression(lr)
forward(X)
The linear regression model.
get_w_b()
loss(y_hat, y)
training: bool
class d2l.torch.Classifier(plot_train_per_epoch=2, plot_valid_per_epoch=1)

accuracy(Y_hat, Y, averaged=True)
Compute the number of correct predictions.

layer_summary(X_shape)
loss(Y_hat, Y, averaged=True)
training: bool
validation_step(batch)
20.8.2 Data
class d2l.torch.DataModule(root='../data', num_workers=4)

Bases: HyperParameters (page 983)
get_dataloader(train)
get_tensorloader(tensors, train, indices=slice(0, None, None))

train_dataloader()
val_dataloader()
class d2l.torch.SyntheticRegressionData(w, b, noise=0.01, num_train=1000, num_val=1000,

batch_size=32)
Bases: DataModule (page 981)
class d2l.torch.FashionMNIST(batch_size=64, resize=(28, 28))
Bases: DataModule (page 981)
text_labels(indices)
Return text labels.
visualize(batch, nrows=1, ncols=8, labels=[])

20.8.3 Trainer
class d2l.torch.Trainer(max_epochs, num_gpus=0, gradient_clip_val=0)

clip_gradients(grad_clip_val, model)
fit(model, data)
fit_epoch()
prepare_batch(batch)
prepare_data(data)
prepare_model(model)
class d2l.torch.SGD(params, lr)
step()
zero_grad()
20.8.4 Utilities
d2l.torch.add_to_class(Class)
d2l.torch.cpu()
d2l.torch.gpu(i=0)
d2l.torch.num_gpus()
class d2l.torch.ProgressBoard(xlabel=None, ylabel=None, xlim=None, ylim=None,
xscale='linear', yscale='linear', ls=['-', '--', '-.', ':'], colors=['C0', 'C1',
'C2', 'C3'], fig=None, axes=None, figsize=(3.5, 2.5), display=True)
Plot data points in animation.

draw(x, y, label, every_n=1)
class d2l.torch.HyperParameters
Bases: object
save_hyperparameters(ignore=[])
Save function arguments into class attributes.

Bibliography
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … others. (2016). Tensorflow: a sys-
tem for large-scale machine learning. 12th $\$USENIX$\$ symposium on operating systems design
and implementation ($\$OSDI$\$ 16) (pp. 265–283).
Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional
neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language
processing, 22(10), 1533–1545.
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., & Smola, A. J. (2012). Scalable inference in
latent variable models. Proceedings of the fifth ACM international conference on Web search and data
mining (pp. 123–132).
Aji, S. M., & McEliece, R. J. (2000). The generalized distributive law. IEEE transactions on Information
Theory, 46(2), 325–343.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … others. (2022). Flamingo: a
visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., & Reblitz-Richardson, O. (2020). Mind the pad–
cnns can develop blind spots. arXiv preprint arXiv:2010.02178.
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American mathematical
society, 68(3), 337–404.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Baevski, A., & Auli, M. (2018). Adaptive input representations for neural language modeling. In-
ternational Conference on Learning Representations.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align
and translate. arXiv preprint arXiv:1409.0473.
Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: speeded up robust features. European conference
on computer vision (pp. 404–417).
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model.
Journal of machine learning research, 3(Feb), 1137–1155.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., … Bengio, Y. (2010).
Theano: a cpu and gpu math compiler in python. Proc. 9th python in science conf (pp. 3–10).
Beutel, A., Murray, K., Faloutsos, C., & Smola, A. J. (2014). Cobafi: collaborative bayesian filtering.
Proceedings of the 23rd international conference on World wide web (pp. 97–108).
Bishop, C. M. (1995). Training with noise is equivalent to tikhonov regularization. Neural compu-
tation, 7(1), 108–116.
985
Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. The Journal of Po-
litical Economy, pp. 637–654.
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-nms–improving object detection with
one line of code. Proceedings of the IEEE international conference on computer vision (pp. 5561–
5569).
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword
information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Bollobás, B. (1999). Linear analysis. Cambridge University Press, Cambridge.
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proceedings of
COMPSTAT'2010 (pp. 177–186). Springer.
Bottou, L., & Le Cun, Y. (1988). Sn: a simulator for connectionist models. Proceedings of NeuroNimes
88 (pp. 371–382). Nimes, France. URL: http://leon.bottou.org/papers/bottou-lecun-88
Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent
advances. ESAIM: probability and statistics, 9, 323–375.
Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent
advances. ESAIM: probability and statistics, 9, 323–375.
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning
natural language inference. arXiv preprint arXiv:1508.05326.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge, England: Cambridge Uni-
versity Press.
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: i. the method of
paired comparisons. Biometrika, 39(3/4), 324–345.
Brown, N., & Sandholm, T. (2017). Libratus: the superhuman ai for no-limit poker. IJCAI (pp. 5226–
5228).
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., … Roossin, P. S.
(1990). A statistical approach to machine translation. Computational linguistics, 16(2), 79–85.
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Mercer, R. L., & Roossin,
P. (1988). A statistical approach to language translation. Coling Budapest 1988 Volume 1: Interna-
tional Conference on Computational Linguistics.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … others. (2020). Language
models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.
Buslaev, A., Iglovikov, V. I., Khvedchenya, E., Parinov, A., Druzhinin, M., & Kalinin, A. A. (2020).
Albumentations: fast and flexible image augmentations. Information, 11(2), 125.
Campbell, M., Hoane Jr, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial intelligence, 134(1-2), 57–83.
Canny, J. (1987). A computational approach to edge detection. Readings in computer vision (pp. 184–
203). Elsevier.
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). Semeval-2017 task 1: seman-
tic textual similarity multilingual and crosslingual focused evaluation. Proceedings of the 11th
International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14).
986 Bibliography
Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 551–
561).
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E.
(2014). Cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759.
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural ma-
chine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio,
Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … others. (2022). Palm:
scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neu-
ral networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). Electra: pre-training text encoders as
discriminators rather than generators. International Conference on Learning Representations.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural
language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE),
2493–2537.
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and
convolutional layers. International Conference on Learning Representations.
Cover, T., & Thomas, J.M. (1999). Elements of information theory. John Wiley & Sons.
Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3), 261–273.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of con-
trol, signals and systems, 2(4), 303–314.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE
computer society conference on computer vision and pattern recognition (CVPR'05) (pp. 886–893).
De Cock, D. (2011). Ames, iowa: alternative to the boston housing data as an end of semester
regression project. Journal of Statistics Education, 19(3).
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., … others. (2012). Large scale
distributed deep networks. Proceedings of the 25th International Conference on Neural Information
Processing Systems-Volume 1 (pp. 1223–1231).
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., … Vogels, W.
(2007). Dynamo: amazon's highly available key-value store. ACM SIGOPS operating systems review
(pp. 205–220).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: a large-scale hierar-
chical image database. 2009 IEEE conference on computer vision and pattern recognition (pp. 248–
255).
Der Kiureghian, A., & Ditlevsen, O. (2009). Aleatory or epistemic? does it matter? Structural safety,
31(2), 105–112.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805.
Bibliography 987
Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by con-
text prediction. Proceedings of the IEEE international conference on computer vision (pp. 1422–1430).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … others.
(2021). An image is worth 16x16 words: transformers for image recognition at scale. Interna-
tional Conference on Learning Representations.
Doucet, A., De Freitas, N., & Gordon, N. (2001). An introduction to sequential monte carlo meth-
ods. Sequential Monte Carlo methods in practice (pp. 3–14). Springer.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv preprint
arXiv:1603.07285.
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015). Preserving sta-
tistical validity in adaptive data analysis. Proceedings of the forty-seventh annual ACM symposium
on Theory of computing (pp. 117–126).
Fechner, G. T. (1860). Elemente der Ppsychophysik. Vol. 2. Breitkopf u. Härtel.
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter mod-
els with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39.
Fernando, R. (2004). GPU gems: programming techniques, tips, and tricks for real-time graphics. Vol.
590. Addison-Wesley Reading.
Field, D. J. (1987). Relations between the statistics of natural images and the response properties
of cortical cells. Josa a, 4(12), 2379–2394.
Fisher, R. (1928). Statistical methods for research workers. Stechert.
Flammarion, N., & Bach, F. (2015). From averaging to acceleration, there is only a step-size. Con-
ference on Learning Theory (pp. 658–695).
Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: finding sparse, trainable neural
networks. arXiv preprint arXiv:1803.03635.
Frazier, P. I. (2018). A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811.
Freund, Y., Schapire, R. E., & others. (1996). Experiments with a new boosting algorithm. icml
(pp. 148–156).
Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American statistical association,
82(397), 249–266.
Frostig, R., Johnson, M. J., & Leary, C. (2018). Compiling machine learning programs via high-level
tracing. Systems for Machine Learning.
Fukushima, K. (1982). Neocognitron: a self-organizing neural network model for a mechanism of
visual pattern recognition. Competition and cooperation in neural nets (pp. 267–285). Springer.
Garg, S., Balakrishnan, S., Kolter, Z., & Lipton, Z. (2021). Ratt: leveraging unlabeled data to guar-
antee generalization. International Conference on Machine Learning (pp. 3598–3609).
Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural
networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2414–
2423).
Gauss, C. F. (1809). Theoria motus corporum coelestum. Werke.
988 Bibliography
Gibbs, J. W. (1902). Elementary principles of statistical mechanics. Compare, 289, 314.
Ginibre, J. (1965). Statistical ensembles of complex, quaternion, and real matrices. Journal of
Mathematical Physics, 6(3), 440–449.
Girshick, R. (2015). Fast r-cnn. Proceedings of the IEEE international conference on computer vision
(pp. 1440–1448).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object
detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 580–587).
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural
networks. Proceedings of the thirteenth international conference on artificial intelligence and statistics
(pp. 249–256).
Goh, G. (2017). Why momentum really works. Distill. URL: http://distill.pub/2017/momentum,
doi:10.23915/distill.00006295
Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an
information tapestry. Communications of the ACM, 35(12), 61–71.
Golub, G. H., & Van Loan, C. F. (1996). Matrix computations. Johns Hopkins studies in the mathemat-
ical sciences.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.
deeplearningbook.org.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014).
Generative adversarial nets. Advances in neural information processing systems (pp. 2672–2680).
Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2018). A closer look at deep learning heuristics:
learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243.
Goyal, A., Bochkovskiy, A., Deng, J., & Koltun, V. (2021). Non-deep networks. arXiv preprint
arXiv:2110.07641.
Graham, B. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071.
Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint
arXiv:1308.0850.
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2008). A novel
connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern
analysis and machine intelligence, 31(5), 855–868.
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm
and other neural network architectures. Neural networks, 18(5-6), 602–610.
Griewank, A. (1989). On automatic differentiation. Mathematical Programming: recent developments
and applications, 6(6), 83–107.
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2008). Feature extraction: foundations and appli-
cations. Vol. 207. Springer.
Hadjis, S., Zhang, C., Mitliagkas, I., Iter, D., & Ré, C. (2016). Omnivore: an optimizer for multi-
device deep learning on cpus and gpus. arXiv preprint arXiv:1606.04487.
295
https://doi.org/10.23915/distill.00006
Bibliography 989
Hartley, R., & Zisserman, A. (2000). Multiple View Geometry in Computer Vision. Cambridge Univer-
sity Press.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. Proceedings of the IEEE interna-
tional conference on computer vision (pp. 2961–2969).
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: surpassing human-level
performance on imagenet classification. Proceedings of the IEEE international conference on com-
puter vision (pp. 1026–1034).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceed-
ings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. European
conference on computer vision (pp. 630–645).
Hebb, D. O., & Hebb, D. (1949). The organization of behavior. Vol. 65. Wiley New York.
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415.
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415.
Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., & others (2001). Gradient flow in recurrent
nets: the difficulty of learning long-term dependencies.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–
1780.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … others. (2022).
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., … Adam, H. (2019). Searching for
mobilenetv3. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314–
1324).
Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery
with additive noise models. Advances in neural information processing systems (pp. 689–696).
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE confer-
ence on computer vision and pattern recognition (pp. 7132–7141).
Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. 2008
Eighth IEEE International Conference on Data Mining (pp. 263–272).
Hu, Z., Lee, R. K.-W., Aggarwal, C. C., & Zhang, A. (2020). Text style transfer: a review and experi-
mental evaluation. arXiv preprint arXiv:2010.12742.
Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., … Eck, D.
(2018). Music transformer: generating music with long-term structure. International Conference
on Learning Representations.
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convo-
lutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition
(pp. 4700–4708).
990 Bibliography
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv
preprint arXiv:1508.01991.
Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat's striate cortex.
The Journal of physiology, 148(3), 574–591.
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional archi-
tecture in the cat's visual cortex. The Journal of physiology, 160(1), 106–154.
Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate
cortex. The Journal of physiology, 195(1), 215–243.
Ioffe, S. (2017). Batch renormalization: towards reducing minibatch dependence in batch-
normalized models. Advances in neural information processing systems (pp. 1945–1953).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reduc-
ing internal covariate shift. arXiv preprint arXiv:1502.03167.
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging weights
leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: convergence and generalization
in neural networks. Advances in neural information processing systems.
Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the"
echo state network" approach. Vol. 5. GMD-Forschungszentrum Informationstechnik Bonn.
James, W. (2007). The principles of psychology. Vol. 1. Cosimo, Inc.
Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., … others. (2018). Highly scalable deep learn-
ing training system with mixed-precision: training imagenet in four minutes. arXiv preprint
arXiv:1807.11205.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Darrell, T. (2014). Caffe:
convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM international
conference on Multimedia (pp. 675–678).
Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). Spanbert: improving
pre-training by representing and predicting spans. Transactions of the Association for Computa-
tional Linguistics, 8, 64–77.
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., … others. (2017). In-
datacenter performance analysis of a tensor processing unit. 2017 ACM/IEEE 44th Annual Inter-
national Symposium on Computer Architecture (ISCA) (pp. 1–12).
Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for mod-
elling sentences. arXiv preprint arXiv:1404.2188.
Kalman, B. L., & Kwasny, S. C. (1992). Why tanh: choosing a sigmoidal function. [Proceedings 1992]
IJCNN International Joint Conference on Neural Networks (pp. 578–581).
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020).
Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved
quality, stability, and variation. arXiv preprint arXiv:1710.10196.
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882.
Bibliography 991
Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions.
J. Math. Anal. Appl., 33, 82-95.
Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on
Learning Representations (ICLR).
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907.
Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
Kolmogorov, A. (1933). Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. At-
tuari, Giorn., 4, 83–91.
Kolter, Z. (2008). Linear algebra review and reference. Available online:
http://cs229.stanford.edu/section/cs229-linalg.pdf.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolu-
tional neural networks. Advances in neural information processing systems (pp. 1097–1105).
Kung, S. Y. (1988). Vlsi array processors. Englewood Cliffs, NJ, Prentice Hall, 1988, 685 p. Research
supported by the Semiconductor Research Corp., SDIO, NSF, and US Navy.
Kuzovkin, I., Vicente, R., Petton, M., Lachaux, J.-P., Baciu, M., Kahane, P., … Aru, J. (2018). Acti-
vations of deep convolutional neural networks are aligned with gamma band activity of human
visual cortex. Communications biology, 1(1), 1–12.
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: a lite bert for
self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks. Proceedings of the
IEEE conference on computer vision and pattern recognition (pp. 4013–4021).
LeCun, Y., Bengio, Y., & others. (1995). Convolutional networks for images, speech, and time se-
ries. The handbook of brain theory and neural networks, 3361(10), 1995.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D.
(1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4),
541–551.
LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). Efficient backprop. Neural Networks: Tricks of
the Trade. New York: Springer.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., … others. (1995). Comparison of
learning algorithms for handwritten digit recognition. International conference on artificial neural
networks (pp. 53–60).
Legendre, A. M. (1805). Mémoire sur les opérations trigonométriques: dont les résultats dépendent de la
figure de la terre. F. Didot.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … Zettlemoyer, L. (2019).
Bart: denoising sequence-to-sequence pre-training for natural language generation, transla-
tion, and comprehension. arXiv preprint arXiv:1910.13461.
992 Bibliography
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., … oth-
ers. (2022). Solving quantitative reasoning problems with language models. arXiv preprint
arXiv:2206.14858.
Li, M. (2017). Scaling Distributed Machine Learning with System and Algorithm Co-design (Doctoral
dissertation). PhD Thesis, CMU.
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., … Su, B.-Y. (2014).
Scaling distributed machine learning with the parameter server. 11th $\$USENIX$\$ Symposium
on Operating Systems Design and Implementation ($\$OSDI$\$ 14) (pp. 583–598).
Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic op-
timization. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery
and data mining (pp. 661–670).
Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection.
Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).
Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., … others. (2010). Imagenet classification: fast
descriptor coding and large-scale svm training. Large scale visual recognition challenge.
Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured
self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for
sequence learning. arXiv preprint arXiv:1506.00019.
Lipton, Z. C., Kale, D. C., Elkan, C., & Wetzel, R. (2016). Learning to diagnose with lstm recurrent
neural networks. International Conference on Learning Representations (ICLR).
Lipton, Z. C., & Steinhardt, J. (2018). Troubling trends in machine learning scholarship. Commu-
nications of the ACM (CACM).
Liu, D. C., & Nocedal, J. (1989). On the limited memory bfgs method for large scale optimization.
Mathematical programming, 45(1), 503–528.
Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: differentiable architecture search. arXiv preprint
arXiv:1806.09055.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: single
shot multibox detector. European conference on computer vision (pp. 21–37).
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). Roberta: a robustly
optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). Roberta: a robustly
optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … Guo, B. (2021). Swin transformer: hierarchical
vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference
on Computer Vision (pp. 10012–10022).
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s.
arXiv preprint arXiv:2201.03545.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmenta-
tion. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).
Bibliography 993
Loshchilov, I., & Hutter, F. (2016). Sgdr: stochastic gradient descent with warm restarts. arXiv
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal
of computer vision, 60(2), 91–110.
Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding regularization in batch
normalization. arXiv preprint.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors
for sentiment analysis. Proceedings of the 49th annual meeting of the association for computational
linguistics: Human language technologies-volume 1 (pp. 142–150).
MacKay, D. J., & Mac Kay, D. J. (2003). Information theory, inference and learning algorithms. Cam-
bridge university press.
Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming.
Oper. Res., 13, 444-452.
Mangram, M. E. (2013). A simplified perspective of the markowitz portfolio theory. Global journal
of business research, 7(1), 59–70.
McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in translation: contextualized
word vectors. Advances in Neural Information Processing Systems (pp. 6294–6305).
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity.
The bulletin of mathematical biophysics, 5(4), 115–133.
Mead, C. (1980). Introduction to vlsi systems. IEE Proceedings I-Solid-State and Electron Devices,
128(1), 18.
Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. arXiv
Micchelli, C. A. (1984). Interpolation of scattered data: distance matrices and conditionally posi-
tive definite functions. Approximation theory and spline functions (pp. 143–145). Springer.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations
in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of
words and phrases and their compositionality. Advances in neural information processing systems
(pp. 3111–3119).
Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11),
39–41.
Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., … Dean, J. (2017). Device
placement optimization with reinforcement learning. Proceedings of the 34th International Con-
ference on Machine Learning-Volume 70 (pp. 2430–2439).
Mnih, V., Heess, N., Graves, A., & others. (2014). Recurrent models of visual attention. Advances in
neural information processing systems (pp. 2204–2212).
Moon, T., Smola, A., Chang, Y., & Zheng, Z. (2010). Intervalrank: isotonic regression with listwise
and pairwise constraints. Proceedings of the third ACM international conference on Web search and
data mining (pp. 151–160).
Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of
placing confidence in confidence intervals. Psychonomic bulletin & review, 23(1), 103–123.
994 Bibliography
Morozov, V. A. (2012). Methods for solving incorrectly posed problems. Springer Science & Business
Media.
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–
142.
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines.
Icml.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). Deep double de-
scent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and
Experiment, 2021(12), 124003.
Naor, M., & Reingold, O. (1999). On the construction of pseudorandom permutations: luby—
rackoff revisited. Journal of Cryptology, 12(1), 29–66.
Nesterov, Y., & Vial, J.-P. (2000). Confidence level solutions for stochastic programming, Stochastic Pro-
gramming E-Print Series.
Nesterov, Y. (2018). Lectures on convex optimization. Vol. 137. Springer.
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of
probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and
Physical Sciences, 236(767), 333–380.
Ong, C. S., Smola, A., Williamson, R., & others. (2005). Learning the kernel with hyperkernels.
Journal of Machine Learning Research.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of
machine translation. Proceedings of the 40th annual meeting of the Association for Computational
Linguistics (pp. 311–318).
Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A decomposable attention model for
natural language inference. arXiv preprint arXiv:1606.01933.
Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.-Y. (2019). Semantic image synthesis with spatially-
adaptive normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition (pp. 2337–2346).
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … others. (2019). Pytorch: an
imperative style, high-performance deep learning library. Advances in neural information pro-
cessing systems, 32, 8026–8037.
Paulus, R., Xiong, C., & Socher, R. (2017). A deep reinforced model for abstractive summarization.
Pennington, J., Schoenholz, S., & Ganguli, S. (2017). Resurrecting the sigmoid in deep learning
through dynamical isometry: theory and practice. Advances in neural information processing sys-
tems (pp. 4785–4795).
Pennington, J., Socher, R., & Manning, C. (2014). Glove: global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
(pp. 1532–1543).
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: foundations and learning
algorithms. MIT press.
Peters, M., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging
Bibliography 995
with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) (pp. 1756–1765).
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep
contextualized word representations. Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1
(Long Papers) (pp. 2227–2237).
Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of
Denmark, 7(15), 510.
Pleiss, G., Chen, D., Huang, G., Li, T., Van Der Maaten, L., & Weinberger, K. Q. (2017). Memory-
efficient implementation of densenets. arXiv preprint arXiv:1707.06990.
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR
Computational Mathematics and Mathematical Physics, 4(5), 1–17.
Popper, K. (2005). The logic of scientific discovery. Routledge.
Quadrana, M., Cremonesi, P., & Jannach, D. (2018). Sequence-aware recommender systems. ACM
Computing Surveys (CSUR), 51(4), 66.
Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … others. (2021). Learn-
ing transferable visual models from natural language supervision. International Conference on
Machine Learning (pp. 8748–8763).
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep con-
volutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understand-
ing by generative pre-training. OpenAI.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are
unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Radosavovic, I., Johnson, J., Xie, S., Lo, W.-Y., & Dollár, P. (2019). On network design spaces
for visual recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision
(pp. 1882–1890).
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing network de-
sign spaces. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp. 10428–10436).
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., … others. (2021).
Scaling language models: methods, analysis & insights from training gopher. arXiv preprint
arXiv:2112.11446.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring
the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning
Research, 21, 1–67.
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100,000+ questions for machine
comprehension of text. arXiv preprint arXiv:1606.05250.
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone
self-attention in vision models. Advances in Neural Information Processing Systems, 32.
996 Bibliography
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv preprint
arXiv:1710.05941.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional
image generation with clip latents. arXiv preprint arXiv:2204.06125.
Ranzato, MarcʼAurelio, Boureau, Y.-L., Chopra, S., & LeCun, Y. (2007). A unified energy-based
framework for unsupervised learning. Artificial Intelligence and Statistics (pp. 371–379).
Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint
arXiv:1904.09237.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified, real-
time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition
(pp. 779–788).
Redmon, J., & Farhadi, A. (2018). Yolov3: an incremental improvement. arXiv preprint
arXiv:1804.02767.
Reed, S., & De Freitas, N. (2015). Neural programmer-interpreters. arXiv preprint arXiv:1511.06279.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., … others.
(2022). A generalist agent. arXiv preprint arXiv:2205.06175.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: towards real-time object detection with
region proposal networks. Advances in neural information processing systems (pp. 91–99).
Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-mode automatic differentiation in julia.
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature
neuroscience, 2(11), 1019–1025.
Rockafellar, R. T. (1970). Convex Analysis. Vol. 28. Princeton, NJ: Princeton University Press.
Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label
noise. arXiv preprint arXiv:1705.10694.
Rudin, W. (1973). Functional Analysis. New York: McGraw-Hill.
Rumelhart, D. E., Hinton, G. E., Williams, R. J., & others. (1988). Learning representations by
back-propagating errors. Cognitive modeling, 5(3), 1.
Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zuc-
chinis: what have we done, and where are we going? International Conference on Computer Vision
(ICCV).
Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Malaysia; Pearson Edu-
cation Limited,.
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., … others. (2022). Photorealistic text-
to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help opti-
mization? Advances in Neural Information Processing Systems (pp. 2483–2493).
Sarwar, B. M., Karypis, G., Konstan, J. A., Riedl, J., & others. (2001). Item-based collaborative fil-
tering recommendation algorithms. Www, 1, 285–295.
Bibliography 997
Schein, A. I., Popescul, A., Ungar, L. H., & Pennock, D. M. (2002). Methods and metrics for cold-
start recommendations. Proceedings of the 25th annual international ACM SIGIR conference on Re-
search and development in information retrieval (pp. 253–260).
Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines, regularization,
optimization, and beyond. Adaptive Computation and Machine Learning Series.
Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines, regularization,
optimization, and beyond. Adaptive Computation and Machine Learning Series.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions
on Signal Processing, 45(11), 2673–2681.
Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). Helmbold, D. P., & Williamson, B. (Eds.). A
generalized representer theorem. Proc. Annual Conf. Computational Learning Theory (pp. 416–
426). London, UK: Springer-Verlag.
Schölkopf, B., Burges, C., & Vapnik, V. (1996). Incorporating invariances in support vector learning
machines. International Conference on Artificial Neural Networks (pp. 47–52).
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with sub-
word units. arXiv preprint arXiv:1508.07909.
Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in tensorflow.
Shannon, C. E. (1948 , 7). A mathematical theory of communication. The Bell System Technical Jour-
nal, 27(3), 379–423.
Shao, H., Yao, S., Sun, D., Zhang, A., Liu, S., Liu, D., … Abdelzaher, T. (2020). Controlvae: control-
lable variational autoencoder. Proceedings of the 37th International Conference on Machine Learn-
ing.
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … others. (2016).
Mastering the game of go with deep neural networks and tree search. nature, 529(7587), 484.
Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pat-
tern recognition—tangent distance and tangent propagation. Neural networks: tricks of the trade
(pp. 239–274). Springer.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
Sindhwani, V., Sainath, T. N., & Kumar, S. (2015). Structured transforms for small-footprint deep
learning. arXiv preprint arXiv:1510.01722.
Sivic, J., & Zisserman, A. (2003). Video google: a text retrieval approach to object matching in
videos. Computer Vision, IEEE International Conference on (pp. 1470–1470).
Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., … others. (2022). Using
deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language
model. arXiv preprint arXiv:2201.11990.
Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of
the VLDB Endowment, 3(1-2), 703–710.
998 Bibliography
Speelpenning, B. (1980). Compiling fast partial derivatives of functions given by algorithms (Doctoral
dissertation). University of Illinois at Urbana-Champaign.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., … others. (2022). Beyond
the imitation game: quantifying and extrapolating the capabilities of language models. arXiv
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a
simple way to prevent neural networks from overfitting. The Journal of Machine Learning Re-
search, 15(1), 1929–1958.
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint
arXiv:1505.00387.
Strang, G. (1993). Introduction to linear algebra. Vol. 3. Wellesley-Cambridge Press Wellesley, MA.
Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in
artificial intelligence, 2009.
Sukhbaatar, S., Weston, J., Fergus, R., & others. (2015). End-to-end memory networks. Advances in
neural information processing systems (pp. 2440–2448).
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and
momentum in deep learning. International conference on machine learning (pp. 1139–1147).
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks.
Advances in neural information processing systems (pp. 3104–3112).
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the
impact of residual connections on learning. Thirty-First AAAI Conference on Artificial Intelligence.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Go-
ing deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 1–9).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception ar-
chitecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 2818–2826).
Tallec, C., & Ollivier, Y. (2017). Unbiasing truncated backpropagation through time. arXiv preprint
arXiv:1705.08209.
Tan, M., & Le, Q. (2019). Efficientnet: rethinking model scaling for convolutional neural networks.
International conference on machine learning (pp. 6105–6114).
Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin markov networks. Advances in neural in-
formation processing systems, 16, 25.
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: a survey. arXiv
Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized
deep networks. arXiv preprint arXiv:1802.06455.
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., … Li, L.-J. (2016).
Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2), 64–73.
Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: divide the gradient by a running average
of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26–31.
Bibliography 999
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. W.H. Winston.
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., … others. (2021).
Mlp-mixer: an all-mlp architecture for vision. Advances in Neural Information Processing Systems,
34.
Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large data set for
nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine
intelligence, 30(11), 1958–1970.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-
efficient image transformers & distillation through attention. International Conference on Ma-
chine Learning (pp. 10347–10357).
Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: an overview. International Journal of
Data Warehousing and Mining (IJDWM), 3(3), 1–13.
Turing, A. (1950). Computing machinery and intelligence. Mind, 59(236), 433.
Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object
recognition. International journal of computer vision, 104(2), 154–171.
Van Loan, C. F., & Golub, G. H. (1983). Matrix computations. Johns Hopkins University Press.
Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley and Sons.
Vapnik, V., & Chervonenkis, A. (1964). A note on one class of perceptrons. Automation and Remote
Control, 25.
Vapnik, V., & Chervonenkis, A. (1968). Uniform convergence of frequencies of occurence of events
to their probabilities. Dokl. Akad. Nauk SSSR, 181, 915-918.
Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of
events to their probabilities. Theory Probab. Appl., 16(2), 264-281.
Vapnik, V., & Chervonenkis, A. (1981). The necessary and sufficient conditions for the uniform
convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3),
543-564.
Vapnik, V., & Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in
the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3), 283-305.
Vapnik, V. N., & Chervonenkis, A. Y. (1974). Ordered risk minimization. Automation and Remote
Control, 35, 1226–1235, 1403–1412.
Vapnik, V. (1992). Principles of risk minimization for learning theory. Advances in neural informa-
tion processing systems (pp. 831–838).
Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the vc-dimension of a learning machine.
Neural computation, 6(5), 851–876.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017).
Attention is all you need. Advances in neural information processing systems (pp. 5998–6008).
Wahba, G. (1990). Spline models for observational data. SIAM.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using
time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3),
328–339.
1000 Bibliography
Wang, L., Li, M., Liberty, E., & Smola, A. J. (2018). Optimal message scheduling for aggregation.
NETWORKS, 2(3), 2–3.
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., & Chao, L. S. (2019). Learning deep trans-
former models for machine translation. Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics (pp. 1810–1822).
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., & Owens, J. D. (2016). Gunrock: a high-
performance graph processing library on the gpu. ACM SIGPLAN Notices (p. 11).
Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Trans-
actions of the Association for Computational Linguistics, 7, 625–641.
Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science &
Business Media.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279–292.
Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A,
pp. 359–372.
Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics.
Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 681–688).
Wengert, R. E. (1964). A simple automatic derivative evaluation program. Communications of the
ACM, 7(8), 463–464.
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of
the IEEE, 78(10), 1550–1560.
Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Ann. Math
(pp. 325–327).
Wood, F., Gasthaus, J., Archambeau, C., James, L., & Teh, Y. W. (2011). The sequence memoizer.
Communications of the ACM, 54(2), 91–98.
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … others. (2016). Google's
neural machine translation system: bridging the gap between human and machine translation.
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking
machine learning algorithms. arXiv preprint arXiv:1708.07747.
Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., & Pennington, J. (2018). Dynamical isom-
etry and a mean field theory of cnns: how to train 10,000-layer vanilla convolutional neural
networks. International Conference on Machine Learning (pp. 5393–5402).
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for
deep neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition
(pp. 1492–1500).
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., … Liu, T. (2020). On layer normalization
in the transformer architecture. International Conference on Machine Learning (pp. 10524–10533).
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The microsoft 2017 con-
versational speech recognition system. 2018 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP) (pp. 5934–5938).
y Cajal, S. R., & Azoulay, L. (1894). Les nouvelles idées sur la structure du système nerveux chez l'homme
et chez les vertébrés. C. Reinwald.
Bibliography 1001
Yamaguchi, K., Sakamoto, K., Akabane, T., & Fujimoto, Y. (1990). A neural network for speaker-
independent isolated word recognition. First International Conference on Spoken Language Pro-
cessing.
Yang, Z., Moczulski, M., Denil, M., De Freitas, N., Smola, A., Song, L., & Wang, Z. (2015). Deep fried
convnets. Proceedings of the IEEE International Conference on Computer Vision (pp. 1476–1483).
Ye, M., Yin, P., Lee, W.-C., & Lee, D.-L. (2011). Exploiting geographical influence for collaborative
point-of-interest recommendation. Proceedings of the 34th international ACM SIGIR conference on
Research and development in Information Retrieval (pp. 325–334).
You, Y., Gitman, I., & Ginsburg, B. (2017). Large batch training of convolutional networks. arXiv
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., … Wu, Y. (2022). Scaling autoregressive
models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.
Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2018). Adaptive methods for nonconvex
optimization. Advances in Neural Information Processing Systems (pp. 9793–9803).
Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neu-
ral networks. arXiv preprint arXiv:1301.3557.
Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2021). Beyond fully-
connected layers with quaternions: parameterization of hypercomplex multiplications with 1/n
parameters. International Conference on Learning Representations.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still)
requires rethinking generalization. Communications of the ACM, 64(3), 107–115.
Zhang, S., Yao, L., Sun, A., & Tay, Y. (2019). Deep learning based recommender system: a survey
and new perspectives. ACM Computing Surveys (CSUR), 52(1), 5.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., … others. (2022). Opt: open pre-
trained transformer language models. arXiv preprint arXiv:2205.01068.
Zhang, W., & others. (1988). Shift-invariant pattern recognition neural network and its optical ar-
chitecture. Proceedings of annual conference of the Japan Society of Applied Physics.
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., … Wang, X. (2021). Bytetrack: multi-object
tracking by associating every detection box. arXiv preprint arXiv:2110.06864.
Zhao, Z.-Q., Zheng, P., Xu, S.-t., & Wu, X. (2019). Object detection with deep learning: a review.
IEEE transactions on neural networks and learning systems, 30(11), 3212–3232.
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using
cycle-consistent adversarial networks. Proceedings of the IEEE international conference on com-
puter vision (pp. 2223–2232).
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Align-
ing books and movies: towards story-like visual explanations by watching movies and reading
books. Proceedings of the IEEE international conference on computer vision (pp. 19–27).
Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint
arXiv:1611.01578.
1002 Bibliography
Index
A get_w_b() (d2l.torch.LinearRegression method),

accuracy() (d2l.torch.Classifier method), 980 980
add_to_class() (in module d2l.torch), 982 gpu() (in module d2l.torch), 982
apply_init() (d2l.torch.Module method), 979 H
C HyperParameters (class in d2l.torch), 983
Classifier (class in d2l.torch), 980 L
clip_gradients() (d2l.torch.Trainer method),
layer_summary() (d2l.torch.Classifier method),
982
980
LinearRegression (class in d2l.torch), 980
(d2l.torch.LinearRegression method),
LinearRegressionScratch (class in d2l.torch),
980
980
loss() (d2l.torch.Classifier method), 981
(d2l.torch.LinearRegressionScratch
loss() (d2l.torch.LinearRegression method), 980
method), 980
loss() (d2l.torch.LinearRegressionScratch
configure_optimizers() (d2l.torch.Module
method), 980
method), 979
loss() (d2l.torch.Module method), 979
cpu() (in module d2l.torch), 982
D M
Module (class in d2l.torch), 979
DataModule (class in d2l.torch), 981
draw() (d2l.torch.ProgressBoard method), 982 N
F num_gpus() (in module d2l.torch), 982
FashionMNIST (class in d2l.torch), 981 P
fit() (d2l.torch.Trainer method), 982 plot() (d2l.torch.Module method), 980
fit_epoch() (d2l.torch.Trainer method), 982 prepare_batch() (d2l.torch.Trainer method), 982
forward() (d2l.torch.LinearRegression method), prepare_data() (d2l.torch.Trainer method), 982
980 prepare_model() (d2l.torch.Trainer method), 982
forward() (d2l.torch.LinearRegressionScratch ProgressBoard (class in d2l.torch), 982
method), 980
forward() (d2l.torch.Module method), 979 S
save_hyperparameters()
G (d2l.torch.HyperParameters method),
get_dataloader() (d2l.torch.DataModule 983
method), 981 SGD (class in d2l.torch), 982
get_dataloader() (d2l.torch.FashionMNIST step() (d2l.torch.SGD method), 982
method), 981 SyntheticRegressionData (class in d2l.torch),
get_dataloader() 981
(d2l.torch.SyntheticRegressionData
method), 981 T
get_tensorloader() (d2l.torch.DataModule text_labels() (d2l.torch.FashionMNIST
method), 981 method), 981
1003
train_dataloader() (d2l.torch.DataModule
method), 981
Trainer (class in d2l.torch), 982
training (d2l.torch.Classifier attribute), 981
training (d2l.torch.LinearRegression attribute),
980
training (d2l.torch.LinearRegressionScratch at-
tribute), 980
training (d2l.torch.Module attribute), 980
training_step() (d2l.torch.Module method), 980
V
val_dataloader() (d2l.torch.DataModule
method), 981
validation_step() (d2l.torch.Classifier method),
981
validation_step() (d2l.torch.Module method),
980
visualize() (d2l.torch.FashionMNIST method),
981
Z
zero_grad() (d2l.torch.SGD method), 982
1004 Index

d2l en PDF

Uploaded by

Copyright:

Available Formats

d2l en PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

d2l en PDF

Uploaded by

Copyright:

Available Formats

Dive into Deep Learning

Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola

Aug 19, 2022

3 Linear Neural Networks for Regression 89

4 Linear Neural Networks for Classification 131

5 Multilayer Perceptrons 169

6 Buildersʼ Guide 209

7 Convolutional Neural Networks 235

8 Modern Convolutional Neural Networks 269

9 Recurrent Neural Networks 323

10 Modern Recurrent Neural Networks 367

11 Attention Mechanisms and Transformers 413

12 Optimization Algorithms 473

13 Computational Performance 555

14 Computer Vision 601

15 Natural Language Processing: Pretraining 701

16 Natural Language Processing: Applications 753

17 Recommender Systems 791

18 Generative Adversarial Networks 795

19 Appendix: Mathematics for Deep Learning 809

20 Appendix: Tools for Deep Learning 943

About This Book

One Medium Combining Code, Math, and HTML

Fig. 1: Book structure

1. Register an account on the discussion forum of this book discuss.d2l.ai10 .

# The file name is subject to changes

# The file name is subject to changes

Next, initialize the shell so we can run conda directly.

conda create --name d2l python=3.8 -y

Now we can activate the d2l environment:

conda activate d2l

pip install torch torchvision

pip install d2l==1.0.0a0

Downloading and Running the Code

mkdir d2l-en && cd d2l-en

Functions and Operators

• H(X): entropy of random variable X

1.1 A Motivating Example

Fig. 1.1.1: Identify a wake word.

Fig. 1.1.2: A typical training process.

1.1. A Motivating Example 17

1.2 Key Components

1.2. Key Components 19

1.2.4 Optimization Algorithms

1.3.1 Supervised Learning

1.3. Kinds of Machine Learning Problems 21

1.3. Kinds of Machine Learning Problems 23

1.3. Kinds of Machine Learning Problems 25

1.3. Kinds of Machine Learning Problems 27

Tom has dinner in Washington with Sally

Fig. 1.3.5: -D-e-e-p- L-ea-r-ni-ng- in an audio recording.

German: Haben Sie sich schon dieses grossartige Lehrwerk angeschaut?

1.3.2 Unsupervised and Self-Supervised Learning

1.3. Kinds of Machine Learning Problems 29

1.3.3 Interacting with an Environment

Fig. 1.3.6: Collecting data for supervised learning from an environment.

1.3.4 Reinforcement Learning

Fig. 1.3.7: The interaction between reinforcement learning and an environment.

1.3. Kinds of Machine Learning Problems 31