Lecture1 FGV

Machine Learning for Economics—Handout
of Lecture 1
Bernard Salanié
8 July 2024
Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Going Beyond the Buzz
Glossary
Machine learning Statistics
network, graphs model
weights parameters
learning fitting
generalization test set performance
supervised learning regression/classification
unsupervised learning density estimation, clustering
large grant = $1,000,000 large grant= $50,000
nice place to have a meeting: nice place to have a meeting:

Snowbird, Utah, French Alps Las Vegas in August
1
. . . and the parody

Machine Learning
Outside of economics:
computer science + statistics + driven by user-side
(mostly) focused on model selection for prediction, using
large datasets with many covariates
the new, really heavy stuff: Large Language Models, like
GPT.

Machine Learning in Econometrics
We want to do model selection and inference so we can study

causal effects.
We use:
the light stuff: regularized linear models, e.g. Lasso
the heavy stuff uses deep learning: neural networks, came
and went in the 1980s, returned in force in 2010s
in the middle: random forests, bagging, boosting.
And we adapt these methods so we can use standard

(asymptotic) inference: tests, confidence regions.

Things I won’t cover
1 ML for time series

2 text analysis: Natural Language Processing (see e.g.
Gentzkow-Shapiro-Taddy Eca 2019,
Gentzkow-Kelly-Taddy JEL 2019).
3 (a fortiori) Large Language Models (see Korinek JEL 2023)
4 real Big Data: Velocity, Volume, Variety (in particular:
combining heterogeneous data)

Our typical use case
In microecononometrics, often you want to control for a bunch

of covariates, with limited interest in the coefficients.
Common case: a first-step estimator (propensity score,
generated regressors, some kind of projection)
This is a variant of model selection = how to select a
specification and its coefficients.
Trade off:
we want to fit the data well (reduce bias)
without overfitting it, which would increase variance in
prediction.
traditional, and still current, solution: reward parsimony.

Information Criteria
May feel passé but are still useful sometimes.
We start with a set M of candidate models
with more or fewer regressors: p(M) for model M ∈ M
Akaike’s Information Criterion: in maximum likelihod estimation,
choose a model by
 n

 X 
max max log li (θ; M) − p(M) .
M∈M θ
i=1
Bayesian Information Criterion: use

 n

 X 
max max log li (θ; M) − p(M) log(n)/2 .
M∈M θ
i=1
They can be extended to other estimation methods than

maximum likelihood.
ICs (ctd)
AIC is better for out-of-sample prediction

BIC is better for model choice.
They are both impractical when considering a large number of
models and large datasets.
E.g. a common situation:
1 we have dozens of variables
2 which we can interact in myriads of ways
3 to the point that even with a large dataset our pool of
regressors may be larger than the number of observations:
p n.
4 then we can get a perfect fit! is this a good idea?

It is not
First of all, if p > n there are many ways of fitting the data
perfectly → many “perfect” estimators in-sample
suppose you are given another set of covariates xo
(out-of-sample) and you are asked to predict yo
which estimator should you choose? they will all give different
values for yo
Moreover, fitting the data perfectly means that unusual

observations/outliers will influence the value of the estimators
unduly
this is our familiar bias-variance tradoff: it may be worth
tolerating some bias in-sample to get less variance,
and generate better predictions out-of-sample,
using cross-validation to optimize the method.
All of machine learning relies on this idea.

Shrinkage
Take a classical linear model. OLS is BLUE (best among linear
unbiased estimators)
What of biased estimators? could one have a nonzero bias but
a smaller MSE than OLS?
MSE = Bias2 + Variance.
yes indeed!
OLS attempts to fit your sample as well as possible, even

though it may contain some unusual observations
the idea is to “shrink” the influence of unusual observations,
by pulling θ̂n towards zero.
The resulting estimator has higher bias but a lower variance
this is especially useful when predicting y out-of-sample.
What of Tests and Confidence Intervals?
For now, we are only talking about improving prediction

Model selection is a bit like pre-testing, or a first step in a
two-step method;
we know that these things affect the properties of second-step
estimators
→ we move on—more about testing later.

Penalizing large coefficients
= ridge regression, Lasso and variants, Elastic Net, . . .

P p 1/m
Remember l m norms: kθkm = k =1
|θk |m .
In the limit as m → 0, we get kθk0 = p, as used with the AIC
and BIC penalties.
We could try convex versions:

Pp
l 2 -penalty: kθk22 = k =1 θ2k
Pp
l 1 -penalty: kθk1 = k =1 |θk |
They give us respectively ridge regression and the

Least Absolute Shrinkage and Selection Operator (Lasso).

Ridge Regression
Consider yi = xi0 θ0 + εi , with E(ε|x) = 0 and V(ε|x) = σ20 .
Pn
The OLS estimator θ̂O n minimizes i=1 (yi − xi θ)
0 2
The ridge regression method minimizes

n
X p
X
(yi − xi θ) + λ
0 2
θ2k
i=1 k =1
and yields
(X 0 X +λIp )θ̂R 0
n =X y
e.g. with only one covariate:

Pn Pn 2
i=1 xi yi i=1 xi
θ̂n = Pn
R
= Pn θ̂O
n
i=1 xi + λ i=1 xi + λ
2 2
which does shrink θ̂O

n towards 0.
What is Good about Ridge Regression?
It shrinks large coefficients (proportionally) more than small

ones, good for dense models with many similar-size coefficients
For λ = 0, we get θ̂R
n (0) = θ̂n
O
For λ = +∞, we get θ̂n (+∞) = 0.

R
Ridge regression improves prediction performance for a range

of values λ ∈ (0, λ̄).

Important Slide!
Pn
Denote Ên Z = i=1 Zi /n, and V̂n Z for the empirical variance.
Renormalization
Throughout the lectures (and most often in practice!) we
translate the variables so Ên xk = 0 and V̂n xk = 1 for each
covariate k (except the constant).

Proof
n = n θ̂n /(n + λ).
Given our normalizations, we have θ̂R O
√
Note also that θ̂O
n = θ0 + σ0 u/ n with u = N(0, 1).
The MSE is
E(y − x θ̂R
n ) = E(xθ0 + ε − nx θ̂n /(n + λ))
2 O 2
√ !2
nxθ0 n
= E xθ0 + ε − − σ0 u
λ+n λ+n
!
n 2 n

= 1− E(xθ0 ) + σ0 1 +
2 2
.
n+λ (λ + n)2
Simple algebra shows that it has a negative derivative at λ = 0:

the increase in the squared bias (the first term) is more than
compensated by the decrease in the variance (the second
term).

Choosing the Hyperparameter
In machine learning we always have tuning parameters (like the
bandwidth in nonparametric estimation).
We call them hyperparameters.
Here λ is the only hyperparameter, and the obvious choice (for

prediction purposes!)
is to take the value that minimizes the MSE (NB: sometimes it
is ∞)
A more general answer is cross-validation, based on
sample-splitting:
fix λ
use one random part of the sample (say m observations)
to estimate θ̂Rn
predict on the other (n − m) observations and compute the
MSE
(often) repeat and average the MSEs
try another λ until you are happy.
Preserving the Structure of the Data
If the sample is not iid, be careful not to destroy the structure

when splitting the samples.
E.g. keep families intact if you suspect clustering
For panel data, do not separate observations (i, t) and (i, t 0 )
More generally, keep such blocks intact when randomly splitting

the sample.

Very popular: K -fold CV
choose K = 5 or 10 (typically)
split the sample randomly into K folds
for given values of the hyperparameters:
for k = 1, . . . , K :
estimate θ̂k (λ) on observations from all folds except k
predict on fold k : ŷi (λ) = xi θ̂
Pk (λ)
compute the MSE ek (λ) = i∈k (yi − ŷi (λ))2 on fold k
PK
compute e(λ) = k =1 ek (λ)
repeat with other values until happy with e(λ).

Predicting y
If λ∗ is the chosen hyperparameter after K -fold validation

then for any observation i, retain ŷi (λ∗ ) as the predicted value.
Important remark: the prediction for observation i ∈ k depends

on θ̂k (λ∗ ), which was estimated on other folds
this is the cross-fitting part,

and it is essential to obtain good statistical properties.

Lecture1 FGV

Uploaded by

Copyright:

Available Formats

Lecture1 FGV

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture1 FGV

Uploaded by

Copyright:

Available Formats

Machine Learning for Economics—Handout

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Machine learning Statistics

network, graphs model

generalization test set performance

supervised learning regression/classification

unsupervised learning density estimation, clustering

large grant = $1,000,000 large grant= $50,000

nice place to have a meeting: nice place to have a meeting:

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

We want to do model selection and inference so we can study

And we adapt these methods so we can use standard

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

1 ML for time series

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

In microecononometrics, often you want to control for a bunch

traditional, and still current, solution: reward parsimony.

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Bayesian Information Criterion: use

They can be extended to other estimation methods than

AIC is better for out-of-sample prediction

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Moreover, fitting the data perfectly means that unusual

All of machine learning relies on this idea.

MSE = Bias2 + Variance.

OLS attempts to fit your sample as well as possible, even

For now, we are only talking about improving prediction

→ we move on—more about testing later.

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

= ridge regression, Lasso and variants, Elastic Net, . . .

We could try convex versions:

They give us respectively ridge regression and the

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

The ridge regression method minimizes

e.g. with only one covariate:

which does shrink θ̂O

It shrinks large coefficients (proportionally) more than small

For λ = +∞, we get θ̂n (+∞) = 0.

Ridge regression improves prediction performance for a range

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Simple algebra shows that it has a negative derivative at λ = 0:

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Here λ is the only hyperparameter, and the obvious choice (for

If the sample is not iid, be careful not to destroy the structure

More generally, keep such blocks intact when randomly splitting

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

If λ∗ is the chosen hyperparameter after K -fold validation

Important remark: the prediction for observation i ∈ k depends

this is the cross-fitting part,

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

You might also like