Lecture1 FGV

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Machine Learning for Economics—Handout

of Lecture 1

Bernard Salanié

8 July 2024

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Going Beyond the Buzz

Glossary

Machine learning Statistics

network, graphs model

weights parameters

learning fitting

generalization test set performance

supervised learning regression/classification

unsupervised learning density estimation, clustering

large grant = $1,000,000 large grant= $50,000

nice place to have a meeting: nice place to have a meeting:


Snowbird, Utah, French Alps Las Vegas in August

1
Bernard Salanié Machine Learning for Economics—Handout of Lecture 1
. . . and the parody

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Machine Learning

Outside of economics:
computer science + statistics + driven by user-side
(mostly) focused on model selection for prediction, using
large datasets with many covariates
the new, really heavy stuff: Large Language Models, like
GPT.

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Machine Learning in Econometrics

We want to do model selection and inference so we can study


causal effects.
We use:
the light stuff: regularized linear models, e.g. Lasso
the heavy stuff uses deep learning: neural networks, came
and went in the 1980s, returned in force in 2010s
in the middle: random forests, bagging, boosting.

And we adapt these methods so we can use standard


(asymptotic) inference: tests, confidence regions.

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Things I won’t cover

1 ML for time series


2 text analysis: Natural Language Processing (see e.g.
Gentzkow-Shapiro-Taddy Eca 2019,
Gentzkow-Kelly-Taddy JEL 2019).
3 (a fortiori) Large Language Models (see Korinek JEL 2023)
4 real Big Data: Velocity, Volume, Variety (in particular:
combining heterogeneous data)

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Our typical use case

In microecononometrics, often you want to control for a bunch


of covariates, with limited interest in the coefficients.
Common case: a first-step estimator (propensity score,
generated regressors, some kind of projection)
This is a variant of model selection = how to select a
specification and its coefficients.
Trade off:
we want to fit the data well (reduce bias)
without overfitting it, which would increase variance in
prediction.

traditional, and still current, solution: reward parsimony.

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Information Criteria
May feel passé but are still useful sometimes.
We start with a set M of candidate models
with more or fewer regressors: p(M) for model M ∈ M
Akaike’s Information Criterion: in maximum likelihod estimation,
choose a model by
 n

 X 
max max log li (θ; M) − p(M) .
M∈M θ
i=1

Bayesian Information Criterion: use


 n

 X 
max max log li (θ; M) − p(M) log(n)/2 .
M∈M θ
i=1

They can be extended to other estimation methods than


maximum likelihood.
Bernard Salanié Machine Learning for Economics—Handout of Lecture 1
ICs (ctd)

AIC is better for out-of-sample prediction


BIC is better for model choice.
They are both impractical when considering a large number of
models and large datasets.
E.g. a common situation:
1 we have dozens of variables
2 which we can interact in myriads of ways
3 to the point that even with a large dataset our pool of
regressors may be larger than the number of observations:
p  n.
4 then we can get a perfect fit! is this a good idea?

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


It is not
First of all, if p > n there are many ways of fitting the data
perfectly → many “perfect” estimators in-sample
suppose you are given another set of covariates xo
(out-of-sample) and you are asked to predict yo
which estimator should you choose? they will all give different
values for yo

Moreover, fitting the data perfectly means that unusual


observations/outliers will influence the value of the estimators
unduly
this is our familiar bias-variance tradoff: it may be worth
tolerating some bias in-sample to get less variance,
and generate better predictions out-of-sample,
using cross-validation to optimize the method.

All of machine learning relies on this idea.


Bernard Salanié Machine Learning for Economics—Handout of Lecture 1
Shrinkage
Take a classical linear model. OLS is BLUE (best among linear
unbiased estimators)
What of biased estimators? could one have a nonzero bias but
a smaller MSE than OLS?

MSE = Bias2 + Variance.

yes indeed!

OLS attempts to fit your sample as well as possible, even


though it may contain some unusual observations
the idea is to “shrink” the influence of unusual observations,
by pulling θ̂n towards zero.
The resulting estimator has higher bias but a lower variance
this is especially useful when predicting y out-of-sample.
Bernard Salanié Machine Learning for Economics—Handout of Lecture 1
What of Tests and Confidence Intervals?

For now, we are only talking about improving prediction


Model selection is a bit like pre-testing, or a first step in a
two-step method;
we know that these things affect the properties of second-step
estimators

→ we move on—more about testing later.

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Penalizing large coefficients

= ridge regression, Lasso and variants, Elastic Net, . . .


P p 1/m
Remember l m norms: kθkm = k =1
|θk |m .
In the limit as m → 0, we get kθk0 = p, as used with the AIC
and BIC penalties.

We could try convex versions:


Pp
l 2 -penalty: kθk22 = k =1 θ2k
Pp
l 1 -penalty: kθk1 = k =1 |θk |

They give us respectively ridge regression and the


Least Absolute Shrinkage and Selection Operator (Lasso).

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Ridge Regression
Consider yi = xi0 θ0 + εi , with E(ε|x) = 0 and V(ε|x) = σ20 .
Pn
The OLS estimator θ̂O n minimizes i=1 (yi − xi θ)
0 2

The ridge regression method minimizes


n
X p
X
(yi − xi θ) + λ
0 2
θ2k
i=1 k =1

and yields
(X 0 X +λIp )θ̂R 0
n =X y

e.g. with only one covariate:


Pn Pn 2
i=1 xi yi i=1 xi
θ̂n = Pn
R
= Pn θ̂O
n
i=1 xi + λ i=1 xi + λ
2 2

which does shrink θ̂O


n towards 0.
Bernard Salanié Machine Learning for Economics—Handout of Lecture 1
What is Good about Ridge Regression?

It shrinks large coefficients (proportionally) more than small


ones, good for dense models with many similar-size coefficients
For λ = 0, we get θ̂R
n (0) = θ̂n
O

For λ = +∞, we get θ̂n (+∞) = 0.


R

Ridge regression improves prediction performance for a range


of values λ ∈ (0, λ̄).

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Important Slide!

Pn
Denote Ên Z = i=1 Zi /n, and V̂n Z for the empirical variance.

Renormalization
Throughout the lectures (and most often in practice!) we
translate the variables so Ên xk = 0 and V̂n xk = 1 for each
covariate k (except the constant).

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Proof
n = n θ̂n /(n + λ).
Given our normalizations, we have θ̂R O


Note also that θ̂O
n = θ0 + σ0 u/ n with u = N(0, 1).

The MSE is

E(y − x θ̂R
n ) = E(xθ0 + ε − nx θ̂n /(n + λ))
2 O 2

√ !2
nxθ0 n
= E xθ0 + ε − − σ0 u
λ+n λ+n
!
n 2 n
 
= 1− E(xθ0 ) + σ0 1 +
2 2
.
n+λ (λ + n)2

Simple algebra shows that it has a negative derivative at λ = 0:


the increase in the squared bias (the first term) is more than
compensated by the decrease in the variance (the second
term).

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Choosing the Hyperparameter
In machine learning we always have tuning parameters (like the
bandwidth in nonparametric estimation).
We call them hyperparameters.

Here λ is the only hyperparameter, and the obvious choice (for


prediction purposes!)
is to take the value that minimizes the MSE (NB: sometimes it
is ∞)
A more general answer is cross-validation, based on
sample-splitting:
fix λ
use one random part of the sample (say m observations)
to estimate θ̂Rn
predict on the other (n − m) observations and compute the
MSE
(often) repeat and average the MSEs
try another λ until you are happy.
Bernard Salanié Machine Learning for Economics—Handout of Lecture 1
Preserving the Structure of the Data

If the sample is not iid, be careful not to destroy the structure


when splitting the samples.
E.g. keep families intact if you suspect clustering
For panel data, do not separate observations (i, t) and (i, t 0 )

More generally, keep such blocks intact when randomly splitting


the sample.

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Very popular: K -fold CV

choose K = 5 or 10 (typically)
split the sample randomly into K folds
for given values of the hyperparameters:
for k = 1, . . . , K :
estimate θ̂k (λ) on observations from all folds except k
predict on fold k : ŷi (λ) = xi θ̂
Pk (λ)
compute the MSE ek (λ) = i∈k (yi − ŷi (λ))2 on fold k
PK
compute e(λ) = k =1 ek (λ)
repeat with other values until happy with e(λ).

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1


Predicting y

If λ∗ is the chosen hyperparameter after K -fold validation


then for any observation i, retain ŷi (λ∗ ) as the predicted value.

Important remark: the prediction for observation i ∈ k depends


on θ̂k (λ∗ ), which was estimated on other folds

this is the cross-fitting part,


and it is essential to obtain good statistical properties.

Bernard Salanié Machine Learning for Economics—Handout of Lecture 1

You might also like