Regularization and Feature Selection: Big Data For Economic Applications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Regularization

and feature
selection

Alessio
Farcomeni
Regularization and feature selection
Big Data for Economic Applications

Alessio Farcomeni
University of Rome “Tor Vergata”

[email protected]
Large p

Regularization
and feature
selection

Alessio
Farcomeni

Big data means also, and especially, large number of


variables
Often, p > n which makes several methods not feasible or
ill defined
Even simple models might be grossly over-parameterized
and not identifiable
Example: image analysis

Regularization
and feature
selection

Alessio
Farcomeni
An image can be used as a predictor
Each pixel can be summarized for instance through the
intensity of Red, Green, Blue within.
Common resolution is 2048x1536, that is, 3145728 pixels
(or 3.1 Megapixels) which lead to ∼ 107 predictors.
Even low resolution 32x32 imply 1024 pixels,
1024 ∗ 3 = 3072 predictors.
Sparsity: often pixels will be empty or constant over the
entire data, so they can be discarded
Two main routes

Regularization
and feature
selection

Alessio
Farcomeni

Variable selection
Regularization
Several other approaches, including dimension reduction, are
possible but not discussed in this course.
Information Criteria

Regularization
and feature
selection

Alessio A very general framework for comparing alternative


Farcomeni
models is given by information criteria
These involve a function of the log-likelihood and the
number of parameters
General formulation:

−2l(θ̂) + P ∗ g ,

where g is the number of free parameters, and P is a


constant. Lower is better.
Most popular choices are Akaike Information Criterion
(P = 2) and Bayesian Information Criterion (P = log(n)).
Comparison

Regularization
and feature
selection

Alessio
Farcomeni

Obviously, AIC is less parsimonious than BIC


AIC is optimal from a predictive perspective, BIC is model
consistent
AIC better with small n, BIC better with large n
Forward selection

Regularization
and feature
selection

Alessio
Farcomeni
With p predictors, there are 2p candidate models.
Exhaustive search is often impossible
Simple and effective model selection is forward selection
based on AIC or BIC
1 Estimate IC0 for the empty model
2 At stage j: keep model with j − 1 predictors fixed, compute
p − j models with one additional and ICj,1 , . . . , ICj,p−j .
3 If minu ICj,u < ICj−1 , set ICj = minu ICj,u and add
arg minu ICj,u to the set of predictors selected.
4 Stop with optimal model with j − 1 predictors otherwise
Screening

Regularization
and feature
selection

Alessio
Farcomeni
For ultra-high dimensional problems (say, p/n > 40) an
independence screening is recommended.
At the first step, univariate associations with the outcome
are examined (see next) and only a subset of the variables
is retained.
At the second step, variable selection, or (regularized, see
next) model estimation is performed.
I have the habit of experimenting with data augmentation (e.g.,
polynomial augmentation) between the first and second step.
Sure Independence Screening

Regularization
and feature
selection

Alessio
Farcomeni

Choose γ ∈ (0, 1)
Choose a measure of univariate association between Y and
Xj , e.g., absolute value of univariate regression coefficient
Select the nγ << p variables with largest values of the
measure above
These are then used for model selection or estimation.
Why model selection?

Regularization
and feature
selection

Alessio
Farcomeni

Occam’s razor
Bias: expected difference between true (population-level)
target quantity and estimate
Variance: variability of estimate at repeated sampling
Mean Squared Error: Bias2 +Variance
Bias-Variance trade off

Regularization
and feature
selection

Alessio
Farcomeni

Underparameterized models are biased


Overparameterized models exhibit high variance
Penalization in linear regression

Regularization
and feature
selection Consider the least squares loss for a linear regression model
Alessio
Farcomeni n
X 2
inf yi − β 0 xi ,
β
i=1

where xi is a vector of p covariates with a leading 1 for the


intercept.
Penalization:
n
X 2
inf yi − β 0 xi + λP(β),
β
i=1

for some λ > 0 and penalty function P(·) that in


component-wise increasing in β
Penalization rationale

Regularization
and feature
selection

Alessio
Farcomeni

Model selection of variables in a set of indices I


corresponds to fix βj = 0 for j ∈
/I
This is a discrete process, which can not be completely
tuned. It exhibits high variance
The result of penalization: shrinkage of β̂ towards zero.
This increases bias but reduces variance of predictions.
Ridge Regression

Regularization
and feature
selection

Alessio
Pp 2
Farcomeni
P(β) = j=2 βj (intercept not usually penalized)
Surprisingly enough, there is a closed form solution:

(X 0 X + λIp )−1 X 0 Y

Regularization involved with the Ip term allows one to


have p > n. Good predictive performance.
All β̂ 6= 0, but shrinkage implies a simplification of the
model so that the effective degrees of freedom are

df(λ) = tr X (X 0 X + λIp )−1 X 0



LASSO regression

Regularization
and feature
selection

Alessio
Farcomeni

LASSO: Least Absolute Shrinkage Selection Operator.


P(β) = pi=2 |βj |
P

No closed form solution, only doable with n > p


But: some βj can be shrank exactly towards zero.
Simultaneous shrinkage and selection!
Why is that?

Regularization
and feature
selection

Alessio
Farcomeni
To be clear

Regularization
and feature
selection

Alessio
Farcomeni

Optimizing a penalized objective function corresponds to


(
inf β ni=1 (yi − β 0 xi )2
P

subject to P(β) ≤ tλ

The second equation gives the constraint region.


Elastic Net

Regularization
and feature
selection

Alessio
Farcomeni

Pp Pp 2
Penalization term: λ1 i=2 |βj | + λ2 i=2 βj
Advantages: simultaneous shrinkage and selection also
with p > n. Bias of LASSO slightly reduced in general.
Disadvantages: two penalty parameters must be chosen
instead of one
Background: binary outcomes

Regularization
and feature
selection

Alessio
Farcomeni

Binary outcome Y , predictors X


You might be tempted to use a so-called linear probability
model
yi = α + β 0 xi + εi
That would be wrong. Can you tell why?
Linear regression with binary outcomes

Regularization
and feature
selection

Alessio
Farcomeni

Predictions would not be 0 or 1, the only admissible values


for Y .
Interpretation would not make sense, consequently.
Model assumptions are not satisfied (Gaussian outcome,
homoscedasticity), hence inferential properties do not hold.
The logistic transform

Regularization
and feature
selection

Alessio
Farcomeni

 
p
logit(p) = log 1−p
It is simply a device for mapping the 0-1 interval to R
exp(x)
Inverse expit(x) = 1+exp(x) .
An appropriate modeling approach

Regularization
and feature
selection

Alessio
Farcomeni

Binary outcome Y
Bernoulli distribution: Pr(Y = 1) = p
E [Y ] = p
Natural link: the logistic transform.
Logistic regression: logit link

Regularization
and feature
selection

Alessio
Farcomeni

logit(p) = α + β 0 X
Equivalently,

Pr(Y = 1|X ) = expit(α + β 0 X )

No boundary problems now


MLE

Regularization
and feature
selection

Alessio
Farcomeni

Likelihood
n
Y
expit(α + β 0 xi )yi (1 − expit(α + β 0 xi ))1−yi
i=1

No closed form solution for β̂ and α̂. No problem.


Interpretation

Regularization
and feature
selection

Alessio
Farcomeni

expit(α̂) is the probability of success (Y = 1) when all


predictors are zero.
Single predictor: OR = exp(β̂) is the odds-ratio for an
increasing unit.
Multiple predictors: same intepretation but with all other
predictors held fixed (as usual)
Odds-ratio

Regularization
and feature
selection

Alessio
Farcomeni

OR > 1 (and p < 0.05): Pr(Y = 1|X ) increases with X


OR < 1 (and p < 0.05): Pr(Y = 1|X ) decreases with X
Roughly, OR-fold per unit (or 1 vs 0 for binary predictors)
Underlying assumption: proportionality of odds (that is,
depends only on difference on log-scale)
Probit link

Regularization
and feature
selection

Alessio An alternative to the logit link is the probit link


Farcomeni

Φ−1 (p),

where Φ−1 (·) is the CDF of a Gaussian distribution


Similar results and rationale. Statisticians usually prefer
the logit link (since it leads to the OR). Econometricians
usually prefer the probit link.
Coeffients do not have a clear interpretation but the probit
link arises from a reasoning: the outcome Y is a
thresholded measurement of an underlying Gaussian
variable Ỹ .
Logistic regression in R

Regularization
and feature
selection

Alessio
Farcomeni

Use glm(y~x1+x2,family=binomial). Syntax similar to


lm, plus option family=binomial.
Want to use a probit link? Use option
family=binomial(link=’probit’).
There are methods summary, coef,
predict(,type=’response’)
What about counts?

Regularization
and feature
selection

Alessio
Farcomeni

If Yi ∈ {0, 1, . . . , }, you similarly should avoid using linear


regression (even on log(Y ))
Poisson regression: Yi ∼ Poi(λi ), where Poi(·) indicates a
Poisson distribution
log(λi ) = α + β 0 xi .
β̂ is the expected fold increase in Yi per unit of xi .
Regularization in Generalized Linear Models

Regularization
and feature
selection

Alessio
Farcomeni
Regularization principles apply also in GLM (logistic,
Poisson regression) and other scenarios (e.g., duration
models)
The general objective function is

inf −l(θ) + λP(θ),


θ

where l(θ) is the log-likelihood, for instance.


No closed form solution usually available.
Tuning parameter choice

Regularization
and feature
selection

Alessio
Farcomeni

Penalty parameter λ, and similar tuning parameters in


other techniques, must be chosen
Most objective functions are monotone in λ, hence useless
One could optimize a surrogate objective, e.g.,
misclassification error.
Problem: bias due to the fact that data is used twice.
Once for tuning, once for estimation.
What are the consequences?

Regularization
and feature
selection

Alessio
Farcomeni

The choice is not optimal


The performance assessment is optimistic
Cross-validation

Regularization
and feature
selection

Alessio
Farcomeni
Solution for medium sized data
Leave-One-Out: for i = 1, . . . , n leave the i-th observation
apart, fit the model for fixed λ using the remaining n − 1,
measure performance on the i-th observation (e.g.,
prediction).
Let CV (λ1 ), . . . , CV (λk ) denote the average LOO
performance over a grid of values.
Set λ = arg min CV (·)
More in general

Regularization
and feature
selection

Alessio
Farcomeni

Leave-k-out cross-validation
k-fold cross-validation: the original sample is randomly
partitioned into k subsamples and one is left out at each
of the k iterations
Plot twist

Regularization
and feature
selection

Alessio
Farcomeni

When n → ∞, minimizing AIC is equivalent to minimizing


CV
BIC is equivalent to leave-k-out with

k = n(1 − 1/(log(n) − 1))


Big data

Regularization
and feature
selection

Alessio
Farcomeni
With big data iterating model estimation might not be
computationally feasible
CV is also optimistic unless k → ∞
Solution: if n is large, split data into
Training set
Dev set
Test set
Be careful to do this at random (simple or stratified)
Stratification: guarantee outcome distribution
Training set

Regularization
and feature
selection

Alessio
Farcomeni

Set of units used only for estimation of a set of candidate


models
These can be different for instance only with respect to
the penalty parameter used, but not necessarily
Development set

Regularization
and feature
selection

Alessio
Farcomeni

Set of units used only for model choice


Performance (e.g., of predictions) on the development set
is used to do so
Hybrid role: it is used only indirectly for model estimation
Test set

Regularization
and feature
selection

Alessio
Farcomeni

Only the chosen model is applied to the test set to


estimate model performance
Not used in any respect for model estimation

You might also like