Regularization and Feature Selection: Big Data For Economic Applications

Regularization
and feature
selection
Alessio
Farcomeni
Regularization and feature selection
Big Data for Economic Applications
Alessio Farcomeni
University of Rome “Tor Vergata”
[email protected]
Large p
Regularization
and feature
selection
Alessio
Farcomeni
Big data means also, and especially, large number of

variables
Often, p > n which makes several methods not feasible or
ill defined
Even simple models might be grossly over-parameterized
and not identifiable
Example: image analysis
Regularization
and feature
selection
Alessio
Farcomeni
An image can be used as a predictor
Each pixel can be summarized for instance through the
intensity of Red, Green, Blue within.
Common resolution is 2048x1536, that is, 3145728 pixels
(or 3.1 Megapixels) which lead to ∼ 107 predictors.
Even low resolution 32x32 imply 1024 pixels,
1024 ∗ 3 = 3072 predictors.
Sparsity: often pixels will be empty or constant over the
entire data, so they can be discarded
Two main routes
Regularization
and feature
selection
Alessio
Farcomeni
Variable selection
Regularization
Several other approaches, including dimension reduction, are
possible but not discussed in this course.
Information Criteria
Regularization
and feature
selection
Alessio A very general framework for comparing alternative

Farcomeni
models is given by information criteria
These involve a function of the log-likelihood and the
number of parameters
General formulation:
−2l(θ̂) + P ∗ g ,
where g is the number of free parameters, and P is a

constant. Lower is better.
Most popular choices are Akaike Information Criterion
(P = 2) and Bayesian Information Criterion (P = log(n)).
Comparison
Regularization
and feature
selection
Alessio
Farcomeni
Obviously, AIC is less parsimonious than BIC

AIC is optimal from a predictive perspective, BIC is model
consistent
AIC better with small n, BIC better with large n
Forward selection
Regularization
and feature
selection
Alessio
Farcomeni
With p predictors, there are 2p candidate models.
Exhaustive search is often impossible
Simple and effective model selection is forward selection
based on AIC or BIC
1 Estimate IC0 for the empty model
2 At stage j: keep model with j − 1 predictors fixed, compute
p − j models with one additional and ICj,1 , . . . , ICj,p−j .
3 If minu ICj,u < ICj−1 , set ICj = minu ICj,u and add
arg minu ICj,u to the set of predictors selected.
4 Stop with optimal model with j − 1 predictors otherwise
Screening
Regularization
and feature
selection
Alessio
Farcomeni
For ultra-high dimensional problems (say, p/n > 40) an
independence screening is recommended.
At the first step, univariate associations with the outcome
are examined (see next) and only a subset of the variables
is retained.
At the second step, variable selection, or (regularized, see
next) model estimation is performed.
I have the habit of experimenting with data augmentation (e.g.,
polynomial augmentation) between the first and second step.
Sure Independence Screening
Regularization
and feature
selection
Alessio
Farcomeni
Choose γ ∈ (0, 1)
Choose a measure of univariate association between Y and
Xj , e.g., absolute value of univariate regression coefficient
Select the nγ << p variables with largest values of the
measure above
These are then used for model selection or estimation.
Why model selection?
Regularization
and feature
selection
Alessio
Farcomeni
Occam’s razor
Bias: expected difference between true (population-level)
target quantity and estimate
Variance: variability of estimate at repeated sampling
Mean Squared Error: Bias2 +Variance
Bias-Variance trade off
Regularization
and feature
selection
Alessio
Farcomeni
Underparameterized models are biased

Overparameterized models exhibit high variance
Penalization in linear regression
Regularization
and feature
selection Consider the least squares loss for a linear regression model
Alessio
Farcomeni n
X 2
inf yi − β 0 xi ,
β
i=1
where xi is a vector of p covariates with a leading 1 for the

intercept.
Penalization:
n
X 2
inf yi − β 0 xi + λP(β),
β
i=1
for some λ > 0 and penalty function P(·) that in

component-wise increasing in β
Penalization rationale
Regularization
and feature
selection
Alessio
Farcomeni
Model selection of variables in a set of indices I

corresponds to fix βj = 0 for j ∈
/I
This is a discrete process, which can not be completely
tuned. It exhibits high variance
The result of penalization: shrinkage of β̂ towards zero.
This increases bias but reduces variance of predictions.
Ridge Regression
Regularization
and feature
selection
Alessio
Pp 2
Farcomeni
P(β) = j=2 βj (intercept not usually penalized)
Surprisingly enough, there is a closed form solution:
(X 0 X + λIp )−1 X 0 Y
Regularization involved with the Ip term allows one to

have p > n. Good predictive performance.
All β̂ 6= 0, but shrinkage implies a simplification of the
model so that the effective degrees of freedom are
df(λ) = tr X (X 0 X + λIp )−1 X 0

LASSO regression
Regularization
and feature
selection
Alessio
Farcomeni
LASSO: Least Absolute Shrinkage Selection Operator.

P(β) = pi=2 |βj |
P
No closed form solution, only doable with n > p

But: some βj can be shrank exactly towards zero.
Simultaneous shrinkage and selection!
Why is that?
Regularization
and feature
selection
Alessio
Farcomeni
To be clear
Regularization
and feature
selection
Alessio
Farcomeni
Optimizing a penalized objective function corresponds to

(
inf β ni=1 (yi − β 0 xi )2
P
subject to P(β) ≤ tλ
The second equation gives the constraint region.

Elastic Net
Regularization
and feature
selection
Alessio
Farcomeni
Pp Pp 2
Penalization term: λ1 i=2 |βj | + λ2 i=2 βj
Advantages: simultaneous shrinkage and selection also
with p > n. Bias of LASSO slightly reduced in general.
Disadvantages: two penalty parameters must be chosen
instead of one
Background: binary outcomes
Regularization
and feature
selection
Alessio
Farcomeni
Binary outcome Y , predictors X

You might be tempted to use a so-called linear probability
model
yi = α + β 0 xi + εi
That would be wrong. Can you tell why?
Linear regression with binary outcomes
Regularization
and feature
selection
Alessio
Farcomeni
Predictions would not be 0 or 1, the only admissible values

for Y .
Interpretation would not make sense, consequently.
Model assumptions are not satisfied (Gaussian outcome,
homoscedasticity), hence inferential properties do not hold.
The logistic transform
Regularization
and feature
selection
Alessio
Farcomeni

p
logit(p) = log 1−p
It is simply a device for mapping the 0-1 interval to R
exp(x)
Inverse expit(x) = 1+exp(x) .
An appropriate modeling approach
Regularization
and feature
selection
Alessio
Farcomeni
Binary outcome Y
Bernoulli distribution: Pr(Y = 1) = p
E [Y ] = p
Natural link: the logistic transform.
Logistic regression: logit link
Regularization
and feature
selection
Alessio
Farcomeni
logit(p) = α + β 0 X
Equivalently,
Pr(Y = 1|X ) = expit(α + β 0 X )
No boundary problems now

MLE
Regularization
and feature
selection
Alessio
Farcomeni
Likelihood
n
Y
expit(α + β 0 xi )yi (1 − expit(α + β 0 xi ))1−yi
i=1
No closed form solution for β̂ and α̂. No problem.

Interpretation
Regularization
and feature
selection
Alessio
Farcomeni
expit(α̂) is the probability of success (Y = 1) when all

predictors are zero.
Single predictor: OR = exp(β̂) is the odds-ratio for an
increasing unit.
Multiple predictors: same intepretation but with all other
predictors held fixed (as usual)
Odds-ratio
Regularization
and feature
selection
Alessio
Farcomeni
OR > 1 (and p < 0.05): Pr(Y = 1|X ) increases with X

OR < 1 (and p < 0.05): Pr(Y = 1|X ) decreases with X
Roughly, OR-fold per unit (or 1 vs 0 for binary predictors)
Underlying assumption: proportionality of odds (that is,
depends only on difference on log-scale)
Probit link
Regularization
and feature
selection
Alessio An alternative to the logit link is the probit link

Farcomeni
Φ−1 (p),
where Φ−1 (·) is the CDF of a Gaussian distribution

Similar results and rationale. Statisticians usually prefer
the logit link (since it leads to the OR). Econometricians
usually prefer the probit link.
Coeffients do not have a clear interpretation but the probit
link arises from a reasoning: the outcome Y is a
thresholded measurement of an underlying Gaussian
variable Ỹ .
Logistic regression in R
Regularization
and feature
selection
Alessio
Farcomeni
Use glm(y~x1+x2,family=binomial). Syntax similar to

lm, plus option family=binomial.
Want to use a probit link? Use option
family=binomial(link=’probit’).
There are methods summary, coef,
predict(,type=’response’)
What about counts?
Regularization
and feature
selection
Alessio
Farcomeni
If Yi ∈ {0, 1, . . . , }, you similarly should avoid using linear

regression (even on log(Y ))
Poisson regression: Yi ∼ Poi(λi ), where Poi(·) indicates a
Poisson distribution
log(λi ) = α + β 0 xi .
β̂ is the expected fold increase in Yi per unit of xi .
Regularization in Generalized Linear Models
Regularization
and feature
selection
Alessio
Farcomeni
Regularization principles apply also in GLM (logistic,
Poisson regression) and other scenarios (e.g., duration
models)
The general objective function is
inf −l(θ) + λP(θ),

θ
where l(θ) is the log-likelihood, for instance.

No closed form solution usually available.
Tuning parameter choice
Regularization
and feature
selection
Alessio
Farcomeni
Penalty parameter λ, and similar tuning parameters in

other techniques, must be chosen
Most objective functions are monotone in λ, hence useless
One could optimize a surrogate objective, e.g.,
misclassification error.
Problem: bias due to the fact that data is used twice.
Once for tuning, once for estimation.
What are the consequences?
Regularization
and feature
selection
Alessio
Farcomeni
The choice is not optimal

The performance assessment is optimistic
Cross-validation
Regularization
and feature
selection
Alessio
Farcomeni
Solution for medium sized data
Leave-One-Out: for i = 1, . . . , n leave the i-th observation
apart, fit the model for fixed λ using the remaining n − 1,
measure performance on the i-th observation (e.g.,
prediction).
Let CV (λ1 ), . . . , CV (λk ) denote the average LOO
performance over a grid of values.
Set λ = arg min CV (·)
More in general
Regularization
and feature
selection
Alessio
Farcomeni
Leave-k-out cross-validation
k-fold cross-validation: the original sample is randomly
partitioned into k subsamples and one is left out at each
of the k iterations
Plot twist
Regularization
and feature
selection
Alessio
Farcomeni
When n → ∞, minimizing AIC is equivalent to minimizing

CV
BIC is equivalent to leave-k-out with
k = n(1 − 1/(log(n) − 1))

Big data
Regularization
and feature
selection
Alessio
Farcomeni
With big data iterating model estimation might not be
computationally feasible
CV is also optimistic unless k → ∞
Solution: if n is large, split data into
Training set
Dev set
Test set
Be careful to do this at random (simple or stratified)
Stratification: guarantee outcome distribution
Training set
Regularization
and feature
selection
Alessio
Farcomeni
Set of units used only for estimation of a set of candidate

models
These can be different for instance only with respect to
the penalty parameter used, but not necessarily
Development set
Regularization
and feature
selection
Alessio
Farcomeni
Set of units used only for model choice

Performance (e.g., of predictions) on the development set
is used to do so
Hybrid role: it is used only indirectly for model estimation
Test set
Regularization
and feature
selection
Alessio
Farcomeni
Only the chosen model is applied to the test set to

estimate model performance
Not used in any respect for model estimation

Regularization and Feature Selection: Big Data For Economic Applications

Uploaded by

Copyright:

Available Formats

Regularization and Feature Selection: Big Data For Economic Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regularization and Feature Selection: Big Data For Economic Applications

Uploaded by

Copyright:

Available Formats

Regularization

Big data means also, and especially, large number of

Alessio A very general framework for comparing alternative

where g is the number of free parameters, and P is a

Obviously, AIC is less parsimonious than BIC

Underparameterized models are biased

where xi is a vector of p covariates with a leading 1 for the

for some λ > 0 and penalty function P(·) that in

Model selection of variables in a set of indices I

Regularization involved with the Ip term allows one to

df(λ) = tr X (X 0 X + λIp )−1 X 0

LASSO: Least Absolute Shrinkage Selection Operator.

No closed form solution, only doable with n > p

Optimizing a penalized objective function corresponds to

The second equation gives the constraint region.

Binary outcome Y , predictors X

Predictions would not be 0 or 1, the only admissible values

Pr(Y = 1|X ) = expit(α + β 0 X )

No boundary problems now

No closed form solution for β̂ and α̂. No problem.

expit(α̂) is the probability of success (Y = 1) when all

OR > 1 (and p < 0.05): Pr(Y = 1|X ) increases with X

Alessio An alternative to the logit link is the probit link

where Φ−1 (·) is the CDF of a Gaussian distribution

Use glm(y~x1+x2,family=binomial). Syntax similar to

If Yi ∈ {0, 1, . . . , }, you similarly should avoid using linear

inf −l(θ) + λP(θ),

where l(θ) is the log-likelihood, for instance.

Penalty parameter λ, and similar tuning parameters in

The choice is not optimal

When n → ∞, minimizing AIC is equivalent to minimizing

k = n(1 − 1/(log(n) − 1))

Set of units used only for estimation of a set of candidate

Set of units used only for model choice

Only the chosen model is applied to the test set to

You might also like