Regularization and Feature Selection: Big Data For Economic Applications
Regularization and Feature Selection: Big Data For Economic Applications
Regularization and Feature Selection: Big Data For Economic Applications
and feature
selection
Alessio
Farcomeni
Regularization and feature selection
Big Data for Economic Applications
Alessio Farcomeni
University of Rome “Tor Vergata”
[email protected]
Large p
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
An image can be used as a predictor
Each pixel can be summarized for instance through the
intensity of Red, Green, Blue within.
Common resolution is 2048x1536, that is, 3145728 pixels
(or 3.1 Megapixels) which lead to ∼ 107 predictors.
Even low resolution 32x32 imply 1024 pixels,
1024 ∗ 3 = 3072 predictors.
Sparsity: often pixels will be empty or constant over the
entire data, so they can be discarded
Two main routes
Regularization
and feature
selection
Alessio
Farcomeni
Variable selection
Regularization
Several other approaches, including dimension reduction, are
possible but not discussed in this course.
Information Criteria
Regularization
and feature
selection
−2l(θ̂) + P ∗ g ,
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
With p predictors, there are 2p candidate models.
Exhaustive search is often impossible
Simple and effective model selection is forward selection
based on AIC or BIC
1 Estimate IC0 for the empty model
2 At stage j: keep model with j − 1 predictors fixed, compute
p − j models with one additional and ICj,1 , . . . , ICj,p−j .
3 If minu ICj,u < ICj−1 , set ICj = minu ICj,u and add
arg minu ICj,u to the set of predictors selected.
4 Stop with optimal model with j − 1 predictors otherwise
Screening
Regularization
and feature
selection
Alessio
Farcomeni
For ultra-high dimensional problems (say, p/n > 40) an
independence screening is recommended.
At the first step, univariate associations with the outcome
are examined (see next) and only a subset of the variables
is retained.
At the second step, variable selection, or (regularized, see
next) model estimation is performed.
I have the habit of experimenting with data augmentation (e.g.,
polynomial augmentation) between the first and second step.
Sure Independence Screening
Regularization
and feature
selection
Alessio
Farcomeni
Choose γ ∈ (0, 1)
Choose a measure of univariate association between Y and
Xj , e.g., absolute value of univariate regression coefficient
Select the nγ << p variables with largest values of the
measure above
These are then used for model selection or estimation.
Why model selection?
Regularization
and feature
selection
Alessio
Farcomeni
Occam’s razor
Bias: expected difference between true (population-level)
target quantity and estimate
Variance: variability of estimate at repeated sampling
Mean Squared Error: Bias2 +Variance
Bias-Variance trade off
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection Consider the least squares loss for a linear regression model
Alessio
Farcomeni n
X 2
inf yi − β 0 xi ,
β
i=1
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Pp 2
Farcomeni
P(β) = j=2 βj (intercept not usually penalized)
Surprisingly enough, there is a closed form solution:
(X 0 X + λIp )−1 X 0 Y
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
To be clear
Regularization
and feature
selection
Alessio
Farcomeni
subject to P(β) ≤ tλ
Regularization
and feature
selection
Alessio
Farcomeni
Pp Pp 2
Penalization term: λ1 i=2 |βj | + λ2 i=2 βj
Advantages: simultaneous shrinkage and selection also
with p > n. Bias of LASSO slightly reduced in general.
Disadvantages: two penalty parameters must be chosen
instead of one
Background: binary outcomes
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
p
logit(p) = log 1−p
It is simply a device for mapping the 0-1 interval to R
exp(x)
Inverse expit(x) = 1+exp(x) .
An appropriate modeling approach
Regularization
and feature
selection
Alessio
Farcomeni
Binary outcome Y
Bernoulli distribution: Pr(Y = 1) = p
E [Y ] = p
Natural link: the logistic transform.
Logistic regression: logit link
Regularization
and feature
selection
Alessio
Farcomeni
logit(p) = α + β 0 X
Equivalently,
Regularization
and feature
selection
Alessio
Farcomeni
Likelihood
n
Y
expit(α + β 0 xi )yi (1 − expit(α + β 0 xi ))1−yi
i=1
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Φ−1 (p),
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
Regularization principles apply also in GLM (logistic,
Poisson regression) and other scenarios (e.g., duration
models)
The general objective function is
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
Solution for medium sized data
Leave-One-Out: for i = 1, . . . , n leave the i-th observation
apart, fit the model for fixed λ using the remaining n − 1,
measure performance on the i-th observation (e.g.,
prediction).
Let CV (λ1 ), . . . , CV (λk ) denote the average LOO
performance over a grid of values.
Set λ = arg min CV (·)
More in general
Regularization
and feature
selection
Alessio
Farcomeni
Leave-k-out cross-validation
k-fold cross-validation: the original sample is randomly
partitioned into k subsamples and one is left out at each
of the k iterations
Plot twist
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
With big data iterating model estimation might not be
computationally feasible
CV is also optimistic unless k → ∞
Solution: if n is large, split data into
Training set
Dev set
Test set
Be careful to do this at random (simple or stratified)
Stratification: guarantee outcome distribution
Training set
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni
Regularization
and feature
selection
Alessio
Farcomeni