SSRN Id3334458
SSRN Id3334458
SSRN Id3334458
Horse Race in
High Dimensional Space
Abstract
In this paper, we study the predictive power of dense and sparse estimators in a high dimensional
space. We propose a new forecasting method, called Elastically Weighted Principal Components
Analysis (EWPCA) that selects the variables, with respect to the target variable, taking into ac-
count the collinearity among the data using the Elastic Net soft thresholding. Then, we weight the
selected predictors using the Elastic Net regression coefficient, and we finally apply the principal
component analysis to the new “elastically” weighted data matrix. We compare this method to
common benchmark and other methods to forecast macroeconomic variables in a data-rich envi-
ronment, dived into dense representation, such as Dynamic Factor Models and Ridge regressions
and sparse representations, such as LASSO regression. All these models are adapted to take into
account the linear dependency of the macroeconomic time series.
Moreover, to estimate the hyperparameters of these models, including the EWPCA, we propose
a new procedure called “brute force”. This method allows us to treat all the hyperparameters of
the model uniformly and to take the longitudinal feature of the time-series data into account.
Our findings can be summarized as follows. First, the “brute force” method to estimate the
hyperparameters is more stable and gives better forecasting performances, in terms of MSFE, than
the traditional criteria used in the literature to tune the hyperparameters. This result holds for all
samples sizes and forecasting horizons. Secondly, our two-step forecasting procedure enhances the
forecasts’ interpretability. Lastly, the EWPCA leads to better forecasting performances, in terms of
mean square forecast error (MSFE), than the other sparse and dense methods or naïve benchmark,
at different forecasts horizons and sample sizes.
Keywords: Variable selection, High-dimensional time series, Dynamic factor models, Shrinkage
methods, Cross-validation.
We are grateful to Tommaso Proietti, Marco Lippi, Lucrezia Reichilin, Filippo Pellegrino, Thomas Hasenzagel, the
conference participants at the ’1st Vienna Workshop on Economic Forecasting 2018’ and the Now-casting Economics
team for helpful comments and suggestions.
The opinions expressed and conclusions drawn are those of the authors and do not necessarily reflect the views of
the Bank of Italy
2 Methodology
In this section we compare the most common methods of variable selection proper of the literature
in the high dimensional space, adapting these models in order to take into account the autoregressive
features of the macroeconomic time series and to give a better assessment of the predictive power of
these techniques.
In particular, this section presents a brief overview of the more sophisticated approaches used in
the literature to forecast in high dimensional framework. These methods can be classified into two
different categories: sparse and dense modeling techniques.
Models in the first category select a small subset of variables with a high predictive power “throwing”
away the others. We consider LASSO-based approach by Tibshirani (1996) as a typical example of
this category,augmenting this method to include the linear dependency both of the target variable and
the covariates in selecting the variables, as in Li et al. (2014).
On the other hand, dense models consider all the variables as important for prediction at the same
time even if their impact might be small. Among th others we propose to compare the Dynamic Factor
Model, proposed firstly by Stock and Watson (2002 a, b), based on principal component analysis;
the Ridge regression by Hoerl and Kennard (1970) and the Elastic Net regression, by Zou and Hastie
(2005), which it can be seen as an “almost” sparse representation model.
where p = 1, ..., P denotes the linear dependency of our target variable, xt on xj,T −p , with j = 1, ..., N,
the error term ei,T +h is stationary and gaussian, with mean zero and variance σe2 and h denotes the
forecast horizon. We proceed estimating the coefficients βip ≡ {βi,jp
}, with j = 1, ..., N and p = 1, ..., P,
by minimizing the penalized least square:
2
T P XN P X N
1
βip = arg min p p
X X X
xi,t+h − µ − βi,j x j,t−p + λ1 βi,j (2)
β 2
t=1+P p=1 j=1 p=1 j=1
In this class of models the shrinkage parameter is identified by λ2 and it allows us to use the normal
restriction to minimize equation (3). For the estimation of both coefficients and critical parameters of
the Ridge regression we proceed as for the LASSO using the “brute force” cross-validation.
where there are two penalty coefficients, meaning that our minimization problem has to fulfill two
constraints. The Elastic Net regression allows a lower instability in the variable selection since it
decreases the uncertainty related to large covariance matrices. The above formula refers to the naïve
Elastic Net which can introduce extra bias in the estimates because the double shrinkage in the equation
(4) and it does not reduce the variance. We can write α = λ1λ+λ 2
2
so the above formula become:
2
T P X
N P X
N P X
N
1
βip = arg min p p p 2
X X X X
xi,t+h − µ − βi,j x j,t−p + (1 − α) βi,j +α (βi,j ) (5)
β 2
t=1+P p=1 j=1 p=1 j=1 p=1 j=1
Following Hastie and Zou (2005), augmenting the data1 , we can determine the elastic net estimates
(β̂ en ) as:
en,p ∗p
β̂i,j = (1 + λ2 )βi,j
where β ∗ is the naïve elastic net estimates obtained from equation (4) using the augmented data.
This scaling transformation preserves the variable selection property, unties the shrinkage and it is
due to the Ridge component in the Elastic Net. The reason why (1 + λ2 ) is selected as a scaling factor
is due to the decomposition of the Ridge operator. Moreover, Hastie and Zou (2005) show (Th.22 ) that
the Elastic Net can be seen as a decorrelation estimator solved by a Lasso minimization. To simplify
the notation and using matrix form, we can reframe the equation (5) in the following:
′
′ X X + λ2 I
β̂ = arg min β β − 2y ′ Xβ + λ1 |β|1 (6)
β 1 + λ2
Where β is the vector containing elastic net estimators for all predictors, (1+λ2 ) is the decorrelation
operator that stabilizes the parameters estimation and controls the estimation of the variance.
To calibrate the value of the critical parameters we proceed as for the previous models using the “brute
force” cross-validation.
x t = Λf t + ξ t (7)
1
Details in the Appendix A.2.
2
Proof in Appendix A.2.
3 EWPCA
The Elastically Weighted Principal Component Analysis (EWPCA) is a two-step forecasting procedure
which allows us both to select the most important variables, with respect to a target variable, and to
perform a parsimonious forecasting exercise. Using the same notation as in section 2.1, in the first step
we perform the variable screening using the Elastic Net soft-thresholding formalized by Hastie and
Zou (2005) by minimizing the objective function in the equation 5 framed into a scaled minimization
problem to not bias the estimates. We use this penalization technique because it overcomes the two
most significant shortcomings of LASSO. First, it enhances the variable selection making it more stable
over time, because as a new data becomes available the subset of the most important predictors does
not change considerably; second it deals in an optimal way with multicollinear data, so with variables
that have a very high pairwise correlation. LASSO takes only one of these variables, and it does not
care on which one, while the EN with its double regularization is able to pick the most important one,
due to the decorrelation mechanism.
Following Hastie et al. (2005), we can see that the β̂ estimates of the Elastic Net, presented in section
2.3., using matrix form has the following formula: 3
′
′ X X + λ2 I ′
β̂ = arg min β β − 2y Xβ + λ1 |β|1 (8)
β 1 + λ2
3
For convenience we drop all the subscripts and superscripts referred to variable i and lags p.
where yt+h = xi,t+h , α1 (L) are the coefficients of the target variable and its lags up to pY , and α2 (L)
are the coefficients of the factors F̂t and their lags up to pF . We consider only a linear combination of
the factors, and we do not add any non-linear feature in our estimation procedure, as in Bai and Ng
(2008), since it does not help to improve the forecasts of quarterly variables. In this second step, we
use the principal component analysis because it allows to represent the space in a parsimonious way.
Indeed, the factors estimated reduce the number of parameters to estimate and the model uncertainty
related to them. Moreover, the bulk of the variation in X ∗ can be explained by few mutually orthogonal
linear combination of the variables selected in the first step through the factor representation.
Therefore to compute the forecast, we need to set five hyperparameters ϑ = (λ1 , λ2 , r, pY , pF ) and
they are estimated as defined the previous section. In the empirical application, consistency checks
are performed to evaluate this forecast estimation procedure and to show the improved stability in the
variable selection along time with respect to the standard LASSO.
This two-step procedure also helps the macroeconomic interpretability of the forecast and the
model, since from the step one we can interpret which are the most important variables and their
weights that help to forecast the target variable. Moreover, from the second step, we can determine
which are the variables that contribute most to the factors in the forecasting exercise.
• ϑLASSO = (λ1 , p): the amount of shrinkage and the number of lags to be included in the prediction
equation;
4
Further details are included in section 2.3.
• ϑDF M = (r, pY , pF ): the number of factors, the number of lags of both the dependent and
independents variables.
• ϑEW P CA = (λ1 , λ2 , r, pY , pF ): the amount of shrinkage and selection, number of factors and
number of lags of dependent and independent variables.
Most of these parameters can be estimated using criteria proposed in the literature. In particular,
the number of lags in the regression (p, pY , pF ) is usually estimated by information criteria like Akaike
Information Criterion (AIC), Bayesian Information Criterion (BIC) or Schwartz Information Criteriona
(SIC), while the optimal amount of shrinkage (λ1 , λ2 ) is estimated using the cross-section K-fold cross-
validation procedure, or other similar techniques, like for example, “Leave-one-out” or what is proposed
by Bergmeir et al. (2015). The number of factors (r) can be estimated using the methods proposed
by Alessi et al. (2010) (ABC) or Bai and Ng (2002).
All these criteria rely on the evaluation of the performance of the model, expressed in terms of fit
and complexity, once we change the single hyperparameter of interest. But it could be the case that
mixing them is not enough to find the optimal prediction setup. In this paper, we propose to use an
estimation method which is able to treat all the hyperparameters of the model uniformly.
The estimation procedure we propose is as follow:
ii. to evaluate the performance of all possible compositions of ϑ in the calibration set we define a
rolling window scheme. We choose the size of this window as τ = (kϑk + h) × 2, that roughly
coincide with a little less than 20 years;
iii. for each rolling window we compute {ŷt+h (ϑ)}t∈cal , i.e. the h−step ahead prediction of the
dependent variable based on a specific composition of the vector ϑ. In particular, we aim to
try all possible combinations of hyperparameters. In practice, for each of them, we define a set
of values sufficiently large to include the optimal one because we want to avoid results on the
bounds of our grid of values.
For each prediction horizon we compute M SF Ehi (ϑ) = t (ŷt+h (ϑ) − yt+h )2 , the Mean Square
P
Forecast Error of model i based on the hyperparameters in ϑ;
iv. the optimal set of hyperparameters is such that M SF Ehi (ϑ∗ ) is minimum;
v. ϑ∗ is then used in the proper dataset in order to produce the prediction of the dependent variable.
For each model, the M SF Ehi is computed, and it is at the base of the performance comparison.
The procedure described above, is a data-driven model, and it can be seen as a greedy cross-validation
algorithm where we take into consideration all the information available. We call it “brute force”
10
5.1 Cross-validation
In the cross-validation procedure we aim to determine the optimal value of the hyperparameters for
each model within the calibration set to be used in the whole forecasting exercise. As we have outlined
in Section 4, each model has its own set of parameters to calibrate, and the best parameters are selected
in the sense of minimizing the M SF E for each horizon. The general set up illustrated in Section 2
is used for the cross-validation step, where we also consider the autoregressive component of the tar-
get variable and the covariates, to calibrate the hyperparameters with a forward-looking procedure 6 .
Table 1 shows the optimal value of the critical parameters selected by this cross-validation procedure.
To initialize and find the grid for λ1 and λ2 , related to the variable selection and shrinkage on the
coefficients, we implement a common algorithm used in Hastie et al. (2009). As mentioned above, the
“brute-force” cross-validation procedure is compared to the classical criteria to check consistency and
stability in the value of the critical parameters.
Forecasting horizon 1 2 3 4 8
LASSO
λ1 0.3266 0.2646 0.3164 0.3186 0.2764
p 2 2 2 2 2
RIDGE
λ2 6.0753 7.9300 6.0652 6.0652 9.7598
p 0 0 0 0 0
DFM
r 14 10 12 10 3
pY 1 1 1 2 1
pF 9 7 7 9 9
EWPCA
λ1 0.9141 0.5377 0.8644 0.4623 0.7638
λ2 0.030 0.2935 0.0792 0.0372 0.008
r 14 4 16 18 18
pY 1 1 1 1 1
pF 7 9 9 7 6
It is interesting to note that the Ridge regression does not admit any lag of the target nor the
6
See Section 2 for more details.
11
5.1.1 LASSO
The vector of hyperparameters to calibrate for the LASSO is: ϑLASSO = (λ1 , p), where λ1 is the
amount of shrinkage and p the number of the lag of the covariates and the target variables. The grid
of values for the number of lags is p = 0, ..., P , where P = 12 is chosen to be large because we want
to avoid solutions on the boundary; p can also be equal to 0, hence, no lags are considered. The lag
refers to both the covariates and the target variable, because, as it is shown in equation (2), the vector
xj contains also the lags of the target variable.
As a first step, we create a grid of values for λ1 , i.e., the parameter governing the amount of shrink-
age. This grid is composed by 200 different values such that the highest corresponds to the largest value
of λ1 for a model with at least one coefficient different from zero when we run a regression considering
the whole calibration set (not the single time window). The values of λ1 go from 0.005 to 0.995, we do
not consider λ1 = 0 because it corresponds to the original OLS regression. Using this approach, we do
not consider the K-fold cross-validation usually performed in LASSO regression, differently from what
is proposed by Bergmeir et al. (2015), since it deteriorates our results in prediction.7 Once we have
this grid, for each time window and each lag, we compute the prediction of the dependent variable
using the coefficients from the LASSO regression using every λ1 , one at a time. We save the optimal
λ̂1 , the optimal lag p and the optimal β LASSO . The optimal values selected are the one that correspond
to the minimum MSFE produced in the calibration set and they are kept fix in the proper sample.
The optimal β̂ will be used in Section 5.3 for the selection variable stability check. This procedure is
implemented for each forecast horizon h = 1, 2, 3, 4, 8.
5.1.2 Ridge
The cross-validation procedure for the vector of hyperparameters, ϑRidge = (λ2 , p), in the Ridge regres-
sion, is similar as the one performed for Lasso, with the difference that the grid of values used to cross
validate λ2 is created following the procedure proposed by Hastie, Tibshirani, and, Friedman (2009).
In order to evaluate each time window of the calibration set using the same grid of λ2 , we perform the
SVD on the augmented matrix X2′ X2 , that includes also the target variable, X2 = (y, X ′ )′ , with the
time span of the data that goes from 1, the first observation, up to Tcal − h. Then, we multiply the
7
Alternative for the grid of λ1 : Hastie et al. (2009): to perform Single Value Decomposition (SVD) on the matrix
X ′ X, take the square of the highest eigenvalue and then compute a grid with equidistant lambda. We do not consider
this method in our paper and we leave the investigation of the properties of this approach to future research.
12
The vector of the critical parameters in the Dynamic Factor Model is ϑDF M = (r, pY , pF ), where r is
the number of factors and it goes from 1 to 20, also in this case to avoid boundary solutions, pY is
the number of lags of only the target variable differently from LASSO and Ridge, while pF it is the
number of lags of the factors and it goes from pF = 1, ..., 10. The factors of the dynamic factor model
are estimated from the data matrix without using lags, through the principal component analysis.
The critical parameters are estimated using the same rolling window scheme used for Lasso and
Ridge, and for each time window we compute the prediction of the dependent variable and we store
the set of optimal parameters (r̂, p̂Y , p̂F ), that minimize the M SF Eh for each forecast horizon.
5.1.4 EWPCA
The Elastically Weighted PCA has a vector of five hyperparameters to calibrate ϑEW P CA = (λ1 , λ2 , r, pY ,
pF ). The grid of values for each critical parameter is determined as for the previous models using the
rolling window estimation and we store the optimal set of parameters (λ̂1 , λ̂2 , r̂, p̂Y , p̂F ) that minimize
the M SF Eh for each forecast horizon. To reduce the computation burden, for the EWPCA we use a
grid of fifty values for λ1 and fifty values for λ2 .
13
Forecasting horizon 1 2 3 4 8
EWPCA vs AR(1) 0.4097∗∗∗ 0.5156∗∗∗ 0.5148∗∗∗ 0.2940∗∗∗ 0.6363∗∗∗
EWPCA vs AR(p) 0.4263∗∗∗ 0.5080∗∗∗ 0.5168∗∗∗ 0.2943∗∗∗ 0.6489∗∗∗
Notes: Entries are Relative Mean Square Forecast Errors of the EWPCA with respect to the AR(1) and
AR(bic). Bold values mean that we reject the null hypothesis while the asterisks indicate at which level we
reject the null hypothesis. ∗∗∗ p < 0.01, ∗∗ p < 0.05, ∗ p < 0.1
In Table 2 we can notice that our methodology performs better than both the AR(1) and the AR(p) at
each forecast horizons. The results are all significant with p < 0.01. We notice that the performance of
our methodology deteriorates more than both the AR specifications as the forecast horizon increases
but it still outperform these two standard benchmarks.
The better forecasting performances are due to the two steps of our methodology where we are able
to select the most important variables from which we extract the principal components. In both steps
our procedure try to be parsimonious in the variable selected, avoiding the multicollinearity issue due
to Elastic Net penalization procedure, and then the use of the principal component analysis allows to
use a parsimonious forecast method. Moreover, the cross-validation procedure helps us to detect the
optimal value for the critical parameters to use in the out-of-sample exercise. In all the steps of our
methodology, including the cross-validation, we want to preserve the forecasting ability of the variables
and let this feature drive the procedure.
Forecasting horizon 1 2 3 4 8
EWPCA vs Lasso 0.3208∗∗∗ 0.3633∗∗∗ 0.3922∗∗∗ 0.2297∗∗∗ 0.4807∗∗∗
EWPCA vs Ridge 0.2610∗∗∗ 0.3088∗∗∗ 0.1680∗∗∗ 0.1315∗∗∗ 0.2234∗∗∗
EWPCA vs DFM 0.9164∗∗ 1.3889 0.6644∗∗∗ 0.7067∗∗∗ 0.9108∗∗∗
Notes: Entries are Relative Mean Square Forecast Errors of the EWPCA with respect to the LASSO, Ridge,
and DFM. All the critical parameters have been selected using the cross-validation procedure in section 2.3.
Bold values mean that we reject the null hypothesis while the asterisks indicate at which level we reject the
null hypothesis. ∗∗∗ p < 0.01, ∗∗ p < 0.05, ∗ p < 0.1
14
Forecasting horizon 1 2 3 4 8
EWPCA 0.1647 0.3146 0.2823 0.1614 0.3731
DFM 0.1797 0.2488 0.4249 0.2283 0.4097
AR(1) 0.4387 0.4826 0.5484 0.5487 0.5863
AR(p) 0.4216 0.4898 0.5463 0.5481 0.5749
Notes: Entries are Mean Square Forecast Errors of the EWPCA, DFM, AR(1), and AR(bic). All the critical
parameters have been selected using the cross-validation procedure in section 2.3.
forecast errors between the models, but as we can see in Table 4, the difference of the M SF E, also
between EWPCA and DFM at h = 2, when DFM outperforms EWPCA, is not so large10 . The gap
between the EWPCA and the DFM increases as the forecast horizon becomes bigger, hence we can
assert that the EWPCA produces lower MSFE than the DFM al longer forecast horizons. We can
also notice that the M SF E for each model increases as the forecast horizon increases. In general, we
10
We also put the AR(1) and AR(bic) in the table to show also their M SF E values.
15
16
Notes: Each column in the graph represents the fluctuation test of the EWPCA versus DFM (column 1),
LASSO (column 2) and Ridge (column 3) for each forecast horizons. The dotted lines indicate the 5%
critical values. 17
Notes: Each column in the graph represents the fluctuation test of the EWPCA versus AR(1) (column 1)
and AR(bic) (column 2) for each forecast horizons. The dotted lines indicate the 5% critical values.
18
In this paragraph, we analyze the model sparsity and its increased interpretability thanks to the usage
of the Elastic Net soft-thresholding. In Table 5 we show the model sparsity of LASSO versus the
Elastic Net regression used in step one of our procedure. We summarize the sparsity as the proportion
of the independent variables selected by these shrinkage methods in the out of sample exercise for each
forecast horizon. To compute this measure, we simply count how many predictors are selected in each
rolling window and then we take the average. The standard deviation of this measure shows whether
the number of variables selected is consistent over time.
The number of variables selected by the Elastic Net is much bigger than the one selected from the
LASSO, but they are quite stable over time, this means that the group of variables selected in each
rolling window comprehends almost the same predictors. Moreover, this result has been also proven
by the fact that the standard deviation is very small, meaning that the number of variables selected
in each rolling window does not diverge. It is interesting to notice that the number of variables
selected by LASSO it is very small at each h, it might be the case that the out of sample data
structure changes considerably with respect to the calibration data set making the optimal value of
the hyperparameters inadequate to select the predictors in the proper data set. Furthermore, this
difference can be explained by the fact that lasso imposes a sparse structure of the data, while Elastic
Net adopts both the penalization (L1 and L2 norm) without imposing a strict sparse structure, and
therefore it is able to “kill” fewer predictors.
1 2 3 4 8
mean st.dev mean st.dev mean st.dev mean st.dev mean st.dev
EWPCA 0.288 0.026 0.068 0.032 0.173 0.016 0.406 0.035 0.515 0.035
LASSO 0.014 0.022 0.014 0.013 0.002 0.008 0.0015 0.005 0.0015 0.006
Notes: At each forecast horizon we compute the mean and standard deviation of the number of predictors
selected across time for LASSO and EWPCA.
In Figure 3 there are the heatmaps of the variables selected by Elastic Net in step one of the
EWPCA and its stability over time. We can notice that a lot of variables, especially at the first
three forecasts horizons, are not selected while some of them are selected for almost all the rolling
windows. This fact can enhance the model interpretability and help to detect which variables are the
most important to forecast the target variable at each forecast horizon. We can assert that even if we
do not impose a sparse structure, for h = 1, 2, 3 the data structure is enough sparse and the variables
selected, thanks to the decorrelation correction of the Elastic Net, are stable along time. This evidence
overcomes one of the drawbacks of LASSO that is the instability in the predictors’ selection along time.
We can notice that at short horizons, h = 1 and 3, the core predictors selected are almost the same
and it might be the case that in the data set there are some variables with a lot of forecast ability for
short horizons, while others can be completely useless, hence, our two step procedure can give us a less
noisy measure of the GDP.
The situation at longer horizons (i.e. h = 4 and 8) is different because we can not see a clear sparsity
19
Notes: Each graph represents the variables selected (yellow points) by the Elastic Net soft-thresholding for
each rolling window, at each forecast horizon.
In this paragraph, we study extensively the ability to detect the best hyperparameters combination
of our cross-validation procedure with respect to the common criteria. To determine which method is
better we use the M SF E for each horizon computed in the proper set. The criteria used are the ABC
criteria (Alessi, Barigozzi and Capasso 2010) for the selection of the number of factors, and Schwarz
criteria to determine the number of lags for the target variable and the factors. The value of the critical
parameters determined by “brute force” cross-validation are computed in the calibration set following
the procedure outlined in Section 2 and kept fixed for the all the out of sample periods. The values
of the hyperparameters estimated through the criteria are computed for each out of sample rolling
window in order to select always the best value. The criteria can adapt the optimal value at each time,
and this is an advantage with respect to our cross-validation procedure. To calibrate the λ1 and λ2 we
20
Table 6: Mean Square Forecast Errors EWPCA “brute force” versus criteria
Forecast horizon 1 2 3 4 8
Size = 40%
EWPCAbf 0.1647 0.3146 0.2823 0.1614 0.3731
EWPCAabc 1.4607 1.2910 1.1325 1.3112 1.8725
EWPCAs 0.3571 0.4790 0.8911 0.5192 0.5417
EWPCAabc−s 0.4278 0.4921 0.7409 0.5833 0.5755
Size = 50%
EWPCAbf 0.1169 0.1932 0.2436 0.1286 0.1587
EWPCAabc 1.8345 0.4900 2.2464 0.9961 0.5020
EWPCAs 0.3464 0.3843 0.8824 0.3838 0.5884
EWPCAabc−s 0.3360 0.4605 0.5852 0.5485 0.7795
Size = 70%
EWPCAbf 0.1102 0.1319 0.2947 0.1624 0.1774
EWPCAabc 3.6321 2.3120 1.2297 0.3313 0.6540
EWPCAs 0.3457 1.1921 0.4432 0.4076 0.7028
EWPCAabc−s 0.5473 0.5791 0.8976 0.6653 1.5115
Notes: Entries are Mean Square Forecast Errors of the EWPCA calibrated in different ways. Bold values
are the best forecast at each forecast horizon among the different model specification.
In Table 6 we show the out of sample M SF E at each forecast horizon, using different size of the
calibration set {40%, 50% and 70%} of T for different EWPCAc , where c indicates the criteria used,
c = {bf, abc, s, abc-s}11 . The value of the critical parameters, for each size and model, are reported
in the online appendix.12 We can notice that in all the cases the MSFE referred to our “brute force”
cross-validation procedure is the lowest for each forecast horizon and size of the calibration data set
and it is highlighted using bold characters. We can also notice that the MSFE of our cross-validation
procedure is stable among the different sizes used for this accuracy and consistency control. Looking at
the other values in Table 6, we can assert that the ABC criteria, used to select the number of factors,
is the most unstable. Its MSFE has big volatility both the forecast horizons and the sizes of the
calibration data set. This effect can be due to the reduction of the high dimensional space performed
in the first step of EWPCA, making the selection criteria unstable and unable to detect the optimal
number of the factors 13
It is clear from Table 6 that treating all the parameters in the same way and exploiting the space
of all the possible combinations helps to improve the forecasting ability of EWPCA, where we have
a consistent number of hyperparameters. The hybrid methods, EWPCAabc and EWPCAs , where we
11
bf = brute force, abc = ABC, s = SIC, abc-s = ABC-SIC.
12
Available upon request.
13
We perform an extensive exercise where we compare the ABC criteria against a simpler factor selection, where we fix
the number of factors = {1, 2, 3} and we notice that the outcome for each forecast horizon of the latter method produces
lower MSFEs with a more stable pattern.
21
In this section, we perform the last consistency check, and we compare EWPCA calibrated using the
brute force methods with the other models at each forecast horizon for different size of the calibration
set, as in the paragraph above. The hyperparameters of the LASSO, Ridge, and DFM are estimated
using the same cross-validation procedure outlined in Section 4 but with different size of the dataset.
Table 7: Mean Square Forecast Errors EWPCA brute force versus criteria
Forecast horizon 1 2 3 4 8
Size = 50%
EWPCAbf 0.1169 0.1932 0.2436 0.1286 0.1587
Lasso 0.4523 0.5695 0.6315 0.6224 0.7351
Ridge 0.5857 0.7056 0.9247 1.0275 1.2355
DFMbf 0.0911 0.2528 0.2015 0.2608 0.3471
Size = 70%
EWPCAbf 0.1102 0.1319 0.2947 0.1624 0.1774
Lasso 0.5561 0.8074 0.9525 0.9276 1.2516
Ridge 0.5949 0.6790 0.9270 1.0839 1.2828
DFMbf 0.1508 0.2843 0.1987 0.3665 0.3208
Notes: Entries are Mean Square Forecast Errors of the EWPCA, LASSO, Ridge and DFM using different
size of the calibration data set. Bold values are the best forecast at each forecast horizon among the different
model specification.
In Table 7 we report out of sample M SF E for all the models using different size of the calibration
data set {50% and 70%}. The bold characters highlight the lowest mean square forecast error for
a specific forecast horizon. We can notice that when the calibration size is equal to the 50% of the
sample, the EWPCA on average is the model that performs better and has the most stable MSFE
along the different horizons. At h = 1 and 3 the DFM slightly outperforms EWPCA and for h = 1
and the difference is statistically significant, meaning that the DFM outperforms the EWPCA in this
case. When the calibration set is equal to 70% of our data, the EWPCA performs much better than
the DFM at each forecast horizon, except then h = 3, and also, in this case, the volatility of the mean
square forecast error is lower than the DFM, meaning that the results of the EWPCA are more stable.
22
6 Conclusion
Forecasting in high dimensional time series is an important and contemporaneous problem that re-
searchers are trying to address theoretically and empirically. In the recent past, especially in economics,
with the word “big” or “high dimension” people were referring to the simultaneous modeling of 15-20
variables to predict the target variable in order to avoid the curse of dimensionality in the parameter
estimation and to maintain the interpretation of the model. Recently, dense method, like DFM, were
able to use much more data series at the same time with the losing of the model interpretation. This
paper proposes a new method able to deal with a huge number of series and to improve the forecast of
the target variables without losing model interpretation that is useful for macroeconomic implications.
The EWPCA selects the variables with the highest predictive power and through the principal
component analysis is able to extract the most important information to forecast the target variable.
This method is preferred to the DFM because it has better performances and it enhances the model
explanation high dimensional and complex space where a lot of the variables are correlated. Moreover,
we can assert that EWPCA has a stable variable selection along time that overcomes the drawbacks
related to the LASSO selection. In the first step we are able to decorrelate the variables and selects
for each forecast horizon the optimal subsample. From the empirical evidence we can notice that the
optimal subsample changes at each forecast horizon meaning that the variables have different forecast’s
ability due to the different structure and complexity of the data.
Finally, this work proposes also an alternative cross-validation procedure for high dimensional time
series called “brute force” that gives overall more stable results, in terms of MSFE, and it treats all the
critical parameters, in the same way, exploiting the space of all the possible combinations.
23
24
The T-transformation Codes, Tcode in the above Table, refer to how we transform raw time series
into stationary. Being Xt a raw series, the transformations adopted are:
Xt if Tcode = 1
(1 − L)Xt if Tcode = 2
(1 − L)(1 − L12 )X
if Tcode = 3
t
Zt =
logXt if Tcode = 4
(1 − L)logXt if Tcode = 5
(1 − L)(1 − L12 )logX if Tcode = 6
t
25
− 12 √X
X∗(t+N )×N = (1 + λ2 )
λ2 I
y
y∗t+N =
0
λ1
β̂ ∗ = arg min |y ∗ − X∗ β| + √ |β|
β 1 + λ2 1
We can write the above equation using the definition of the Elastic Net in the following way:
2
β λ1 β
β̂ ∗ = arg min y ∗ − X∗ √ +√ √ (10)
β 1 + λ2 1 + λ2 1 + λ2 1
′
!
′ X∗ X y∗′ X∗ λ1 |β|
= arg min β β−2 + y∗′ y∗ + (11)
β 1 + λ2 1 + λ2 1 + λ2
′
!
∗′ XX
X X∗ =
1 + λ2
y′ X
y∗′ X∗ =
1 + λ2
y y = y′ y
∗′ ∗
( ′
! )
1 ′ X∗ X ′
β̂ = arg min β β − 2y Xβ + λ1 |β|1 + y′ y (12)
β 1 + λ1 1 + λ2
′
!
′ X∗ X
= arg min β β − 2y′ Xβ + λ1 |β|1 (13)
β 1 + λ2
26
[2] Bai, J., & Ng, S. Determining he number of factors in the approximate factor models. Econometrica,
70(1), 191-221, 2002.
[3] Bai, J., & Ng, S. Forecasting economic time series using targeted predictors. Journal of Economet-
rics, 146(2), 304-317, 2008.
[4] Banbura, M., Giannone, D., & Reichlin, L. Large Bayesian vector auto regression. Journal of
Applied Econometrics, 25(1), 71-91 , 2010.
[5] Bergmeir, C., Hyndman, R. J., & Koo, B. A note on the validity of cross-validation for evaluat-
ing autoregressive time series prediction. Journal of Computational Statistics and Data Analysis,
120(4), 70-83 , 2018.
[6] D’Agostino, A., Gambetti, L., & Giannone, D. Comparing alternative predictors based on large-
panel of factor models: Comparing alternative predictors. Oxford Bulletin of Economics and Statis-
tics, 74(2), 306-326, 2012.
[7] De Mol, C., Giannone, D., & Reichlin, L. Forecasting using a large number of predictors: is Bayesian
shrinkage a valid alternative to principal components?. Journal of Econometrics, 146(2), 318-328,
2008.
[8] Diebold, C. X., & Mariano, R. S. Forecasting using a large number of predictors: is Bayesian
shrinkage a valid alternative to principal components?. Journal of Business and Economic Statistics,
13(3), 253-263, 1995.
[9] Forni, M., Giovannelli, A., Lippi, M., & Soccorsi, S.. Dynamic factor model with infinite dimansional
space: Forecasting. CEPR Disucssion Paper no. DP11161, 1-43, 2016.
[10] Forni, M., Hallin, M., Lippi, M., & Reichlin L. The generalized dynamic-factor model: Identifica-
tion and estimation. Review of Economics and Statistics, 82(4), 540-554, 2000.
[11] Forni, M., Hallin, M., Lippi, M., & Reichlin L. The generalized dynamic-factor model: One-side
estimation and forecasting. Journal of the American Statistical Association, 100, 830-840, 2005.
[12] Forni, M., & Lippi, M. The generalized dynamic-factor model: One-side representation results.
Journal of Econometrics,163(1), 23-28, 2011.
[13] Friedman, J., Hastie, T., & Tibshirani, R. Regularization path for generalized linear model via
coordinate descent. Journal of Statistical Software, 33(1), 1-22, 2010.
[14] Giacomini, R., & Rossi, B. Forecast comparison in unstable environments. Journal of Applied
Econometrics, 25(4), 595-620, 2010.
27
[16] Hastie, T., Tibshirani, R. & Friedman, J. The elements of statistical learning: Data mining,
inference and prediction. Springer Series in Statistics. New York, 2009.
[17] Kim, H. H., & Swanson, N. R. Forecasting financial and macroeconomic variables using data
reduction methods: New empirical evidence. Journal of Econometrics, 178, 352-367, 2014.
[18] Ledoit, O., & Wolf, M. Honey, I shrunk the sample covariance matrix. Economic Working Papers
691, Department of Economics and business, Universitat Pompeu Fabra, 1-22, 2003.
[19] Ledoit, O., & Wolf, M. A well-conditioned estimator for large-dimensional covariance matrices.
Journal of Multivariate Analysis, 88(2), 365-411, 2004.
[20] Li, J. & Chen, W. Forecasting macroeconomic time series: LASSO-based approaches and their
forecast combinations with dynamic factor models. International Journal of Forecasting, 30, 996-
1015, 2014.
[21] Marcellino, M., Stock, J. H., % Watson, M. W. A comparison of direct forecast and iterated
multistep AR methods for forecasting macroeconomic time series. Review of Econometrics, 135,
499-526, 2006.
[22] McCracken, M. & Ng, S. FRED- MD: A monthly database for Macroeconomic Research. Journal
of Business and Economic Statistics, 0-0, 2015.
[23] McCracken, M. & Ng, S. FRED- MD: A quarterly database for Macroeconomic Research. Journal
of Business and Economic Statistics, 0-0, 2014.
[24] Ng, S. Variable selection in predictive regressions. in G. Elliott & A. Timmermann (Eds.), Hand-
book of economic forecassting,2(B), 752-789, 2013.
[25] Stock, J. H., & Watson, M. W. Macroeconomic forecasting using diffusion indexes. Journal of
Business and Economic Statistics, 20, 147-162, 2002(a).
[26] Stock, J. H., & Watson, M. W. Forecasting using principal components from a large number of
predictors. Journal of the American Statistical Association, 97(460), 1167-1179, 2002(b).
[27] Tibshirani, R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical
Society, Sereis B, 58, 267-288, 1996.
[28] Zou, H., & Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal
Statistical Society, Sereis B, 67(2), 301-320, 2005.
28
The Financial Decisions of Immigrant and Native Households: Evidence from Italy
Graziella Bertocchi, Marianna Brunetti and Anzelika Zaiceva
CEIS Research Paper, 449, January 2019
Assessing the Effects of Fiscal Policy News under Imperfect Information: Evidence
from Aggregate and Individual Data
Luisa Corrado and Edgar Silgado-Gómez
CEIS Research Paper, 447, November 2018
DISCLAIMER
The opinions expressed in these publications are the authors’ alone and therefore do
not necessarily reflect the opinions of the supporters, staff, or boards of CEIS Tor
Vergata.
COPYRIGHT
Copyright © 2019 by authors. All rights reserved. No part of this publication may be
reproduced in any manner whatsoever without written permission except in the case
of brief passages quoted in critical articles and reviews.