International Journal of Forecasting: Andrea Carriero, Ana Beatriz Galvão, George Kapetanios
International Journal of Forecasting: Andrea Carriero, Ana Beatriz Galvão, George Kapetanios
International Journal of Forecasting: Andrea Carriero, Ana Beatriz Galvão, George Kapetanios
article info a b s t r a c t
Keywords: We employ datasets for seven developed economies and consider four classes of multi-
Factor models variate forecasting models in order to extend and enhance the empirical evidence in the
BVAR models macroeconomic forecasting literature. The evaluation considers forecasting horizons of
MIDAS models
between one quarter and two years ahead. We find that the structural model, a medium-
DSGE models
sized DSGE model, provides accurate long-horizon US and UK inflation forecasts. We
Density forecasts
Meta-analysis strike a balance between being comprehensive and producing clear messages by ap-
plying meta-analysis regressions to 2,976 relative accuracy comparisons that vary with
the forecasting horizon, country, model class and specification, number of predictors,
and evaluation period. For point and density forecasting of GDP growth and inflation,
we find that models with large numbers of predictors do not outperform models with
13–14 hand-picked predictors. Factor-augmented models and equal-weighted combina-
tions of single-predictor mixed-data sampling regressions are a better choice for dealing
with large numbers of predictors than Bayesian VARs.
© 2019 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.ijforecast.2019.02.007
0169-2070/© 2019 International Institute of Forecasters. Published by Elsevier B.V. All rights reserved.
A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239 1227
increasingly large datasets. The former has been driven data from seven economies: US, UK, euro area, Germany,
by the widespread recognition that structural change France, Italy and Japan. For these seven economies, we
is a leading cause of forecast failure. A number of ap- compute forecasts for output growth and inflation using
proaches of varying degrees of sophistication are being three classes of state-of-the-art reduced-form forecasting
used to accommodate structural change. These range from models: factor-augmented distributed lag (FADL) models,
time-varying coefficient models to methods that allow mixed data sampling (MIDAS) models, and Bayesian vec-
for the time-varying estimation of standard econometric tor autoregressive (BVAR) models.6 These model classes
forecasting models. In this context, as is common with are useful for exploring the predictive information con-
forecasting in general, these increases in sophistication tained in a large number of indicators. As a consequence,
have been found not to necessarily be correlated closely we build a dataset with a large number of monthly in-
with superior forecasting performance.2 The second trend dicators for each country and assess the importance of
of considering large datasets has been spurred on by their employing large (100 predictors) datasets in compari-
use in many economic analyses, given their availability in son with medium-sized (a dozen predictors) and small
central banks and other policy-making institutions.3 datasets for macroeconomic forecasting. We also consider
The above developments set the scene for the current one class of structural models: a medium-sized dynamic
paper. Our goal is to provide a comprehensive, state of stochastic general equilibrium model (DSGE). We com-
pare the DSGE model’s performance with that of reduced-
the art evaluation of recently proposed model classes for
form models for forecasting output growth and inflation
forecasting output growth and inflation, giving special
in the US, the UK and the euro area.
attention to model classes that are able to deal with large
We have some knowledge of the relative point fore-
numbers of predictors. The aim of the paper is to strike
casting performances of DSGE models relative to Bayesian
a balance between being comprehensive and producing
VARs (as, for example, Smets and Wouters (2007)), of
clear messages. This requires us to consider a wide range
FADL relative to factor-augmented MIDAS models (An-
of models, but to be selective in some dimensions so as
dreou et al., 2013), and of Bayesian VARs relative to
to make the evaluation exercise feasible and informative.
dynamic factor models (Bańbura et al., 2010). This paper
Furthermore, it requires an evaluation across a number advances further by comparing the out-of-sample fore-
of different countries and sample periods. Finally, we casting accuracies for point and density forecasts of
aim to compare and contrast reduced-form models and output growth and inflation for the following classes of
structural models, which have traditionally been consid- models: BVAR, FADL, MIDAS and DSGE models.
ered inferior for forecasting purposes. This latter aspect The design of our forecasting comparison with the ele-
of our analysis is found less commonly in forecasting ments described above implies that we are evaluating the
evaluations.4 forecasting performances of 13 reduced-form model spec-
Forecasting comparisons in the literature normally fo- ifications for predicting two quarterly macroeconomic
cus on data from only a single country or a small subset time series over horizons from one to eight quarters
of countries (US, UK and euro area).5 Instead, we will use ahead. We perform this comparison for seven different
countries and consider four different subperiods of five
2 For example, Faust and Wright (2013) provide evidence that time- years over a 20-year out-of-sample period. In order to get
varying vector autoregressive models with stochastic volatility do not clear messages from our empirical exercise, we develop
improve point forecasts of inflation in comparison with a univariate evaluation methods that pool forecasting performances
benchmark, although there is stronger evidence that stochastic volatil- across countries, model classes, forecasting origin periods
ity improves density forecasts of inflation (Clark, 2011). Chauvet and and dataset sizes.
Potter (2013) consider Markov-switching models for the prediction of
output growth, and find gains only during recessions and only at short
Our meta-analysis method employs a regression of the
horizons. Based on data for a set of countries, Ferrara, Marcellino, and relative performance of each multivariate reduced-form
Mogliani (2015) show that nonlinear models rarely improve on the
forecasts of their linear counterparts.
seven countries when designing their forecasting exercises. Ferrara
3 The paper by Stock and Watson (2002) is influential in supporting
et al. (2015) evaluate models for 19 countries, but they use only a
the use of large datasets for forecasting macroeconomic variables. relatively small set of predictors.
Other more recent contributions, all pointing towards the importance 6 Time-varying vector autoregressive models, exploited as fore-
of using medium–large dataset for macroeconomic forecasting, include
casting models by D’Agostino, Gambetti, and Giannone (2013), and
the studies by Bańbura, Giannone, and Reichlin (2010), Carriero, Clark,
vector autoregressive models with stochastic volatility, the forecasting
and Marcellino (2015), Giannone, Lenza, and Primiceri (2015) and Koop
performances of which were evaluated by Clark (2011), are two classes
(2013).
of models that are excluded from this forecasting comparison. The main
4 Density forecasts of DSGE models are evaluated by Del Negro and
reason for this is that these two classes cannot be adapted easily to
Schorftheide (2013) and Diebold, Schorftheide, and Shin (2017), but large datasets. The approach proposed by Koop and Korobilis (2013)
when DSGE models are compared with a large set of statistical models for large datasets considers a VAR with 25 variables as ‘‘large’’, whereas
by Faust and Wright (2013) and Chauvet and Potter (2013), only point this paper uses datasets up to 155 variables. We also use data from
forecasts are considered. Note also that the set of forecasting models countries with shorter time series where structural changes are harder
used for predicting inflation by Faust and Wright (2013) differs from to identify. This paper considers just one class of mixed frequency
the models considered by Chauvet and Potter (2013). While Faust and models. Mixed frequency specifications are popular for nowcasting,
Wright (2013) consider horizons up to one year ahead, Chauvet and as was shown in the survey by Banbura, Giannone, Modugno, and
Potter (2013) choose to look at horizons up to two quarters only, Reichlin (2013), as well as the recent contribution by Schorfheide and
whereas Del Negro and Schorftheide (2013) evaluate horizons up to Song (2015). Because we aim to evaluate forecasting performances
two years ahead. from nowcasting up to long horizons (two years), we select just one
5 Those by Kuzin, Marcellino, and Schumacher (2013) and Stock class of mixed frequency models that has a relatively good nowcasting
and Watson (2003) are exceptions, in that they consider data from performance (Andreou, Ghysels, & Kourtellos, 2013; Kuzin et al., 2013).
1228 A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239
model on a set of characteristics. The relative performance of the forecasting of output and inflation by Chauvet and
is measured using the root mean squared forecast error Potter (2013) and Faust and Wright (2013) , respectively,
for point forecasts and logscores for density forecasts. we use the same set of forecasting model classes for pre-
The performance is measured with respect to the autore- dicting output growth and inflation. The advantage of this
gressive model for the same variable and horizon. The approach is that it allows us to evaluate whether we need
method allows us to assess the statistical significance of different forecasting models for output and inflation. The
forecasting horizon, geographical source (country), model disadvantage is that we do not evaluate forecasting meth-
class, evaluation period and number of predictors (dataset ods that were designed for certain specific features of
size) in explaining forecasting performance. each variable, such as the UCSV models for inflation (Stock
A second evaluation method relies on t-statistics from & Watson, 2007) and Markov-switching models for output
a (Diebold & Mariano, 1995) equal forecast accuracy test (Chauvet, 1998). Another important feature of our fore-
over the 20-year evaluation period. We investigate the casting exercise is that we consider both point and density
empirical distribution of t-statistics using an autoregres- forecasts. This density forecasting evaluation provides us
sive model under the null. We use this approach to com- with insights on the accuracy of forecasting models for the
plement the results of the meta-analysis when comparing whole predictive distribution. The advantage of consider-
the point and density forecasting performances of speci- ing both point and density forecasts is that we can assess
fications that use a large set of predictors with those of whether the choice of the loss function has an impact on
specifications that use a smaller set. We also use the em- model rankings.
pirical distributions of equal-accuracy t-statistics against The remainder of this section describes how we com-
an AR benchmark to evaluate how the forecasting accu- pute density forecasts of three reduced-form forecasting
racy of structural models compares with that of reduced- models: factor models, Bayesian VAR models and MIDAS
form models. models. We also describe how we obtain density fore-
We find no support for the use of large datasets (100 casts using a structural DSGE model and simple univariate
predictors) rather than medium-sized ones (a dozen models.
predictors). However, we provide evidence that the fac- In the text below, we use the following notation: Qt ,
tor model and an equal-weighted combination of single- for t = 1, . . . , T , denotes the raw data; and qt = log(Qt )
regressor MIDAS models are the best specifications for denotes the time series in log-levels. The variable in first
dealing with large datasets, since they perform better differences is ∆qt = 100 ∗ (qt − qt −1 ). The forecast horizon
than Bayesian VARs on average. We find that DSGE mod- is h, and the maximum forecast horizon is hmax .
els have relatively good performances for forecasting US
and UK inflation at forecasting horizons of longer than one 2.1. Univariate models
year.
The empirical results provide only limited support We compute forecasts from univariate autoregressive
for the use of mixed frequency models, which exploit (AR(p)) models. The autoregressive order is selected using
current-quarter information on monthly series, for im- the Schwarz information criterion (SIC) and assuming a
proving nowcasts of output growth. The reason for this is maximum order of four. We compute the predictive den-
that there is a large degree of cross-country variation in sity by bootstrap as per Clements and Taylor (2001). First,
the nowcasting performances of mixed-frequency models. we get a full bootstrapped time series ∆q∗p+1 , . . . , ∆q∗T
The results also suggest changes in the relative forecasting by using the OLS estimates, initial values ∆q1 , . . . , ∆qp
performances of forecasting models. The relative perfor- and a T − p bootstrapped time series from the residuals.
mances of reduced-form multivariate models are at their Using the bootstrapped time series, we estimate an AR(p)
peak in the 1993–1997 period for inflation and in the model with the same autoregressive order as the original
2008–2011 period for output growth. model. Then, we compute forecasts by iteration for h =
We describe the classes of forecasting models in 1, . . . , hmax , including a bootstrap draw from the residuals
Section 2. Section 3 provides a summary of the datasets for each horizon. This bootstrap procedure will deliver
(i) (i)
that we employed, which are reported fully in our online sequential draws as ∆q̂T +1 , . . . , ∆q̂T +hmax for each time
appendix. Section 4 describes the key elements of the de- we reestimate the model on a new bootstrapped sample.
sign of our forecasting exercises, including the statistical
tests employed. Section 5 explores the key determinants 2.2. Factor models
of the point and density macroeconomic forecasting per-
formances of multivariate statistical models relative to We forecast with factors using the following FADL(p, k)
those of AR models using meta-analysis regressions and equation for each horizon h:
the empirical distribution of equal-accuracy t-statistics. p−1 r k−1
Evaluations of the point and density forecasting accura- ∑ ∑ ∑
∆qt = β0 + βi+1 ∆qt −h−i + γj,i+1 fj,t −h−i +εt , (1)
cies of structural models in comparison to reduced-from
i=0 j=1 i=0
models are discussed in Section 6. Section 7 concludes.
where r counts the number of factors f .
2. Forecasting methods Factors are estimated by principal components applied
to either a medium-sized (around 14 variables) or large
This section describes the forecasting methods com- (around 100 variables) dataset of predictors of qt . Be-
pared in this paper. In contrast to the recent evaluations fore performing factor estimation, we decide whether to
A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239 1229
transform raw data to log-levels as is described in the for current and future quarters are computed conditional
‘‘log vs. level’’ column in Tables B2 and B3 in the online on monthly observations of economic indicators during
appendix. Then we apply ADF unit root tests to define the the current quarter. In the forecasting exercise, we set
order of differentiation of each variables. Next, principal l = 2 for all h. This implies that we are considering
components is applied to standardized data to compute typical nowcasting horizons if h = 1. This utilization
the factors. We follow Groen and Kapetanios (2013) in of monthly data is the main advantage of the MIDAS
choosing the number of factors. We first choose the au- approach for macroeconomic forecasting (Andreou et al.,
toregressive order p in a univariate regression using the 2013; Clements & Galvão, 2008; Kuzin et al., 2013).
SIC, then we set k = 1 to choose the number of factors We measure the impact of the high frequency xt on
using the modified SIC of Groen and Kapetanios (2013), the low frequency qt by first applying the weights w (θ, i)
assuming a maximum number of factors of four. We have to all monthly lags, then multiplying by an intercept γ ,
also tried to choose r and k jointly using the modified which is identified because the weights sum to one. We
SIC, and normally k = 1 is the choice indicated, with use the beta function to obtain the weights, that is,
the impact on the average forecasting performance being f (θ; i)
negligible even when k should be larger. w (θ; i) = ∑K
j=1 f (θ; j)
We compute density forecasts from the FADL model
by a fixed regressor bootstrap. We choose this specific (j)θ1 −1 (1 − j)θ2 −1 Γ (θ1 + θ2 )
f (θ; i) = , j = i/km.
approach because it takes into account both parameter Γ (θ1 )Γ (θ2 )
and forecasting uncertainties when computing density The two parameters in θ are estimated jointly with the
forecasts, and because we will apply a similar approach, other parameters by nonlinear least squares. Note that, as
based on (Aastveit, Foroni, & Ravazzolo, 2016), for com- in the case of the factor approach, we need to estimate a
puting density forecasts with MIDAS models. This im- MIDAS regression for each forecasting horizon.
plies that we fix the variables on the right-hand-side We compute density forecasts using a fixed regressor,
(RHS) of the regression to their data values, and use bootstrapped as per Aastveit et al. (2016) and as described
bootstrapped values from the residuals to get a full boot- in Section 2.2. Our application of the fixed regressor boot-
strapped time series ∆q∗p+1 , . . . , ∆q∗T for the left-hand- strap to MIDAS models implies that we also fix θ , that is,
side (LHS).7 We then re-estimate the ADL regression using take θ = θ̂ from the estimation with observed data, but
the bootstrapped LHS values and the fixed RHS values. Us- obtain different values of βi and γ for each bootstrapped
ing bootstrapped coefficients, we compute a forecast draw sample. This has a large beneficial impact on our compu-
(i)
∆q̂T +h , conditional on observed values for ..., ∆qT −1 , ∆qT , tational burden. Our density computation strategy is still
and using a bootstrap draw from the reestimated regres- able to capture the impact of parameter uncertainty on a
sion residuals. Note that this bootstrapping procedure will set of parameters while computing forecasts. Note that,
deliver the density for one specific forecasting horizon. as in the case of factor models, the last step in computing
(i)
Our factor modelling approach requires the estimation of ∆q̂T +h also requires a draw from the residuals of the
a forecasting model for each horizon. re-estimated MIDAS regression.
We consider two different types of MIDAS specifica-
2.3. MIDAS models tions that are able to deal with large datasets. The first
one assumes that x is an individual predictor. Because we
The economic predictors in our dataset, summarized plan to employ sizeable datasets, we estimate a single
in Table 2, are sampled monthly. The factor approach regressor MIDAS model for each predictor, then combine
described above requires the aggregation of monthly data their predictive densities using equal weights. We call
into quarters. We exploit monthly information directly by this model the combination MIDAS (C-MIDAS) model. In
employing an ADL-MIDAS model. The model is written as: this specification, we decide beforehand whether we will
be using log, log-levels or quarterly differences for each
p−1 km−1
∑ ∑ of the indicators when using our medium dataset. Our
∆qt = β0 + βi+1 ∆qt −h−i + γ w(θ, i)xt −mh−i+l + εt , choice of data transformation is indicated in Tables B2 and
i=0 i=0 B3 in the online appendix.
where m is the difference in sampling frequencies be- The second specification first estimates factors with
tween qt and xt , and w (θ, i) are the weights for each high monthly data by principal components, applying the data
frequency lag, which are functions of the parameters θ . transformation based on unit root tests described for
FADL models. We then set the number of factors to one
In our applications, m = 3, since xt is sampled monthly
in the case of medium datasets and to two in the case
while qt is sampled quarterly. The autoregressive order
of large datasets following Andreou et al. (2013). We call
in quarters is denoted by k, and km is the autoregressive
this specification the F-MIDAS model, and the regressors
order in months such that lags of x are counted in months.
xt are the factors estimated in a previous step by principal
The number of lead months is represented by l (named as
components.
per Andreou et al. (2013), though it was first employed
for macroeconomic forecasting by Clements and Galvão 2.4. BVAR models
(2008)). The intuition on the use of leads is that forecasts
Our BVAR approach is the benchmark model of Car-
7 As a consequence, this approach does not take into account the riero et al. (2015), who provide a summary of the litera-
uncertainty in the estimation of the factors, but only in the βs and γs . ture on the application of BVARs to forecasting. Define the
1230 A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239
vector yt = (q1t , q2t , . . . , qNt )′ ; then, a VAR(p) is: many sequences of forecasts for each parameter draw.
The point forecast is the median over all draws for each
yt = A0 + A1 yt −1 + · · · + Ap yt −p + εt (2)
horizon.
εt ∼ N(0, Σ ), We consider specifications both in levels, which we
call L-BVAR, and in differences, called D-BVAR. We set
for t = p + 1, . . . , T .
p = 4. When the target forecasting variable is the quar-
We elicit a conjugate normal-inverse Wishart prior:
terly growth rate, we transform the forecasts for the
α|Σ ∼ N(α0 , Σ ⊗ Ω0 ) model in levels accordingly.
Σ ∼ IW (S0 , v0 ),
′
2.5. DSGE models
where α = v ec([Ac , A1 , . . . , Ap ] ), so that the posterior
distributions are The literature provides evidence of the accuracy of the
medium-sized (Smets & Wouters, 2007) model (Christof-
α|Σ , data ∼ N(α, Σ ⊗ Ω )
fel, Coenen, & Warne, 2010; Del Negro & Schorftheide,
Σ |data ∼ IW (S̄ , v̄ ). 2013; Edge & Gurkaynak, 2011; Woulters, 2015). We
employ the Smets-Wouters DSGE model with seven ob-
Carriero et al. (2015) describe the closed-form solutions
servables, including output and inflation, as our structural
for the posterior and prior means and variances under
model. We use the specification as Herbst and Schorftheide
the assumption that they follow a Minnesota-style prior
(2012) and Smets and Wouters (2007), which assume a
as Bańbura et al. (2010). We consider the prior means
deterministic trend to productivity.
for the first-order autoregressive coefficients as equal to
one if the endogenous variables, yt , are in log-levels as We use the priors as Herbst and Schorftheide (2012)
described above. We also consider a specification in dif- and Smets and Wouters (2007). The posterior distribution
ferences, using ∆yt , with the prior mean equal to zero. of the structural parameters is obtained using the random
In the case of the VAR in levels, we also impose the walk Metropolis algorithm described by Del Negro and
sum of coefficients prior, which expresses the belief that Schorftheide (2011), and we calibrate the spread parame-
the average of the past values of a given variable provides ter such that the acceptance rate is in the 20%–40% range
a good forecast for that variable. The fact that, in the for each country’s dataset. We compute the predictive
limit, the sum-of-coefficients prior is not consistent with density using 5000 equally-spaced draws from the poste-
cointegration motivates the use of an additional prior, rior parameter draws generated by the MCMC procedure.
known as the ‘dummy initial observation’ prior. This was For each parameter draw, we also draw from the normal
proposed by Sims (1993) and avoids giving an unreason- distribution of the disturbances (structural shocks) to get
ably high explanatory power to the initial conditions, a a sequence of forecasts from h = 1, . . . , hmax for each
pathology which is typical in nearly nonstationary mod- observed variable.
els (Sims, 2000). These last two priors together tend to We compute forecasts with DSGE models for only
improve the forecasts when dealing with data in levels. three economies in our dataset: the US, the UK and the
Hyperparameters governing priors are set as the baseline euro area. The reason for this is that the assumption in the
case by Carriero et al. (2015). The overall prior tightness model that the central bank that sets interest rates based
λ1 is selected to maximise the marginal likelihood: on a Taylor rule, which depends on domestic inflation,
is not adequate for countries which are part of the euro
λ1 = arg max ln(p(Y )), area. We also choose not apply it to Japan, again because
λ1
the Taylor rule may be a very poor approximation of the
where p(Y ) is computed in closed form as per Carriero Bank of Japan’s monetary policy over the last 20 years. We
et al. (2015). The grid has 15 elements [0.01, 0.025, 0.1, apply the model to euro area data by adding an equation
0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.75, 1, 2, 5]. that links employment to hours such that we can use the
In an out-of-sample forecasting exercise, we compute employment time series instead of hours, following the
λ1 each time we re-estimate the model with a longer modification proposed by Christoffel, Coenen, and Warne
sample period. (2008).
Forecasts are computed by simulation. We use pos-
terior draws of α and Σ to obtain a implied path for 3. Data description
′
ŷT +1 , . . . , ŷT +h . Assume that A = [Ac , A1 , . . . , Ap ] , which
is an N × Np + 1 matrix; then, we obtain a draw j for all We employ data from seven developed economies:
autoregressive coefficients using: US, UK, the euro area, Germany, France, Italy and Japan.
(j) Our target variables are the quarterly change in log real
(A(j) ) = (A) + chol(Ω ) ∗ V (j) ∗ chol(Σ(j) )′ ,
GDP and the quarterly change in seasonally-adjusted log
where V (j) is an (Np + 1) × N matrix obtained from a CPI, with the data sources being described in Table B1
standard normal distribution. Then, for a draw of A(j) and in the online appendix. Seasonally-adjusted CPI data are
Σ(j) , we take a sequence of h draws from the N(0, Σ(j) ) not available for the European countries or Japan. As a
in order to compute by iteration a sequence of forecasts consequence, we seasonally adjusted the data using the
ŷT +1 , . . . , ŷT +h for the model in Eq. (2). We use a total X12 filter.
of 5000 draws, and split the procedure such that we use For each country, we build a medium and a large
a small number of draws of A(j) and Σ(j) , then generate dataset of economic indicators, sampled monthly. The
A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239 1231
terly data are required, we use the average over quarter m Name Description
for factor models, so F-MIDAS nest FADL models,8 and the 1 AR Autoregressive model
2 FADL_M Factor ADL model with a medium-sized dataset
end of the quarter value for the BVAR, as is popular in the
3 FADL_L Factor ADL model with a large dataset
BVAR literature. When possible, we follow the series in- 4 F-MIDAS_M Factor MIDAS with a medium-sized dataset
cluded in the datasets of Kuzin et al. (2013) . The medium 5 F-MIDAS_L Factor MIDAS with a large dataset
dataset includes 11–14 variables per country. These are 6 C-MIDAS_M Combination MIDAS with a medium-sized
dataset
a mix of measures of economic activity, including sur-
7 C-MIDAS_L Combination MIDAS with a large dataset
vey data, prices and financial variables. Similar sets of 8 L-BVAR_S BVAR in levels with a small dataset
variables were employed by Carriero et al. (2015). These 9 D-BVAR_S BVAR in differences with a small dataset
datasets include oil prices as a common variable. 10 L-BVAR_M BVAR in levels with a medium-sized dataset
The number of variables included in the large dataset 11 D-BVAR_M BVAR in differences with a medium-sized
dataset
varies across countries due to data availability, as is 12 L-BVAR_L BVAR in levels with a large dataset
recorded in Table 2. It varies between 57 (Japan, France) 13 D-BVAR_L BVAR in differences with a large dataset
and 155 (US). The large dataset also includes all variables 14 DSGE Smets and Wouters’ (2007) medium-sized DSGE
in the medium dataset. Because of the international trans- model
mission of business cycle shocks, we include some key
US variables in the large datasets of the six remaining
economies, including financial variables such as equity the forecast accuracy for forecasts up to 2011Q3; that
prices and Treasury bond rates. We provide descriptions is, we have 75 observations in our out-of-sample period
of all variables, including their datastream codes, in Table for the US, the UK, Japan and France, 55 observations for
B3 of the online appendix.9 Germany and Italy, and 35 observations for the euro area.
Due to the lack of availability of a real-time dataset For some of our results, we split the out-of-sample period
for the monthly indicators for our seven countries, we into windows of five years (20 observations), based on
use only data from the currently available vintage, as is the forecast origin date, to verify whether the relative
generally the case when evaluating forecasts using models forecasting performance varies over the out-of-sample
for large datasets (such as Kuzin et al. (2013) and Smets period. The literature provides evidence that the predic-
and Wouters (2007) , for example). tive ability may change over time (Giacomini & Rossi,
DSGE models are estimated using the quarterly growth 2010). In addition, changes in the underlying structure of
rate in output per capita. They also use inflation as mea- the economy and the data characteristics may affect the
sured by the GDP deflator. As a consequence, when eval- models’ relative forecasting performances.
uating the forecasts of the DSGE models, we change the We compute forecasts from models estimated with
target variable to the growth in output per capita and expanding samples over the out-of-sample period, that is,
the quarterly GDP deflator inflation. We then reestimate we re-estimate each model at each forecast origin and use
the forecasting models for these modified target variables all observations available up to the forecasting origin.
for a subset of our reduced-form models to enable us We use two measures of the forecasting performance.
to compare the predictions of structural and reduced- The accuracy of the point forecasts is measured using
form models. Table B4 in the online appendix describes root mean squared forecast errors (RMSFE), while the
the variables employed in the DSGE estimation, including log predictive score measures the accuracy of the density
their required transformations. forecasts. The advantage of using log scores to compare
The last observation employed in our forecasting ex- density forecasts is that the maximization of the logscore
ercise is 2013M9. For the US, Japan and the UK, we use is equivalent to the minimization of the Kullback–Leibler
data from 1975M1 (with the exception of UK CPI infla- distance between the model and the true density. We
tion, which is available only from 1980M1), but for other compute log scores by first fitting a Gaussian kernel den-
countries, data are only available later, as is described in sity to the 5000 predictive density draws over a grid
Table 2. The data for DSGE estimation are from 1984Q1 between −15 and 15, then finding the probability at the
for the US, the UK and the euro area. outturn.
We use the Diebold and Mariano (1995) t-statistic to
4. Evaluation design test for equal accuracy. The variance is computed us-
ing the Newey–West estimator with the maximum order
Our first forecast origin is 1993Q1 for the US, the UK, increasing with the horizon.
Japan and France; 1998Q1 for Germany and Italy; and Table 1 provides a short description of each of the
2003Q1 for the euro area. We set the maximum forecast forecasting models that we employ in this evaluation.
horizon to eight, enabling us to compute measures of Similarly to Bańbura et al. (2010), we consider BVAR
models of three sizes: small, medium and large. We use
8 This implies that F-MIDAS specification nests the FADL if the medium and large datasets for the FADL and MIDAS mod-
MIDAS weighting function is flat, that is, θ1 = θ2 = 1. els, but our only small model is the BVAR. This model has
9 Some variables were seasonally adjusted by the X12 filter before only three variables: real GDP, CPI, and the short-term
estimation, and these are marked SA in Table B3. interest rate.
1232 A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239
Table 2
Data summary.
Country Sample period Out-of-sample Medium dataset – Large dataset –
period number of predictors number of predictors
1 US 1975M1–2013M9 1993Q1–2013Q3 14 155
2 UK 1975M1–2013M9 1993Q1–2013Q3 13 59
3 Japan 1975M1–2013M9 1993Q1–2013Q3 13 57
4 France 1983M1–2013M9 1993Q1–2013Q3 13 57
5 Italy 1990M1–2013M9 1998Q1–2013Q3 12 128
6 Germany 1991M1–2013M9 1998Q1–2013Q3 13 114
7 Euro area 1998M4–2013M9 2003Q1–2013Q3 11 81
Note: Full descriptions of the time series employed, data sources and data transformations are available
in the online data appendix. Table B1 describes the target variables, Table B2 describes the monthly
medium-sized datasets, Table B3 reports the monthly large datasets, and Table B4 contains the
descriptions and transformations of the series employed for estimating the DSGE models.
5. Explaining the forecasting performances of statisti- where m = 2, . . . , 13, which are the statistical models
cal models numbered 2 to 13 in Table 1. Each measure varies with the
set of forecasting origins employed in the computation,
We provide acronyms for each of the forecasting mod- p = 1993Q1–1997Q4, 1998Q1–2002Q4, 2003Q1–2007Q4,
els included in this evaluation in Table 1. They comprise 2008Q1–2011Q3, 1993Q1–2011Q3; with the source coun-
13 reduced-form models, including a univariate model try c = US, UK, EU, FR, IT, GER, JP; and with the forecasting
(AR), and one structural model (DSGE). This section ex- horizon h = 1, . . . , 8.
plores the relative forecasting performances of the 12 As consequence, the total number of relative per-
multivariate reduced-form models, listed as models 2 to formance observations (given that the forecasting pe-
13 in Table 1. Forecasting comparisons that include the riod availability varies across countries as is noted in
DSGE model are discussed in Section 6. We measure the Table 2) is 2976. By exploiting a large set of forecast-
impacts of the model class, forecasting horizon, dataset ing comparisons, we aim to find sources of performance
size and data source (country) on the point and density improvements in macroeconomic forecasting that are not
forecasting performances. constrained by the model class, forecast horizon, country
or evaluation period.
The first characteristic that we explore is the country
5.1. A meta-analysis
in which the data are sourced. We use two dummy vari-
ables to split the country set in Table 2 into three: DEU = 1
Our aim is to investigate how the relative (to the AR for euro area countries (c = EU, FR, IT, GER) (and DEU = 0
model) forecasting performance of each statistical model otherwise), and DJP = 1 if c = JP. Thus, the benchmark
class (MIDAS, FADL and BVAR) varies with the number countries are the US and the UK.
of predictors (medium vs. large dataset), the forecasting The second characteristic is the forecasting horizon.
horizon (nowcasting, short horizon (h = 2, . . . , 4) and We split the set of forecast horizons into three groups
medium horizon (h = 5, . . . , 8)), the 5-year subperiod by defining Dsh = 1 if h = 2, . . . , 4 and Dmh = 1 if
evaluated, and the geographical source of the dataset. h = 5, . . . , 8. Accordingly, differences in performance
The dependent variable in our meta-analysis regres- over short and medium horizons are assessed against the
sion is a measure of the forecasting performance of a nowcasting (h = 1) benchmark.
specific forecasting model relative to that of the autore- We are also interested in finding differences among
gressive model when predicting one of the target vari- the three model classes. We set DMIDAS = 1 if m =
ables (output growth and inflation) for a specific country, 4, 5, 6, 7 and DBVAR = 1 if m = 8, 9, 10, 11, 12, 13, based
horizon and forecasting origin period. The measures of the on the description in Table 1. The benchmark model class
forecasting performance are based on root mean squared is the FADL (m = 2, 3). The impact of the number of
forecast errors (RMSFE) and the median logscore (MLS)10 predictors is evaluated using Dsmall = 1 if m = 8, 9
computed for a specific target variable that varies across and Dl arg e = 1 if m = 3, 5, 7, 12, 13, implying that the
countries, forecasting models, periods and horizons. The benchmark dataset size is the medium one.
measures for point and density forecasting performances Finally, the impact of the evaluation period is as-
are: sessed by creating a dummy variable for each of the four
RMSFEAR,p,c ,h five-year out-of-sample subperiods. As a consequence,
rMSFE m,p,c ,h = performance improvements are relative to the full out-
RMSFEm,p,c ,h
of-sample period (p = 1993Q1–2011Q3).
rMLS m,p,c ,h = 1 + [(−MLSar ,p,c ,h ) − (−MLSm,p,c ,h )], We also consider interactions among the dummy vari-
ables described above. We consider interactions between
10 We use the median rather than the mean logscore in order to horizon and model class dummies, between Dl arg e and
minimize the impact of outliers in our analysis. Outlier values are more model class dummies, and between Dl arg e and evaluation
frequent with logscores than with squared forecast errors. period dummies.
A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239 1233
Table 3
Explaining relative forecasting performances by country, forecasting origin period, horizon,
model class and dataset size.
rMSFE rMLS
Output growth Inflation Output growth Inflation
const (FADL_M, h = 1 1.037*** 1.017*** 1.041*** 1.057***
1993–2011, US+UK) (0.026) (0.015) (0.042) (0.054)
Japan 0.016* 0.068*** 0.011 −0.022
(0.009) (0.004) (0.028) (0.057)
Euro area −0.038* −0.012 −0.100 −0.066
(Euro, GER, IT, FR) (0.015) (0.028) (0.064) (0.068)
1993Q1–1997Q4 −0.068 0.077 −0.001 0.068
(0.043) (0.047) (0.025) (0.044)
1998Q1–2002Q4 −0.037*** 0.022 0.022 −0.016
(0.008) (0.036) (0.031) (0.027)
2003Q1–2007Q4 0.003 0.006 0.001 −0.005
(0.009) (0.032) (0.024) (0.046)
2008Q1–2011Q3 0.039** 0.013 0.068 −0.008
(0.020) (0.025) (0.056) (0.039)
h = 2,. . . ,4 −0.040* −0.012 −0.002 −0.028
(0.023) (0.025) (0.030) (0.019)
h = 5,. . . ,8 −0.028 −0.035 −0.065 −0.057*
(0.028) (0.039) (0.063) (0.035)
MIDAS (h = 1) 0.029 −0.002 0.011 0.012
(0.029) (0.039) (0.025) (0.047)
BVAR (h = 1) −0.013 −0.011 0.004 −0.098*
(0.017) (0.031) (0.039) (0.057)
(h = 2,. . . ,4)*MIDAS −0.021 −0.033 −0.035** −0.051
(0.027) (0.032) (0.018) (0.043)
(h = 5,. . . ,8)*MIDAS −0.060** −0.006 −0.048** −0.024
(0.026) (0.025) (0.021) (0.049)
(h = 2,. . . ,4)*BVAR 0.058*** −0.013 −0.025 −0.007
(0.020) (0.026) (0.024) (0.037)
(h = 5,. . . ,8)*BVAR 0.024 0.006 −0.019 0.028
(0.026) (0.028) (0.057) (0.055)
Small −0.031* 0.027 −0.005 0.002
(0.019) (0.025) (0.015) (0.034)
Large −0.003 −0.048*** −0.011 −0.019
(−0.003) (0.018) (0.007) (0.016)
Large* BVAR −0.029* −0.070* −0.163*** −0.201***
(0.017) (0.042) (0.055) (0.041)
Large*MIDAS 0.001 0.021 −0.013 −0.030*
(0.014) (0.023) (0.026) (0.018)
Large*(1993–1997) −0.031 0.127 0.037 0.121***
(0.024) (0.081) (0.038) (0.041)
Large*(1998–2002) −0.021** −0.062 0.007 −0.040
(0.010) (0.048) (0.024) (0.027)
Large*(2003–2007) −0.002 0.012 −0.010 −0.021
(0.007) (0.014) (0.015) (0.015)
Large*(2007–2011) 0.008 −0.003 0.009 0.001
(0.014) (0.021) (0.022) (0.026)
R2 0.154 0.126 0.156 0.198
No. of obs. 2976 2976 2976 2976
Mean of dep. var. 0.978 0.986 0.929 0.888
Note: Values larger than one imply that model improves over the AR. All explanatory
variables are dummy variables. The regressions are estimated by OLS. The values in
brackets are standard errors clustered by country. Values in bold denote estimates that
are statistically significant at the 10% level if we use heteroscedasticity-consistent (White)
standard errors.
*Indicate rejection of the null of no statistical significance at 10% level.
**Indicate rejection of the null of no statistical significance at 5% level.
***Indicate rejection of the null of no statistical significance at 1% level.
1234 A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239
Table 4
Additional regressions.
A: C-MIDAS vs F-MIDAS with observations for MIDAS specifications only
rMSFE rMLS
Output growth Inflation Output growth Inflation
C-MIDAS 0.048*** 0.078** 0.041*** 0.129***
(0.006) (0.034) (0.011) (0.028)
R2 0.031 0.037 0.006 0.048
No. of obs. 992 992 992 992
Mean of dep. var. 0.971 0.991 0.941 0.940
B: D-BVAR vs L-BVAR with observations for BVAR specifications only
rMSFE rMLS
Output growth Inflation Output growth Inflation
D-BVAR −0.037* −0.050 −0.008 0.069
(0.021) (0.056) (0.070) (0.093)
R2 0.022 0.015 0.001 0.020
No. of obs. 1488 1488 1488 1488
Mean of dep. var. 0.981 0.977 0.904 0.824
Note: The regressions are estimated by OLS. The values in brackets are standard errors
clustered by country. Values in bold denote estimates that are statistically significant at
the 10% level if we use heteroscedasticity-consistent (White) standard errors.
*Indicate rejection of the null of no statistical significance at the 10% level.
**Indicate rejection of the null of no statistical significance at the 5% level.
***Indicate rejection of the null of no statistical significance at the 1% level.
The meta-analysis regression is then: and rMLS) and target variable (output growth, inflation).
Cases in which the null hypothesis that the coefficient
rLossm,p,c ,h = β0 + β1 D + β2 D
JP EU
(3) is equal to zero is rejected are indicated with stars for
+ β3 D9397 + β4 D9802 + β5 D0307 + β6 D0811 the 10%, 5% and 1% significance levels. Values in bold
show that the estimates are statistically significant at the
+ β7 Dsh + β8 Dlh + β9 DMIDAS + β10 DBVAR
10% level when using heteroscedasticity-robust standard
+ β11 DMIDAS ∗ Dsh errors instead of the country-clustered standard errors
+ β12 DMIDAS ∗ Dlh + β12 DBVAR ∗ Dsh displayed in Table 3.
The characteristics considered in Eq. (3) explain be-
+ β14 DBVAR ∗ Dlh tween 13% and 20% of the forecasting performance,
+ β15 Dsmall + β16 Dl arg e + β17 Dl arg e ∗ DBVAR depending on the target and the type of performance
measure. As a consequence, idiosyncratic variation plays
+ β18 Dl arg e ∗ DMIDAS
an important role in explaining forecasting performances
+ β19 Dl arg e ∗ D9397 + β20 Dl arg e ∗ D9802 across this large number of forecasting exercises. The
+ β21 Dl arg e ∗ D0307 + β22 Dl arg e ∗ D0811 following analysis will consider characteristics that have
a statistically significant role in explaining the forecasting
+ εm,p,c ,h . performance, as indicated in Table 3.
for m = 2, . . . , 13; p = 1993–1997, 1998–2002, The estimates of the regressions’ intercepts are all
2003–2007, 2008–2011, 1993–2011; larger than one, implying that the FADL_M improves on
the AR on average when nowcasting US and UK variables.
h = 1, . . . , 8;
The gains for output growth are larger, and imply a 4%
c = US, UK, JP, FR, IT, GER, EU; improvement in RMSFE. The estimates for β1 and β2 sug-
gest that the benefits of employing multivariate models
where rLossm,p,c ,h is either rMSFEm,p,c ,h or rMLSm,p,c ,h .
for predicting output growth rather than AR models are
Note that β0 measures the relative (to the AR model)
larger with Japanese data but smaller with European data.
performance of the FADL medium model (m = 2) for
The estimated coefficients on the evaluation period
h = 1 over the full sample period (p = 1993–2011) with
dummies point to changes in the statistical performances
US and UK data (c = 1, 2). As a consequence, all other over time, but the estimates are statistically significant
coefficient estimates are measures of gains/losses relative with country-clustered standard errors only when eval-
to this benchmark. uating output growth point forecasts. We find that multi-
variate models perform relatively better for output growth
5.2. Meta-analysis results during the turbulent 2008Q1–2011Q3 period, but rela-
tively worse in the 1998Q1–2011Q3 period.
Table 3 presents estimates of the regression in Eq. (3) The estimated coefficients on the forecasting horizon
with standard errors clustered by country, implying that dummies are all negative, implying that the performances
we consider country-specific effects. The table columns of multivariate models relative to the AR model deteri-
describe the results for each performance measure (rMSFE orate with the horizon. This deterioration is statistically
A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239 1235
significant for point forecasts of output growth and infla- (D-BVAR). This subsection uses relative performance re-
tion when the horizon is iterated with the MIDAS model gressions to test whether there are any statistical differ-
dummy variable. This decline in MIDAS forecasting per- ences in performance between these specification types
formance with the horizon is compensated in part by the that hold across countries, horizons, evaluation periods
fact that MIDAS models’ RMSFEs improve on the bench- and numbers of predictors.
mark by 3% on average when nowcasting output growth, Table 4A presents results for the four performance
albeit the estimate of β9 is not statistically significant. measures in Table 3 (output growth and inflation; rMSFE
For predicting output growth, BVAR models do relatively and rMLS). These are single regressions that are estimated
better at medium horizons (h = 5, . . . , 8) and are signif- using performance measures computed only for MIDAS
icantly better at h = 2, 3, 4. These results suggest that models (rLossm,p,c ,h for m = 4, . . . , 7 and with p, h and
although MIDAS models may deliver accurate nowcasts c varying as in Eq. (3)). We define the dummy variable
of output growth for some countries, the performance of DCMIDAS as equal to 1 if m = 6, 7. As a consequence, if
this class of models deteriorates rapidly with the forecast the estimated coefficient of DCMIDAS is significantly posi-
horizon, meaning that a BVAR specification may be a more tive, we can conclude that the equal-weighted forecast-
accurate choice in some cases. ing combination of single-regressor MIDAS models is a
The estimated coefficients on the dataset-size dum- better way of exploiting the information in a set of pre-
mies indicate that BVARs with only three variables, in- dictors than using monthly factors. The coefficients are
cluding both targets, are significantly worse for predicting indeed positive and statistically significant with country-
output growth than models with a moderate number of clustered standard errors in all columns of Table 4A, so
indicators. For predicting inflation, models with either a we conclude in favour of the C-MIDAS specifications.
small or a medium set of indicators perform significantly Table 4B computes single regressions with the same
better than large datasets. performance measures, but for BVAR models only
The interactions between the dataset size and model (rLossm,p,c ,h for m = 8, . . . , 13 and with p, h and c
class clearly indicate that large BVAR models lead to a varying as in Eq. (3)). We define the dummy variable
deterioration in forecasting performance. These results DDBVAR as equal to 1 if m = 9, 11, 13 and zero otherwise.
suggest that models with factors (FADL and F-MIDAS) or The empirical results can inform us as to whether the
forecasting combinations (C-MIDAS) are more adequate BVAR-in-differences improves over the BVAR-in-levels.
than BVAR models if the aim is to exploit the infor- Recall that the main advantage of using the BVAR-in-
mation in a large number of predictors (more than 55 levels (L-BVAR) is that it allows for the possibility of
indicators) for forecasting output growth and inflation. cointegration. The results in Table 4B suggest that this
BVAR specification choice matters only for point fore-
However, there is no evidence that the use of a large
casting of output growth: L-BVARs perform significantly
number of predictors instead of a dozen picked variables
better than D-BVARs.
(medium dataset) improves macroeconomic forecasting.
By evaluating the estimates for the iterations between
5.4. Evaluating the impact of the dataset size with equal
Dl arg e and the sample period, we find that using a large set
accuracy tests
of predictors worsens the output growth point forecasting
performance in the earlier periods when the sample sizes
Our previous results suggest that the use of forecasting
employed in the estimation are shorter (recall that we
models with large sets of predictors may have a negative
increase the sample size when estimating models at each
effect on forecasting performances for both output growth
forecasting origin).
and inflation, particularly if using BVAR models with short
In summary, we find some time variation in the rel-
samples. This subsection evaluates this research question
ative forecasting performances of multivariate statistical
using the empirical variation of ‘‘medium vs. large’’ equal
models to AR models for forecasting output growth across
accuracy tests for point and density forecasts as described
countries: multivariate models are of particular use dur-
in Section 4.
ing the last four-year period (2008–2011). We find no
Fig. 1 presents empirical t-statistic distributions for
evidence that models with larger numbers of predictors
the following models: FADL, F-MIDAS, C-MIDAS, L-BVAR
improve on the performances of models with smaller
and D-BVAR. The Diebold and Mariano (1995) t-statistics
sets of predictors. If using a large dataset, FADL and MI- are computed using the specification with a medium
DAS models are more adequate than BVAR models. We dataset under the null and the model with a large dataset
find very limited evidence that MIDAS models improve under the alternative using the full out-of-sample pe-
nowcasts. riod (p = 1993–2011). The box plots are computed for
t-statistics obtained for different horizons (h = 1, . . . , 8)
5.3. Additional meta-analysis comparisons and countries. Negative values imply that the model with
a large number of predictors is more accurate than the
We consider two main specification types for MIDAS same model with a medium data set. Using a two-sided
and BVAR model classes. For MIDAS models, we com- 5% test, statistical differences are found when the absolute
pute forecasts using both a factor-augmented version value of the t-statistic is larger than 1.96.
(F-MIDAS) and an equal-weight forecasting combination In general, the t-statistics are between −1.96 and
strategy (C-MIDAS). For BVAR models, we use one spec- 1.96; that is, models using large and medium datasets
ification in levels (L-BVAR) and another in growth rates deliver statistically similar point and density forecasting
1236 A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239
Fig. 1. Box plots of equal accuracy t-statistics for a model with a medium dataset against a large dataset for each indicated forecasting model
(aggregated over forecasting horizons (1 to 8) and countries; full out-of-sample period) Note: Negative t-statistics imply that the specification with
a large dataset is more accurate than the equivalent with a medium-sized dataset. Each box plot is based on 8 horizons x 7 countries = 56 values.
The accuracy test in the first panel is based on MSFEs, while the statistics in the second panel are based on the logscore. ‘‘On each box, the central
mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered
outliers, and outliers are plotted individually’’. (Matlab description).
performances. However, based on the median t-statistics, Section 3 described the dataset employed in the estima-
we can say that D-BVARs are worse at handling large tion of DSGE models. One should note that medium-sized
datasets than L-BVARs, providing an additional nuance DSGE forecasts are considered only for c = US, UK and
to our results earlier in this section that discouraged the EU, and are estimated with output growth per person and
use of BVARs with large datasets. These results also sup- GDP deflator inflation. We measure the performance of
port the use of the C-MIDAS specification instead of the DSGE models relative to the AR benchmark by recomput-
F-MIDAS, in particular when dealing with large datasets ing AR forecasts using the same measurements of output
for forecasting inflation. growth and inflation as are employed by the DSGE model.
In summary, there is no strong evidence that using a Figures 2 and 3 present box plots of the Diebold and
large number of predictors provides improved forecasts Mariano (1995) t-statistics. The t-statistics are computed
relative to using a moderate number, but we can provide for the full out-of-sample period for each country, as
evidence to support the use of C-MIDAS and FADL speci- listed in Table 2. Negative values mean that the model is
fications instead of BVAR models when dealing with large more accurate than the AR model. Using a one-sided test,
datasets. we would reject the null of predictability at the 5% level
if the DM t-statistic is smaller than −1.65. The empirical
6. Comparing structural vs. reduced-form forecasting distributions vary with the country and are computed
models for specific model classes (FADL, MIDAS, BVAR, DSGE).
The box plots are presented for three separate horizons
The previous section investigated common features (h = 1, 4 and 8). Fig. 2 presents results for output growth
that explain the relative forecasting performances of and inflation, using the quadratic loss function (MSFE) to
reduced-form statistical models across countries, fore- compute the t-statistics, whereas the plots in Fig. 3 are
casting horizons, forecasting periods and model specifi- based instead on the differences in logscore.
cations. This section uses equal accuracy tests, computed The results in Figs. 2 and 3 help us to indicate which
as described in Section 4, to compare the performances model class, including statistical model classes (FADL,
of reduced-form statistical models (FADL, BVAR, MIDAS) MIDAS, BVAR) and the structural model class (DSGE),
with those of the DSGE model. performs best for each target variable and for a set of
Details of the DSGE model employed, including our forecasting horizons. The median t-statistic in Figs. 2 and
estimation strategy, were discussed in Section 2.4, while 3 can be employed to evaluate how each class of model
A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239 1237
Fig. 2. Box plots of the equal accuracy MSFE t-statistics with the AR model under the null and a forecasting model from the model class indicated
under the alternative (aggregated over specifications and countries; full out-of-sample period) Note: The first panel has results for forecasting the
GDP growth for horizons 1, 4 and 8. The second panel has equivalent results for forecasting inflation. See Table 1 for the description of the factor
(2–3), MIDAS (4–7) and BVAR (8–13) specifications. Note that the DSGE is estimated only for three countries. A list of the seven countries employed
is provided in Table 2. Negative values are in favour of the specified multivariate model class. See also the notes to Fig. 1.
performs on average across specifications and countries In summary, we provide evidence that structural
for each horizon and target variable. (DSGE) models can deliver superior long-horizon forecasts
MIDAS models do better at h = 1 for output growth, of US and UK inflation.
but the distribution of t-statistics has a large spread,
suggesting that mixed-frequency models improve output 7. Conclusion
growth nowcasts for the median country but perform
poorly for some countries. For h = 4, it is clear that BVARs The comprehensive evaluation of macroeconomic fore-
perform better for forecasting output growth. When fore- casting models that is reported in this paper contributes
casting inflation, the clear evidence that we have is that to both the academic literature and the practice of
DSGE models do better when predicting inflation at h = macroeconomic forecasting. By employing datasets for
4, 8 for both point and density forecasts. The results in seven developed economies and considering four classes
Figs. 2 and 3 suggest that DSGE models are able to im- of multivariate forecasting models, we provide new em-
prove AR forecasts of quarterly inflation significantly at pirical findings, extending and enhancing the evidence
h = 4, 8. that is usually available for US data.
These results are supported by detailed tables, by Our multicountry comparison provides a new dimen-
country and forecasting horizon, in the online appendix. sion when comparing structural with reduced-form mod-
Table A1 shows the relative performance of the DSGE els in forecasting. The DSGE model specification that we
model against those of the AR and the FADL_M using consider (Smets & Wouters, 2007) provides accurate one-
RMSFEs, while Table A2 shows results using the logscore. and two-year-ahead forecasts of inflation not only for the
The results indicate that the DSGE gains for forecast- US, but also for the UK.
ing inflation are present mainly for the US and the UK, This evaluation was designed to look at forecasting
with disappointing results for the Euro area, which is horizons from nowcasting up to two years ahead. Our
in agreement with the findings of Smets, Warne, and contribution is to consider a large set of model specifica-
Wouters (2014). The DSGE model performs better in the tions over all of these horizons to enable us to provide
earlier period (1993–2002) than in the later period (2003– evidence that the choice of the best forecasting model
2011), confirming the literature that supports the use class clearly varies with the forecast horizon. We propose
of DSGE forecasts during the Great Moderation period meta-analysis regressions in order to draw a small set of
(1985–2007) (Del Negro & Schorftheide, 2013). clear messages from 2976 relative accuracy comparisons.
1238 A. Carriero, A.B. Galvão and G. Kapetanios / International Journal of Forecasting 35 (2019) 1226–1239
Fig. 3. Box plots of the equal accuracy logscore t-statistics with the AR model under the null and a forecasting model from the model class indicated
under the alternative (aggregate over specifications and countries, full out-of-sample period) Note: See the notes to Fig. 1.
Clark, T. E. (2011). Real-time density forecasts from Bayesian vector Groen, J. J. J., & Kapetanios, G. (2013). Model selection criteria for factor-
autoregressions with stochastic volatility. Journal of Business & augmented regressions. Oxford Bullettin of Economics and Statistics,
Economic Statistics, 29, 327–341. 75, 37–63.
Clements, M. P., & Galvão, A. B. (2008). Macroeconomic forecasting with Herbst, E., & Schorftheide, F. (2012). Evaluating DSGE model forecasts
mixed-frequency data: Forecasting output growth in the United of comovements. Journal of Econometrics, 171, 152–166.
States. Journal of Business & Economic Statistics, 26, 546–554. Koop, G. (2013). Forecasting with medium and large Bayesian VARs.
Clements, M. P., & Taylor, N. (2001). Bootstrapping prediction intervals Journal of Applied Econometrics, 28, 177–203.
for autoregressive models. International Journal of Forecasting, 17, Koop, G., & Korobilis, D. (2013). Large time-varying parameter VARs.
247–267. Journal of Econometrics, 177, 185–198.
D’Agostino, A., Gambetti, L., & Giannone, D. (2013). Macroeconomic
Kuzin, V., Marcellino, M., & Schumacher, C. (2013). Pooling versus
forecasting and structural change. Journal of Applied Econometrics,
model selection for nowcasting with many predictors: Empiri-
28, 82–101.
cal evidence for six industrialized countries. Journal of Applied
Del Negro, M., & Schorftheide, F. (2011). Bayesian macroeconometrics.
Econometrics, 28, 392–411.
In J. Geweke, G. Koop, & H. van Dijk (Eds.), The oxford handbook of
Schorfheide, F., & Song, D. (2015). Real-time forecasting with a
bayesian econometrics (pp. 293–389). Oxford University Press.
Del Negro, M., & Schorftheide, F. (2013). DSGE model-based forecasting. mixed-frequency VAR. Journal of Business & Economic Statistics, 33,
In Handbook of economic forecasting, Volume 2A (pp. 57–140). 366–380.
Elsevier, chapter 2. Sims, C. (1993). A nine-variable probabilistic macroeconomic fore-
Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. casting model. In Business cycles, indicators and forecasting
Journal of Business & Economic Statistics, 13, 253–263, Reprinted (pp. 179–212). National Bureau of Economic Research.
in Mills, T. C. (ed.) (1999), Economic forecasting. The international Sims, C. (2000). Using a likelihood perspective to sharpen econometric
library of critical writings in economics. Cheltenham: Edward Elgar. discourse: three examples. Journal of Econometrics, 95(2), 443–462.
Diebold, F. X., Schorftheide, F., & Shin, M. (2017). Real-time forecast Smets, F., Warne, A., & Wouters, R. (2014). Professional forecasters and
evaluation of DSGE models with stochastic volatility. Journal of real-time forecasting with a DSGE model. International Journal of
Econometrics, 201, 322–332. Forecasting, 30, 981–995.
Edge, R. M., & Gurkaynak, R. S. (2011). How useful are estimated DSGE Smets, F., & Wouters, R. (2007). Shocks and frictions in US business
model forecasts. Federal Reserve Board, Finance and Economics cycles.. American Economic Review, 97, 586–606.
Discussion Series 11. Stock, J. H., & Watson, M. W. (2002). Macroeconomic forecasting using
Faust, J., & Wright, J. H. (2013). Forecasting inflation. In Handbook of diffusion indexes. Journal of Business & Economic Statistics, 20,
economic forecasting, Vol. 2A (pp. 3–56). Elsevier, chapter 1. 147–162.
Ferrara, L., Marcellino, M., & Mogliani, M. (2015). Macroeconomic Stock, J. H., & Watson, M. W. (2003). Forecasting output and inflation:
forecasting during the Great Recession: the return of non-linearity?
The role of asset prices. Journal of Economic Literature, 41, 788–829.
International Journal of Forecating, 31, 664–679.
Stock, J. H., & Watson, M. W. (2007). Why has U.S. inflation become
Giacomini, R., & Rossi, B. (2010). Forecast comparisons in unstable
harder to forecast? Journal of Money, Credit and Banking, 39(Suppl.),
environments. Journal of Applied Econometrics, 25, 595–620.
3–33.
Giannone, D., Lenza, M., & Primiceri, G. E. (2015). Prior selection
Woulters, M. H. (2015). Evaluating point and density forecasts of DSGE
for vector autoregressions. Review of Economic and Statistics, 97,
412–435. models. Journal of Applied Econometrics, 30, 74–96.