Academia.eduAcademia.edu

Reformulating Empirical Macro-econometric Modelling

2000

The policy implications of estimated macro-econometric systems depend on the formulations of their equations, the methodology of empirical model selection and evaluation, the techniques of pol- icy analysis, and their forecast performance. Drawing on recent results in the theory of forecasting, we question the role of 'rational expectations'; criticize a common approach to testing economic theories; show that impulse-response methods

Department of Economics University of Southampton Southampton SO17 1BJ UK Discussion Papers in Economics and Econometrics REFORMULATING EMPIRICAL MACRO-ECONOMETRIC MODELLING David F Hendry and Grayham E Mizon No. 0104 This paper is available on our website http://www/soton.ac.uk/~econweb/dp/dp01.html Reformulating Empirical Macro-econometric Modelling David F. Hendry Economics Department, Oxford University and Grayham E. Mizon Economics Department, Southampton University. Abstract The policy implications of estimated macro-econometric systems depend on the formulations of their equations, the methodology of empirical model selection and evaluation, the techniques of policy analysis, and their forecast performance. Drawing on recent results in the theory of forecasting, we question the role of ‘rational expectations’; criticize a common approach to testing economic theories; show that impulse-response methods of evaluating policy are seriously flawed; and question the mechanistic derivation of forecasts from econometric systems. In their place, we propose that expectations should be treated as instrumental to agents’ decisions; discuss a powerful new approach to the empirical modelling of econometric relationships; offer viable alternatives to studying policy implications; and note modifications to forecasting devices that can enhance their robustness to unanticipated structural breaks. JEL classification: C3, C5, E17, E52, E6 Keywords: economic policy analysis, macro-econometric systems, empirical model selection and evaluation, forecasting, rational expectations, impulse-response analysis, structural breaks.  Financial support from the U.K. Economic and Social Research Council under grant L138251009 is gratefully acknowledged. We are pleased to acknowledge helpful comments from Chris Allsopp, Mike Clements, Jurgen Doornik, Bronwyn Hall, John Muellbauer and Bent Nielsen. 1 2 1 Introduction The policy implications derived from any estimated macro-econometric system depend on the formulation of its equations, the methodology used for the empirical modelling and evaluation, the approach to policy analysis, and the forecast performance. Drawing on recent results in the theory of forecasting, we question the role of ‘rational expectations’ in the first stage; then criticize the present approach to testing economic theories prevalent in the profession; next, we show that impulse-response methods of evaluating the policy implications of models are seriously flawed; and finally, question the mechanistic derivation of forecasts from econometric systems. In their place, we propose that expectations should be treated as instrumental to agents’ decisions; suggest a powerful new approach to the empirical modelling of econometric relationships; offer viable alternatives to studying policy implications; and discuss modifications to forecasting devices that can enhance their robustness to unanticipated structural breaks. We first sketch the arguments underlying our critical appraisals, then briefly describe the constructive replacements, before presenting more detailed analyses of these four issues. Sub-section 1.1 summarizes our critiques, and sub-section 1.2 introduces our remedies. 1.1 Four critiques of present practice Our approach builds on extensive research that has radically altered our understanding of the causes of forecast failure, the occurrence of which was one of the driving forces behind the so-called ‘rational expectations revolution’ that replaced ‘Keynesian’ models. Forecast failure is a significant deterioration in forecast performance relative to the anticipated outcome, usually based on the historical performance of a model: systematic failure is the occurrence of repeated mis-forecasting. The research reveals that the causes of forecast failure differ from what is usually believed – as do the implications. To explain such differences, we begin by reconsidering the ‘conventional’ view of economic forecasting. When the data processes being modelled are weakly-stationary (so means and variances are constant over time), three important results can be established. First, causal variables will outperform non-causal (i.e., variables that do not determine the series being forecast), both in terms of fit and when forecasting. Secondly, a model that in-sample fully exploits the available information (called congruent) and is at least as good as the alternatives (encompassing) will also dominate in forecasting; and for large samples, will do so at all forecast horizons. Thirdly, forecast failure will rarely occur, since the sample under analysis is ‘representative’ of the sample that needs to be forecast – moreover that result remains true for mis-specified models, inaccurate data, inefficient estimation and so on, so long as the process remains stationary. Such theorems provide a firm basis for forecasting weakly-stationary time series using econometric models: unfortunately, they can be extended to non-stationary processes only when the model coincides with the data generation process (DGP). The systematic mis-forecasting and forecast failure that has periodically blighted macroeconomics highlights a large discrepancy between such theory and empirical practice, which is also visible in other disciplines: see e.g., Fildes and Makridakis (1995) and Makridakis and Hibon (2000). The key problem is the inherently non-stationary nature of economic data – even after differencing and cointegration transforms have removed unit roots – interacting with the impossibility in a high-dimensional and evolving world of building an empirical model which coincides with the DGP at all points in time. Consequently, one can disprove the most basic theorem that forecasts based on causal variables will dominate those from non-causal. Restated, it is easy to construct examples where forecasts based on variables that do not enter the DGP outperform those based on well-specified causally-sound models – one such example is shown below. Importantly, such results match the empirical evidence: we have opened Pandora’s Box, 3 with profound implications that are the focus of this paper. Having allowed the data process to be non-stationary and models to be mis-specified representations thereof (both in unspecified ways), one might imagine that an almost indefinite list of problems could precipitate forecast failure. Fortunately, that is not the case. To understand why, we must dissect the ingredients of econometric models. In general, econometric models have three main components: deterministic terms, namely variables whose future values are known (such as intercepts which are 1, 1, 1,... and trends, which are 1, 2, 3, 4...); observed stochastic variables with known past, but unknown future, values (such as GNP and inflation); and unobserved errors all of whose values (past, present and future) are unknown. Most relationships in models involve all three components because that is how we conceive of the data. In principle, any or all of the components could be: mis-specified; poorly estimated; based on inaccurate data; selected by inappropriate methods; involve collinearities or non-parsimonious formulations; and suffer structural breaks. Moreover, forecast failure might result from each ‘problem’. Given the complexity of modern economies, most of these ‘problems’ will be present in any empirical macro-model, and will reduce forecast performance by increasing inaccuracy and imprecision. However, and somewhat surprisingly, most combinations do not in fact induce systematic forecast failure. The taxonomy of sources of forecast errors in Clements and Hendry (1998, 1999a) implicates unanticipated forecast-period shifts in deterministic factors (such as equilibrium means, examples of which are the means of the savings rate, velocity of circulation, and the NAIRU) as the dominant cause of systematic failure. As explained in section 2.1, there is an important distinction between shifts in the deterministic components (such as intercepts) that enter models, and those that precipitate forecast failure (unmodelled shifts in data means), but for the moment we leave that to one side, as the former is usually sufficient for the latter. The crucial converse is that forecast failure is not in fact primarily due to the list of ‘problems’ in the previous paragraph, or even the Lucas (1976) critique of changing parameters: by themselves, none of these factors induces systematic failure. Our first critique now follows – since ‘rational expectations’ claim to embody the actual conditional expectations, they do not have a sound theoretical basis in an economy subject to deterministic shifts. Further, in the presence of unmodelled deterministic shifts, models embodying previously-rational expectations will not forecast well in general. Turning to the second critique, tests of economic theories based on whole-sample goodness of fit comparisons can be seriously misled by unmodelled deterministic shifts. This occurs because such shifts can be proxied by autoregressive dynamics, which has two implications. First, deterministic shifts induce apparent unit roots, so cointegration often fails in the face of such breaks. Thus, long-run relationships – often viewed as the statistical embodiment of economic theory predictions – then receive no support. Secondly, such false unit roots can make lagged information from other variables appear irrelevant, so tests of theories – particularly of Euler equations – can be badly distorted. Our second critique now follows: so long as the degree of non-congruence of a model is unknown, false theories can end being accepted, and useful ones rejected. A necessary condition for both economic theories and macro-economic models to be of practical value is that their parameters remain constant over the relevant horizon, and for the admissible range of policy changes to be implemented. Many structural breaks are manifest empirically, and so are easy to detect; deterministic shifts are a salient example. The class of breaks that are easy to detect comprises shifts in the unconditional expectations of non-integrated (denoted I(0)) components. Their ease of detection is the obverse of their pernicious effect on forecast performance. However, it transpires that a range of parameter changes in econometric models cannot be easily detected by conventional statistical tests. This class includes changes that leave unaltered the unconditional expectations, even when dynamics, adjustment speeds, and intercepts are radically altered: illustrations are provided in Hendry and Doornik (1997). This leads to our third critique – impulse-response methods of evaluating the policy 4 implications of models are dependent on the absence of such ‘undetectable breaks’, and so can be misleading in both sign and magnitude when non-deterministic shifts have occurred, even when models are rigorously tested (and certainly when minimal testing occurs). Fourthly, there is ample evidence that forecasts from econometric systems can err systematically in the face of deterministic shifts, such that they perform worse than ‘naive’ methods in forecasting competitions. Theory now exists to explain how and why that occurs: see e.g., Clements and Hendry (1999c). The implementation of cointegration may in practice have reduced the robustness of econometric-model forecasts to breaks, by ensuring they adjust back to pre-existing equilibria, even when those equilibria have shifted. Mechanistic econometric-model based forecasts, therefore, are unlikely to be robust to precisely the form of shift that is most detrimental to forecasting. It is well known that devices such as intercept corrections can improve forecast performance (see e.g., Turner, 1990), but manifestly do not alter policy implications; and conversely, that time-series models with no policy implications might provide the best available forecasts. Hence our fourth critique – it is inadvisable to select policy-analysis models by their forecast accuracy: see Hendry and Mizon (2000). 1.2 Some remedies The existence of these four problems implies that many empirical macro-econometric models are incorrectly formulated and wrongly selected, with policy implications derived by inappropriate methods. Whilst we suspect that some amelioration arises in practice as a result of most macro-forecasters continuing to use intercept corrections to improve forecasts, the almost insuperable problems confronting some approaches to macro-economics remain. Fortunately though, effective alternatives exist. First, since expectations are instrumental to the decisions of economic agents, not an end in themselves, the devices that win forecasting competitions – which are easy to use and economical in information – suggest themselves as natural ingredients in agents’ decision rules (possibly ‘economicallyrational expectations’: see Feige and Pearce, 1976). We show that is the case, with the interesting implication that the resulting rules may not be susceptible to the Lucas (1976) critique, thus helping to explain its apparent empirical irrelevance: see Ericsson and Irons (1995). Next, stimulated by Hoover and Perez (1999), Hendry and Krolzig (1999) investigate econometric model selection from a computer-automation perspective, focusing on general-to-specific reduction approaches, embodied in the program PcGets (general–to–specific: see Krolzig and Hendry, 2000). In Monte Carlo experiments, PcGets recovers the DGP with remarkable accuracy, having empirical size and power close to what one would expect if the DGP were known, suggesting that search costs are low. Thus, a general-to-specific modelling strategy that starts from a congruent general model and requires congruence and encompassing throughout the reduction process offers a powerful method for selecting models. This outcome contrasts with beliefs in economics about the dangers of ‘data mining’. Rather, it transpires that the difficult problem is not to eliminate spurious variables, but to retain relevant ones. The existence of PcGets not only allows the advantages of congruent modelling to be established, it greatly improves the efficiency of modellers who take advantage of this model-selection strategy. Thirdly, there is a strong case for using open, rather than closed, macro-econometric systems, particularly those which condition on policy instruments. Modelling open systems has the advantage that amongst their parameters are the dynamic multipliers which are important ingredients for estimating the responses of targets to policy changes. Further, it is difficult to build models of variables such as interest rates, tax rates, and exchange rates, which are either policy instruments or central to the determination of targets. Since many policy decisions entail shifts in the unconditional means of policy instruments, corresponding shifts in the targets’ unconditional means are required for policy to be effective. The relevant 5 concept is called co-breaking, and entails that although each variable in a set shifts, there are linear combinations that do not shift (i.e., are independent of the breaks: see Clements and Hendry, 1999a, ch. 9). Co-breaking is analogous to cointegration where a linear combination of variables is stationary although individually they are all non-stationary. Whenever there is co-breaking between the instrument and target means, reliable estimates of the policy responses can be obtained from the model of the targets conditioned on the instruments, despite the probable absence of weak exogeneity of policy instruments for the parameters of interest in macro-models (due to mutual dependence on previous disequilibria): Ericsson (1992) provides an excellent exposition of weak exogeneity. The existence of co-breaking between the means of the policy instruments and targets is testable, and moreover, is anyway necessary to justify impulse-response analysis (see Hendry and Mizon, 1998). Finally, there are gains from separating policy models – to be judged by their ability to deliver accurate advice on the responses likely from policy changes – from forecasting models, to be judged by their forecast accuracy and precision. No forecast can be robust to unanticipated events that occur after its announcement, but some are much more robust than others to unmodelled breaks that occurred in the recent past. Since regime shifts and major policy changes act as breaks to models that do not embody the relevant policy responses, we discuss pooling robust forecasts with scenario differences from policy models to avoid both traps. We conclude that the popular methodologies of model formulation, modelling and testing, policy evaluation, and forecasting may prejudice the accuracy of implications derived from macro-econometric models. Related dangers confronted earlier generations of macro-models: for example, the use of dynamic simulation to select systems was shown in Hendry and Richard (1982) to have biased the choice of models to ones which over-emphasized the role of unmodelled (‘exogenous’) variables at the expense of endogenous dynamics, with a consequential deterioration in forecast performance, and mis-leading estimates of speeds of policy responses. As in that debate, we propose positive antidotes to each of the major lacunae in existing approaches. The detailed analyses have been presented in other publications: here we seek to integrate and explain their implications for macro-econometric modelling. 1.3 Overview The remainder of the paper is structured as follows. Since we attribute a central role to forecast-period shifts in deterministic factors as the cause of forecast failure, section 2 first explains the concept of deterministic shifts, reviews their implications, and contrasts those with the impacts of non-deterministic shifts. Thereafter, the analysis assumes a world subject to such shifts. Section 3 derives the resulting implications for ‘rational expectations’, and suggests alternatives that are both feasible and more robust to breaks. Then section 4 discusses tests of theory-based propositions, before section 5 turns to model selection for forecasting. Next, section 6 introduces three related sections concerned with aspects of model selection for policy. First, section 6.1, considers the obverse of section 5, and shows that policy models should not be selected by forecast criteria. Secondly, section 6.2 considers policy analyses based on impulse responses, and thirdly, section 6.3 examines estimation of policy responses. Section 7 describes appropriate model-selection procedures, based on computer automation, and section 8 justifies the focus on congruent modelling. Finally, section 9 concludes. 2 Setting the scene In a constant-parameter, stationary world, forecast failure should rarely occur: the in-sample and outof-sample fits will be similar because the data properties are unchanged. As discussed in Miller (1978), 6 stationarity ensures that, on average (i.e., excluding rare events), an incorrectly-specified model will forecast within its anticipated tolerances (providing these are correctly calculated). Although a mis-specified model could be beaten by methods based on correctly-specified equations, it will not suffer excessive forecast failure purely because it is mis-specified. Nevertheless, since a congruent, encompassing model will variance-dominate in-sample, it will continue to do so when forecasting under unchanged conditions. Thus, adding causal variables will improve forecasts on average; adding non-causal variables (i.e., variables that do not enter the DGP) will only do so when they proxy for omitted causal variables. In an important sense, the best model will win. Empirical models are usually data-based (selected to match the available observations), which could induce some overfitting, but should not produce systematic forecast failure (see Clements and Hendry, 1999b). Conversely, when the data properties over the forecast horizon differ from those in-sample – a natural event in non-stationary processes – forecast failure will result. The latter’s regular occurrence is strong evidence for pandemic non-stationarity in economics, an unsurprising finding given the manifest legislative, social, technological and political changes witnessed over modern times (and indeed through most of history). Once such non-stationarity is granted, many ‘conventional’ results that are provable in a constantparameter, stationary setting change radically. In particular, since the future will not be like the present or the past, two important results can be established in theory, and demonstrated in practice: the potential forecast dominance of models using causal variables by those involving non-causal variables; and of in-sample well-specified models by badly mis-specified ones. Clements and Hendry (1999a) provide several examples: another is offered below. Together, such results remove the theoretical support for basing forecasting models – and hence agents’ expectations formation – on the in-sample conditional expectation given available information. We develop this analysis in section 3. Moreover, these two results potentially explain why over-differencing and intercept corrections – both of which introduce non-causal variables into forecasting devices – could add value to model-based forecasts: this aspect is explored in section 5, which emphasizes the potential dangers of selecting a policy model by such criteria as forecast accuracy. Finally, a failure to model the relevant non-stationarities can distort in-sample tests, and lead to incorrect inferences about the usefulness or otherwise of economic theories: that is the topic of section 4. Not all forms of non-stationarity are equally pernicious. For example, unit roots generate stochastic trends in data series, which thereby have changing means and variances, but nevertheless seem relatively benign. This form of non-stationarity can be removed by differencing or cointegration transformations, and often, it may not matter greatly whether or not those transforms are imposed (see e.g., Sims, Stock and Watson, 1990, for estimation, and Clements and Hendry, 1998, for forecasting). Of course, omitting dynamics could induce ‘nonsense regressions’, but provided appropriate critical values are used, even that hypothesis is testable – and its rejection entails cointegration. As we show in section 2.2, shifts in parameters that do not produce any deterministic shifts also need not induce forecast failure, despite inducing non-stationarities. bT +hjT , for a vector of ny variables yT +h. Consider an h-step ahead forecast made at time T , denotedy The difference between the eventual outcomes and the forecast values is the vector of forecast errors bT +hjT , and this can be decomposed into the various mistakes and unpredictable eT +hjT = yT +h ; y elements. Doing so delivers a forecast-error taxonomy, partitioned appropriately into deterministic, observed stochastic, and innovation-error influences. For each component, there are effects from structural change, model mis-specification, data inaccuracy, and inappropriate estimation. Although the decomposition is not unique, it can be expressed in nearly-orthogonal effects corresponding to influences on forecast-error means and variances respectively. The former involves all the deterministic terms; the 7 latter the remainder. We now briefly consider these major categories of error, commencing with mean effects, then turn to variance components. 2.1 Forecast failure and deterministic shifts Systematic forecast-error biases derive from deterministic factors being mis-specified, mis-estimated, or non-constant. The simplest example is omitting a trend; or when a trend is included, under-estimating its slope; or when the slope is correct, experiencing a shift in the growth rate. A similar notion applies to equilibrium means, including shifts, mis-specification of, or mis-estimation in the means of (say) the savings rate, velocity of circulation, or the NAIRU. Any of these will lead to a systematic, and possibly increasing, divergence between outcomes and forecasts. However, there is an important distinction between the roles of intercepts, trends etc., in models, and any resulting deterministic shifts, as we will now explain. To clarify the roles of deterministic, stochastic, and error factors, we consider a static regression where the parameters change prior to forecasting. The in-sample DGP, for t = 1; : : : ; T , is: yt = + xt + t ; (1) where xt is an independent normally-distributed variable with mean  and variance x2 . Also, t is an independent, normally-distributed error with mean zero and constant variance 2 . Finally, xt has known future values to the forecaster, and ft g is independent of x. A special case of interest below is = 0, so is just the mean of y . In (1), the conditional mean of yt is E [yt jxt ] = + xt and the conditional variance is V[yt ] = 2 . The unconditional mean, E [yt ], and variance, V [yt ], (which allow for the variation in xt ) are +  = and 2 + 2 x2 respectively. Thus, there are two deterministic components in (1): the intercept and the mean of the regressor , so the overall deterministic term is . Indeed, we can always rewrite (1) as: yt = + = +  + (xt ; ) + t (xt ; ) + t ; (2) Shifts in the composite deterministic term will transpire to be crucial. Consider using the estimated DGP (1) as the forecasting model. For simplicity, we assume known parameter values. Then, with an exactly-measured forecast origin at time T , (1) produces the h-step ahead forecast sequence: ybT +hjT = + xT +h : (3) However, over the forecast period, h = known to the forecaster, so that in fact: yT +h ; : : : ; H , there is a shift in the parameters of the process un- 1  =  = + + xT +h + T +h  (xT +h ; ) + T +h :  The distributions of fxT +h g and fT +h g could also change (e.g., ; x2 or 2 might change), but we neglect such effects here as not germane to the central issues. Indeed, when xt is known, as is assumed here, changes  in are irrelevant. The resulting sequence of forecast errors eT +hjT = yT +h ; ybT +hjT after the unanticipated shift is:  = ( +  ;  xT +h + T +h ; ( + xT +h)  )+( ; ) xT +h + T +h: (4) 8 There are two kinds of terms in (4): those contributing to the mean, and deviations from that mean. The former is obtained by taking expectations, which leads to:  ET +h eT +hjT  = (  ; ) +(  ; ) = ; :  = : importantly, that does not require (5) =  and The composite shift is zero if and only if ;  . When (5) is non-zero, we call it a ‘deterministic shift’: the effect is pernicious when   ET +h eT +hjT increases by several , but is then usually easy to detect. An empirically-relevant case is when the variables labelled yt and xt are log-differences, so  defines the mean growth rate of xt , and that of yt . Many econometric growth-rate equations have  '  (often around 0.5%–1%), so the requirement that  ; be as large as (say)  is actually very strong: e.g., a doubling of the trend rate of growth. Consequently, even moderate trend shifts can be hard to detect till quite a few periods have elapsed. To illustrate that an ‘incorrect’ model can outperform the in-sample DGP in forecasting, we return to the special case of (1) when  . The expected forecast error sequence from using the in-sample DGP will be  ; . That remains true when the forecast origin moves through time to T ,T etc.: because the forecasting model remains unchanged, so do the average forecast errors. Consider, instead, using the naive predictor yeT +hjT +1 yT +1 . The resulting sequence of forecast errors will be similar to eT +hjT when the origin is T : unanticipated shifts after forecasting are bound to harm all onwards, a different result ensues for eeT +hjT +1 methods. However, when forecasting from time T yT +h ; yeT +hjT +1 because: ( ) =0 2 ( +1 +2 ) = =0 = +1 eeT +hjT +1 = y + ; y +1 =  + + ;  ; =  ;1 + ; T h  = = T h T h T h T +1 (6) T +h ; T +1 . The last line of (6) has a mean of zero, despite the deterministic where h;1 T +h shift. Thus, on a bias criterion, yeT +hjT +1 outperforms the in-sample DGP (and could win on mean-square error), despite the fact that yt;1 is not a causal variable. Dynamics make the picture more complicated, but similar principles apply. An interesting, and much studied, example of a deterministic shift concerns forecast failure in a model of narrow money (M1) in the UK after the Banking Act of 1984, which permitted interest payments on current accounts in exchange for all interest payments being after the deduction of ‘standard rate’ tax. The own rate of interest (Ro ) changed from zero to near the value of the competitive rate (Rc : about 12 per cent per annum at the time) in about 6 quarters, inducing very large inflows to M1. Thus, a large shift occurred in the mean opportunity cost of holding money, namely a deterministic shift from Rc to Rc ; Ro . Pre-existing models of M1 – which used the outside rate of interest Rc as the measure of opportunity cost – suffered marked forecast failure, which persisted for many years after the break. Models that correctly re-measured the opportunity cost by Rc ; Ro continued to forecast well, once the break was observed, and indeed had the same estimated parameter values after the break as before. However, methods analogous to yeT +hjT also did not suffer forecast failure: see Clements and Hendry (1999c) for details and references. In general, the key drivers of forecast failure are mis-specification of, uncertainty in, or changes to the conditional expectation (where that exists) given the history of the process. The mean forecast can differ from the correct conditional expectation due to biased estimation of the mean, or when there are unexpected shifts. Because forecast failure is usually judged relative to in-sample behaviour, the latter is 9 the dominant cause. However, mis-estimation of coefficients of deterministic terms could be deleterious to forecast accuracy if estimation errors are large by chance. 2.2 Non-deterministic shifts Having extracted the deterministic terms, all other factors fall under the heading of non-deterministic. The converse problem now occurs. Shifts in the coefficients of zero-mean variables have a surprisingly small impact on forecasts (as measured by the inability of parameter-constancy tests to detect the break). There are three consequences. First, such shifts seem an unlikely explanation for observed forecast failure. Secondly, changes in reaction parameters such as are difficult to detect unless they induce a deterministic shift in the model, which cannot occur when xt has mean zero ( = 0). This finding helps explain the absence of empirical evidence on the Lucas (1976) critique, as discussed in section 3. Finally, although they have relatively benign effects in the context of forecasting, undetected changes in reaction parameters could have disastrous effects on policy analyses – but we leave that story till section 6.2. More formally, when xt and yt both have mean zero in (1) , all terms in (4) have zero expectations, so no forecast bias results when changes. There is an increase in the forecast error variance, from 2 to 2 2 (  ; ) x + 2 , and the detectability (or otherwise) of the break depends on how much the variance increases. For (  ; )2 = 42 (say), then the ratio is 1 + 4x2 which can be difficult to detect against the background noise (see e.g., Hendry and Doornik, 1997, for simulation illustrations). In a stationary dynamic process, an intercept like also differs from the unconditional mean , and it is shifts in the latter which are again relevant to forecast failure. A strong, and corroborated, prediction is that shifts in both the intercepts and the regression parameters which leave the unconditional mean unchanged will not induce forecast failure, and tests will be relatively powerless to detect that anything + , even though every parameter has has changed. For example,  = when  +   = altered. Indeed, the situation where the unconditional mean is constant is precisely the same as when all the means are zero: Hendry and Doornik (1997) and section 6.2 below provide the details, and lead to the conclusions that shifts in unconditional means are a primary source of forecast failure, and other ‘problems’ are less relevant to forecast failure. For example, omitting zero-mean stochastic components is unlikely to be a major source of forecast failure, but could precipitate failure if stochastic mis-specification resulted in deterministic shifts elsewhere in the economy affecting the model. Equally, the false inclusion of zero-mean stochastic variables is a secondary problem, whereas wrongly including regressors which experienced deterministic shifts could have a marked impact on forecast failure as the model mean shifts although the data mean does not. Estimation uncertainty in the parameters of stochastic variables also seems to be a secondary problem, as such errors add variance terms of O(1=T ) for stationary components. Neither collinearity nor a lack of parsimony per se seem likely culprits, although interacting with breaks occurring elsewhere in the economy could induce problems. Finally, better-fitting models have smaller error accumulation, but little can be done otherwise about forecast inaccuracy from that source. 2.3 Digression: modelling deterministic terms Over long runs of historical time, all aspects of economic behaviour are probably stochastic. However, in shorter periods some variables may exhibit little variation (or deviation from trend) and so be well represented by a deterministic variable. Further, such variables might be subject to level shifts characterizing different epochs (see Anderson and Mizon, 1989). ‘Models’ of intercept shifts are easily envisaged, 10 where they become drawings from a ‘meta’ distribution; or where large shocks persist but small do not; or they are functions of more basic causes – endogenous growth theory could be seen as one attempt to model the intercepts in growth-rate equations. Such re-representations do not alter the forecasting implications drawn above, merely re-interpret what we call deterministic shifts: the key issue is whether the average draw over the relevant horizon is close to that over the sample used, or differs therefrom. The latter induces forecast failure. That concludes the ‘scene setting’ analysis, summarized as: deterministic shifts of the data relative to the model are the primary source of forecast failure. Monte Carlo evidence presented in several papers bears out the analytics: parameter non-constancy and forecast-failure tests reject for small changes in unconditional means, but not for substantial changes in dynamics, or in all parameters when that leaves equilibrium means unaltered (all measured as a proportion of  ). 3 ‘Rational expectations’ When unanticipated deterministic shifts make an economy non-stationary, the formation of ‘rational expectations’ requires agents to know:  all the relevant information;  how every component enters the joint data density;  the changes in that density at each point in time. In terms of our scalar example, the model forecast error eT +hjT in (4) equals the ‘rational expectations’ error T +h if and only if every other term is zero. Yet most shifts, and many of their consequences, cannot be anticipated: assuming knowledge of current and future deterministic shifts is untenable. Otherwise, the resulting forecasting device can be dominated by methods which use no causally-relevant variables. Thus, it ceases to be rational to try and form expectations using the current conditional expectation when that will neither hold in the relevant future, nor forecast more accurately than other devices. Agents will learn that they do better forming expectations from ‘robust forecasting rules’ – which adapt rapidly to deterministic shifts. These may provide an example of ‘economically-rational expectations’ as suggested by Feige and Pearce (1976), equating the marginal costs and benefits of improvements in the accuracy of expectations: Hendry (2000b) provides a more comprehensive discussion. Robust forecasting rules need not alter with changes in policy. Of course, if agents fully understood a policy change and its implications, they would undoubtedly be able to forecast better: but that would require the benefits of doing so to exceed the costs. The problem for agents is compounded by the fact that many major policy changes occur in turbulent times, precisely when it is most difficult to form ‘rational expectations’, and when robust predictors may outperform. Thus, many agents may adopt the adaptive rules discussed above, consistent with the lack of empirical evidence in favour of the Lucas (1976) critique reported in Ericsson and Irons (1995). Consequently, if an econometric model used xt as a replacement for the expected change xet+1jt when agents used robust rules, then the model’s parameters need not change even after forecast failure occurred. Alternatively, the unimportant consequences for forecasting of changes in reaction coefficients, rather than their absence, could account for the lack of empirical evidence that the critique occurs, but anyway reduces its relevance. Hence, though it might be sensible to use ‘rational expectations’ for a congruent and encompassing model in a stationary world, in practice the evident non-stationarities make it inadvisable.   11 4 Model selection for theory testing Although not normally perceived as a ‘selection’ issue, tests of economic theories based on whole-sample goodness of fit comparisons involve selection, and can be seriously misled by deterministic shifts. Three examples affected by unmodelled shifts are: lagged information from other variables appearing irrelevant, affecting tests of Euler equation theories; cointegration failing so long-run relationships receive no empirical support; and tests of forecast efficiency rejecting because of residual serial correlation induced ex post by an unpredictable deterministic shift. We address these in turn. The first two are closely related, so our illustration concerns tests of the implications of the Hall (1978) Euler-equation consumption theory when credit rationing changes, as happened in the UK (see Muellbauer, 1994). The log of real consumers’ expenditure on non-durables and services (c) is not cointegrated with the log of real personal disposable income (y ) over 1962(2)–1992(4): a unit-root test using : so does not reject (see Banerjee 5 lags of each variable, a constant and seasonals delivers tur and Hendry, 1992, and Ericsson and MacKinnon, 1999, on the properties of this test). Nevertheless, the solved long-run relation is: = 0 97 c = ; 0:53 + 0:98 y + Seasonals: (0:99) (7) (0:10) =15 (5 109) = 1 5 Lagged income terms are individually (max t : ) and jointly (F ; : ) insignificant in explaining 4 ct ct ; ct;4 . Such evidence appears to support the Hall life-cycle model, which entails that consumption changes are unpredictable, with permanent consumption proportional to fully-anticipated permanent income. As fig. 1a shows for annual changes, the data behaviour is at odds with the theory after 1985, since consumption first grows faster than income for several years, then falls faster – far from smoothing. Moreover, the large departure from equilibrium in (7) is manifest in panel b, resulting in a marked deterioration in the resulting (fixed-parameter) 1-step forecast errors from the model in Davidson, Hendry, Srba and Yeo (1978) after 1984(4) (the period to the right of the vertical line in fig. 1c). Finally, an autoregressive model for 4 ct 4 ct ; 4 ct;1 produces 1-step forecast errors which are smaller than average after 1984(4): see panel (d). Such a result is consistent with a deterministic shift around the mid 1980s (see Hendry, 1994, and Muellbauer, 1994, for explanations based on financial deregulation inducing a major reduction in credit rationing), which neither precludes the ex ante predictability of consumption from a congruent model, nor consumption and income being cointegrated. The apparent insignificance of additional variables may be an artefact of mis-specifying a crucial shift, so the ‘selected’ model is not valid support for the theory. Conversely, non-causal proxies for the break may seem significant. Thus, models used to test theories should first be demonstrated to be congruent and encompassing. We must stress that our example is not an argument against econometric modelling. While 4 ct;1 may be a more robust forecasting device than the models extant at the time, it is possible in principle that the appropriate structural model – which built in changes in credit markets – would both have produced better forecasts and certainly better policy. For example, by 1985, building society data suggested that mortgages were available on much easier terms than had been the case historically, and ‘housing-equity withdrawal’ was already causing concern to policy makers. Rather, we are criticizing the practice of ‘testing theories’ without first testing that the model used is a congruent and undominated representation, precisely because ‘false’ but robust predictors exist, and deterministic shifts appear to occur intermittently. The same data illustrate the third mistake: rejecting forecast efficiency because of residual serial correlation induced ex post by an unpredictable deterministic shift. A model estimated prior to such a shift  =  =    12 0.10 5.0 ∆ 4c ∆ 4y 0.05 2.5 0.00 0.0 1960 1970 1980 1990 DHSY residual −2.5 1960 2 2 Cointegration residual 1970 1980 1990 1980 1990 ∆ 1∆ 4c 1 0 0 −1 −2 1960 −2 1970 Figure 1 1980 1990 1960 1970 UK real consumers’ expenditure and income with model residuals. could efficiently exploit all available information; but if a shift was unanticipated ex ante, and unmodelled ex post, it would induce whole-sample residual serial correlation, apparently rejecting forecast efficiency. Of course, the results correctly reject ‘no mis-specification’; but as no-one could have outperformed the in-sample DGP without prescience, the announced forecasts were not ex ante inefficient in any reasonable sense. 5 Model selection for forecasting Forecast performance in a world of deterministic shifts is not a good guide to model choice, unless the sole objective is short-term forecasting. This is because models which omit causal factors and cointegrating relations, by imposing additional unit roots, may adapt more quickly in the face of unmodelled shifts, and so provide more accurate forecasts after breaks. We referred to this above as ‘robustness’ to breaks. The admissible deductions on observing either the presence or absence of forecast failure are rather stark, particularly for general methodologies which believe that forecasts are the appropriate way to judge empirical models. In this setting of structural change, there may exist non-causal models (i.e., models none of whose ‘explanatory’ variables enter the DGP) that do not suffer forecast failure, and indeed may forecast absolutely more accurately on reasonable measures, than previously congruent, theory-based models. Conversely, ex ante forecast failure may merely reflect inappropriate measures of the inputs, as we showed with the example of ‘opportunity-cost’ affecting UK M1: a model that suffers severe forecast failure may nonetheless have constant parameters on ex post re-estimation. Consequently, neither relative success nor failure in forecasting is a reliable basis for selecting between models – other than for forecasting purposes. Apparent failure on forecasting need have no implications for the goodness of a model, nor its theoretical underpinnings, as it may arise from incorrect data, that are later corrected. 13 Some forecast failures will be due to model mis-specification, such as omitting a variable whose mean alters; and some successes to having well-specified models that are robust to breaks. The problem is discriminating between such cases, since the event of success or failure per se is insufficient information. Because the future can differ in unanticipated ways from the past in non-stationary processes, previous success (failure) does not entail the same will be repeated later. That is why we have stressed the need for ‘robust’ or adaptable devices in the forecasting context. If it is desired to use a ‘structural’ or econometric model for forecasting, then there are many ways of increasing its robustness, as discussed in Clements and Hendry (1999a). The most usual are ‘intercept corrections’, which adjust the fit at the forecast origin to exactly match the data, and thereby induce the differences of the forecast errors that would otherwise have occurred. Such an outcome is close to that achieved by modelling the differenced data, but retains the important influences from disequilibria between levels. Alternatively, and closely related, one could update the equilibrium means and growth rates every period, placing considerable weight on the most recent data, retaining the in-sample values of all other reaction parameters. In both cases, howsoever the adjustments are implemented, the policy implications of the underlying model are unaltered, although the forecasts may be greatly improved after deterministic shifts. The obvious conclusion, discussed further below, is that forecast performance is also not a good guide to policymodel choice. Without the correction, the forecasts would be poor; with the correction they are fine, but the policy recommendation is unaffected. Conversely, simple time-series predictors like 4 ct;1 have no policy implications. We conclude that policy-analysis models should be selected on different criteria, which we now discuss.  6 Model selection for policy analysis The next three sub-sections consider related, but distinct, selection issues when the purpose of modelling is policy: using forecast performance to select a policy model; investigating policy in closed models, where every variable is endogenous; and analyzing policy in open models, which condition on some policy variables. Although the three issues arise under the general heading of selecting a policy model, and all derive from the existence of, and pernicious consequences from, deterministic shifts, very different arguments apply to each, as we now sketch. ‘Selection’ is used in a general sense: only the first topic concerns an ‘empirical criterion’ determining the choice of model, whereas the other two issues derive from ‘prior’ decisions to select from within particular model classes. First, because forecast failure derives primarily from unanticipated deterministic shifts, its occurrence does not sustain the rejection of a policy model: shifts in means may be pernicious, but need not impugn policy implications. For example, intercept corrections would have altered the forecast performance, but not the policy advice. Secondly, because badly mis-specified models can win forecasting competitions, forecast performance is not a sensible criterion for selecting policy models, as shown in section 6.1. Thirdly, policy conclusions depend on the values of reaction parameters, but we have noted the difficulty of detecting shifts in those parameters when there are no concomitant deterministic shifts, with adverse consequences for impulse-response analyses. Section 6.2 provides some detailed evidence. Finally, policy changes in open models almost inevitably induce regime shifts with deterministic effects, and so can highlight previously hidden mis-specifications, or non-deterministic shifts; but also have a sustainable basis in the important concept of co-breaking, noted in section 1.2 above. 14 6.1 Selecting policy models by forecast performance A statistical forecasting system is one having no economic-theory basis, in contrast to econometric models for which economic theory is the hallmark. Since the former system will rarely have implications for economic-policy analysis – and may not even entail links between target variables and policy instruments – being the ‘best’ available forecasting device is insufficient to ensure any value for policy analysis. Consequently, the main issue is the converse: does the existence of a dominating forecasting procedure invalidate the use of an econometric model for policy? Since forecast failure often results from factors unrelated to the policy change in question, an econometric model may continue to characterize the responses of the economy to a policy, despite its forecast inaccuracy. Moreover, as stressed above, while such ‘tricks’ as intercept corrections may mitigate forecast failure, they do not alter the reliability of the policy implications of the resulting models. Thus, neither direction of evaluation is reliable: from forecast failure or success to poor or good policy advice. Policy models require evaluation on policy criteria. Nevertheless, post-forecasting policy changes that entail regime shifts should induce breaks in models that do not embody the relevant policy links. Statistical forecasting devices will perform worse in such a setting: their forecasts are unaltered (since they do not embody the instruments), but the outcomes change. Conversely, econometric systems that do embody policy reactions need not experience any policy-regime shifts. Consequently, when both structural breaks and regime shifts occur, neither econometric nor time-series models alone are adequate: this suggests that they should be combined, and Hendry and Mizon (2000) provide an empirical illustration of doing so. 6.2 Impulse-response analyses Impulse response analysis is a widely-used method for evaluating the response of one set of variables to ‘shocks’ in another set of variables (see e.g., Lütkepohl, 1991, Runkle, 1987, and Sims, 1980). The finding that shifts in the parameters of dynamic reactions are not readily detectable is potentially disastrous for impulse-response analyses of economic policy based on closed systems, usually vector autoregressions (VARs). Since changes in VAR intercepts and dynamic coefficient matrices may not be detected – even when tested for – but the full-sample estimates are a weighted average across different regimes, the resulting impulse responses need not represent the policy outcomes that will in fact occur. Indeed, this problem may be exacerbated by specifying VARs in first differences (as often occurs), since deterministic factors play a small role in such models. It may be felt to be a cruel twist of fate that when a class of breaks is not pernicious for forecasting, it should be detrimental to policy – but these are just the opposite sides of the same coin. Moreover, this is only one of a sequence of drawbacks to using impulse responses on models to evaluate policy we have emphasized over recent years: see Banerjee, Hendry and Mizon (1996), Ericsson, Hendry and Mizon (1998a), and Hendry and Mizon (1998). Impulse response functions describe the dynamic properties of an estimated model, and not the dynamic characteristics of the variables. For example, when the DGP is a multivariate random walk, the impulse responses calculated from an estimated VAR in levels will rarely reveal the ‘persistence’ of shocks, since the estimated roots will not be exactly unity. Equally, estimated parameters may be inconsistent or inefficient, unless the model is congruent, encompasses rival models, and is invariant to extensions of the information used (see section 8). When a model has the three properties just noted, it may embody structure (see Hendry, 1995b), but that does not imply that its residuals are structural: indeed, residuals cannot be invariant to extensions of information unless the model coincides with the DGP. In particular, increasing or reducing the number of variables directly affects the residuals, as does conditioning on putative exogenous variables. Worse still, specifying a variable to be weakly or 15 strongly exogenous alters the impulse responses, irrespective of whether or not that variable actually is exogenous. While Granger non-causality (Granger, 1969) is sufficient for the equivalence of standarderror based impulse responses from systems and conditional models, it does not ensure efficient or valid inferences unless the conditioning variables are weakly exogenous. Moreover, the results are invariant to the ordering of variables only by ignoring the correlations between residuals in different equations. Avoiding this last problem by reporting orthogonalized impulse responses is not recommended either: it violates weak exogeneity for most orderings, induces a sequential conditioning of variables that depends on the chance ordering of the variables, and may lose invariance. The literature on ‘structural VARs’ (see, e.g., Bernanke, 1986, and Blanchard and Quah, 1989), which also analyzes impulse responses for a transformed system, faces a similar difficulty. The lack of understanding of the crucial role of weak exogeneity in impulse-response analyses is puzzling in view of the obvious feature that any given ‘shock’ to the error and to the intercept are indistinguishable, yet the actual reaction in the economy will be the same only if the means and variances are linked in the same way – which is the weak exogeneity condition in the present setting. Finally in closed systems which ‘model’ policy variables, impulse-response analysis assumes that the instrument process remains constant under the ‘shock’, when in fact this will often not be so. Thus, it may not be the response of agents that changes when there is change in policy, but via the policy feedback, the VAR coefficients themselves nevertheless shift, albeit in a way that is difficutl to detect. There seems no alternative for viable policy analyses to carefully investigating the weak and super exogeneity status of appropriate policy conditioning variables. Despite the fact that all of these serious problems are well known, impulse responses are still calculated. However, the problem which we are highlighting here – of undetectable breaks – is not well known, so we will demonstrate its deleterious impact using a Monte Carlo simulation. Consider the unrestricted I(0) VAR: y1;t y2;t = = + 12 y2 ;1 + 1 ;1 + 22 y2 ;1 + 2 11 y1;t;1 21 y1;t ;t ;t ;t ;t where both errors i;t are independent, normally-distributed with means of zero and constant variances ii , with E[1;t 2;s ] = 0 8t; s. The yi;t are to be interpreted as I(0) transformations of integrated variables, either by differencing or cointegration. We consider breaks in the  = (ij ) matrix, maintaining constant unconditional expectations of zero (E[yi;t ] = 0). The full-sample size is T = 120, with a single break at t = 0:5T = 60, setting ii = 0:01 (1% in a log-linear model). An unrestricted VAR with intercepts and one lag is estimated, and then tested for breaks. The critical values for the constancy tests are those for a known break point, which delivers the highest possible power for the test used. We consider a large parameter shift, from: ! ;0:20  = ;00:50 :20 ;0:25 to:  = 0:50 0:20 0:20 0:25 ; (8) ! ; (9) so the sign is altered on all but one response, which is left constant simply to highlight the changes in the other impulses below. We computed 1000 replications at both p = 0:05 and p = 0:01 (the estimates have standard errors of about 0.007 and 0.003 respectively), both when the null is true (no break) and when the break from to  occurs. The resulting constancy-test rejection frequencies are reported graphically for both p values to illustrate the outcomes visually: the vertical axes show the rejection frequencies plotted against the sample sizes on the horizontal.   16 .1 .09 .08 .07 Upper 95% line for 5% test .06 5% test .05 .04 Lower 95% line for 5% test .03 .02 Upper 95% line for 1% test .01 1% test Lower 95% line for 1% test 0 10 20 30 40 50 60 70 80 90 100 110 Figure 2 Constancy-test rejection frequencies for the I(0) null. 6.2.1 Test rejection frequencies under the null As fig. 2 reveals, the null rejection frequencies in the I(0) baseline data are reassuring: with 1000 replications, the approximate 95% confidence intervals are (0.036, 0.064) and (0.004, 0.016) for 5% and 1% nominal, and these are shown on the graphs as dotted and dashed lines respectively. The actual test null rejection frequencies are, therefore, close to their nominal levels. This gives us confidence that the estimated power to detect the break is reliable. 6.2.2 Shift in the dynamics The constancy-test graph in fig. 3 shows the rejection frequencies when a break occurs. The highest power is less than 25%, even though the change constitutes a major structural break for the model economy: the detectability of a shift in dynamics is low when the DGP is an I(0) VAR. This may be an explanation for the lack of evidence supporting the Lucas (1976) critique: shifts in zero-mean reaction parameters are relatively undetectable, rather than absent. 6.2.3 Misleading impulse responses Finally, we record the impulse responses from the averages of pre- and post- break models, and the model fitted across the regime shifts in fig. 4. The contrast is marked: despite the near undetectability of the break, the signs of most of the impulses have altered, and those obtained from the fitted model sometimes reflect one regime, and sometimes the other. Overall, mis-leading policy advice would follow, since even testing for the break would rarely detect it. 17 1.0 0.05 test 0.01 test 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 10 20 Figure 3 30 40 50 60 70 80 90 100 110 120 Constancy-test rejection frequencies for the I(0) structural break. 6.3 Policy analysis in open-models Many of the problems in analyzing the responses of targets to changes in instruments noted above are absent when the modelling is validly conditional on the instruments, leading to an open model. Since it is often difficult to model the time-series behaviour of the policy instruments, particularly in highdimensional systems, conditioning on them is much easier and is certainly preferable to omitting them from the analysis. For economic policy analysis, another advantage of modelling ny target variables yt conditionally on nz instrument variables zt , is that @yt+h =@zt0 , which are important ingredients in the required policy responses, are directly estimable analytically, or at worst via simulation. The fact that the fzt g process is under the control of a policy agency does not ensure that zt are exogenous variables: indeed, policy is likely to depend on precisely the disequilibria in the rest of the economy that are key to its internal dynamics. Although the weak exogeneity of zt for the parameters of the endogenous variables’ equations is required for there to be no loss of information in making inferences on the parameters of interest, in practice it is likely that reliable estimates of policy responses will be obtained even when zt is not weakly exogenous. This is because cointegration relations are usually established in the system, before conditioning is introduced. A more important requirement is that whenever policy involves a regime shift, the instruments must be super exogenous for the parameters of interest. Co-breaking (described in section 1.2 above) between the targets and instruments then ensures that the policy is effective, and that the response of yt can be reliably estimated (efficiently, when zt is weakly exogenous for the response parameters). Since realistic policies involve deterministic shifts, any failure of co-breaking will be readily detected. 6.3.1 Stationary process This section draws on some results in Ericsson et al. (1998a), and is presented for completeness as a preliminary to considering the more realistic integrated setting in the next sub-section. Modelling the 18 Shock from y 1 0.0100 Shock from y 2 Regime 1 Regime 2 Mixed 0.0075 0.002 Shock to y 1 0.001 0.0050 0.000 0.0025 −0.001 0 5 10 Shock to y 2 0 5 10 0 5 10 0.010 0.002 0.001 0.005 0.000 0.000 0 5 10 Figure 4 Impulse response comparisons in an I(0) VAR. conditional distribution for yt given zt (and any relevant lags) will yield efficient inference on the parameters of interest when zt is weakly exogenous for those parameters. In addition, the conditional model will provide reliable estimates of the response in yt to policy changes in zt when its parameters are invariant to the policy change. When these conditions are satisfied, the conditional model provides viable impulse responses and dynamic multipliers for assessing the effects of policy. However, Ericsson et al. (1998a) showed that, in general, the weak exogeneity status of conditioning variables is not invariant to transformations such as orthogonalizations, or identified ‘structural VARs’. Irrespective of the exogeneity status of zt , modelling the conditional distribution alone will result in different impulse-response matrices (@ yt+h =@ "0t and dynamic multipliers (@ yt+h =@ z0t ) because the latter takes into account the effects from contemporaneous and lagged zt . Thus, the response of yt to an impulse in the innovation "t of the conditional model is not the relevant response for assessing the effects of policy changes in the fzt g process. ) 6.3.2 Integrated process In an integrated process, the class of open models will be equilibrium-correction systems conditional on the current growth-rate of the policy instruments, zt , assuming that the zt are I , and are included in some of the r cointegration relations 0 xt;1 . For there to be no loss of information, this analysis requires that zt be weakly exogenous for both the long-run parameters , and any short-run dynamic response parameters (for both lagged yt;s and lagged zt;k ), all of which parameters should be invariant to the policy change. Under these conditions, it is possible to estimate the responses in the growth rates yt and the disequilibria 0 xt to particular choices of the instruments zt even when the latter are I . To derive the impact on (say) yt+h from a change in zt requires a specification of the future path of zt+i in response to zt over t to t h; implicitly, the model must be closed. This provides a link to the ‘policy rules literature’ (see e.g., Taylor, 1993, 2000), where alternative mappings of the policy    =1 (1) (1) + 19 instruments onto past values of disequilibria are evaluated. Nevertheless, the outcomes obtained can differ substantially from impulse-response analysis based on a (cointegrated) VAR when the policy rule does not coincide with the historical description of policy responses, and in importantly when the policy rule itself is changed perhaps as a result of observing previous responses to policy. 0 0 z reveals that y0 yt are feasible target variables in yt Partitioning the disequilibria 0 xt z t y this context despite yt and y0 yt being I . However, an implication of this analysis is that very special conditions are required for policy that changes a single instrument zi;t (e.g., the minimum lending rate) to successfully target a single target variable yj;t (e.g., inflation) when these variables are I . Conversely, variable is successful, then Johansen and Juselius (2000) demonstrate that if a policy that targets an I the target will be rendered I . Also, there are only ny nz ; r unconstrained stochastic trends when r is equal to the number 0 z 0. of cointegrating vectors, so the growth rates y of yt and z of zt are linked by y0 y z However, when there is co-breaking between yt and zt , then a change in z will result in a corresponding change in the unconditional mean y of yt . Hence, just as in Hendry and Mizon (1998, 2000), linkages between deterministic terms are critical for policy to be effective when it is implemented via shifts in deterministic terms in the instrument process. Moreover, co-breaking here requires that yt responds to contemporaneous and/or lagged changes in zt . An important aspect of policy changes which comprise deterministic shifts is their ability to reveal previously undetected changes which might contaminate model specification. The dynamic response in a model will trace out a sequence of shifts over time, which will differ systematically from the corresponding responses in the economy when earlier changes lurk undetected. While the outcome will not be as anticipated, the mis-specification does not persist undetected. = (1) + (1) (0) +    (1) + =  7 Empirical model selection First developing congruent general models, then selecting appropriate simplifications thereof that retain only the relevant information has not proved easy – even to experienced practitioners. The former remains the domain where considerable detailed institutional, historical and empirical knowledge interacts with value-added insights and clever theories of investigators: a good initial general model is essential. However, the latter is primarily determined by econometric modelling skills, and the developments in Hoover and Perez (1999) suggest automating those aspects of the task that require the implementation of selection rules, namely the simplification process. Just as early chess-playing programs were easily defeated, but later ones can systematically beat Grandmasters, so we anticipate computer-automated model selection software will develop well beyond the capabilities of the most expert modellers. We now explain why a general-to-specific modelling strategy – as implemented in PcGets – is able to perform so well despite the problem of ‘data mining’, discuss the costs of search, distinguish them from the (unavoidable) costs of inference, and suggest that the practical modelling problem is to retain relevant variables, not eliminate spurious ones. Statistical inference is always uncertain because of type I and type II errors (rejecting the null when it is true; and failing to reject the null when it is false respectively). Even if the DGP were derived a priori from economic theory, an investigator could not know that such a specification was ‘true’, and inferential mistakes will occur when testing hypotheses about it. This is a ‘pre-test’ problem: beginning with the truth and testing it will sometimes lead to false conclusions. ‘Pre-testing’ is known to bias estimated coefficients, and may distort inference (see inter alia, Judge and Bock, 1978). Of course, the DGP specification is never known in practice, and since ‘theory dependence’ in a model has as many drawbacks as 20 ‘sample dependence’, data-based model-search procedures are used in practice, thus adding search costs to the costs of inference. A number of arguments point towards the advantages of ‘general-to-specific’ searches. Statistical analyses of repeated testing provide a pessimistic background: every test has a non-zero null rejection frequency (‘size’), so type I errors accumulate. Size could be lowered by increasing the significance levels of selection tests, but only at the cost of reducing power to detect the influences that really matter. The simulation experiments in Lovell (1983) suggested that search had high costs, leading to an adverse view of ‘data mining’. However, he evaluated outcomes against the truth, compounding costs of inference with costs of search. Rather, the key issue for any model-selection procedure is: how costly is it to search across many alternatives relative to commencing from the DGP? As we now discuss, it is feasible to lower size and raise power simultaneously by improving the search algorithm. First, White (1990) showed that with sufficiently-rigorous testing and a large enough data sample, the selected model will converge to the DGP, so selection error is a ‘small-sample’ problem, albeit a difficult and prevalent one. Secondly, Mayo (1981) noted that diagnostic testing was effectively independent of the sufficient statistics from which parameter estimates are derived, so would not distort the latter. Thirdly, since the DGP is obviously congruent with itself, congruent models are the appropriate class within which to search. This argues for commencing the search from a congruent model. Fourthly, encompassing – explaining the evidence relevant for all alternative models under consideration– resolves ‘data mining’ (see Hendry, 1995a) and delivers a dominant outcome. This suggests commencing from a general model that embeds all relevant contenders. Fifthly, any model-selection process must avoid getting stuck in search paths that inadvertently delete relevant variables, thereby retaining many other variables as proxies. The resulting approach of sequentially simplifying a congruent general unrestricted model (GUM) to obtain the maximal acceptable reduction, is called general-to-specific (Gets). To evaluate the performance of Gets modelling procedures, Hoover and Perez (1999) reconsidered the Lovell (1983) experiments, searching for a single conditional equation (with 0 to 5 regressors) from a large macroeconomic database (containing up to 40 variables, including lags). By following several reduction search paths – each terminated by either no further feasible reductions or significant diagnostic test outcomes – they showed how much better the structured Gets approach was than any method Lovell considered, suggesting that modelling per se need not be bad. Indeed, the overall ‘size’ (false null rejection frequency) of their selection procedure was close to that expected without repeated testing, yet the power was reasonable. Building on their findings, Hendry and Krolzig (1999) and Krolzig and Hendry (2000) developed the Ox (see Doornik, 1999) program PcGets which first tests the congruency of a GUM, then conducts preselection tests for ‘highly irrelevant’ variables at a loose significance level (25% or 50%, say), and simplifies the model accordingly. It then explores many selection paths to eliminate statistically-insignificant variables on F- and t-tests, applying diagnostic tests to check the validity of all reductions, thereby ensuring a congruent final model. All the terminal selections resulting from search paths are stored, and encompassing procedures and information criteria select between the contenders. Finally, sub-sample significance is used to assess the reliability of the resulting model choice. In Monte Carlo experiments, PcGets recovers the DGP with power close to what one would expect if the DGP were known, and empirical size often below the nominal, suggesting that search costs are in fact low. In the ‘classic’ experiment in which the dependent variable is regressed on 40 irrelevant regressors, PcGets correctly finds the null model about 97% of the time for the Lovell database. Some simple analytics proposed in Hendry (2000a) suggest why PcGets performs well, even though the following analysis ignores pre-selection, search paths and diagnostic testing (all of which improve the algorithm). An F-test against the GUM using critical value c would have size P (F  c ) = under 21 the null if it were the only test implemented. For k regressors, the probability of retaining no variables from t-tests at size is: P jti j < c 8i ; : : : ; k ; k; (10) ( =1 ) = (1 ) where the average number of variables retained then is: n=k : (11) Combined with the F-test of the GUM, the probability  of correctly selecting the null model is no smaller than:  ; ; k: (12) = (1 ) + (1 ) = 0:01, when k = 40, then  = 0:98 and n = 0:4. Although falsely rejecting = 0 05 : and For the null on the F-test signals that spurious significance lurks, so (11) will understate the number of regressors then retained, nevertheless, eliminating adventitiously-significant (spurious) variables is not the real problem in empirical modelling. Indeed, the focus in earlier research on ‘over-fitting’ – reflecting inferior algorithms – has misdirected the profession’s attention. The really difficult problem is retaining the variables that matter. Consider an equation with six relevant regressors, all with (absolute) t-values of 2 on average (i.e., ). The probability in any given sample that each observed bjti j  c (say) is approxE jti j imately 0.5, so even if one began with the DGP, the probability of retaining all six is: [ ]=2 =2 ; P jbti j  c  8i = 1; : : : ; k j jtij = 2 = 0:56 ' 0:016: [ ] = 3, the chances Using 1% significance lowers this to essentially zero. Surprisingly, even if every E jti j of keeping the DGP specification are poor: ; P jbti j  c  8i j jtij = 3 = 0:846 ' 0:35: Thus, the costs of inference are high in such full-sample testing, and will lead to under-estimating model size. An alternative, block-testing, approach discussed in Hendry (2000a) seems able to improve the power substantially. Nevertheless, many empirical equations have many regressors. This is probably due to the high average t-values found in economics: ; P jbti j  c  8i j jtij = 5 ' 0:9896 ' 0:935; (so almost all will always be retained), and not to selection biases as shown above. Even selecting by candidate regressors at would only deliver 2 significant variables on average. We t-testing from conclude that models with many significant variables correctly represent some of the complexity of aggregate economic behaviour and not ‘over-fitting’. The evidence to date in Hoover and Perez (1999) and Hendry and Krolzig (1999) for conditional dynamic models, Krolzig (2000) for VARs, and Hoover and Perez (2000) for cross-section data sets is of equally impressive performance by Gets in model selection. Certainly, the Monte Carlo evidence concerns selecting a model which is a special case of the DGP, whereas empirically, models are unlikely even to be special cases of the local DGP (LDGP). However, that is not an argument against Gets, but confirms the need for commencing with general specifications than have some chance of embedding the LDGP. Only then will relaible approximations to the actual economic process be obtained. 40 5% 22 8 Congruent modelling As a usable knowledge base, theory-related, congruent and encompassing econometric models remain undominated by matching the data in all measurable respects (see, e.g., Hendry, 1995a). For empirical understanding, such models seem likely to remain an integral component of any progressive research strategy. Nevertheless, even the ‘best available model’ can be caught out when forecasting by an unanticipated outbreak of (say) a major war or other crisis for which no effect was included in the forecast. However, if empirical models which are congruent within sample remain subject to a nonnegligible probability of failing out of sample, then a critic might doubt their worth. Our defence of the program of attempting to discover such models rests on the fact that empirical research is part of a progressive strategy, in which knowledge gradually accumulates. This includes knowledge about general causes of structural changes, such that later models incorporate measures accounting for previous events, and hence are more robust (e.g., to wars, changes in credit rationing, financial innovations, etc.). For example, the dummy variables for purchase-tax changes in Davidson et al. (1978) that at the time ‘mopped up’ forecast failure, later successfully predicted the effects of introducing VAT, as well as the consequences of its doubling in 1979; and the First World-War shift in money demand in Ericsson, Hendry and Prestwich (1998b) matched that needed for the Second World War. Since we now have an operational selection methodology with excellent properties, Gets seems a natural way to select models for empirical characterization, theory testing and policy analyses. When the GUM is a congruent representation, embedding the available theory knowledge of the target-instrument linkages, and parsimoniously encompassing previous empirical findings, the selection strategy described in section 7 offers scope for selecting policy models. Four features favour such a view. First, for a given null rejection frequency, variables that matter in the DGP are selected with the same probabilities as if the DGP were known. In the absence of omniscience, it is difficult to imagine doing much better systematically. Secondly, although estimates are biased on average, conditional on retaining a variable, its coefficient provides an unbiased estimate of the policy reaction parameter. This is essential for economic policy – if a variable is included, PcGets delivers the right response; otherwise, when it is excluded, one is simply unaware that such an effect exists.1 Thirdly, the probability of retaining adventitiously significant variables is around the anticipated level for the variables that remain after pre-selection simplification. If that is (say) even as many as 30 regressors, of which 5 actually matter, then at 1% significance, 0.25 extra variables will be retained on average: i.e., one additional ‘spuriously-significant’ variable per four equations. This seems unlikely to distort policy in important ways. Finally, the sub-sample – or more generally, recursive – selection procedures help to reveal which variables have non-central t-statistics, and which central (and hence should be eliminated). Overall, the role of Gets in selecting policy models looks promising. Because changes to the coefficients of zero-mean variables are difficult to detect in dynamic models, for policy models they remain hazardous: the estimated parameters would appear to be constant, but in fact be mixtures across regimes, leading to inappropriate advice. In a progressive research context (i.e., from the perspective of learning), this is unproblematic since most policy changes involve deterministic shifts (as opposed to mean-preserving spreads), hence earlier incorrect inferences will be detected rapidly – but is cold comfort to the policy maker, or the economic agents subjected to the wrong policies. 1 This is one of three reasons why we have not explored ‘shrinkage’ estimators, which have been proposed as a solution to the ‘pre-test’ problem, namely, they deliver biased estimators (see, e.g., Judge and Bock, 1978). The second, and main, reason is that such a strategy has no theoretical underpinnings in processes subject to intermittent parameter shifts. The final reason concerns the need for progressivity, explaining more by less, which such an approach hardly facilitates. 23 9 Conclusion The implications for econometric modelling that result from the observance of forecast failure differ considerably from those obtained when the model is assumed to coincide with a constant mechanism. Causal information can no longer be shown to uniformly dominate non-causal. Intercept corrections have no theoretical justification in stationary worlds with correctly-specified empirical models, but in a world subject to structural breaks of unknown form, size, and timing, they serve to ‘robustify’ forecasts against deterministic shifts – as the practical efficacy of intercept corrections confirms. Forecasting success is no better an index for model selection than forecast failure is for model rejection. Thus, emphasizing ‘out-ofsample’ forecast performance (perhaps because of fears over ‘data-mining’) is unsustainable (see, e.g., Newbold, 1993, p.658), as is the belief that a greater reliance on economic theory will help forecasting (see, e.g., Diebold, 1998), because that does not tackle the root problem. A taxonomy of potential sources of forecast errors clarifies the roles of model mis-specification, sampling variability, error accumulation, forecast origin mis-measurement, intercept shifts, and slopeparameter changes. Forecast failure seems primarily attributable to deterministic shifts in the model relative to the data. Other shifts are far more difficult to detect. Such findings are potentially disastrous for ‘impulse-response’ analyses of economic policy. Since the changes in VAR intercepts and dynamic coefficient matrices may not be detected even when tested for, but the recorded estimates are a weighted average across the different regimes, the resulting impulse responses do not represent the policy outcomes that will in fact occur. If the economy were reducible by transformations to a stationary stochastic process, where the resulting unconditional moments were constant over time, then well-tested, causally-relevant, congruent models which embodied valid theory restrictions would both fit best, and by encompassing, also dominate in forecasting on average. The prevalence historically of unanticipated deterministic shifts suggests that such transformations do not exist. Even the best policy model may fail at forecasting in such an environment. As we have shown, this need not impugn its policy relevance – other criteria than forecasting are needed for that judgement. Nevertheless, the case for continuing to use econometric systems probably depends on their competing reasonably successfully in the forecasting arena. Cointegration, co-breaking, and model-selection procedures as good as PcGets, with rigorous testing will help in understanding economic behaviour and evaluating policy options, but none of these ensures immunity to forecast failure from new breaks. An approach which incorporates causal information in a congruent econometric system for policy, but operates with robustified forecasts, merits consideration. We have not yet established formally that Gets should be used for selecting policy models from a theory-based GUM – but such a proof should be possible, given the relative accuracy with which the DGP is located. Achieving that aim represents the next step of our research program, and we anticipate establishing that a data-based Gets approach will perform well in selecting models for policy. References Anderson, G. J., & Mizon, G. E. (1989). What can statistics contribute to the analysis of economic structural change?. In Hackl, P. (ed.), Statistical Analysis and Forecasting of Economic Structural Change, Ch. 1. Berlin: Springer-Verlag. Banerjee, A., & Hendry, D. F. (1992). Testing integration and cointegration: An overview. Oxford Bulletin of Economics and Statistics, 54, 225–255. 24 Banerjee, A., Hendry, D. F., & Mizon, G. E. (1996). The econometric analysis of economic policy. Oxford Bulletin of Economics and Statistics, 58, 573–600. Bernanke, B. S. (1986). Alternative explorations of the money-income correlation. In Brunner, K., & Meltzer, A. H. (eds.), Real Business Cycles, Real Exchange Rates, and Actual Policies, Vol. 25 of Carnegie-Rochester Conferences on Public Policy, pp. 49–99. Amsterdam: North-Holland Publishing Company. Blanchard, O., & Quah, D. (1989). The dynamic effects of aggregate demand and supply disturbances. American Economic Review, 79, 655–673. Clements, M. P., & Hendry, D. F. (1998). Forecasting Economic Time Series. Cambridge: Cambridge University Press. Clements, M. P., & Hendry, D. F. (1999a). Forecasting Non-stationary Economic Time Series. Cambridge, Mass.: MIT Press. Clements, M. P., & Hendry, D. F. (1999b). Modelling methodology and forecast failure. Unpublished typescript, Economics Department, University of Oxford. Clements, M. P., & Hendry, D. F. (1999c). On winning forecasting competitions in economics. Spanish Economic Review, 1, 123–160. Davidson, J. E. H., Hendry, D. F., Srba, F., & Yeo, J. S. (1978). Econometric modelling of the aggregate time-series relationship between consumers’ expenditure and income in the United Kingdom. Economic Journal, 88, 661–692. Reprinted in Hendry, D. F., Econometrics: Alchemy or Science? Oxford: Blackwell Publishers, 1993, and Oxford University Press, 2000. Diebold, F. X. (1998). The past, present and future of macroeconomic forecasting. The Journal of Economic Perspectives, 12, 175–192. Doornik, J. A. (1999). Object-Oriented Matrix Programming using Ox 3rd edn. London: Timberlake Consultants Press. Ericsson, N. R. (1992). Cointegration, exogeneity and policy analysis: An overview. Journal of Policy Modeling, 14, 251–280. Ericsson, N. R., Hendry, D. F., & Mizon, G. E. (1998a). Exogeneity, cointegration and economic policy analysis. Journal of Business and Economic Statistics, 16, 370–387. Ericsson, N. R., Hendry, D. F., & Prestwich, K. M. (1998b). The demand for broad money in the United Kingdom, 1878–1993. Scandinavian Journal of Economics, 100, 289–324. Ericsson, N. R., & Irons, J. S. (1995). The Lucas critique in practice: Theory without measurement. In Hoover, K. D. (ed.), Macroeconometrics: Developments, Tensions and Prospects. Dordrecht: Kluwer Academic Press. Ericsson, N. R., & MacKinnon, J. G. (1999). Distributions of error correction tests for cointegration. International finance discussion paper no. 655, Federal Reserve Board of Governors, Washington, D.C. www.bog.frb.fed.us/pubs/ifdp/1999/655/ default.htm. Feige, E. L., & Pearce, D. K. (1976). Economically rational expectations. Journal of Political Economy, 84, 499–522. Fildes, R. A., & Makridakis, S. (1995). The impact of empirical accuracy studies on time series analysis and forecasting. International Statistical Review, 63, 289–308. Granger, C. W. J. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37, 424–438. Hall, R. E. (1978). Stochastic implications of the life cycle-permanent income hypothesis: Evidence. 25 Journal of Political Economy, 86, 971–987. Hendry, D. F. (1994). HUS revisited. Oxford Review of Economic Policy, 10, 86–106. Hendry, D. F. (1995a). Dynamic Econometrics. Oxford: Oxford University Press. Hendry, D. F. (1995b). Econometrics and business cycle empirics. Economic Journal, 105, 1622–1636. Hendry, D. F. (2000a). Econometrics: Alchemy or Science? Oxford: Oxford University Press. New Edition. Hendry, D. F. (2000b). Forecast failure, expectations formation, and the Lucas critique. Mimeo, Nuffield College, Oxford. Hendry, D. F., & Doornik, J. A. (1997). The implications for econometric modelling of forecast failure. Scottish Journal of Political Economy, 44, 437–461. Special Issue. Hendry, D. F., & Krolzig, H.-M. (1999). Improving on ‘Data mining reconsidered’ by K.D. Hoover and S.J. Perez. Econometrics Journal, 2, 202–219. Hendry, D. F., & Mizon, G. E. (1998). Exogeneity, causality, and co-breaking in economic policy analysis of a small econometric model of money in the UK. Empirical Economics, 23, 267–294. Hendry, D. F., & Mizon, G. E. (2000). On selecting policy analysis models by forecast accuracy. In Atkinson, A. B., Glennerster, H., & Stern, N. (eds.), Putting Economics to Work: Volume in Honour of Michio Morishima, pp. 71–113. London School of Economics: STICERD. Hendry, D. F., & Richard, J.-F. (1982). On the formulation of empirical models in dynamic econometrics. Journal of Econometrics, 20, 3–33. Reprinted in Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press and in Hendry D. F. (1993, 2000), op. cit. Hoover, K. D., & Perez, S. J. (1999). Data mining reconsidered: Encompassing and the general-tospecific approach to specification search. Econometrics Journal, 2, 167–191. Hoover, K. D., & Perez, S. J. (2000). Truth and robustness in cross-country growth regressions. unpublished paper, Economics Department, University of California, Davis. Johansen, S., & Juselius, K. (2000). How to control a target variable in the VAR model. Mimeo, European University of Institute, Florence. Judge, G. G., & Bock, M. E. (1978). The Statistical Implications of Pre-Test and Stein-Rule Estimators in Econometrics. Amsterdam: North Holland Publishing Company. Krolzig, H.-M. (2000). General-to-specific reductions of vector autoregressive processes. Unpublished paper, Department of Economics, University of Oxford. Krolzig, H.-M., & Hendry, D. F. (2000). Computer automation of general-to-specific model selection procedures. Journal of Economic Dynamics and Control. forthcoming. Lovell, M. C. (1983). Data mining. Review of Economics and Statistics, 65, 1–12. Lucas, R. E. (1976). Econometric policy evaluation: A critique. In Brunner, K., & Meltzer, A. (eds.), The Phillips Curve and Labor Markets, Vol. 1 of Carnegie-Rochester Conferences on Public Policy, pp. 19–46. Amsterdam: North-Holland Publishing Company. Lütkepohl, H. (1991). Introduction to Multiple Time Series Analysis. New York: Springer-Verlag. Makridakis, S., & Hibon, M. (2000). The M3-competition: Results, conclusions and implications. Discussion paper, INSEAD, Paris. Mayo, D. (1981). Testing statistical testing. In Pitt, J. C. (ed.), Philosophy in Economics, pp. 175–230: D. Reidel Publishing Co. Reprinted as pp. 45–73 in Caldwell B. J. (1993), The Philosophy and Methodology of Economics, Vol. 2, Aldershot: Edward Elgar. 26 Miller, P. J. (1978). Forecasting with econometric methods: A comment. Journal of Business, 51, 579– 586. Muellbauer, J. N. J. (1994). The assessment: Consumer expenditure. Oxford Review of Economic Policy, 10, 1–41. Newbold, P. (1993). Comment on ‘On the limitations of comparing mean squared forecast errors’, by M.P. Clements and D.F. Hendry. Journal of Forecasting, 12, 658–660. Runkle, D. E. (1987). Vector autoregressions and reality. Journal of Business and Economic Statistics, 5, 437–442. Sims, C. A. (1980). Macroeconomics and reality. Econometrica, 48, 1–48. Reprinted in Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press. Sims, C. A., Stock, J. H., & Watson, M. W. (1990). Inference in linear time series models with some unit roots. Econometrica, 58, 113–144. Taylor, J. B. (1993). Discretion versus policy rules in practice. Carnegie–Rochester Conference Series on Public Policy, 39, 195–214. Taylor, J. B. (2000). The monetary transmission mechanism and the evaluation of monetary policy rules. Forthcoming, Oxford Review of Economic Policy. Turner, D. S. (1990). The role of judgement in macroeconomic forecasting. Journal of Forecasting, 9, 315–345. White, H. (1990). A consistent model selection. In Granger, C. W. J. (ed.), Modelling Economic Series, pp. 369–383. Oxford: Clarendon Press.