Journal of Hydrology: Hoshin V. Gupta, Harald Kling, Koray K. Yilmaz, Guillermo F. Martinez

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Journal of Hydrology 377 (2009) 80–91

Contents lists available at ScienceDirect

Journal of Hydrology
journal homepage: www.elsevier.com/locate/jhydrol

Decomposition of the mean squared error and NSE performance criteria:


Implications for improving hydrological modelling
Hoshin V. Gupta, Harald Kling *, Koray K. Yilmaz 1, Guillermo F. Martinez
Department of Hydrology and Water Resources, The University of Arizona, 1133 E North Campus Dr., Tucson, AZ 85721, USA

a r t i c l e i n f o s u m m a r y

Article history: The mean squared error (MSE) and the related normalization, the Nash–Sutcliffe efficiency (NSE), are the
Received 8 April 2009 two criteria most widely used for calibration and evaluation of hydrological models with observed data.
Received in revised form 30 June 2009 Here, we present a diagnostically interesting decomposition of NSE (and hence MSE), which facilitates
Accepted 3 August 2009
analysis of the relative importance of its different components in the context of hydrological modelling,
and show how model calibration problems can arise due to interactions among these components. The
This manuscript was handled by Andreas analysis is illustrated by calibrating a simple conceptual precipitation-runoff model to daily data for a
Bardossy, Editor-in-Chief, with the number of Austrian basins having a broad range of hydro-meteorological characteristics. Evaluation of
assistance of Attilio Castellarin, Associate the results clearly demonstrates the problems that can be associated with any calibration based on the
Editor NSE (or MSE) criterion. While we propose and test an alternative criterion that can help to reduce model
calibration problems, the primary purpose of this study is not to present an improved measure of model
Keywords: performance. Instead, we seek to show that there are systematic problems inherent with any optimiza-
Mean squared error tion based on formulations related to the MSE. The analysis and results have implications to the manner
Nash–Sutcliffe efficiency in which we calibrate and evaluate environmental models; we discuss these and suggest possible ways
Model performance evaluation forward that may move us towards an improved and diagnostically meaningful approach to model per-
Calibration
formance evaluation and identification.
Multiple criteria
Ó 2009 Elsevier B.V. All rights reserved.
Criteria decomposition

Introduction tions’ (i.e., if NSE 6 0, the model is no better than using the ob-
served mean as a predictor). The equations are:
The mean squared error (MSE) criterion and its related normal-
ization, the Nash–Sutcliffe efficiency (NSE, defined by Nash and 1 X n
MSE ¼  ðxs;t  xo;t Þ2 ð1Þ
Sutcliffe, 1970) are the two criteria most widely used for calibra- n t¼1
tion and evaluation of hydrological models with observed data. Pn
ðxs;t  xo;t Þ2 MSE
The value of MSE depends on the units of the predicted variable NSE ¼ 1  Pt¼1 ¼1 2 ð2Þ
n
t¼1 ðxo;t  lo Þ
2 ro
and varies on the interval [0.0 to inf], whereas NSE is dimension-
less, being scaled onto the interval [inf to 1.0]. As a consequence, where n is the total number of time-steps, xs,t is the simulated value
the NSE value – obtained by dividing MSE by the variance of the at time-step t, xo,t is the observed value at time-step t, and lo and ro
observations and subtracting that ratio from 1.0 (Eqs. (1) and (2)) are the mean and standard deviation of the observed values. In opti-
– is commonly the measure of choice for reporting (and compar- mization MSE is subject to minimization and NSE is subject to
ing) model performance. Further, NSE can be interpreted as a clas- maximization.
sic skill score (Murphy, 1988), where skill is interpreted as the As evident from the above equations, NSE and MSE are closely
comparative ability of a model with regards to a baseline model, related. In this study we will mainly focus on NSE, but the results
which in the case of NSE is taken to be the ‘mean of the observa- can be generalized to MSE (and similar criteria such as RMSE).
While the NSE criterion may be a convenient and popular (al-
beit gross) indicator of model skill, there has been a long and vivid
* Corresponding author. Present address: IWHW, University of Natural Resources discussion about the suitability of NSE (McCuen and Snyder, 1975;
and Applied Life Sciences, Muthgasse 18, 1190 Vienna, Austria. Tel.: +1 520 626 Martinec and Rango, 1989; Legates and McCabe, 1999; Krause
9712; fax: +1 520 621 1422.
et al., 2005; McCuen et al., 2006; Schaefli and Gupta, 2007; Jain
E-mail address: [email protected] (H. Kling).
1
Present addresses: Earth System Science Interdisciplinary Center, University of
and Sudheer, 2008) and several authors have proposed modifica-
Maryland, College Park, MD 20742, USA; NASA Goddard Space Flight Center, tions – e.g. Mathevet et al. (2006) proposed a bounded version
Laboratory for Atmospheres, Greenbelt, MD 20771, USA. of NSE and Criss and Winston (2008) proposed a volumetric

0022-1694/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.jhydrol.2009.08.003
H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91 81

efficiency to be used instead of NSE. One of the main concerns Decomposition of model performance criteria
about NSE is its use of the observed mean as baseline, which can
lead to overestimation of model skill for highly seasonal variables Previous decomposition of NSE
such as runoff in snowmelt dominated basins. A comparison of NSE
across basins with different seasonality (as is often reported in the A previous decomposition of criteria based on mean squared er-
literature) should therefore be interpreted with caution. For such rors (Murphy, 1988; Weglarczyk, 1998) has shown that there are
situations, various authors have recommended the use of the sea- three distinctive components, represented by the correlation, the
sonal or climatological mean as a baseline model (Garrick et al., conditional bias, and the unconditional bias, as evident in Eq. (3),
1978; Murphy, 1988; Martinec and Rango, 1989; Legates and McC- which shows a decomposition of NSE.
abe, 1999; Schaefli and Gupta, 2007).
It is now generally accepted that the calibration of hydrological NSE ¼ A  B  C ð3Þ
models should be approached as a multi-objective problem (Gupta with:
et al., 1998). Within a multiple-criteria framework, the MSE and
NSE criteria continue to be commonly used, because they can be A ¼ r2
computed separately for (1) different types of observations (e.g. B ¼ ½r  ðrs =ro Þ2
runoff and snow observations; Bergström et al., 2002), (2) different
locations (e.g. runoff at multiple gauges; Madsen, 2003), or (3) dif- C ¼ ½ðls  lo Þ=ro 2
ferent subsets of the same observation (e.g. rising and falling limb
where r is the linear correlation coefficient between xs and xo, and
of the hydrograph; Boyle et al., 2000). More generally, however,
(ls, rs) and (lo, ro) represent the first two statistical moments
different types of model performance criteria – such as NSE, coef-
(means and standard deviations) of xs and xo, respectively. The
ficient of correlation, and bias – can be computed from multiple
quantity A measures the strength of the linear relationship between
variables and/or at multiple sites (see Anderton et al., 2002; Beld-
the simulated and observed values, B measures the conditional bias,
ring, 2002; Rojanschi et al., 2005; Cao et al., 2006; and others).
and C measures the unconditional bias (Murphy, 1988).
When handled in this manner, the model calibration problem
can be treated as a full multiple-criteria optimization problem
resulting in a ‘Pareto set’ of non-dominated solutions (Gupta New decomposition of NSE
et al., 1998), or reduced to a related single-criterion optimization
problem by combining the different (weighted) criteria into one An alternative way in which to reformulate Eq. (3) is given be-
overall objective function. Numerous examples of the latter ap- low as Eq. (4), which reveals that NSE consists of three distinctive
proach exist in the literature where NSE or MSE appear in an over- components representing the correlation, the bias, and a measure
all objective function (e.g. Lindström, 1997; Bergström et al., 2002; of relative variability in the simulated and observed values.
Madsen, 2003; van Griensven and Bauwens, 2003; Parajka et al.,
NSE ¼ 2  a  r  a2  b2n ð4Þ
2005; Young, 2006; Rode et al., 2007; Marce et al., 2008; Wang
et al., 2009; Safari et al., 2009), because it conveniently enables with
the application of efficient single-criterion automated search algo-
rithms, such as SCE (Shuffled Complex Evolution, Duan et al., 1992) a ¼ rs =ro
or DDS (Dynamically Dimensioned Search, Tolson and Shoemaker, bn ¼ ðls  lo Þ=ro
2007).
When using multiple criteria in evaluation, it has to be consid- where the quantity a is a measure of relative variability in the
ered that some of these criteria are mathematically related, which simulated and observed values, and bn is the bias normalized by
is not always recognized (Weglarczyk, 1998). For example, it is the standard deviation in the observed values (note that bn =
possible to decompose the NSE criterion into separate components, sqrt(C)).
as shown by Murphy (1988) and Weglarczyk (1998), which facili- Eq. (4) shows that two of the three components of NSE relate to
tates a better understanding of how different criteria are interre- the ability of the model to reproduce the first and second moments
lated and thereby enable more insight into what is causing a of the distribution of the observations (i.e. mean and standard
particular model performance to be ‘good’ or ‘bad’. Equally impor- deviation), while the third relates to the ability of the model to
tant, the decomposition can provide insight into possible trade-offs reproduce timing and shape as measured by the correlation coeffi-
between the different components. cient. The ‘ideal’ values for the three components are r = 1, a = 1,
In this paper we present a diagnostically interesting decompo- and bn = 0. From a hydrological perspective, ‘good’ values for each
sition of NSE (and hence MSE), which facilitates analysis of the rel- of these three components are highly desirable, since in general we
ative importance of different components in the context of aim at matching the overall volume of flow, the spread of flows
hydrological modelling. As we will show in the first part of the pa- (e.g. flow duration curve), and the timing and shape of (for exam-
per, model calibration problems can arise due to interactions ple) the hydrograph (Yilmaz et al., 2008). It is clear, therefore, that
among these components. Based on this analysis, we propose optimizing NSE is essentially a search for a balanced solution
and test alternative criteria that can help to avoid these problems. among the three components, where with ‘optimal’ values of the
The analysis is illustrated with a case study in the second part of three components the overall NSE is maximized. This is similar
the paper, where we calibrate a simple precipitation-runoff model to the multiple-criteria approach of computing an overall
to daily data for a number of Austrian basins having a broad range (weighted) objective function from several different criteria as dis-
of hydro-meteorological characteristics. The evaluation of the re- cussed in ‘‘Introduction”.
sults on both the calibration and an independent ‘evaluation’ per- However, in using NSE we must be concerned with two facts.
iod clearly demonstrates the problems that can be associated with First, the bias (ls  lo) component appears in a normalized form,
any calibration based on the NSE (or MSE) criterion. In the third scaled by the standard deviation in the observed flows. This means
part of the paper we discuss the implications for calibration and that in basins with high runoff variability the bias component will
evaluation of environmental models and we also briefly discuss tend to have a smaller contribution (and therefore impact) in the
some possible ways forward. computation and optimization of NSE, possibly leading to model
simulations having large volume balance errors. In a multiple-cri-
82 H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91

teria sense, this is equivalent to using a weighted objective func- Fi


fi ¼ P3 ð6Þ
tion with a low weight applied to the bias component. j¼1 F j
Second, and equally serious, the quantity a appears twice in Eq.
(4), exhibiting an interesting (and problematic) interplay with the with:
linear correlation coefficient r. It is easy to show, by taking the first
F 1 ¼ 2  rs  ro  ð1  rÞ
derivative of NSE (in Eq. (4)) with respect to a that the maximum
value of NSE is obtained when a = r. And, since r will always be F 2 ¼ ðrs  ro Þ2
smaller than unity, this means that in maximizing NSE we will F 3 ¼ ðls  lo Þ2
tend to select a value for a that underestimates the variability in
the flows (more precisely, we will favour models/parameter sets
that generate simulated flows that underestimate the variability). Alternative model performance criteria
Taking these two facts together, we note that when bn = 0 and
a = r, then the NSE is equivalent to r2, which is the well-known As discussed above, a peculiar feature of the NSE criterion is the
coefficient of determination. Therefore, r2 can be interpreted as a problematic interplay between a and r, which is likely to result in
maximum (potential) value for NSE if the other two components an underestimation of the variability in the flows. One way to over-
are able to achieve their optimal values. come this is by inflating the observed variability as indicated by Eq.
Fig. 1 illustrates the relationship of NSE with r and a, while (7), while at the same time preserving the mean of the observa-
assuming that bn is zero (bn is only an additive term, anyway). tions and their linear correlation with the simulations. Using Eq.
For a given r the optimal a for maximizing NSE lies on the 1:1 line, (7) with Eq. (4) results in Eq. (8), which represents a ‘corrected’
although the ideal value of a is on a horizontal line at 1.0. This the- version of NSE:
oretical relationship is illustrated in Fig. 1a. Of course, not all com-
xo;t ¼ c  ðxo;t  lo Þ þ lo ð7Þ
binations of r and a may be possible with a hydrological model due
to restrictions imposed by the model structure, feasible parameter 1 1 1
NSEcor ¼  2  a  r  2  a2  2  b2n ð8Þ
values and input–output data. However, Fig. 1b shows a real exam- c c c
ple in which random sampling of the parameter space actually where c is correction factor to inflate the variability in the observed
seems to cover a large portion of the theoretical criteria space. flows. It can be easily shown that if c is set equal to 1/r, it will assure
Since the model used here (HyMod model, Boyle, 2000) is a simple, that a value of a = 1 will now maximize NSEcor (as opposed to a = r
but representative, example of watershed models in common use, maximizing NSE).
the problematic interplay between a and r is likely to be of impor- Alternatively, instead of trying to come up with a ‘corrected’
tance for any type of hydrological model that is optimized with NSE criterion, since MSE and NSE can be decomposed into three
NSE. components, the whole calibration problem can instead be viewed
Further, the same exact problems will arise when using MSE as from the multi-objective perspective, by focusing on the correla-
a model calibration criterion. We can substitute Eq. (4) into Eq. (2), tion, variability error and bias error as separate criteria to be opti-
and thereby obtain Eq. (5) which shows the related decomposition mized. In doing this, it makes sense to enable a better hydrological
of the MSE criterion, consisting (again) of three error terms, but interpretation of the bias component by using the ratio of the
here all three of them are additive. means of the simulated and observed flows (b) for this further
analysis – as opposed to using bn. With this formulation, using b in-
MSE ¼ 2  rs  ro  ð1  rÞ þ ðrs  ro Þ2 þ ðls  lo Þ2 ð5Þ
stead of bn, all three of the components now have their ideal value
From Eqs. (3)–(5) it should be immediately obvious that many at unity.
different combinations of the three components can result in the Fig. 2 shows an example for the trade-off between the three
same overall value for MSE or NSE, respectively, potentially leading components for a simple hydrological model using random param-
to considerable ambiguity in the comparative evaluation of alter- eter sampling. The plot shows a distinctive Pareto front in the
native model hypotheses. The relative contribution of each of these three-dimensional criteria space. If it is desired to select a compro-
components to the overall MSE can be computed as: mise solution from the Pareto front, one possible approach is to

Fig. 1. Relationship of NSE with a and c (bn is assumed to be zero). (a) Theoretical relationship illustrating ideal and optimal a: NSE at point A is greater than at point B, even
though a is at its ideal value at point B. (b) Illustrative example obtained by random parameter sampling with a hydrological model: Leaf River, Mississippi, USA, 1924 km2,
11 years daily data, HyMod model; only those points where b2n 6 0:01 are displayed. Contour lines indicate values for NSE. See colour version of this figure online.
H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91 83

with:

G1 ¼ ðr  1Þ2
G2 ¼ ða  1Þ2
G3 ¼ ðb  1Þ2

Notes on regression lines

As is well known, the slope of the regression lines and the coef-
ficient of correlation are related (Eqs. (12)–(14)). Since different
‘optimal’ values for a are obtained by the NSE and KGE criteria, this
also leads to implications for the regression lines.

r 2 ¼ ks  ko ð12Þ
Cov so r
ks ¼ ¼ ð13Þ
rs2 a
Cov so
ko ¼ ¼ra ð14Þ
r2o
with:
Cov so

Fig. 2. Example for three-dimensional Pareto front of r, a and b. ED is the Euclidian rs  ro
distance between the optimal point and the ideal point, where all three measures
are 1.0. Glan River, Austria, 432 km2, 5 years daily data, HBV model variant, random where Covso is the covariance between the simulated and observed
parameter sampling. values, ks is the slope of the regression line when regressing the ob-
served against the simulated values, and ko is the slope of the
regression line when regressing the simulated against the observed
values.
compute for all points the Euclidian distance from the ideal point Murphy (1988) has already noted that for NSE the conditional
and then to subsequently select the point having the shortest dis- bias term B in Eq. (3) will vanish only if the slope of the regression
tance (Eq. (9)). Since all three of the components are dimensionless line ks is equal to unity (i.e. regressing the observed against the
numbers, we are able to obtain a reasonable solution for the simulated values), which is desirable in the context of the ‘verifica-
Euclidian distance in the un-transformed criteria space. Alterna- tion’ of forecasts. This means that for a given forecast (simulated
tively, a re-scaling of the axes in the criteria space is easily ob- value), the expected value of the observed value lies on the 1:1 line
tained via Eq. (10). In this paper, we will only explore the use of (assuming a Gaussian distribution). As discussed before, the opti-
the criterion of Eq. (9), which is equivalent to setting all three scal- mal value of a that maximizes NSE is given by a model simulation
ing factors of Eq. (10) to unity; for lack of a better name we will dis- for which a is equal to r. As evident in Eq. (13) this results in ks = 1,
tinguish this criterion from the Nash–Sutcliffe efficiency (NSE) by but at the same time this also implies that ko ¼ r2 (Eq. (14)). Be-
calling it the Kling–Gupta efficiency (KGE). cause r2 will always be smaller than unity, this means that we will,
KGE ¼ 1  ED ð9Þ in general, tend to underestimate the slope of the regression line
when regressing the simulated against the observed values. The
KGEs ¼ 1  EDs ð10Þ
tendency will be for high values (peak flows) to be underestimated
with: and for low values (recessions) to be overestimated in the
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi simulation.
ED ¼ ðr  1Þ2 þ ða  1Þ2 þ ðb  1Þ2 In brief, for maximizing NSE the optimal values for ks and ko are
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi unity and r2, respectively. In the case of KGE, the optimal value for
EDs ¼ ½sr  ðr  1Þ2 þ ½sa  ða  1Þ2 þ ½sb  ðb  1Þ2 a is at unity, which means that for maximizing KGE the optimal
b ¼ ls =lo values for both ks and ko are equal to r. Again, since r is smaller than
unity we will tend to underestimate high values and overestimate
where ED is the Euclidian distance from the ideal point, EDs is the low values.
Euclidian distance from the ideal point in the scaled space, b is In considering this, it should be noted that both approaches for
the ratio between the mean simulated and mean observed flows, computing the regression lines (regressing observed against simu-
i.e. b represents the bias; sr, sa and sb are scaling factors that can lated values, or vice versa) are valid, but have different interpreta-
be used to re-scale the criteria space before computing the Euclid- tions. In the context of runoff simulations, when using ks we are
ian distance from the ideal point, i.e. sr, sa and sb can be used for basing the evaluation on the expected error in simulation of the
adjusting the emphasis on different components. In optimization observed runoff being zero for a given simulated runoff, which is
KGE and KGEs are subject to maximization, with an ideal value at a sensible approach when making runoff forecasts under ‘normal’
unity. Similar to NSE, r can be interpreted as a maximum (potential) conditions. However, if we are interested in the ‘unusual’ runoff
value for KGE and KGEs if the other two components are able to conditions – such as runoff peaks – then a more sensible approach
achieve their optimal values close to unity. would be to use ko, where we are interested in the question, ‘‘If a
Analogous to Eq. (6) we can compute the relative contribution flood occurs, can we forecast (simulate) it?”, whereas in the case
of the three components with Eq. (11). of ks such a runoff peak is ‘averaged out’. Fig. 3 illustrates this with
Gi typical scatter plots for runoff simulation. In this example, ks is
g i ¼ P3 ð11Þ close to unity, suggesting unbiased forecasts (Fig. 3a), and at the
j¼1 Gj
highest simulated flows of around 10 m3/s the small number of
84 H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91

regression against simulated runoff regression against observed runoff

a 1
lin
e
b 1
lin
e
1: 1:

simulated runoff [m3/s]


observed runoff [m3/s]
20 20

10 ks = 0.96 10
ko = 0.77

0 0
0 10 20 0 10 20
simulated runoff [m3/s] observed runoff [m3/s]

Fig. 3. Typical scatter plots depicting simulated and observed runoff (r = 0.86 and a = 0.90) and fitted regression lines: (a) regression against simulated runoff (ks = 0.96) and
(b) regression against observed runoff (ko = 0.77). Pitten River, Austria, 277 km2, 5 years daily data, HBV model variant, parameters optimized on NSE. Note that in (a) and (b)
the identical data points are plotted, but the axes are flipped.

observed flows (runoff peaks) that are well above the regression Study area
line are ‘averaged out’ by the larger number of observed flows that
occur slightly below the regression line. However, it is clear that For this study we used 49 mesoscale Austrian basins (Fig. 4a)
whenever a runoff peak above 10 m3/s occurs, there is a clear ten- used in the regionalization study reported by Kling and Gupta
dency for underestimation in the simulation (Fig. 3b). (2009). All are pre-alpine or lowland basins where snowmelt does
These problems arise because the distribution of runoff is usu- not dominate runoff generation. They vary in size from 112.9 km2
ally highly skewed. If ko is of higher interest, then the use of NSE to 689.4 km2, with a median size of 287.3 km2, and a mean eleva-
may cause problems, since the simulated runoff will tend to under- tion range from 232 m to 952 m above sea level. The basins repre-
estimate the peak flows. In the case of the KGE criterion, we will sent a wide range of physiographic and meteorological properties,
also have a tendency towards underestimation, but not as severe with the most important land-use types being forest, grassland and
as with the NSE. Note that for extreme low-flows, similar consider- agriculture. According to the Hydrological Atlas of Austria
ations as for the runoff peaks apply (but here we will tend to over- (BMLFUW, 2007), the long-term mean annual precipitation in the
estimate the low-flow). basins ranges from 507 to 1929 mm, and the corresponding runoff
ranges from 44 to 1387 mm, resulting in a large range of runoff
coefficients (from 9% to 72%). Thus, both wet and dry basins are in-
cluded. Fig. 4b shows a diagnostic plot where normalized actual
Case study evapotranspiration is plotted against normalized precipitation
(both variables are scaled by potential evapotranspiration); it indi-
To examine and illustrate the implications of the theoretical cates that most of the basins are energy limited and only a few of
considerations presented above we applied a simple conceptual the basins are water limited.
precipitation-runoff model to several basins. Using NSE (Eq. (2))
and KGE (Eq. (9)) as model performance criteria, two different sets
of parameters were obtained for each basin by calibration against Data basis
observed runoff data. For each parameter set we compare the over-
all model performance as evaluated by the NSE and KGE criteria We used observed daily data for the period September 1990–
and, in addition, conduct a detailed analysis of the criterion com- August 2000; the first 2 years were used as a warm-up period,
ponents. Further, we also examine the model performance on an the next 5 years for calibration, and the final 3 years for indepen-
independent ‘evaluation’ period. dent evaluation. Observed catchment outlet runoff data were used

Fig. 4. (a) Map showing locations of the 49 Austrian basins used in this study. Also depicted are the 49 gauges and 222 precipitation stations. (b) Relationship between index
of evaporation and index of wetness for the 49 Austrian basins. The index of wetness is computed as the ratio between precipitation (P) and potential evapotranspiration
(ETp). The index of evaporation is computed as the ratio between actual evapotranspiration (ETa) and ETp. Data represent long-term means from the period 1961 to 1990 and
are taken from Hydrological Atlas of Austria (BMLFUW, 2007).
H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91 85

for parameter calibration in each of the basins. Precipitation inputs the rate of actual evapotranspiration depends on current soil mois-
were based on daily data from 222 stations, regionalized using the ture conditions and potential evapotranspiration. Runoff is sepa-
method of Thiessen-Polygons. Air temperature inputs were based rated into fast (surface flow) and slow (base flow) components
on data from 98 stations, regionalized via linear regression with by two linear reservoirs having different recession coefficients. A
elevation. Potential evapotranspiration inputs were based on further linear reservoir is used to simulate channel routing of the
monthly fields of potential evapotranspiration (Kling et al., 2007) runoff. Fig. 5 shows the conceptual structure of the model (the
with a spatial resolution of 1  1 km. The monthly potential evapo- snow module is not shown). The model equations are presented
transpiration data were disaggregated to daily time-steps by using in Kling and Gupta (2009). Table 1 lists the most important param-
daily data from 21 indicator stations, where the daily potential eters of the model.
evapotranspiration was computed using the Thornthwaite-method To reduce dimensionality of the parameter calibration problem,
(Thornthwaite and Mather, 1957). some of the model parameters are set to plausible values and are
not further calibrated. This applies to snow parameters, because
Hydrological model snow is of limited importance in the basins of this study, and to
the channel routing parameters, which are of limited importance
A simple, conceptual, spatially distributed daily precipitation- at the daily time-step (the values of Kling and Gupta (2009) are
runoff model similar to the HBV model (Bergström, 1995) was used). In addition, the critical soil moisture for reducing actual
used; the model was previously applied to these same basins by evapotranspiration is set to a constant value. The six remaining
Kling and Gupta (2009). The model uses a 1  1 km2 raster grid parameters were calibrated in each basin using the Shuffled Com-
for spatial discretization of the basins. However, for simplicity, plex Evolution optimization algorithm (SCE, Duan et al., 1992),
the current study assumes uniform parameter fields. Inputs to using six complexes (13 points per complex) and a convergence
the model are precipitation, air temperature, and potential evapo- criterion of 0.001 in five shuffling loops, resulting in approximately
transpiration. The model consists of a snow module, soil moisture 2000 model evaluations per optimization run.
accounting, runoff separation into different components, and a
routing module. Snowfall is determined from precipitation data
Results
using a threshold temperature, and snowmelt is computed with
the temperature-index method (see e.g. Hock, 2003). Rainfall and
Overall model performance
snowmelt are input to the soil module, where runoff generation
is computed via an exponential formulation that accounts for cur-
The optimization runs resulted in two parameter sets for each
rent soil moisture conditions (see e.g. Bergström and Graham,
basin. Optimization using the ‘optNSE’ method results in parame-
1998). Actual evapotranspiration depletes the soil moisture store;
ter sets ‘hoptNSE’ that yield optimal runoff simulations maximizing
NSE (Eq. (4)), while optimization using the ‘optKGE’ method results
in parameter sets ‘hoptKGE’ that yield optimal runoff simulations
maximizing KGE (Eq. (9)). A standard method for reporting model
act. ET Rainfall + Snowmelt
performance in precipitation-runoff modelling studies is to present
pot. ET scatter plots of NSE between calibration and evaluation periods
soil storage
S1max (see e.g. Merz and Blöschl, 2004). Fig. 6 displays such a scatter plot;
(S1crit)
as expected, for many basins the NSE deteriorates when going from
Beta the calibration to the evaluation period (Fig. 6a). Similar results are
K1 obtained for KGE (Fig. 6b).
surface flow
reservoir
S2crit
K2 1 1
a b
NSE: optNSE, eval [/]

KGE: optKGE, eval [/]

base flow upstream


reservoir K3 Runoff
(K4) 0.5 0.5
routing
Runoff
Fig. 5. Conceptual model structure (the snow module is not shown). Parameters in
brackets are not calibrated.
0 0
0 0.5 1 0 0.5 1
NSE: optNSE, cal [/] KGE: optKGE, cal [/]
Table 1
1 1
Parameters of the model. Parameters in brackets were not calibrated.
c d
KGE: optNSE, cal [/]

NSE: optKGE, cal [/]

Parameter Units Feasible Description


range
S1max mm 50–700 Soil storage capacity 0.5 0.5
Beta / 0.1–25 Exponent for computing runoff
generation
(S1crit) / (0.6) Critical soil moisture for actual
evapotranspiration
K1 h 10–500 Recession coefficient for surface flow 0 0
0 0.5 1 0 0.5 1
K2 h 10–1000 Recession coefficient for percolation
S2crit mm 0–15 Outlet height for surface flow NSE: optNSE, cal [/] KGE: optKGE, cal [/]
K3 h 500–10000 Recession coefficient for base flow
(K4) h (0–10) Recession coefficient for distributed Fig. 6. Scatter plots of overall model performance: cal = calibration period,
routing eval = evaluation period. Note that in (a) two points are located outside the plotting
range because of negative NSE values in the evaluation period.
86 H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91

Now, there can be different reasons for deterioration of model ponent in the equation for calculating the overall model perfor-
performance on the evaluation period. These include over-fitting mance, and/or (2) the value of the component is close to its
of the parameters to the calibration period, non-stationarity be- optimal value. As a consequence of (2), the relative contribution
tween the calibration and evaluation periods, lack of ‘power’ in of the components representing the bias and the variability of
the objective function, etc. Instead of falsification and model rejec- flows can become large for non-optimal parameter sets.
tion, which would be a logical conclusion from such a result, it is To illustrate these considerations, Fig. 7c and d shows the rela-
common practice to simply report the deterioration in the model tive contribution of the criterion components using random
performance and then to move on. In our case, we can report that parameter sampling for a selected basin (Glan River). The sampled
when moving from calibration to evaluation period the median points are arranged from left to right in order of decreasing perfor-
NSE has decreased from 0.76 to 0.59 and the median KGE has de- mance for the selected criterion. With decreasing overall model
creased from 0.86 to 0.72, but what hydrological meaning do these performance (either NSE or KGE) there is a general tendency for
numbers have? Here, an analysis of the different components that the relative contribution of r to decrease and for the other two
constitute the overall model performance enables us to learn much components to become much more important. In some cases only
more about the model behaviour, differences between the calibra- the component representing the bias is dominant, whereas in other
tion and evaluation periods, and also differences between basins. cases only the component representing the variability of flows is
Before analysing the criterion components it is interesting to dominant. This clearly indicates that both NSE and KGE are sensi-
note the relationship between NSE and KGE. Fig. 6 shows that tive to all three of the components. From a multi-objective point of
when optimizing on KGE (optKGE) there is a strong correlation be- view this is definitely desirable, because it means that by calibrat-
tween the values obtained for the KGE and NSE criteria (Fig. 6d). ing on the overall model performance we can substantially im-
However, when optimizing on NSE (optNSE), the correlation be- prove the components representing the bias and the variability
tween the values obtained for NSE and KGE is lower (Fig. 6c). of flows. Here of course we should remember that in NSE the bias
The reasons for this will become much clearer later in this section, is normalized by the standard deviation of the observed flows and
but briefly it is useful to keep in mind that optimization on KGE that the ‘optimal’ a is equal to r. Hence, with NSE it is not necessar-
strongly controls the values that the a and b components can ily assured that from hydrological point of view good values for a
achieve, whereas optimization on NSE constrains these compo- and b are obtained.
nents only weakly. The cumulative distribution functions for the NSE, r, a, and b
measures as obtained with optNSE and optKGE in the calibration
Criterion components and evaluation periods are shown in Fig. 8. Looking first at the re-
sults for the NSE criterion (Fig. 8a), we see that while the NSE ob-
The relative contributions of the criterion components to the tained by optNSE is larger than with optKGE, the difference is
overall model performance obtained via optimization are shown rather small. This indicates that by calibrating on KGE, we have ob-
in Fig. 7 (see Eq. (6) for optNSE and Eq. (11) for optKGE). The ob- tained only a slight deterioration in overall performance as mea-
tained (optimized) model performance is dominated by the com- sured by NSE. Further, although there is a pronounced reduction
ponent representing r (dark grey), whereas the other components in NSE from calibration to evaluation period, the reduction is sim-
representing the bias (light grey) and the variability (medium grey) ilar for both optNSE and optKGE.
of flows have only small relative contributions. This applies for all However, the change in NSE tells us little that is diagnostically
49 basins and for both optimization on NSE (Fig. 7a) and KGE useful about the causes of this ‘deterioration’ in overall model per-
(Fig. 7b). However, a low relative contribution of a component to formance. Of more interest, are the values obtained for the three
the final value of the (optimized) model performance does not nec- criterion components. The results for the calibration period are dis-
essarily imply that the model performance criterion is, in general, cussed first. Note that the distribution of r is almost identical with
insensitive to this component. Instead, the relative contribution of either optNSE or optKGE (Fig. 8b, filled symbols), indicating that
a component can be small because of (1) low ‘weight’ of the com- both of the criteria have achieved similar hydrograph match in

Fig. 7. Stacked area plots showing the relative contribution of the components for NSE and KGE in the calibration period: (a) optNSE in 49 basins, (b) optKGE in 49 basins, (c)
and (d) random parameter sampling in the Glan River basin.
H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91 87

1.00 1.00
a b
0.75 0.75 optNSE cal

CDF [/]
optKGE cal
0.50 0.50
optNSE eval

0.25 0.25 optKGE eval

0.00 0.00
0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1
NSE [/] r [/]

1.00 1.00
c d
0.75 0.75
CDF [/]

CDF [/]
0.50 0.50

0.25 0.25

0.00 0.00
0.5 0.75 1 1.25 1.5 0.5 0.75 1 1.25 1.5
[/] [/]

Fig. 8. Cumulative distribution functions for NSE, r, a and b as obtained with optNSE and optKGE in the calibration and evaluation periods.

1.2 1.0
a b
1.0 0.8
k o [/]
[/]

0.8 0.6

0.6 0.4
optKGE optKGE
optNSE optNSE
0.4 0.2
0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 1.2
r [/] k s [/]

Fig. 9. Relationship between (a) r and a and (b) the slope of the regression lines ks and ko.

terms of shape and timing. However, for the other two compo- tions have retained a median value of a close to unity (the same as
nents, optKGE has achieved considerably better results. Fig. 8c during calibration) while the overall variability in the distribution
shows that there is a strong tendency for underestimation of a has increased around the median value (Fig. 9c). This indicates that
by optNSE (filled circle symbols), due to which only 18% of the ba- the statistical tendency to provide good reproduction of flow vari-
sins are within 10% of the ideal value at unity, whereas for optKGE ability persists into the evaluation period, but there is an increase
(filled triangle symbols) all of the basins are within 10% of the ideal in the noise so that the distribution has become much wider. In
value. Similarly optKGE yields good results for b (Fig. 8d), with all contrast, the optNSE results continue to show a systematic ten-
of the basins having a bias of less than 10%, while for optNSE 16% of dency to underestimate a (variability of flows) during the evalua-
the basins have a bias of greater than 10%. In general, optKGE re- tion period along with a considerable increase in random noise.
sults in a b value that is much closer to the ideal value at unity than Similarly, the cumulative distribution function of b obtained by
with optNSE. Thus, the use of optKGE has resulted in all of the ba- both methods remains centred close to its calibration value while
sins having a and b close to their ideal values of unity during cal-
ibration. This now explains why we get such a high correlation
Table 2
between NSE and KGE in Fig. 6d; because both a and b are now al-
Paradoxical examples for NSE and components in three basins (results for the
most constant across the basins (here close to unity), the equations calibration period). All three components (r, a, b) improve but the overall model
for KGE and NSE both become approximately linear functions performance measured by NSE decreases with the parameter set obtained by optKGE.
of r, and in fact we tend towards the relationship NSEðhoptKGE Þ ¼
Basin Method NSE (/) KGE (/) r (/) a (/) b (/)
2 KGEðhoptKGE Þ  1.
Zaya River optNSE 0.484 0.685 0.714 0.871 1.019
Next we examine what happens for the evaluation period. In
optKGE 0.452 0.732 0.733 1.026 1.001
general, we see that the statistical distributions of the three com-
Pitten River optNSE 0.742 0.828 0.863 0.899 1.028
ponents have changed. The cumulative distribution function of
optKGE 0.730 0.865 0.866 1.004 1.016
r has shifted to lower values in a consistent manner for both
Glan River optNSE 0.786 0.855 0.888 0.912 1.028
optNSE and optKGE (Fig. 8b), so that both methods yield again very
optKGE 0.776 0.888 0.889 1.002 1.007
similar results for timing and shape. However, the optKGE calibra-
88 H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91

showing an increase in the variability (Fig. 8d). The small shift in Parameters
the median value may be caused by the fact that there is approxi-
mately 5% less precipitation during the evaluation period. Clearly, Finally, we report briefly on the optimal parameter values ob-
the KGE criterion has provided model calibrations that are statisti- tained using optNSE and optKGE. Interestingly, even though the
cally more desirable during calibration while providing results that statistical properties of the streamflow hydrographs (as measured
remain statistically more consistent on the independent evaluation by a and b) did change significantly (Fig. 8), for many basins the
period. parameter values did not change by large amounts (compared to
An interesting observation is that in a few basins the paradoxi- the feasible parameter range) when moving from optNSE to opt-
cal case occurs where all three of the criterion components im- KGE (Fig. 10). The correlation between the parameter values of opt-
prove with optKGE, but the value of NSE decreases when NSE and optKGE is at least 0.80 for all six of the parameters, and for
compared to the NSE obtained with optNSE (Table 2). The reason three of the parameters the correlation is larger than 0.90. For the
for this is the interplay between the terms a and r in the NSE equa- parameter K1 the values are slightly smaller with optKGE, which
tion (illustrated nicely in Fig. 1). It is therefore actually (counter has the effect of higher peaks and quicker recession of surface flow.
intuitively) possible for both a and r to get closer to unity while Also the parameter K3 decreases with optKGE, which has the effect
NSE gets smaller. This is, of course, because optimization on NSE of a less dampened base flow response. Given the function of these
seeks to make a = r, and therefore ‘punishes’ solutions for which two parameters in the model structure, a reduction in the param-
a is close to the ideal value of unity, while r will always be smaller eter values has the effect of increasing the value of the a measure.
than unity. In addition, we see an increase in the percolation parameter K2,
As discussed earlier, it is likely that optimization with NSE will which results in more surface flow and less base flow, with the
yield results where a is close to r. Fig. 9a shows a comparison be- overall effect of increasing the value of a.
tween r and a obtained by the two optimization cases for all of the The function of the parameters S1max and Beta in the model is
basins. In general, when optimizing with NSE, the value of a is in- mainly to control the partitioning of precipitation into runoff and
deed very similar to r, which means that the variability of flows is evapotranspiration (thereby controlling the water balance), and
systematically underestimated (as shown above), and a ap- as a consequence these parameters mainly affect the b measure.
proaches the ideal value of unity in only one of the 49 basins. In However, these parameters also affect the a measure and parame-
contrast, when optimizing with KGE, the value of a is close to ter interaction between S1max and Beta complicates the analysis.
the ideal value of unity for most of the basins. Given the function of these parameters in the model, the b measure
Consequently, as expected from the theoretical discussion, sys- should increase with a decrease in either S1max and/or Beta, but
tematically different results are obtained by optNSE and optKGE this is not obvious from Fig. 10, because a decrease in S1max can
for the slopes of the regression lines (Fig. 9b), where the cases of be compensated by an increase in Beta, and vice versa.
regressing the simulated against the observed values (ko, Eq. For the parameter S2crit no clear tendency of change is visible.
(14)) and regressing the observed against the simulated values Here it should be mentioned that there was no change in the
(ks, Eq. (13)) are distinguished. In general, when using optNSE parameter values in sixteen of the basins for which the parameter
the value of ks is close to the ideal value at unity, but ko is signifi- values were at their lower bounds (4 basins) and upper bounds (12
cantly smaller than one. In the case of optKGE both ks and ko are basins), respectively. Note that these 16 points also contribute to
smaller than one, but the underestimation is not as large as for the rather high correlation observed.
ko with optNSE. Note (from Eq. (12)) that the only way that we On a visual, albeit subjective, basis a comparison of the param-
can have both ks and ko equal to one is for r to be equal to unity, eter sets obtained by optNSE and optKGE reveals that in many of
which would only happen if the model and data were perfect. the basins the two parameter sets are almost indistinguishable,

1 1 1
r = 0.95 r = 0.86 r = 0.98
S1max (optKGE) [/]

Beta (optKGE) [/]

K1 (optKGE) [/]

0.5 0.5 0.5

0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
S1max (optNSE) [/] Beta (optNSE) [/] K1 (optNSE) [/]
12 overlapping points
1 1 1
r = 0.88 r = 0.93 r = 0.80
S2crit (optKGE) [/]
K2 (optKGE) [/]

K3 (optKGE) [/]

0.5 0.5 0.5

4 overlapping points
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
K2 (optNSE) [/] S2crit (optNSE) [/] K3 (optNSE) [/]

Fig. 10. Scatter plots of optimal parameters obtained by optNSE and optKGE. Parameter values are normalized by the feasible parameter range (Table 1); the parameters Beta,
K1, K2 and K3 are log-transformed before normalization.
H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91 89

1 1

normalized parameter value [/]

normalized parameter value [/]


a b

Glan River basin


0.75 0.5

difference in
0.5 0

optNSE
0.25 -0.5
optKGE all other basins
Glan River basin
0 -1
S1max Beta K1 K2 S2crit K3 S1max Beta K1 K2 S2crit K3

Fig. 11. Comparison of the parameter sets obtained by optNSE and optKGE: (a) normalized parameter values of hoptNSE and hoptKGE in the Glan River basin, (b) difference in the
normalized parameter values (computed as hoptKGE  hoptNSE), displayed for all basins.

but nevertheless the criterion components have changed. As an was capable of achieving good solutions for the bias and the vari-
example, Fig. 11a displays a comparison of the parameters ob- ability of flows with only slight decreases in the correlation coeffi-
tained by optNSE and optKGE for the Glan River. Apparently the cient. The optimization task therefore becomes one of specifying
parameter values are quite similar, although the a measure and the objective function in such a way that it is capable of achieving
to a lesser extent the b measure have both improved when using such a solution as an optimal solution (i.e. simultaneously good
optKGE (see Table 2). For many basins, the difference in each of solutions for bias, flow variability and correlation). Apparently, this
the parameters was found to be only a small percentage of the was not the case with NSE, and we therefore formulated an alter-
overall feasible range (Fig. 11b); in 14 of the 49 basins, all of the native criterion (KGE) that is based on an equal weighting of the
six parameters have changed by less than 10%, and in only a few three components (correlation, bias, and variability measures). In
of the basins did two or more parameters change by a significant general, the correlation will not reach its ideal value of unity, but
amount. For the latter, the changes may also (at least in part) be an optimization on KGE resulted in the other two components
a consequence of parameter interactions; for example, there is a being indeed close to their ideal values. Thus, the use of KGE in-
clear tendency for K2 and S2crit to increase/decrease simulta- stead of NSE for model calibration improved the bias and the var-
neously, and this fact must, of course, also be considered when iability measure considerably while only slightly decreasing the
interpreting the scatter plots in Fig. 10. correlation.
The simulation results were also examined for an independent
Discussion evaluation period. In general, the overall model performance and
the individual components deteriorated in a statistical sense. It is
A decomposition of the NSE criterion shows that this measure of at least partially likely that this is due to the rather short lengths
overall model performance can be represented in terms of three of the calibration and evaluation periods used in this study (5
components, which measure the linear correlation, the bias and and 3 years, respectively). Further, it should be noted that this
the variability of flow. By simple theoretical considerations, we study has not accounted for either the uncertainty in the parame-
can show that problems can arise in model calibrations that seek ter values or the uncertainty in the computed statistics, which
to optimize the value of NSE (or its related MSE). First, because would require a more rigorous Bayesian approach. Nevertheless,
the bias is normalized by the standard deviation of the observed the results clearly show that optimization using NSE tends to
flows, the relative importance of the bias term will vary across ba- underestimate the variability of flows on the calibration period,
sins (and also across years), and for cases where the variability in and that this behaviour tends to persist into the evaluation period.
the observed flows is high, the bias will have a low ‘weight’ in Further, the bias in the calibration period is well constrained with
the computation of NSE. Second, there will be a tendency for the KGE, but not with NSE, whereas in the evaluation period (with
variability in the flows to be systematically underestimated, so that overall poorer bias) the results with NSE are only slightly inferior
the ratio of the simulated and observed standard deviations of to KGE.
flows will tend to be equal to the correlation coefficient. As a con- An interesting result is that for many basins the optimal param-
sequence, the slope of the regression line (when regressing simu- eter values changed by only small amounts (relative to the feasible
lated against observed values) will be smaller than one, so that range) when using KGE instead of NSE. In the KGE optimization
runoff peaks will tend to be systematically underestimated. This there was a tendency to decrease the recession parameters of sur-
finding may seem to contradict the general notion that optimiza- face flow and base flow to simulate a flashier hydrograph, and
tion on NSE will improve simulation of runoff peaks. In fact NSE thereby improve the value of the variability measure. Because of
is generally found to be highly sensitive to the large runoff values, parameter interactions there was no clear tendency of a change
because of the (typically) larger model and data errors involved in in the parameters for the bias measure. In general, this suggests
the matching of such events, and this fact is separate from the gen- that the values of multiple criteria can be improved by making only
eral (theoretical and practical) tendency to underestimate the run- small changes in the parameter values. This emphasizes the impor-
off peaks. Of course, when it is of interest to regress the observed tance of the relative sensitivity of the criterion components to
against the simulated values then an optimization on NSE can yield changes in the parameter values. On the one hand, this is a desir-
desirable results, since in such a case the optimal slope of the able effect during calibration, because we want to have measures
regression line for maximizing NSE is equal to unity. that are actually sensitive to the parameter values, thereby theo-
These theoretical considerations were all supported by the re- retically increasing parameter identifiability. On the other hand,
sults of the modelling experiment. In such an experiment, not all this raises important questions for parameter regionalization, be-
solutions within the theoretical criteria space are possible because cause even a small ‘error’ in a parameter value could result in poor
of constraints regarding the model structure, parameter ranges, values of individual measures, thereby causing poor overall model
and available data. However, it was found that the simple model performance.
90 H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91

The attempt to explain the relationships between changes in since the criterion components may be quite different. Thus, when
the parameters and values of the criterion components relates to evaluating or reporting results based on calibration with NSE,
the idea of diagnostic model evaluation, as proposed by Gupta information about the correlation, bias, and variability of flows
et al. (2008) and tested by Yilmaz et al. (2008) and Herbst et al. should also be given (interestingly, this was already proposed by
(2009). The idea behind diagnostic model evaluation is to move be- Legates and McCabe (1999), although they did not discuss the
yond aggregate measures of model performance that are primarily interrelation between NSE and its three components). Ultimately
statistical in meaning, towards the use of (multiple) measures and the decision to accept or reject a model must be made by an expert
signature plots that are selected for their ability to provide hydro- hydrologist, where such a decision is best based in a multiple-cri-
logical interpretation. Such an approach should improve our ability teria framework. To this end, an analysis of the components that
to diagnose the causes of a problem and to make corrections at the constitute the overall model performance can significantly en-
appropriate level (i.e. model structure or parameters). The theoret- hance our understanding of model behaviour and provide insights
ical development presented in this paper, shows one simple, statis- helpful for diagnosing differences between models, basins and
tically founded approach to the development of a strategy for time periods within a hydrological context.
diagnostic evaluation and calibration of a model. Clearly, the mea-
sures used in this study have some diagnostic value. The bias and
variability measures represent differences in matching of the Summary and conclusions
means and the standard deviations (the first two moments) of
the probability distributions of the quantities being compared. In this study a decomposition of the widely used Nash–Sutcliffe
Their appearance in NSE and MSE indicates that these performance efficiency (NSE) was applied to analyse the different components
criteria give importance to matching these two long-term statistics that constitute NSE (and hence MSE). We present theoretical con-
of the data. From a hydrological perspective, these statistics relate siderations that serve to highlight problems associated with the
to the properties of the flow duration curve, in which issues of tim- NSE criterion. The results of a case study, where a simple precipi-
ing and shape of the dynamical characteristics of flow are largely tation-runoff model was applied in several basins, support the the-
ignored. These statistics will therefore be mainly controlled by as- oretical findings. For comparison we show how an alternative
pects of model structure and values of the parameters that deter- measure of model performance (KGE) can overcome the problems
mine the general partitioning of precipitation into runoff, associated with NSE.
evapotranspiration and storage (i.e. overall water balance) and, In summary, the main conclusions of this study are:
further, the general partitioning of runoff into fast and slow flow
components (e.g. see Yilmaz et al., 2008). Meanwhile, all other dif-  The mean squared error and its related NSE criterion consists of
ferences between the statistical properties of the observed and three components, representing the correlation, the bias and a
simulated flows such as timing of the peaks, and shapes of the ris- measure of variability. The decomposition shows that in order
ing limbs and the recessions of the hydrograph (i.e. autocorrelation to maximize NSE the variability has to be underestimated. Fur-
structures), are lumped into the (linear) correlation coefficient as ther, the bias is scaled by the standard deviation in the observed
an aggregate measure. A logical next step would be to further values, which complicates a comparison between basins.
decompose the correlation coefficient into diagnostic components  Given that NSE consists of three components, an alternative
that represent different aspects of flow timing and shape (e.g. auto- model performance criterion KGE is easily formulated by com-
correlation structure). Further, a distinction between different puting the Euclidian distance of the three components from
states (modes) of the hydrological response – such as driven and the ideal point, which is equivalent to selecting a point from
non-driven (see e.g. Boyle et al., 2000) – may also prove to be sen- the three-dimensional Pareto front. Such an alternative criterion
sible. Such considerations are left for future work. avoids the problems associated with NSE (but also introduces
Before entering into our concluding remarks, we should point new problems).
out that the primary purpose of this study was not to design an im-  The slopes of the regression lines are directly related with the
proved measure of model performance, but to show clearly that three components. NSE is suitable if the interest is in regressing
there are systematic problems inherent with any optimization that the observed against the simulated values, but less suitable for
is based on mean squared errors (such as NSE). The alternative regressing the simulated against the observed values. This
criterion KGE was simply used for illustration purposes. An optimi- means that if NSE is used in optimization, then runoff peaks will
zation on KGE is equivalent to selecting a point from the three- tend to be underestimated. The same applies for KGE, but the
dimensional Pareto front with the minimal distance from the ideal underestimation will not be as severe.
point. Many different alternative criteria would also be sensible,  After optimization, the component representing the linear corre-
but ultimately it has to be understood that each single measure lation dominates the model performance criterion for both NSE
of model performance has its own peculiarities and trade-offs be- and KGE. For non-optimal parameter sets any of the three com-
tween components. In the case of KGE probably the most problem- ponents can be dominant in NSE or KGE.
atic characteristic is that the slope of the regression lines will tend  Even with a simple precipitation-runoff model it is possible to
to be smaller than one, albeit not as strongly as with NSE (when obtain runoff simulations where the mean and variability of
regressing simulated against observed values). Because of the sim- flows are matched well, and the linear correlation is still high.
ple design of the KGE criterion it is straightforward to understand However, this applies only for optimization with KGE, since
the trade-offs between the correlation, the bias and the variability NSE does not consider such a solution as ‘good’.
measure. These trade-offs are more complex in the case of NSE.  The optimal parameter values may, in practice, only change by
If single measures of model performance are used we deem it to small amounts when using KGE instead of NSE as the objective
be imperative to clearly know the limitations of the selected crite- function for optimization (as in our example). This emphasizes
rion. It then will depend upon the type of application whether the importance of considering the sensitivity of the three com-
these limitations are of concern or not. The decomposition pre- ponents to perturbations in the parameter values.
sented here highlights the fact that identical values of the NSE cri-
terion are not necessarily indistinguishable – as is commonly (and A preliminary analysis of the results of other studies shows that
erroneously) assumed in the literature in arguments relating to the same conclusions are obtained when using more complex,
equifinality (Beven and Binley, 1992; Beven and Freer, 2001) – hourly models (Kling et al., 2008) or simple, monthly water balance
H.V. Gupta et al. / Journal of Hydrology 377 (2009) 80–91 91

models (Martinez and Gupta, submitted for publication), which Herbst, M., Gupta, H.V., Casper, M.C., 2009. Mapping model behaviour using Self-
Organizing Maps. Hydrology and Earth System Sciences 13, 395–409.
emphasizes the generality of the conclusions of this study. This
Hock, R., 2003. Temperature index melt modelling in mountain areas. Journal of
study reinforces the argument that model calibration is a multi- Hydrology 282, 104–115.
objective problem (Gupta et al., 1998), and shows that a decompo- Jain, S.H., Sudheer, K.P., 2008. Fitting of hydrologic models: a close look at the Nash–
sition of the calibration criterion into components can help to Sutcliffe index. Journal of Hydrologic Engineering. 13 (10), 981–986.
Kling, H., Gupta, H., 2009. On the development of regionalization relationships for
greatly enhance our understanding of the overall model perfor- lumped watershed models: the impact of ignoring sub-basin scale variability.
mance (and, by extension, the differences in model performance Journal of Hydrology 373, 337–351.
between model structures, basins and time periods). To compute Kling, H., Fürst, J., Nachtnebel, H.P., 2007. Seasonal water balance. In: BMLFUW
(Ed.), Hydrological Atlas of Austria, third ed., map sheet 7.2, Wien:
these components is a straightforward task and should be included Bundesministerium für Land- und Forstwirtschaft, Umwelt und
in any evaluation of model simulations. Ultimately, such an ap- Wasserwirtschaft. ISBN: 3-85437-250-7.
proach may help in the design of diagnostically powerful evalua- Kling, H., Yilmaz, K.K., Gupta, H.V., 2008. Diagnostic evaluation of a distributed
precipitation-runoff model for snow dominated basins (oral presentation). In:
tion strategies that properly support the identification of American Geophysical Union: AGU Joint Assembly, 27–30th May 2008, Ft.
hydrologically consistent models. Lauderdale, FL, USA.
Krause, P., Boyle, D.P., Bäse, F., 2005. Comparison of different efficiency criteria for
hydrological model assessment. Advances in Geosciences 5, 89–97.
Acknowledgements Legates, D.R., McCabe, G.J., 1999. Evaluating the use of ‘‘goodness-of-fit” measures
in hydrologic and hydroclimatic model evaluation. Water Resources Research
35, 233–241.
Funding was provided for Harald Kling as an Erwin Schrödinger Lindström, G., 1997. A simple automatic calibration routine for the HBV model.
Scholarship (Grant No. J2719-N10) by FWF, Vienna, Austria. Partial Nordic Hydrology 28, 153–168.
support for Hoshin Gupta and Koray Yilmaz was provided by the Madsen, H., 2003. Parameter estimation in distributed hydrological catchment
modelling using automatic calibration with multiple objectives. Advances in
Hydrology Laboratory of the National Weather Service (Grant Water Resources 26, 205–216.
NA04NWS4620012) and by SAHRA (Center for Sustainability of Marce, R., Ruiz, C.E., Armengol, J., 2008. Using spatially distributed parameters and
semi-Arid Hydrology and Riparian Areas) under the STC program multi-response objective functions to solve parameterization of complex
applications of semi-distributed hydrological models. Water Resources
of the National Science Foundation (agreement EAR 9876800). Par- Research 44, 18.
tial support for Koray Yilmaz was also provided by NASA’s Applied Martinec, J., Rango, A., 1989. Merits of statistical criteria for the performance of
Sciences Program (Stephen Ambrose) and PMM (Ramesh Kakar) of hydrological models. Water Resources Bulletin, AWRA. 25 (2), 421–432.
Martinez, G.F., Gupta, H.V., submitted for publication. Continental scale diagnostic
NASA Headquarters. We would like to thank two anonymous
evaluation of the ‘abcd’ monthly water balance model for the conterminous US.
reviewers and Philipp Stanzel for their useful comments that Water Resources Research.
helped to improve the manuscript. Mathevet, T., Michel, C., Andreassian, V., Perrin, C., 2006. A bounded version of the
Nash–Sutcliffe criterion for better model assessment on large sets of basins. In:
Andréassian, V., Hall, A., Chahinian, N., Schaake, J. (Eds.), Large Sample Basin
References Experiment for Hydrological Model Parameterization: Results of the Model
Parameter Experiment – MOPEX. IAHS Publ, p. 567.
McCuen, R.H., Snyder, W.M., 1975. A proposed index for comparing hydrographs.
Anderton, S., Latron, J., Gallart, F., 2002. Sensitivity analysis and multi-response,
Water Resources Research 11 (6), 1021–1024.
multi-criteria evaluation of a physically based distributed model. Hydrological
McCuen, R.H., Knightm, Z., Cutter, A.G., 2006. Evaluation of the Nash–Sutcliffe
Processes 16, 333–353.
efficiency index. Journal of Hydrologic Engineering 11 (6), 597–602.
Beldring, S., 2002. Multi-criteria validation of a precipitation-runoff model. Journal
Merz, R., Blöschl, G., 2004. Regionalisation of catchment model parameters. Journal
of Hydrology 257, 189–211.
of Hydrology 287, 95–123.
Bergström, S., 1995. The HBV model. In: Singh, V.P. (Ed.), Computer Models of
Murphy, A., 1988. Skill scores based on the mean square error and their
Watershed Hydrology. Water Resources Publications, Highlands Ranch Co., USA.
relationships to the correlation coefficient. Monthly Weather Review 116,
ISBN No.: 0-918334-91-8.
2417–2424.
Bergström, S., Graham, L.P., 1998. On the scale problem in hydrological modelling.
Nash, J.E., Sutcliffe, J.V., 1970. River flow forecasting through. Part I. A conceptual
Journal of Hydrology 211, 253–265.
models discussion of principles. Journal of Hydrology. 10, 282–290.
Bergström, S., Lindström, G., Pettersson, A., 2002. Multi-variable parameter
Parajka, J., Merz, R., Blöschl, G., 2005. A comparison of regionalisation methods for
estimation to increase confidence in hydrological modelling. Hydrological
catchment model parameters. Hydrology and Earth System Sciences 9, 157–
Processes 16, 413–421.
171.
Beven, K., Binley, A., 1992. The future of distributed models: model calibration and
Rode, M., Suhr, U., Wriedt, G., 2007. Multi-objective calibration of a river water
uncertainty prediction. Hydrological Processes 6, 279–298.
quality model-Information content of calibration data. Ecological Modelling
Beven, K.J., Freer, J., 2001. Equifinality, data assimilation, and uncertainty estimation
204, 129–142.
in mechanistic modelling of complex environmental systems. Journal of
Rojanschi, V., Wolf, J., Barthel, R., Braun, J., 2005. Using multi-objective optimisation
Hydrology 249, 11–29.
to integrate alpine regions in groundwater flow models. Advances in
BMLFUW, 2007. Hydrological Atlas of Austria, third ed. Wien: Bundesministerium
Geosciences 5, 19–23.
für Land- und Forstwirtschaft, Umwelt und Wasserwirtschaft. ISBN: 3-85437-
Safari, A., De Smedt, F., Moreda, F., 2009. WetSpa model application in the
250-7.
Distributed Model Intercomparison Project (DMIP2). Journal of Hydrology, in
Boyle, D.P., 2000. Multicriteria calibration of hydrological models. Ph.D.
press.
Dissertation, Department of Hydrology and Water Resources, The University
Schaefli, B., Gupta, H.V., 2007. Do Nash values have value? Hydrological Processes
of Arizona, Tucson, USA.
21, 2075–2080.
Boyle, D.P., Gupta, H.V., Sorooshian, S., 2000. Toward improved calibration of
Thornthwaite, C.W., Mather, J.R., 1957. Instructions and tables for computing
hydrologic models: combining the strengths of manual and automatic methods.
potential evaporation and the water balance. Publications in Climatology 10 (3),
Water Resources Research 36 (12), 3663–3674.
311.
Cao, W., Bowden, W.B., Davie, T., Fenemor, A., 2006. Multi-variable and multi-site
Tolson, B.A., Shoemaker, C.A., 2007. Dynamically dimensioned search algorithm for
calibration and validation of SWAT in a large mountainous catchment with high
computationally efficient watershed model calibration. Water Resources
spatial variability. Hydrological Processes 20, 1057–1073.
Research 43, 16.
Criss, R.E., Winston, W.E., 2008. Do Nash values have value? Discussion and
van Griensven, A., Bauwens, W., 2003. Multiobjective autocalibration for
alternate proposals. Hydrological Processes 22, 2723–2725.
semidistributed water quality models. Water Resources Research 39 (12), 9.
Duan, Q., Sorooshian, S., Gupta, V., 1992. Effective and efficient global optimization
Wang, G., Xia, J., Chen, J., 2009. Quantification of effects of climate variations and
for conceptual rainfall-runoff models. Water Resources Research 28 (4), 1015–
human activities on runoff by a monthly water balance model: a case study of
1031.
the Chaobai River basin in northern China. Water Resources Research 45, 12.
Garrick, M., Cunnane, C., Nash, J.E., 1978. A criterion of efficiency for rainfall-runoff
Weglarczyk, S., 1998. The interdependence and applicability of some statistical
models. Journal of Hydrology 36, 375–381.
quality measures for hydrological models. Journal of Hydrology 206, 98–103.
Gupta, H.V., Sorooshian, S., Yapo, P.O., 1998. Toward improved calibration of
Yilmaz, K.K., Gupta, H.V., Wagener, T., 2008. A process-based diagnostic approach to
hydrologic models: multiple and noncommensurable measures of information.
model evaluation: application to the NWS distributed hydrologic model. Water
Water Resources Research 34 (4), 751–763.
Resources Research 44, 18.
Gupta, H.V., Wagener, T., Liu, Y., 2008. Reconciling theory with observations:
Young, A.R., 2006. Stream flow simulation within UK ungauged catchments using a
elements of a diagnostic approach to model evaluation. Hydrological Processes
daily rainfall-runoff model. Journal of Hydrology 320, 155–172.
22, 3802–3813.

You might also like