Mccuen 2006
Mccuen 2006
Mccuen 2006
Abstract: The Nash–Sutcliffe efficiency index 共E f 兲 is a widely used and potentially reliable statistic for assessing the goodness of fit of
hydrologic models; however, a method for estimating the statistical significance of sample values has not been documented. Also, factors
that contribute to poor sample values are not well understood. This research focuses on the interpretation of sample values of E f .
Specifically, the objectives were to present an approximation of the sampling distribution of the index; provide a method for conducting
Downloaded from ascelibrary.org by Ryerson University on 03/18/13. Copyright ASCE. For personal use only; all rights reserved.
hypothesis tests and computing confidence intervals for sample values; and identify the effects of factors that influence sample values of
E f including the sample size, outliers, bias in magnitude, time-offset bias of hydrograph models, and the sampling interval of hydrologic
data. Actual hydrologic data and hypothetical analyses were used to show these effects. The analyses show that outliers can significantly
influence sample values of E f . Time-offset bias and bias in magnitude can have an adverse effect on E f . The time step at which the data
are recorded appears to be an insignificant factor unless the sample size is small. The Nash–Sutcliffe index can be a reliable goodness-
of-fit statistic if it is properly interpreted.
DOI: 10.1061/共ASCE兲1084-0699共2006兲11:6共597兲
CE Database subject headings: Hydrologic models; Statistics; Hydrographs; Research; Time series; Evaluation.
Introduction linear model are unbiased, then the efficiency index will lie in the
interval from 0 to +1. For biased models, the efficiency index
Hydrologic models often require calibration prior to application. may actually be algebraically negative. For nonlinear models,
Traditionally, the correlation coefficient and standard error of es- which most hydrologic models are, negative efficiencies can re-
timate have been used to measure the goodness of fit of the model sult even when the model is unbiased.
calibration. While the correlation coefficient is a useful goodness- One advantage of the Nash–Sutcliffe index is that it can
of-fit index, it is theoretically applicable only to linear models that be applied to a variety of model types. The ASCE Watershed
include an intercept. Even for the commonly used power model, Management Committee 共ASCE 1993兲 recommends the Nash–
ŷ = axb, the computed correlation coefficient can be a poor estima- Sutcliffe index for evaluation of continuous moisture accounting
tor of goodness of fit because of model bias. The correlation models. Erpul et al. 共2003兲 used the index to assess nonlinear
coefficient assumes that the model being tested is unbiased, i.e., regression models of sediment transport. Merz and Bloschl 共2004兲
the sum of the errors is equal to zero, and a fitted power model used the index in the calibration and verification of catchment
can be significantly biased 共McCuen et al. 1990兲. model parameters. Kalin et al. 共2003兲 used the index as a
Recognizing the limitations of the correlation coefficient, Nash goodness-of-fit indicator for a storm event model. It is also widely
and Sutcliffe 共1971兲 proposed an alternative goodness-of-fit used with continuous moisture accounting models 共Birikundavyi
index, which is often referred to as the efficiency index 共E f 兲 et al. 2002; Johnson et al. 2003; Downer and Ogden 2004兲. The
n use of the index for a wide variety of model types indicates its
Ef = 1 −
兺 共Ŷ i − Y i兲2 共1兲
flexibility as a goodness-of-fit statistic.
While the Nash–Sutcliffe index is widely used as a goodness-
兺 共Y i − Ȳ兲2 of-fit index, values are not easily interpreted because the sampling
distribution of E f has not been presented. For this reason, users of
in which Ŷ i and Y i⫽predicted and measured values of the crite- E f are only able to provide subjective interpretations of their
rion 共dependent兲 variable Y, respectively; Ȳ⫽mean of the mea- sample values. Many factors influence a sample value of E f , and
sured values of Y; and n⫽sample size. If the predictions of a high values of E f may result even when the fit is relatively poor,
such as when the variance of Y is very large. Values of E f also
1
Professor, Dept. of Civil and Environmental Engineering, Univ. of depend on the sample size, such that the interpretation of “good”
Maryland, College Park, MD 20742-3021. versus “bad” fit depends on the sample size. A value of 0.7 may or
2
Research Assistant, Dept. of Civil and Environmental Engineering, may not be indicative of a good fit. Therefore, if the Nash–
Univ. of Maryland, College Park, MD 20742-3021. Sutcliffe index is to be used with some sense of reliability, more
3
Research Assistant, Dept. of Civil and Environmental Engineering, knowledge about sample values of E f is needed.
Univ. of Maryland, College Park, MD 20742-3021. The objectives of this study were to present an approximate
Note. Discussion open until April 1, 2007. Separate discussions must sampling distribution of E f and to assess factors that influence
be submitted for individual papers. To extend the closing date by one
computed values of E f . Methods for computing confidence inter-
month, a written request must be filed with the ASCE Managing Editor.
The manuscript for this paper was submitted for review and possible vals on E f and testing hypotheses with sample values are pro-
publication on March 15, 2005; approved on April 20, 2006. This paper is vided. These tools can assist those who use the index to more
part of the Journal of Hydrologic Engineering, Vol. 11, No. 6, consistently assess the goodness of fit of hydrologic model
November 1, 2006. ©ASCE, ISSN 1084-0699/2006/6-597–602/$25.00. predictions.
Ef = R 2
共3兲 m = 0.5 lne 冉 冊
1 + 0.5
0
1 − 0.5
0
共7兲
S = 共n − 3兲−0.5 共8兲
Hypothesis Tests on the Efficiency Index
where n⫽sample size and the value of z computed with Eq. 共5兲 is
compared to a critical value obtained from a standard normal
Computed values of the efficiency index are sample values. As
distribution table for a level of significance of ␣ or ␣ / 2.
with any random variable, a sample value of E f may differ from
the true, but usually unknown, value. Therefore, it has an under-
lying probability distribution. The distribution of the index E f
depends on both the sample size n and the underlying population Confidence Intervals on the Efficiency Index
value 共0兲. As 0 increases toward 1, the distribution becomes
more skewed, with the long tail on the lower side of 0. As the The distribution of E f depends on the sample size n and the
sample size increases, the spread of the distribution decreases. underlying population value 0. Confidence intervals can be based
Fig. 1 shows the distribution of E f for 0 equal to 0.5 and 0.7 and on the approximate sampling distribution developed by Fisher
for sample sizes of 10, 25, and 50. The characteristics of the 共1928兲
spread of the distribution as a function of n and 0 are evident in
Fig. 1. As n increases, the spread decreases, which indicates that
the sample value of E f is a better estimator of the population
冉 冊
ex − 1
ex + 1
2
共9兲
value. in which
冉 冊
While the efficiency index E f does not have an exact distri-
bution, the distribution can be approximated. The approximate 1 + Ef 2z
x = lne + 共10兲
sampling distribution can be used as the basis for hypothesis tests 1 − Ef 共n − 3兲0.5
on sample values of E f . To test whether or not a sample estimate
in which z⫽standard normal deviate. For a one-sided lower
of E f is likely to have been drawn from a population based on a
␥ % 共=100− ␣ % 兲 confidence interval, z of Eq. 共10兲 will be nega-
true value of 0, the following null hypothesis is tested:
tive for ␣% of the standard normal distribution in the lower tail.
For a one-sided upper ␥% confidence interval, z of Eq. 共10兲 will
H0: = 0 共4兲
be positive for a ␣% in the upper tail. For two-sided confidence
An alternative hypothesis HA can be stated for a one-tailed intervals, use z values for ␣ / 2% from each tail. The sampling
upper, a one-tailed lower, or a two-tailed test. distributions of Fig. 1 indicate that two-sided intervals defined by
Sutcliffe index: 共1兲 mathematical functions such as regression Fig. 2. Effect of bias in the magnitude of a hydrograph on the
models and 共2兲 single-event hydrograph analyses. The general Nash–Sutcliffe index
conclusions outlined for these two types of models will also apply
to the assessment of continuous moisture accounting models.
model. The exact effect of model bias will depend on the magni-
tude of the two terms in the numerator of Eq. 共13兲. Based on the
Effect on Ef of Bias in Magnitude implications of Eq. 共13兲, it is always important to report the bias
and relative bias along with the efficiency index E f of Eq. 共1兲.
Hydrologic models do not perfectly replicate measured data, with To illustrate the potential effect of bias in magnitude on E f , a
the error variation reflecting the potential prediction accuracy, gamma distribution was used as the measured or actual hy-
or inaccuracy, of the model. The error variation in predicted val- drograph. To avoid any effects of time-offset error, each predicted
ues of a random variable can be due to both systematic and hydrograph was computed by multiplying each ordinate of the
nonsystematic causes. Calibration is performed to reduce the error gamma hydrograph by a constant percentage, from 50 to 150%.
variation to a minimum, but even calibrated models may be char- The Nash–Sutcliffe efficiency index was computed by comparing
acterized by considerable error variation. the predicted hydrograph with the gamma distribution hy-
Systematic error variation is referred to as a bias, with a posi- drograph. Fig. 2 shows the relationship between E f and the rela-
tive bias indicating overprediction. Even calibration does not en- tive bias. Fig. 2 indicates that the value of E f is the same for
sure that a model will be unbiased. For example, power models positive and negative biases, as would be expected from Eq. 共13兲.
are often biased when calibrated using logarithms 共McCuen et al. Fig. 2 also illustrates that E f is very sensitive to model bias. For a
1990兲. Model bias is estimated using the average error, where an relative bias of 40%, either positive or negative, E f is zero. Fig. 2
error is the difference between the predicted and measured values. also indicates that bias can cause E f to become negative. The
The bias 共ē兲 has the same units as the criterion variable 共Y兲 and is exact relationship between E f and bias will vary with the prob-
computed by lem. These results of bias in gamma hydrograph models show that
n the Nash–Sutcliffe index is greatly influenced by model bias.
1
ē = 兺
n i=1
共Ŷ i − Y i兲 共11兲
E fb = 1 −
兺 共Ŷ U − Y兲 + nē
2 2
共13兲 This equation yields the following goodness-of-fit statistics
兺 共Y i − Ȳ兲2 for estimating turbidity: bias= −1.80 NTU; relative bias= −33%;
standard error of estimate Sc = 6.38 NTU; standard error ratio
Even if the bias is negative, the second term in the numerator, Se / Sy = 0.973; a correlation coefficient R = 0.459; and E f = 0.211.
nē2, will cause the efficiency to be less than that for an unbiased The bias is very significant and indicates that the model will
T̂ = 0.6804Q0.5234 共15兲
Eq. 共15兲 produces the following goodness-of-fit statistics:
Downloaded from ascelibrary.org by Ryerson University on 03/18/13. Copyright ASCE. For personal use only; all rights reserved.
bias= 0; Se / Sy = 0.840, R = 0.642, and E f = 0.416. Therefore, both Fig. 3. Assessment of the Nash–Sutcliffe efficiency index using a
the bias and overall accuracy have improved. Applying the same storm-event unit hydrograph for the White Oak Watershed
hypothesis test used for Eq. 共14兲 to the model of Eq. 共15兲 yields a
value for z of −1.250, which corresponds to a rejection probabil-
ity of 10.6%. Therefore, the null hypothesis that E f is from a measured rainfall and runoff records. Temporal disharmony be-
population with 0 equal to 0.8 cannot be rejected at the 5% level. tween the time distributions of rainfall and runoff can reduce the
A comparison of these two analyses indicates that E f was signifi- fitting efficiency.
cantly influenced by model bias, not just the precision component An analysis of actual data was conducted using unit hydro-
of accuracy. The accuracy improved when the model bias was graphs from the White Oak Bayou, Texas. Unit hydrographs
eliminated. While the biased model failed the hypothesis test, the 共UHs兲 were developed 共Hare 1970兲 for five storm events. The five
unbiased model led to the acceptance of the null hypothesis. This UHs were then used to develop an average unit hydrograph for
illustrates that sample estimates of E f are sensitive to model bias. the watershed 共McCuen 2005兲. Then, the storm event unit hydro-
The Choptank River turbidity data also include one extreme graphs were individually compared with the average UH and
event, i.e., the pair 20 and 160. The turbidity value was tested values of E f computed. The six UHs are shown in Fig. 3. The five
using both the Dixon–Thompson and Chauvenets outlier tests values of E f vary from 0.42 to 0.97. Reasons for the individual
共McCuen 2003兲. Both tests indicated that it is an outlier. There- values differ from event to event. The unit hydrograph for the
fore, it was censored. Using the remaining six pairs, the following 1955 event produced a relatively low value of E f 共0.63兲 because
power model was developed: of the bias in magnitude, with the peak of the actual UH being
about 65% of the peak of the average UH. The E f for the 1960
UH 共E f = 0.60兲 was about the same as that for the 1955 event but
T̂ = 106.4Q−1.013 共16兲
the two UHs were quite different. The 1960 UH was more peaked
which had a bias of −0.178 NTU, a relative bias of −5.9%, a than the average UH but also suffered because it was offset in
standard error of 1.37 NTU, a standard error ratio of 0.909, a time, with the peak occurring 6 h earlier than the peak of the
correlation coefficient of −0.582, and E f = 0.339. The E f indicates average UH. This represents a time-offset bias of about 40% of
a poor fit, although the accuracy is better than that from Eq. 共14兲. the time to peak. The 1953 UH had the largest E f 共0.96兲 as it was
The detection and censoring of outliers is important, as the effi- not offset in time and differed in the magnitude of the peak by
ciency index indicates that model accuracy can be influenced by only 10%. The 1959 unit hydrograph showed a very significant
outliers. Also, the E f of 0.339 fails to indicate that the relationship time offset in the peak, as well as a flat area on the rising limb,
is negative, which is shown by the negative sign on the correla- which produced a very poor index of 0.42. In these examples,
tion coefficient and the exponent of Eq. 共16兲. both magnitude bias and time-offset bias contributed to poor ac-
This example illustrates that both model bias and outliers can curacy. The Nash–Sutcliffe index was able to detect the effects of
affect sample values of E f . The elimination of both the bias and these factors, but a value of E f cannot identify which factor is the
the outlier increased the sample values of E f . Of course, every principal problem. Therefore, other analyses are necessary such as
case will be different, and the effect of unbiasing and outliers will graphical assessments and computations of the magnitude and
vary case by case. However, when poor values of E f occur, the time-offset biases.
data should be checked for both bias and outliers.
While rainfall hyetographs and runoff hydrographs may be on the time scale. This could occur if the rain gauge was not
recorded on one time scale, analyses of the data may be carried located within the watershed and the rainfall hyetograph was off-
out on a different time scale. Thus, if the magnitude of the time set from the runoff hydrograph by an amount equal to the travel
Downloaded from ascelibrary.org by Ryerson University on 03/18/13. Copyright ASCE. For personal use only; all rights reserved.
interval is changed, the efficiency index could change as well. A time of the rainfall between the watershed and the rain gauge. In
gamma distribution with a shape parameter of 4.75 exactly Fig. 3, time-offset errors are evident in both the 1959 and 1960
matches the Soil Conservation Service dimensionless unit hy- events. More-detailed hydrologic models, such as continuous
drograph and was used as the true hydrograph. Keeping the scale moisture accounting models, such as the hydrological simulation
parameter constant, the shape parameter was varied, which program FORTRAN model, include storage components and pa-
yielded a “predicted” hydrograph that could be compared with the rameters that control the release of water from the storages. The
true hydrograph. Twenty-four ordinates were defined on the rising storage components of these models might also contribute to
limb and the time base was set at 120 time increments. The pre- time-offset errors in continuous moisture accounting. If the model
dicted hydrograph was based on the same scale parameter but the components and parameters are not properly calibrated, predicted
shape parameter was varied. The efficiency index was computed discharges or pollution loads may be offset in time from the
for time increments of 1, 2, 3, 4, 6, 8, 10, 12, and 15, which yields measured values. A time offset or an inaccurate modeling of
sample sizes 共i.e., number of ordinates兲 of 120, 60, 40, 30, 24, 20, the recession of flows from the storages can significantly affect
15, 12, 10, and 8, respectively. For a predicted hydrograph based the goodness of fit.
on a shape parameter of 4, the efficiency was high and did not Time-offset errors will increase the numerator of the Nash–
vary very much as long as 12 or more ordinates were used. For Sutcliffe index 关Eq. 共1兲兴. Specifically, the error variation term
the very small sample sizes of 8 and 10, the E f decreased. How-
兺共Ŷ − Y兲2 will be inflated by time-offset errors. In a sense, the
ever, the hypothesis test based on Eqs. 共4兲–共8兲 showed greater
time-offset errors are biases on the time scale, in contrast to a bias
variation. When the predicted hydrograph was based on a shape
parameter of 2.4, the E f values varied as the sample size was in the magnitude, as discussed previously. Even if computed hy-
changed but all of the values were poor. Additionally, the z sta- drographs take the general shape of the measured hydrographs,
tistic of Eq. 共5兲 showed little variation, even as the sample size but are offset in time, the error variation term of Eq. 共1兲 can be
decreased. These analyses indicate that the Nash–Sutcliffe index large, which will produce low values of E f .
is not very sensitive to the time interval as long as the sample size To assess the effect of time-offset error, a gamma distribution
is moderate. However, the statistical significance of a sample E f hydrograph with a shape parameter of 4.7 was translated on the
can be influenced by the sampling interval in hydrograph time axis to reflect a model that has not been properly calibrated
analyses. to fit in the time domain but reproduces the magnitudes of
The White Oak Bayou data base of Fig. 3 was also used to the measured hydrograph. To provide dimensionless indicators,
examine the effect of the time interval. Data were recorded on a the time offset was scaled as a fraction of the time to peak. The
1 h interval, which gives 60 values. By interpolating between the effects of a time-offset bias on the Nash–Sutcliffe index are evi-
measured points, the time interval used to define values of the dent in Table 1.
unit hydrographs was cut in half, which doubled the number of Table 1 shows that as the offset interval increases the index
discharge measurements. For the five unit hydrographs, the values and, therefore, the goodness of fit decreases. The smaller the time
of E f changed very little, on average by only 0.5%. These changes interval, the less significant an offset is on the E f , as is shown by
in the computed E f indices is insignificant because it is much less the higher values for small percentage changes. For a large inter-
than the sampling variation. This shows that unless the change in val, i.e., large changes, the decrease in the index can be dramatic,
interval is going to be significant, the sampling interval has a decreasing to about 0.81 for a 24% time offset for the hydrograph
minor effect on values of the Nash–Sutcliffe index. In conclusion, with a shape parameter of 4.7. These results show the importance
the time interval should be kept as small as possible for the most of choosing an appropriate time interval and to know that a time
accurate index in cases where a bad fit is suspected; otherwise, it offset can significantly affect the goodness of fit of hydrograph
is assumed that the values of the index are not sensitive to the models.
time interval. However, if a hypothesis test is to be made with the The sensitivity of the efficiency index to time-offset bias can
data, a test of significance can be sensitive to the sampling inter- also be illustrated using two of the White Oak unit hydrographs of
val when the sample size is small. Fig. 3. The storm events of 1959 and 1960 produced unit hydro-
graphs that were sensitive to rainfall characteristics. The rainfall
for the 1960 event was of short duration and high intensity, which
Effect on Ef of Time Offset Bias produced a relatively peaked UH. The 1959 event began with a
period of low-intensity rainfall followed by a short period of in-
Time-dependent models based on measured data may be subject tense rainfall, which produced a unit hydrograph with a relatively
to time-offset errors if the rainfall and runoff are not synchronized flat rising limb followed by a peak that exceeded the peak of the
offset bias is very significant in the 1959 unit hydrograph. For the
1960 UH, the efficiogram for a 2 h lag is: −1.40, −1.17, −0.89,
−0.44, 0.12, 0.60, 0.82, 0.91, 0.80. This shows that a time offset References
of 4 h, i.e., two time lags, would increase the efficiency from 0.60
to 0.91. In both cases, the time-offset bias in the actual unit hy- ASCE Task Committee on Definition of Criteria for Evaluation of Water-
drographs caused a significant loss of accuracy. In the assessment shed Models of the Watershed Management, Irrigation, and Drainage
of the accuracy of a time-dependent model, the efficiogram Division 共ASCE兲. 共1993兲. “Criteria for evaluation of watershed mod-
els.” J. Irrig. Drain. Eng., 119共3兲, 429–442.
should be computed and examined for a time-offset bias. If such
Birikundavyi, S., Labib, R., Trung, H. T., and Rousselle, J. 共2002兲.
a bias is evident, it may be useful to either revise the model or “Performance of neural networks in daily streamflow forecasting.”
adjust the data to account for the underlying cause of the time- J. Hydrol. Eng., 7共5兲, 392–398.
offset bias. Time-offset biases should be investigated before the Downer, C. W., and Ogden, F. L. 共2004兲. “GSSHA: Model to simu-
data analysis, but they are often difficult to detect, such as where late diverse stream flow producing processes.” J. Hydrol. Eng., 9共3兲,
they are caused by highly variable storm cells in an area with a 161–174.
low density of rain gauges. However, the efficiogram can be a Erpul, G., Norton, L. D., and Gabriels, D. 共2003兲. “Sediment trans-
useful aide in detecting the errors. port from interrill areas under wind-driven rain.” J. Hydrol., 276,
184–197.
Fisher, R. A. 共1928兲. “The general sampling distribution of the multiple
correlation coefficient.” Proc. R. Soc. London, 121, 654–673.
Conclusions Hare, G. S. 共1970兲. “Effects of urban development on storm runoff rates.”
Proc., Seminar on Urban Hydrology, Paper No. 2, HEC, Corps of
Sample values of the Nash–Sutcliffe index are values of a ran- Engineers. Davis, Calif.
dom variable and subject to sampling variations, as is any random Jennings, M. E., Thomas, W. O., Jr., and Riggs, H. C. 共1994兲.
variable. A method that approximates the sampling distribution “Nationwide summary of U.S. Geological Survey regional regression
of E f was presented and shown how it can be used both to test equations for estimating magnitude and frequency of floods for un-
gaged sites, 1993.” USGS WRI 94-4002, U.S. Geological Survey,
hypotheses about the underlying population value of E f and to
Reston, Va.
compute confidence intervals for sampling values. These statis-
Johnson, M. S., Coon, W., Mehta, V., Steenhuis, T., Brooke, E., and Boll,
tical tools will enable users of the Nash–Sutcliffe index to J. 共2003兲. “Applications of two hydrologic models with different run-
systemically assess values of E f , thus avoiding subjective assess- off mechanisms to a hillslope dominated watershed in the northeastern
ments of goodness of fit. U.S.: A comparison of HSPF and SMR.” J. Hydrol., 284, 57–76.
Hydrologic models are intended to reflect the physical pro- Kalin, L., Govindaraju, R. S., and Hantush, M. M. 共2003兲. “Effect of
cesses that the model is designed to represent. Less than ideal geomorphological resolution on modeling of runoff hydrograph and
values of goodness of fit are not necessarily indicative of a poor sedimentograph over small watersheds.” J. Hydrol., 276, 89–111.
model. Rather, they may be the result of the misuse of the McCuen, R. H. 共2003兲. Modeling hydrologic change, CRC, Boca Raton,
goodness-of-fit index. Therefore, having reliable goodness-of-fit Fla.
criteria is an important element of the modeling process. A McCuen, R. H. 共2005兲. Hydrologic analysis and design, Pearson/
goodness-of-fit statistic is supposed to reflect some aspect of the Prentice-Hall, Upper Saddle River, N.J.
prediction accuracy of the calibrated model. Selection of a statis- McCuen, R. H., Leahy, R. B., and Johnson, P. A. 共1990兲. “Problems with
logarithmic transformations in regression.” J. Hydraul. Eng., 116共3兲,
tic is, therefore, important to ensure that it will reflect the char-
414–428.
acteristic that it is intended to reflect. Merz, R., and Bloschl, G. 共2004兲. “Regionalization of catchment model
The structure of the Nash–Sutcliffe efficiency index E f is very parameters.” J. Hydrol., 287, 95–123.
similar to the Pearson product—moment correlation coefficient. Nash, J. E., and Sutcliffe, J. V. 共1970兲. “River flow forecasting through
Low values of E f may be the result of model bias produced by the conceptual models. Part 1: A discussion of principles.” J. Hydrol.,
calibration, with bias resulting either from differences in magni- 10共3兲, 282–290.