Evaluating The Performance of Morphological Models: J. Sutherland, A.H. Peet, R.L. Soulsby
Evaluating The Performance of Morphological Models: J. Sutherland, A.H. Peet, R.L. Soulsby
Evaluating The Performance of Morphological Models: J. Sutherland, A.H. Peet, R.L. Soulsby
www.elsevier.com/locate/coastaleng
Abstract
Evaluating the performance of numerical models of coastal morphology against observations is an essential part of
establishing their credibility. In the past, this has usually been done by comparing predicted with observed behavior, and the
modeler making a subjective judgement of the goodness of fit. By contrast, meteorological models have been tested for many
years against observations using objective, scientifically rigorous methods. Drawing on the analogy between distributions of
atmospheric pressure (and similar scalars) and coastal bathymetry, the meteorological statistical methods are here adapted to
coastal applications. A set of criteria is proposed that identify the desirable attributes of statistical measures of the performance
of coastal morphological models. Various statistical parameters used in the meteorological field are described, dealing both with
cases in which the model outputs comprise a small number of categories (e.g., advance/equilibrium/retreat of shoreline), and
those in which they are continuous (e.g., bathymetry). Examples of the application of these methods are given, in terms of both
hypothetical illustrations, and real field data and model predictions. Following meteorological practice, it is shown that
measuring the skill of a model (i.e., its performance relative to a simple baseline predictor) is a more critical test than measuring
its absolute accuracy. The attributes of the different performance measures are compared with the proposed desirable criteria,
and those that match them best are selected. These are the LEPSOB test for categorical data, and the Brier Skill Score (BSS) for
continuous data. Routine use of these measures of performance would aid inter-comparability of models, and be a step toward
strengthening user confidence in the predictions of coastal numerical models.
D 2004 Elsevier B.V. All rights reserved.
Keywords: Model validation; Morphological model; Skill; Accuracy; Bias; Correlation; Brier skill score; Linear error in probability space
missing from the literature, despite the variety of there seems no particular reason for doing so, it will
available models. During the early stages of model not be considered further. Bias is sometimes referred
development, this form of assessment is arguably to as reliability. In this definition a breliableQ model
sufficient, as it can be obvious at a glance where does not consistently over-predict or under-predict,
further model improvement is required when a simple but is not necessarily accurate.
comparison of model prediction with data is made. Accuracy is a measure of the average size of the
However, a model that is to be used as an engineering difference between a set of predictions and the
tool requires a more objective assessment. corresponding observations (the average error), so
Coastal zone managers and engineers requiring even when there is no bias in a model it may still have
answers to problems are not concerned with the a low accuracy. Accuracy can be represented in a
details of the internal processes in the specific model dimensional or a non-dimensional (relative accuracy)
they use. They are concerned with the fitness for manner.
purpose of the model and perhaps, most importantly, Skill is a non-dimensional measure of the
its credibility. Therefore, it is necessary to establish accuracy of a prediction relative to the accuracy of
completely objective standard algorithms of model a baseline prediction, which could, for example, be
performance, which can be used to determine the an initial condition, an average value, a random
bias, the accuracy and the skill of a model (defined choice (from the possible range of outcomes) or a
below). simple empirical predictor (such as a Dean profile).
In this paper, the problem of evaluating morpho- Skill can be based on any measure of accuracy. The
logical models is addressed by drawing on established difference between accuracy and skill can be
methods used in meteorology, where predictions of illustrated by the following hypothetical example.
pressure and rainfall are routinely evaluated against A model of bar movement predicted a small offshore
data. This is possible, as the spatial distribution of movement, but the measurements showed a small
coastal bathymetry is directly analogous to the onshore movement. The difference between the
distribution of atmospheric pressure (or rainfall) in predicted and observed bar crest distance from the
meteorology. First, we propose criteria for assessing shore was still small, so the accuracy of the bar crest
the suitability of model performance statistics. Next, a position was still relatively high, even though the
number of methods for evaluating different kinds of model moved the bar in the wrong direction.
meteorological models (Heidke, 1926; Willmott et al., However, introducing a baseline prediction (that the
1985; Murphy and Epstein, 1989; Gandin and bar does not move) showed that the prediction was
Murphy, 1992; Wilks, 1995; Livezey et al., 1996; less accurate than the baseline prediction, so the
Potts et al., 1996) are described in the context of model had a negative level of skill.
coastal morphological modeling. These methods are All skill scores possess an origin, scale and
then applied to both hypothetical and real cases of orientation. The origin is the result obtained if the
coastal morphological modeling. The degree to which model predicts the baseline prediction. The scale is
the statistics match up to the criteria is assessed and the the average score from perfect modeling and a
best measures, according to these criteria, are chosen. positive orientation implies that larger scores are more
skillful. These should be chosen to have sensible
1.1. Bias, accuracy and skill values, such as the origin=0, the scale=1 and a positive
orientation.
The performance of any model can be assessed by
calculating its bias, accuracy and skill. 1.2. Desirable properties of a model performance
Bias is a measure of the difference in the central statistic
tendencies of the predictions and the observations.
The most common measure of the central tendency is This paper considers a number of model perform-
the mean, although there are circumstances where ance statistics that can be applied to morphodynamic
using the median is appropriate. The mode or modal modeling. The statistics are compared and particular
group (for binned data) could also be used, but as ones are recommended for future use. The authors
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 919
consider that the following properties are desirable in comparisons are made with values and not categories.
a model performance statistic: The predictions from a probabilistic model can be
expressed in categorical terms by sorting the output
(1) The statistic should be relatively easy to under- into bins. This allows the skill of probabilistic models
stand. Therefore, it should have a simple and to be compared to the results from simple categorical
clear conceptual basis. predictors. Moreover, by allowing the number of bins
(2) The statistic matches up with expert opinion. It to increase until each bin is the same size as the
should generally return a good score when resolution of the measurements, a categorical model
experts consider that the modeling has been a can be applied to a probabilistic case.
success. In an ordered variable, it is possible to determine
(3) The statistic should appear bhonestQ, in the sense not only whether the prediction is right or wrong, but
that it is not easy to alter the results by altering also how badly wrong an incorrect prediction is. Skill
the model set-up slightly. For example, the Brier assessments of ordered predictions have to take the
Skill Score (BSS) (Section 3.5) is not altered by size of the error into account. For example, proba-
the addition of an additional inactive area to the bilistic predictions or measurements of bed elevation,
domain of a morphodynamic model, whereas the water level or current speed are ordered as the values
correlation coefficient is. This makes the BSS lie within a continuous range and it is easy to calculate
more bhonestQ. the difference between observation and prediction.
(4) The statistic should be transferable from one Categories can also be ordered and, in this case, a skill
model to another and from one data set to assessment that takes into account category error
another. Therefore, there is a preference for non- should be performed. Category error is a measure of
dimensional quantities such as skill scores or how close an incorrect prediction was to the correct
relative errors rather than dimensional measures prediction. An example of ordered categories is the
of accuracy that may be expected to have lower change in shoreline position, which can be split into
values in, for example, a flume test than a field three categories: retreat, equilibrium and advance. The
experiment. order retreat, equilibrium, advance is correct as there
(5) The statistic should take a range of values that is a logical progression from the first category to the
makes it intuitively easy to assess the quality of last. With unordered categories it makes no difference
the modeling from the value returned. which incorrect category is predicted; they are all
(6) The statistic should measure skill rather than equally wrong.
accuracy. This follows partly from the afore-
mentioned properties, all of which can be met by
skill scores, and partly because skill scores can 2. Categorical assessment
be made equitable (defined in Section 2.4).
2.1. Categorical assessment ignoring category error
1.3. Types of prediction
The results from comparing categorical predictions
Predictions can be either categorical or probabil- and observations can always be represented by one of
istic, and either ordered or unordered. A categorical four possible outcomes, which can be recorded in a
prediction is used when a model predicts one of a
finite number of possible outcomes or categories. For Table 1
example, shoreline position changes could be divided Possible outcomes from categorical prediction, shown in a
contingency table
into three categories: retreat, equilibrium and advance.
A probabilistic prediction is used when a model Prediction
predicts a time series of a discrete variable (e.g., Predicted Not predicted Total
shoreline position) or the values of a continuous Observation Occurred H M H+M
variable (e.g., points along a beach profile). The Did not occur F C F+C
number of possible outcomes is now infinite and so Total H+F M+C J
920 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
22 contingency table, as shown in Table 1. The The advantages of using the HSS, rather than PC,
variables in the table are: are illustrated in the following example. Let a
particular state occur for 125 out of 2500 observations
H Hits—the number of times the event was predicted and (5%). Let 100 of these events be predicted and 25 not
observed predicted, with there being 50 false alarms. The 22
M Misses—the number of times the event was observed but
contingency table is shown in Table 2. The resulting
not predicted
F False alarms—the number of times the event was predicted statistics are: PC=0.97, FAR=0.33, Probability of
but did not occur Detection=0.80, Threat Score=0.57 and HSS=0.71.
C Correct rejections—number of times the event was not The HSS is as low as 0.71 because 90% of the events
predicted and did not happen. would be correctly predicted by chance. If, instead,
the events were never predicted (H=F=0, M=125,
A number of statistics can be calculated to assess C=2375), then PC=0.95 (a reduction of only 0.02 on
the performance of the modeling (Wilks, 1995). Here the previous case) whereas HSS=0.0 as the predictive
the bias, Bias=(H+F)/(H+M), is the number of times model has no skill.
the event was predicted compared to the number of These statistics are most obviously applied to two-
times the event was observed. The accuracy can be category systems but can also be applied to systems
expressed by a number of measures. The Percentage with more than two categories. However, in this case,
Correct, PC=(H+C)/J, is the ratio of correct predic- all incorrect categories are treated as equal, which is
tions as a fraction of the total number of forecasts, good enough for unordered variables, but takes no
J=H+M+F+C. The False Alarm Rate, FAR=F/ account of category error (the magnitude of the error)
(H+F), represents the number of false predictions for ordered categories, when it matters how badly
divided by the number of forecasts of the event, while wrong the prediction was. Examples of two-state
the Probability of Detection, POD=H/(H+M), is the categories include whether a channel is deep enough
number of correct predictions of the event divided by to take a particular draught of ship or not, and beach
the number of times the event was observed. The state indicators such as whether a beach is sufficiently
Threat Score, TS=H/(H+F+M), is the number of wide for recreation or not. These statistics could also
successful predictions of the event divided by the be used to assess flooding or overtopping warning
number of predictions minus the number of correct systems. However, in many cases there will be more
rejections. Note, however, that both Percentage than two ordered categories and a skill score that takes
Correct and Probability of Detection are referred to into account category error should be used instead.
as the Hit Rate, so the latter term is ambiguous and
should be avoided. 2.2. Ordered morphological categories
The skill of a categorical prediction can be
determined (Heidke, 1926) using the Heidke Skill A skill assessment that takes into account category
Score (HSS) given by: error (how close an incorrect prediction was to the
correct prediction) should be used with ordered
PC G categories. In the earlier example of shoreline change,
HSS ¼ ð1Þ
1G the correct order must be: retreatYequilibriumY
where PC=Percentage Correct (expresses as a frac- advance (or advanceYequilibriumYretreat), and not
tion) and G is the fraction of predictions of the correct retreat YadvanceYequilibrium (or any of the other
categories (H and C) that would be expected from a three permutations). This condition of ordinal catego-
random choice. G is the sum of the fraction of events
that occurred and were predicted by chance, given by
(H+F)(H+M)/J 2 added to the fraction of events that Table 2
Example of categorical prediction
did not occur and were predicted by chance not to
occur, given by (M+C)( F+C)/J 2. The HSS is the Predicted Not predicted
percentage correct, corrected for the number expected Observed 100 25
to be correct by chance. Not observed 50 2325
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 921
Table 6
Linear error in probability space skill scores for categories of bar movement
Observed Weighted
Retreat Equilibrium Advance average of rows
p(R)=0.46 p(E)=0.21 p(A)=0.33
Predicted Retreat p(R)=0.46 0.58 0.21 0.68 0
Equilibrium p(E)=0.33 0.34 0.28 0.29 0
Advance p(A)=0.21 0.75 0.01 1.03 0
Weighted average of all skill scores 0
Table 7
Observational linear error in probability space (LEPSOB) skill scores for categories of bar movement
Observed Weighted
Retreat Equilibrium Advance average of rows
p(R)=0.46 p(E)=0.21 p(A)=0.33
Predicted Retreat 0.59 0.21 0.68 0
Equilibrium 0.21 0.34 0.07 0
Advance 0.68 0.07 0.89 0
Weighted average of all skill scores 0
924 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
ment techniques can be used both for time series of The choice between MAE and RMSE depends
discrete variables, such as bar height and shoreline largely on the importance of outliers (points with
position, and for continuous variables, such as cross- particularly large differences between measured and
shore bathymetry or an area map of bathymetry. predicted values) in the data. The presence of a few
outliers will have a greater influence on RMSE than
3.1. Bias on MAE as RMSE squares the differences. The
RMSE is, therefore, greater than or equal to the
The basic equations for bias in the mean, Biasa, MAE so RMSE is a more conservative error predictor.
and in the median, Biasm are: The difference between RMSE and MAE depends on
the shape of the distributions of Y and X and is
1 XJ reduced as Y and X tend toward symmetrical
Biasa ¼ yj xj ¼ hY i hX i ð3Þ distributions. The circumstances of the modeling
J j¼1
exercise will determine the importance that any
outliers should be given and will influence the choice
Biasm ¼ YM XM ð4Þ
between MAE and RMSE.
The use of the modulus makes MAE nonanalytic
where Y is a set of J forecasts or predictions, y 1, y 2,. . ., and thus more difficult to manipulate mathematically
y J , and X is a set of J observations, x 1, x 2,. . ., x j with than MSE. However, the MAE is equally applicable to
the nth element of each set occurring at the same point vectors and scalars, and so is particularly useful for
in space and time. Moreover Y M and X M are the validating and evaluating hydrodynamic modeling
median values in Y and X and angular brackets denote (Sutherland et al., 2004). Moreover, the phrases
the mean. Bias reveals the tendency toward under- or bmean absolute errorQ and broot-mean-square-errorQ
over-prediction, with a positive bias indicating that the contain within them the assumption that all errors are
model consistently overpredicts the observations. in the model prediction and the measurement is error-
free. Calculations of best-fit lines also tend to assume
3.2. Accuracy that the data is error-free. This will never be the case,
in practice, and model evaluation should include a full
The most common measures of accuracy are the discussion of the measurements and their inherent
Mean Absolute Error, MAE, and the Mean Square errors. Some consideration of measurement errors is
Error, MSE, defined as: given in Section 7.
1 XJ
MAEðY ; X Þ ¼ jyj xj j ¼ hjY X ji ð5Þ 3.3. Correlation
J j¼1
A widely used method of determining whether two
1 XJ 2 series, X and Y, are related is the linear correlation
MSEðY ; X Þ ¼ yi xj ¼ h ð Y X Þ 2 i ð6Þ coefficient, r XY given by:
J j¼1
The above expression assumes that there is no need hðY hY iÞð X hX iÞi sX Y
to use a weighted average. A general form of the rX Y ¼ ¼ ð7Þ
rX rY rX rY
expression for average error was given by Willmott et
al. (1985), which includes weighting and the possible with r X the measured, r Y the predicted standard
use of other powers than one and two to define a mean deviation and s XY the covariance between the obser-
error statistic. There is no advantage to using powers vations and predictions. The value r XY=1 denotes
other than one or two; therefore, only MAE, MSE and complete positive correlation while r XY=0 occurs
the root-mean-square error, RMSE, will be considered when X and Y are uncorrelated. The coefficient r XY
further. RMSE is often preferred to MSE as it has the is a measure of how strong a correlation is, but not
same units as the measured variable. If necessary, Eqs. how significant the correlation is because the distri-
(5) and (6) can be extended to include weighting. butions of X and Y are not taken into account.
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 925
In evaluating the performance of a morphodynamic MAE are suitable measures of accuracy for calculat-
model, it is often useful to plot out the measured and ing skill.
modeled anomalies. The anomaly is the difference The numerator in Eq. (11) is the improvement in
between the final measurements (or predictions) and accuracy of the model compared to the baseline
the baseline prediction of the final bathymetry. Let B prediction. This can be an intuitively simple model,
be a set of N baseline predictions, b 1, b 2,. . ., b N , with such as an initial condition or an average value, or it
the nth baseline prediction occurring at the same place can be a simple empirical predictor (such as a Dean
and time as the nth value of the predictions and profile) for example. The denominator is the total
observations, Y and X. The modeled and measured possible improvement in accuracy given by the
anomalies are given by: difference between the accuracy of the baseline
prediction and the result from perfect modeling
Y V¼ Y B ð8Þ (Murphy and Epstein, 1989). Commonly the accuracy
of perfect modeling, represented by MAE(W,X),
X V¼ X B ð9Þ MSE(W,X) or RMSE(W, X), is taken to be zero. This
simplifies the expression for skill.
In many cases the baseline prediction for the final
bathymetry used in morphological modeling is the 3.5. Skill scores for probabilistic models
initial bathymetry (i.e., it is assumed that the bed does
not alter). In this case, the measured and modeled If the error associated with a perfect prediction,
anomalies are simply the measured and modeled A(W,X) in Eq. (11), is taken to be zero, then three skill
changes in bed level during the time being modeled. scores can be derived for probabilistic models by
The anomalies commonly have a much smaller using mean square error (MSE), root mean square
vertical scale than the measured and modeled bed, error (RMSE), and mean absolute error (MAE) as
so plotting the anomalies on a larger axis is a very measures of accuracy. These are the BSS:
useful way of assessing where the modeling was good
and where it was bad. It is also possible to calculate an
MSEðY ; X Þ hðY X Þ2 i
anomaly correlation coefficient, r XVYV, given by: BSS ¼ 1 ¼1 ð12Þ
MSEð B; X Þ hð B X Þ2 i
hY VX Vi
rX VY V ¼ ð10Þ
rY VrX V the root-mean-square skill score:
This correlation coefficient has a value of 1 for a
RMSðY ; X Þ hðY X Þ2 i1=2
perfect prediction, which is reduced due to error in RMSSS ¼ 1 ¼1
predicting the mean bed level and the position of
RMSð B; X Þ hð B X Þ2 i1=2
erosion and deposition. No reduction in skill is made ð13Þ
for amplitude error.
and the mean absolute skill score:
3.4. Skill MAEðY ; X Þ hjY X ji
MASS ¼ 1 ¼1 ð14Þ
MAEð B; X Þ hjB X ji
Skill scores can be based on Mean Absolute Error,
Mean Square Error or Root Mean Square Error. The The BSS is commonly used in meteorology and
general form of skill score, SS, can be given by: has already been applied to the modeling of coastal
Að B; X Þ AðY ; X Þ morphodynamics by Brady and Sutherland (2001),
SS ¼ ð11Þ Sutherland et al. (2001), Van Rijn et al. (2002), Van
Að B; X Þ AðW ; X Þ
Rijn et al. (2003) and Sutherland and Soulsby (2003)
where A(B, X)=the accuracy of the baseline predic- all coming from the COAST3D project (Soulsby,
tion, A(Y, X) is the accuracy of the predictions and 2001). Gallagher et al. (1998) used the RMSSS in a
A(W, X)=accuracy of a perfect set of predictions, with local sand transport model driven by measured
W a set of J perfect predictions. MSE, RMSE and flows.
926 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
Perfect agreement gives a skill score of 1 with the terms set out below:
whereas modeling the baseline condition gives a
score of 0. If the model prediction is further away 2
rY V
from the final measured condition than the baseline a ¼ rY2 VX V b ¼ rY VX V
rX V
prediction, the skill score is negative. Note that 2 2
these skill scores are unbounded at the lower limit. hY Vi hX Vi hX Vi
c¼ e¼ ð16Þ
Therefore, they can be extremely sensitive to small rX V rX V
changes when the denominator is low, in common
with other non-dimensional skill scores derived Here,
from the ratio of two numbers. Therefore, large
negative values can be obtained even from models a is a measure of phase error—when the sand is moved to the
which predict a small change (of the correct order wrong position. Perfect modeling of the phase gives a=1.
of magnitude) when the measured change is very b is a measure of amplitude error—when the wrong volume of
small. In these circumstances, different models of sand is moved. Perfect modeling of phase and amplitude
gives b=0.
the same experiment can still be compared (as the
c is a measure of the map mean error—when the predicted
same small denominator will be used) to get a average bed level is different from the measured. Perfect
ranking of relative merit. Note that when the modeling of the mean gives c=0.
denominator reduces to a similar size as the error e is a normalization term, which is only affected by measured
in the measurements, then the skill score becomes changes from the baseline prediction.
effectively meaningless.
of assumptions) a BSS greater than 0.2 represents a alize the accuracy by dividing by the mean absolute
useful forecast. However, transferring this rule-of- value, or root mean square value of the measured
thumb for atmospheric forecasting to morphody- variable. Similar measures of relative accuracy
namic modeling may be inappropriate and a could be derived for bathymetry by dividing the
suitable level of BSS to represent a useful forecast accuracy by the mean or standard deviation of the
in coastal morphodynamic modeling will emerge measured bed levels. However, neither the mean nor
from experience. the standard deviation of the bed level has any great
meaning. The mean bed elevation, in particular, is
3.7. Indices of agreement an unsatisfactory statistic to use as its value depends
on the datum used, and a relative accuracy that
Willmott (1981) introduced an index of agree- varies as the datum changes from chart datum to
ment, d 2, intended to act as a measure of the extent mean sea level cannot be used for intercomparisons
to which the predictions are error-free. This measure between cases.
of relative accuracy was non-dimensionalized to
ensure that it can be easily interpreted and used to
compare between models. It is bounded by 1 and 0 4. Comparison between probabilistic model
with a value of 1 implying perfect agreement performance statistics
between predictions and observations. A value of
zero implies complete disagreement. The concept The difference between the model performance
was extended in Willmott et al. (1985) to include an statistics is demonstrated below using a hypothetical
index, d 1, that used mean absolute errors as well as data set. In this case, a Dean (1977) equilibrium cross-
the original d 2 using mean square errors. They are shore profile was chosen as the initial profile. This
based on the assumption that the measurements are took the form z=Dx 2/3 with z the vertical distance of
error-free (so the mean of the measurements is the bed below the onshore level (here set to 1 m),
error-free). The predicted and measured deviations D=0.1 (corresponding to a median grain diameter of
from the weighted mean of the observations are about 0.2 m), and x is the cross-shore distance from
YhXi and XhXi, and the sum of these terms the start of the Dean profile. A final dmeasuredT bed
gives the maximum potential error. The potential profile, X, was generated by adding an anomaly to the
error and its square form the denominators of d 1 initial profile. The anomaly, XVwas given by the sum
and d 2. No assumptions about the distributions of of two sine waves:
observed or predicted variables are made. The
numerator is the mean absolute difference or X V ¼ 0:20sinðkxÞ þ 0:05sinð2kxÞ ð20Þ
mean-square error. The indices are:
with k=p/24, representing a wavelength of 48 m. XV
hjY X ji was given by Eq. (20) for 0VxV48 and XV=0
d1 ¼ 1 ð18Þ elsewhere. A number of predicted final bathymetries
hjY hX ij þ jX hX iji
were devised:
hðY X Þ2 i
d2 ¼ 1 ð19Þ Scenario 1 perfect modeling: Y=B+XV
hðjY hX ij þ jX hX ijÞ2 i Scenario 2 initial profile plus half the measured anomaly:
Y=B+0.5XV
Alternative versions of the same concept are Scenario 3 initial profile plus one and a half times the measured
possible. One possibility is to subtract the median, anomaly: Y=B+1.5XV
Scenario 4 initial profile plus anomaly shifted 3.5 m offshore:
rather than the mean, of the measured values in the Y=B+XV(x3.5)
denominator as h|XX M|i is smaller than h|XhXi|i. Scenario 5 initial profile plus constant vertical offset of 0.055
Indeed, it is possible to have alternative measures of m: Y=B+XV+0.055
all the measures of relative accuracy. In hydro- Scenario 6 initial profile plus half the measured anomaly, but
dynamic modeling, it is common to non-dimension- with a profile extended out to 70 m.
928 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
The initial and final bathymetries and the bed profile. The mean absolute error between the
anomalies of scenarios 1 to 5 are shown in Fig. initial and final measured profiles, MAE(B,X), was
1. The measures of accuracy, relative accuracy, skill 0.111 m (out to 55 m) while the root mean square
and correlation defined previously were calculated difference was 0.136 m. Table 8 shows that perfect
for each of the scenarios and the resulting values modeling (Scenario 1) gives bias and error of zero
are shown in Table 8. The baseline prediction of the and skill and indices of agreement of 1, as
final bathymetry for the skill scores was the initial expected.
Table 8
Accuracy, skill correlation from idealized cross-shore profiles modeling scenarios
Score Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5 Scenario 6
Bias 0.0000 0.0000 0.0000 0.0000 0.0550 0.0000
MAE(Y, X) 0.0000 0.0550 0.0550 0.0550 0.0550 0.0440
RMSE(Y, X) 0.0000 0.0680 0.0680 0.0660 0.0550 0.0600
MSE(Y, X) 0.0000 0.0046 0.0046 0.0044 0.0031 0.0036
r XY 1.0000 0.9800 0.9800 0.9800 1.0000 0.9900
r XVY V 1.0000 1.0000 1.0000 0.8800 1.0000 1.0000
BSS 1.0000 0.7500 0.7500 0.7600 0.8300 0.7500
RMSSS 1.0000 0.5000 0.5000 0.5100 0.5900 0.5000
MASS 1.0000 0.5000 0.5000 0.5000 0.5000 0.5000
a 1.0000 1.0000 1.0000 0.7700 1.0000 1.0000
b 0.0000 0.2500 0.2500 0.0100 0.0000 0.2500
c 0.0000 0.0000 0.0000 0.0000 0.1700 0.0000
d1 1.0000 0.8960 0.8910 0.8930 0.8900 0.9390
d2 1.0000 0.9910 0.9900 0.9910 0.9940 0.9950
The mean absolute error in the modeling, Absolute Skill Score was the same as before. The
MAE( Y,X), equals 0.055 m for both Scenarios 2 and correlation coefficients were both 1, as they take no
3. The skill scores and correlation coefficients were account of errors in predicting the mean value. The
the same for both scenarios as well, showing that the Murphy–Epstein decomposition of the BSS shows
statistics cannot differentiate between the two cases. that the only reduction in skill from a perfect score of
The Murphy–Epstein decomposition of the BSS one was due to the error in the mean.
shows that the phase term a was 1, as the predicted The indices of agreement gave similar values for
anomaly was in phase with the measured. The mean cases 2 to 5. Scenario 6 used the same data as
and normalization terms c and e are both zero while Scenario 2, but with the profile extended out to 70 m
the reduction in skill from a perfect score of 1 was due using Dean’s profile. In this case, the measures of
to amplitude errors (as term b was non-zero in both accuracy (mean absolute error, the root mean square
cases). The correlation coefficients had values of 0.98, error and mean square error) all reduced, but the skill
reduced from one by the increased standard deviation scores did not. The skill scores are thus a more
of the predicted profile. bhonestQ’ measure, as the modeling was not worsened
In Scenario 4 the anomaly was shifted 3.5~m by the addition of a morphologically inactive area.
offshore, so that the mean absolute error, MAE( Y,X),
was approximately the same as for Scenarios 2 and 3. 4.1. General results of the model performance
The skill scores are essentially the same as before statistics
(small changes of F0.01 were due to the finite
discretization of the profiles in both horizontal and The MAE, MSE and RMSE all give different
vertical directions). The correlation coefficient was measures of the accuracy of the predictions. The use
still 0.98, but the anomaly correlation coefficient was of the modulus makes the MAE nonanalytic and thus
significantly reduced to a value of 0.88. The decom- more difficult to manipulate mathematically than
position of the BSS shows that this phase error was using a RMSE (or MSE). However, the MAE is not
the main cause of the loss of skill in Scenario 4. as heavily influenced by outliers as RMSE (Hedges,
In Scenario 5, the constant vertical offset was 2001) and is always lower than, or equal to, the
chosen to give the same mean absolute error as in RMSE. The average error (and hence accuracy) is
Scenarios 2 to 4. However, in this scenario the root probably best represented by MAE as it does not
mean square error was lower than before (0.055 rather depend on the distribution of the errors. The MAE,
than 0.068) so the BSS and the RMS Skill Score were MSE and RMSE are all reduced by the addition of a
higher than for Scenarios 2 to 4, but the Mean morphologically inactive area to the model domain
930 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
(compare results from Scenario 6 and Scenario 2). values in the idealized scenarios to be close to one,
Hence, the accuracy produced by a model can often be making it difficult to distinguish between scenarios.
improved by extending the model domain offshore The skill scores are sensitive to mean, phase and
into deeper water where there is less sediment amplitude errors. They are unaltered by the inclusion
transport. This is not a good way to judge the of an additional area of zero activity. They clearly
performance of a model. match the criteria for desirable model performance
The skill scores, however, give the same score statistics set out in Section 1.2. Their strength lies
when the morphologically inactive offshore area is partially in the inclusion of the baseline prediction
introduced. The variation between the skill scores (usually the initial condition) within the method.
reflects the variations in the MAE, MSE and RMSE. Alternative baseline predictions include an average
The RMS Skill Score gives the same values as the profile (if enough data was available) and a Dean
Mean Absolute Skill Score, except for Scenario 5 equilibrium profile. The baseline prediction should
(vertical offset) as the distribution of errors was always be clearly stated when using a skill score.
different in this case. The BSS gave the highest Thus, an additional benefit of using the BSS for
values as it is based on mean square errors. It will tend morphological modeling is the Murphy–Epstein
to give higher values for positive skill scores, but decomposition, which is a useful tool in determining
lower values for negative skill scores than the Mean the source of a low skill score.
Absolute Skill Score. The ability to decompose the
BSS into phase, amplitude and bias terms is definitely
an advantage over the other skill scores (which have 5. Application of BSS to a coastal profile model
no equivalent to the Murphy–Epstein decomposition).
This allows the relative importance of phase, ampli- The BSS has been applied to assess the skill of the
tude and bias to be determined and may indicate the coastal profile model COSMOS at Egmond-aan-Zee
areas where a model most needs to be improved. (NL) on the North Sea coast (Brady and Sutherland,
The correlation coefficient, r XY, does not take into 2001; Sutherland and Soulsby, 2003). The modeling
account errors in the mean value. Nor does it take into was carried out as part of the COAST3D project
account the initial condition or any baseline predic- (Soulsby, 2001; Van Rijn et al., 2002; Van Rijn et al.,
tion. Moreover, if Y=aX+b then r XY=1 for any values 2003).
of a and b. Therefore, quite dissimilar sets of Egmond-aan-Zee is on a long, almost straight
observations and predictions may be perfectly corre- stretch of the Dutch coast. However, on direct
lated. Moreover, if the changes in bathymetry are inspection it is seen that there is a natural three-
small compared with the range of depths in the initial dimensionality to the bathymetry produced by rip
bathymetry, then the correlation coefficient will be channels. These intersect a bar system that has an
near to 1 for any plausible model—as the hypothetical offshore bar, an inshore bar and an occasional swash
scenarios showed. A large error may be masked by zone bar as well. The period modeled during the
high standard deviations or a low error may seem high COAST3D main experiment contained almost-con-
if the standard deviations are low. Therefore, correla- tinuous major storms with offshore significant wave
tion coefficients are a poor measure of the perform- height, H s, of up to 5 m. The storms caused
ance of any model, not just of morphodynamic substantial changes in morphology. Regular bathy-
models. metric surveys were performed in water depths up to
Willmott’s indices of agreement are probably about 6 m.
unsuited to testing morphodynamic modeling as they The bathymetric changes along a cross-shore
are based on variations from the measured mean, profile were modeled using the coastal profile model
which is an arbitrary value that depends on the datum COSMOS (Nairn and Southgate, 1993; Southgate and
and has no significance for a beach. It also means that Nairn, 1993). COSMOS includes wave transformation
the denominator on the right hand side is often large by refraction, shoaling, Doppler shifting, bottom
(as measured and predicted levels both tend to be on friction and wave breaking. It also includes wave
the same side of the measured mean). This caused the setup, cross-shore and longshore sediment transport
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 931
using the energetics approach and sea bed level swash zone (from 4870 to 5000 m). The baseline
changes due to cross-shore transport. COSMOS was prediction of the final profile was the initial profile.
driven using wave height, period and directions The calculation of the BSS for the outer bar only
measured by a directional wave buoy approximately extends out to 4488 m due to severe wave conditions
5 km offshore, with water levels taken from a fixed limiting the extent of the bathymetric survey.
pole on the outer bar. COSMOS assumes that the The model predicted the cross-shore evolution of
beach and the hydrodynamics are both uniform in the the outer bar with a skill score of 0.64. The inner bar
longshore direction. The measured profiles from was predicted less well, but still had a positive skill
seven parallel cross-shore transects were averaged at score of 0.20. The skill score for the modeling of the
the start and end of each modeled period to provide a swash zone was negative. The progression of skill
more longshore-uniform profile. A comparison scores can be explained as follows. It is easier to
between the results from five different profile models model the wave heights over the outer bar than the
is provided by Van Rijn et al. (2003). inner bar, as the waves at the outer bar are closer to the
A 7-day period, from 24 October to 31 October, boundary conditions. It is therefore reasonable to
1998, at the height of the storm was modeled (Brady expect that the changes in the cross-shore profile will
and Sutherland, 2001). Fig. 2 shows observed be modeled more accurately over the outer bar than
profiles at the start (24 October) and end (31 the inner. The model does not include swash-zone
October) of the storm period, as well as the predicted processes in its formulation, so its inability to model
profile at 31 October. The initial profile used in the the swash zone is to be expected.
modeling was the profile measured on 24 October. Moreover, the measured mean squared change in
All distances along the cross-shore profiles were bed levels was very low (0.005 m2) in the swash zone.
measured from the position of the wave buoy. The This value was similar to the modeled mean squared
average of the hourly offshore measurements of root- change (0.008 m2) and to the estimated mean squared
mean-square wave height taken during these 7 days measurement error. Therefore, the model was correct
was 2.2 m, while the average measured spectral peak in modeling little change, but incorrect in predicting
period was 8.5 s. the details of the change. The swash zone provides an
Fig. 2 also includes the BSS for each of three example of a case where the BSS is very sensitive to
sections of the profile: the outer bar (from 4488 to small changes in measurements and predictions as the
4660 m), the inner bar (from 4660 to 4870 m), and the denominator is so small.
Fig. 2. Brier Skill Scores for outer, inner and swash zone bars for COSMOS modeling at Egmond. B=initial profile, X=measured final profile,
and Y=predicted final profile.
932 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
6. Application of BSS to a coastal area model COAST3D main experiment provided the initial
and final bathymetries.
The BSS was applied to morphodynamic mod- Teignmouth is located on the south coast of
eling at the mouth of the estuary at Teignmouth England, on the western side of Lyme Bay. The site
situated in the southwestern United Kingdom comprises a tidal inlet adjacent to a beach. It was
(Sutherland et al., 2001; Sutherland et al., 2004; chosen as the UK COAST3D site because the
Sutherland and Soulsby, 2003). The modeling and bathymetry was irregular and three-dimensional and
the measurements used to assess the modeling were the site was known to be morphologically active
part of the COAST3D project (Soulsby, 2001; (Whitehouse and Sutherland, 2001). The river Teign
Whitehouse and Sutherland, 2001). A 14-day empties into the sea through the 6-km-long shallow
period that included relatively stormy weather was estuary. The estuary channel is sinuous in the lower
chosen for comparing predicted and observed kilometer as it passes to the north of the Salty (a large
morphological change. Two surveys from the inter-tidal area). A rocky headland (the Ness) is to the
south of the estuary mouth, while a sand and shingle details of the site can be found in Whitehouse and
spit (Denn Point) is located on the north side. The Sutherland (2001).
beach to the north is about 2-km long, is groyned and The Teignmouth site was simulated using the
backed by a seawall. The exchange of water through coastal area modeling system PISCES (Chesher et
the narrow estuary mouth causes high currents that al., 1993). The PISCES modules used in this study
maintain a deep scoured channel. A bathymetric chart were the finite element flow module, TELEMAC-2D
of the area used for measurements in the COAST3D (Hervouet et al., 1994), the finite difference wave
experiments is shown in Fig. 3. This also includes module, FDWAVE (Southgate and Goldberg, 1989),
some instrument locations for the main experiment. and the sand transport model, SANDFLOW. A local
The mean tidal range is 1.7 m on neaps, and 4.2 m model was nested directly within a regional model.
on springs. The tide causes a large part of the estuary The regional model covered the whole of the English
to dry and flood each tidal cycle. Outside the estuary, Channel and was driven by boundary conditions from
there are a number of morphologically active sand- the Tidal Flow Forum (Werner and Lynch, 1987). The
banks shown in Fig. 3: Pole Sand, Spratt Sand and local model included the whole estuary and extended
East Pole Sand (marked by a box). The beach and approximately 7 km north and 5 km south of the
nearshore bank system comprises mixed sediments estuary mouth and between 3.5 and 4.5 km offshore.
and the sediment offshore is fine sand. The beach and The local model was run for the period between
sandbank systems at Teignmouth have been studied 07:00 on 8 November 1999 and 07:00 on 22
since the mid-nineteenth century when Spratt (1856) November 1999. Wave conditions were taken from
documented cyclic movements of the bars. Further measurements made at a buoy about 1.5 km
Fig. 4. Changes in bathymetry of East Pole Sand from measurements and three model runs. Reproduced from Sutherland and Soulsby (2003) by
permission of World Scientific Publishing.
934 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
offshore (item 7 in Fig. 3). A uniform grain size ing through the model run or by improving the wave–
was used throughout the model area. The sediment current interaction to reproduce the hydrodynamics
transport rate in the deep channel at the estuary around East Pole Sand during the storm.
mouth was set to zero to prevent unrealistically high
sediment transport rates in this region, as the bed
was known not to be of sand. The default bed 7. Effect of measurement errors on BSS
roughness of 0.10 m was also applied throughout,
and the bed was only updated at the end of the Eq. (11) for skill shows that the denominator
model run. Further details of the modeling can be depends on the accuracy of a perfect prediction as
found in Sutherland et al. (2001, 2004) and well as the accuracy of the baseline prediction.
Sutherland and Soulsby (2003). Generally, the error of a perfect prediction is taken
Here we concentrate on the bathymetric modeling to be zero, giving Eqs. (12)–(14). However, this
of East Pole Sand, between 294 200 and 294 600 m ignores the presence of measurement errors. All
East and between 72 200 and 72 500 m North (marked bathymetric surveys contain some errors in the
by a box in Fig. 3). Sand sizes of 0.25, 0.35 and 0.50 measurements. Moreover, the results from different
mm were modeled in turn to test the model’s surveys will generally have to be interpolated onto the
sensitivity to grain size. The measured and modeled same grid for comparisons to take place, and the
changes in bathymetry are shown in Fig. 4 (from interpolation procedure will introduce further inaccur-
Sutherland and Soulsby, 2003). PISCES correctly acies into the calculations of bias, accuracy and skill.
predicted that there was erosion on the top and east of To minimize the latter, the survey and data reduction
East Pole Sand and that there was some deposition on processes used in the initial and final surveys should
the western side. The Brier Skill Scores were be compatible, as accurate as possible and include an
calculated and are shown in Table 9. The Brier Skill estimate of their errors.
Scores are all positive and increase with grain size. Two methods have been proposed to account for
One might expect that increasing the grain size measurement errors. Although it is difficult to estimate
further might lead to improvements in the model skill. measurement error, they will occur. Consequently,
However, Table 9 also includes the results of a even in an inactive area, where no bed level changes
Murphy–Epstein (1989) decomposition of the BSS take place in reality, there will be changes between the
(Eqs. (15) and (16)). This shows that the phase, mean first and second surveys (B and X) at a point.
error and normalization terms (a, c and E, respec- The first adjusted BSS is a refinement of that
tively) remain essentially constant for all three grain proposed in Sutherland et al. (2001). It comes from
sizes, but that the amplitude term, b, reduces from using mean square errors in all parts of Eq. (11). We
0.20 to 0.01 as the grain size increases from 0.25 to assume that the initial measured bathymetry, B, is
0.50 mm. The results show that increasing the grain made up of the actual bathymetry B A plus measure-
size in the model reduces the transport rate, but does ment error, d B , so B=B A+d B . Similarly, let X=X A+d X
not alter the distribution of the sediment. Therefore, and the output from perfect modeling, W=B A+
increasing the grain size further in the model will not d B +dW, with dW the modeled change in bed elevation
produce significantly better results as the amplitude is due to perfect modeling. We assume that for perfect
already close to zero. Further improvements in model modeling dW=X AB A so that W=X A+d B. It follows
performance can thus only come from improving the that MSE(W,X)=hd X2i+hd B2i. We assume that surveys
modeling approach, possibly by including bed updat- B and X are made the same way, so that
hd X2i=hd B2i=hd 2i. Substituting into Eq. (11) gives:
Table 9
Decomposition of Brier Skill Score from coastal area modeling MSEð B; X Þ MSEðY ; X Þ
d 50 [mm] BSS Term a Term b Term c Term e
BSSP ¼ ð21Þ
MSEð B; X Þ 2hd2 i
0.25 0.15 0.38 0.20 0.04 0.01
0.35 0.29 0.38 0.07 0.03 0.01 The assumptions above imply that MSE(B, X)=
0.50 0.34 0.38 0.01 0.03 0.01 h(B AX A)2i+2hd 2iz2hd 2i so the denominator of Eq.
J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939 935
(21) will be positive (or zero). Eq. (21), therefore, and predicted surface elevations on exactly the same
gives a higher magnitude of skill score than the grid. This was necessary, as the BSS requires all
standard form of the BSS (Eq. (12)). This means that three surface elevations to be at the same point.
positive skill scores will increase, but negative ones Using 0.026 m as the RMSE in Eqs. (21) and (22)
will decrease. gives the values for the adjusted Brier Skill Scores,
Van Rijn et al. (2003) used a different approach. BSSP and BSSvR, shown at the top of Table 10 for
Their alternative method of calculating the BSS, East Pole Sand alongside the normal BSS and the root
adjusting it for the measurement error, is given by: mean square difference between measurement and
prediction, RMSE(X,Y). Taking the error into account
hðjY X j dÞ2 i improves the BSSP by about 0.01 each. The Van Rijn
BSSvR ¼ 1 ð22Þ
hð B X Þ2 i method, BSSvR, gives a larger improvement, doubling
the BSS. When a RMSE of 0.1 m is used, the figures
with |YX|d set to zero if |YX|bd. In other words, in the lower half of Table 10 are returned. Here the
the minimum difference between the computed value root mean square measurement error is a much larger
and the measured parameter plus or minus measure- fraction of the root mean square modeling error
ment error is considered. When the computed value is (difference between final measurements and predic-
within the error band of the measurements, the error tions) and the increases in the skill scores are therefore
(numerator on the right hand side of Eq. 22) is set to much larger. The BSS adjusted for the error in perfect
zero. In this method, areas where the difference modeling, BSSP, gives increases in score of between
between modeled and measured final bathymetry is 0.16 and 0.40. The Van Rijn method gives increases in
within the measurement error are treated as having no score ranging from 0.37 to 0.44.
error. The measurement error always acts to reduce The Van Rijn et al. (2003) method (Eq. (22))
the numerator in Eq. (22), so maximizing the BSS generally gives the larger improvement in skill score,
value calculated. for the same error, as any prediction lying within the
Consider the measurements and modeling made at error limit (set by the root mean square measurement
Teignmouth, described in the previous section. The error) is allocated an error of zero. Eq. (21) takes into
majority of the observed bed levels were made from account errors in perfect modeling and is compatible
the survey vessel using DGPS and echo sounder. The with the definition of skill (Eq. (11)), and so is
estimated error (for depths less than 10 m) was 0.1 m considered the best method of including measurement
for wave heights less than 0.2 m, the sea-state for the error in the BSS.
surveys (Whitehouse and Sutherland, 2001; Van Rijn
et al., 2003). An alternative estimate of the noise in
the measurements was made by calculating the root- 8. Qualification of skill scores
mean-square difference between initial and final bed
elevation measurements in a region that showed no Skill scores provide an objective method of
noticeable changes between initial and final surveys. assessing the performance of morphological models.
The chosen region was between 294 700 and 295 000 Any BSS higher than zero means that the modeling
m East and between 72 400 and 72 800 m North
(Fig. 3). This gave a root-mean-square difference of
Table 10
0.037 m. It was assumed that no sand transport took
Brier skill score adjusted Brier skill scores to account for measure-
place in this region, so the difference was due to ment error
independent measurement errors in both pffiffiffi surveys. It d 50 [mm] d rms [m] BSS BSSP BSSvR RMSE(X,Y) [m]
follows that hd2 i0:5 ¼ drms ¼ 0:037= 2 ¼ 0:026 m.
0.25 0.026 0.15 0.15 0.30 0.178
The root mean square difference is about a quarter of 0.35 0.026 0.29 0.30 0.42 0.163
the quoted estimate of error of 0.1 m in the data 0.50 0.026 0.34 0.35 0.47 0.157
report (Whitehouse and Sutherland, 2001). The 0.25 0.100 0.15 0.31 0.59 0.178
situation is also complicated by the use of a surface 0.35 0.100 0.29 0.61 0.68 0.163
interpolation package to produce sets of measured 0.50 0.100 0.34 0.74 0.71 0.157
936 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
water where there is less sediment transport. There- dY modeled change in bed elevation
fore, measures of accuracy have to be judged in the E equilibrium (category of shoreline
context of the area modeled and are not necessarily movement)
easy to compare between applications. F false alarm (i.e., event predicted that did not
The skill scores, however, still give the same occur)
score when the morphologically inactive offshore FAR False Alarm Rate
area is introduced and are therefore more bhonestQ G fraction of correct predictions expected from
than measures of accuracy. The variation between a random choice
the skill scores reflects the variations in the mean H Hit (i.e., event occurred as predicted)
absolute error, mean square error and root mean HSS Heidke Skill Score
square error. The BSS will tend to give higher values J total number of predictions or observations
for positive skill scores, but lower values for k wave number
negative skill scores than the Mean Absolute Skill LEPS Linear Error in Probability Space
Score. This could be considered an advantage of the LS LEPS score
BSS over the Mean Absolute Skill Score, even M Miss (i.e., event occurred but not predicted)
though the latter uses the Mean Absolute Error, MAE mean absolute error
which is probably the best representation of the MASS mean absolute skill score
accuracy. The ability to decompose the BSS into MSE mean square error
phase, amplitude and bias terms is definitely an N number of categories
advantage over the other skill scores. This allows the P cumulative probability
relative importance of phase, amplitude and bias to PC percentage correct
be determined and may indicate the areas where a POD probability of detection
model most needs to be improved. Measurement R retreat (category of shoreline movement)
error can be taken into account by considering it to RMSE root mean square error
be the mean square difference between observations RMSSS root mean square skill score
and predictions in perfect modeling. The BSS, r XY correlation coefficient between sets X and Y
therefore, is considered the best measure of perform- SS skill score
ance for morphological models, based on the criteria s XY covariance between sets X and Y
set out above. TS Threat Score
In addition, when evaluating model performance, W set of predictions from perfect modeling
it is good practice to answer the following ques- X set of measurements of final bathymetry
tions. Was the model tuned against the data set XM median value of set of measurements
used for validation? Was the evaluation performed XV measured anomaly (XB)
as a blind test (i.e., did the modeler know the x horizontal distance
correct answer in advance)? What was the baseline Y set of predictions of final bathymetry
prediction used in the calculation of any skill YM median value of set of predictions
scores? YV predicted anomaly ( YB)
z vertical elevation
Lists of Symbols a phase error in Murphy–Epstein
A advance (category of shoreline movement) decomposition
a constant b amplitude error in Murphy–Epstein
B set of baseline predictions of final decomposition
bathymetry c mean error in Murphy–Epstein decomposition
b constant d measurement error
BSS Brier Skill Score e normalization term in Murphy–Epstein
C correct rejection (i.e., event prediction not to decomposition
occur and did not occur) r standard deviation
D constant in Dean equilibrium beach profile hXi average set of X
938 J. Sutherland et al. / Coastal Engineering 51 (2004) 917–939
Wilks, D.S., 1995. Statistical Methods in the Atmospherical Wright, L.D., Short, A.D., 1983. Morphodynamics of beaches and
Sciences. Academic Press, San Diego. 467 pp. surf zones in Australia. In: Komar, P.D. (Ed.), Handbook of
Willmott, C.J., 1981. On the validation of models. Physical Coastal Processes and Erosion. CRC Press, Boca Raton FL,
Geography 2 (2), 184 – 194. USA, pp. 35 – 64.
Willmott, C.J., Ackleson, S.G., Davis, R.E., Feddema, J.J., Klink, Zheng, J., Dean, R.G., 1997. Numerical models and intercompar-
J.J., Legates, D.R., Rowe, C.M., 1985. Statistics for the isons of beach profile evolution. Coastal Engineering 30 (3–4),
evaluation and comparison of models. Journal of Geophysical 169 – 201.
Research 90 (C5), 8995 – 9005.