0% found this document useful (0 votes)

130 views

A Refined Index of Model Performance

artigo

Uploaded by

AntonioPiresdeCamargo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views

A Refined Index of Model Performance

artigo

Uploaded by

AntonioPiresdeCamargo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

INTERNATIONAL JOURNAL OF CLIMATOLOGY

Int. J. Climatol. 32: 20882094 (2012)

Published online 9 September 2011 in Wiley Online Library
(wileyonlinelibrary.com) DOI: 10.1002/joc.2419

Short Communication
A refined index of model performance
Cort J. Willmott,a * Scott M. Robesonb and Kenji Matsuuraa
a

Center for Climatic Research, Department of Geography, University of Delaware, Newark, DE 19716, USA
b Department of Geography, Indiana University, Bloomington, IN 47405, USA

ABSTRACT: In this paper, we develop, present and evaluate a refined, statistical index of model performance. This
new measure (dr ) is a reformulation of Willmotts index of agreement, which was developed in the 1980s. It (dr ) is
dimensionless, bounded by 1.0 and 1.0 and, in general, more rationally related to model accuracy than are other existing
indices. It also is quite flexible, making it applicable to a wide range of model-performance problems. The two main
published versions of Willmotts index as well as four other comparable dimensionless indices proposed by Nash and
Sutcliffe in 1970, Watterson in 1996, Legates and McCabe in 1999 and Mielke and Berry in 2001 are compared with the
new index. Of the six, Legates and McCabes measure is most similar to dr . Repeated calculations of all six indices, from
intensive random resamplings of predicted and observed spaces, are used to show the covariation and differences between
the various indices, as well as their relative efficacies. Copyright 2011 Royal Meteorological Society
KEY WORDS

accuracy indices; model-performance statistics

Received 4 February 2011; Revised 22 July 2011; Accepted 23 July 2011

Introduction

Numerical models of climatic, hydrologic, and environmental systems have grown in number, variety and
sophistication over the last few decades. There has been
a concomitant and deepening interest in comparing and
evaluating the models, particularly to determine which
models are more accurate (e.g. Krause et al., 2005). Our
interest lies in this arena; that is, in statistical approaches
that can be used to compare model-produced estimates
with reliable values, usually observations.
Our main purpose in this paper is to present and evaluate a refined version of Willmotts dimensionless index of
agreement (Willmott and Wicks, 1980; Willmott, 1981,
1982, 1984; Willmott et al., 1985). The refined index, we
believe, is a nontrivial improvement over earlier versions
of the index and is quite flexible, making it applicable to an extremely wide range of model-performance
applications. Our discussion contains a brief history, a
description and assessment of its form and properties,
and comparisons with a set of other dimensionless measures of average model accuracy to illustrate its relative
effectiveness.
2.

Background

Statistical measures of model performance, including

the index of agreement, commonly compare model
* Correspondence to: Cort J. Willmott, Center for Climatic Research,
Department of Geography, University of Delaware, Newark, DE 19716,
USA. E-mail: [email protected]
Copyright 2011 Royal Meteorological Society

estimates or predictions (Pi ; i = 1, 2, . . . , n) with pairwise-matched observations (Oi ; i = 1, 2, . . . , n) that are

judged to be reliable. The units of P and O should be the
same. The set of model-prediction errors usually is composed of the (Pi Oi ) values, with most dimensioned
measures of model performance being based on the central tendency of this set.
The majority of dimensionless measures of average
error, once again including the index of agreement, are
framed as
= 1 /
(1)
where is a dimensioned measure of average error
or, more precisely, average error-magnitude and is
a basis of comparison. The selection of , of course,
determines how well the average error-magnitude will
be represented, while the choice of determines the
lower limit of Equation (1), as well as the sensitivity
of to changes in . As 0 and > 0, the upper
limit of Equation (1) is 1.0 and indicates perfect model
performance. In most cases, is defined such that the
lower limit of Equation (1) is 0, 1, or .

3. Brief history of the index of agreement

The original form of Willmotts index of agreement
(Willmott and Wicks, 1980; Willmott, 1981) was a
specification of Equation (1). Willmott and Wicks used d
to represent the index (rather than ) and we will follow
their convention here. It (d) was a sums-of-squares-based

2089

A REFINED INDEX OF MODEL PERFORMANCE

measure, within which was the sum of the squared

errors while was the overall sum of the squares of
sums of the absolute values of two partial differences
from the observed mean, |Pi O| and |Oi O|. Thus,
the form of the original index was
n

[(Pi O) (Oi O)]2

d =1

i=1
n

(2a)

(|Pi O| + |Oi O|)

i=1

which simplifies to and is commonly written as

n

(Pi Oi )2

d =1

i=1
n

(|Pi O| + |Oi O|)2

(2b)

i=1

This specification of ensured that the potential upper

limit of was and, in turn, that the lower limit of d
(indicating complete disagreement) was zero. Willmott
reasoned that d should describe the relative covariability
of P and O about an estimate of the true mean; that is,
about O. This not only conveniently bounds Willmotts
d on the lower end, but when d = 0, it can be physically
meaningful. When all the Pi s equal O, for example,
d = 0. Moreover, when every Pi and Oi pair is on the
opposite side of O, d = 0; in other words, the modelpredicted variability about O behaves inversely to the
observed variability (Oi O). Model predictions that are
completely out of phase with observations might produce
such a response.
As an increasing number of applications of d to modelperformance problems were made, Willmott and his graduate students began to suspect that squaring the errors,
prior to summing them, was a problem (Willmott, 1982,
1984; Willmott et al., 1985). It was recognized that the
larger errors, when squared, over-weighted the influence
of those errors on the sum-of-squared errors. The full
nature and extent of this problem was not appreciated
until the upper limit of the sum-of-squared errors was
investigated more thoroughly (Willmott and Matsuura,
2005, 2006). Nonetheless, Willmott et al. (1985) put forward a version of d that was based upon sums of the
absolute values of the errors (calling it d1 ) and he and
his collaborators have preferentially used d1 ever since.
The form of d1 is
n

d1 = 1

|Pi Oi |

i=1
n

perform relatively well. It (d1 ) also is much less sensitive

to the shape of the error-frequency distribution and, as a
consequence, to errors concentrated in outliers.
Our experience has been that average-error or- deviation measures that are based on absolute values of
differences like the mean-absolute error (MAE) and
mean-absolute deviation (MAD) are, in general, preferable to those based on squared differences, like the
root-mean-squared error (RMSE) (Willmott and Matsuura2005, 2006; Willmott et al., 2009),
where MAE =

n1 ni=1 |Pi
Oi |, MAD = n1 ni=1 |Oi O|, and
Dimensionless
RMSE = [n1 ni=1 (Pi Oi )2 ]0.5 .
indices of model performance, in turn, also should be
based on absolute values of differences or deviations.
There are instances, however, when useful information
can be gleaned from a set of squared differences. The
RMSE, for instance, can be decomposed into systematic and unsystematic components (Willmott, 1981), and,
when comparing mapped variables, the RMSE can help
to distinguish between differences due to quantity and
those due to location (Pontius et al., 2008).

A refined index of agreement

Our primary goal is to present a refined index of

agreement that, like d and d1 , is bounded on both
the upper and lower ends. Many other existing indices
are bounded on the upper end (usually by 1.0) but
lack a finite lower bound (cf. Legates and McCabe,
1999; Krause et al., 2005), which makes assessments and
comparisons of poorly performing models difficult. With
this in mind, we have developed our new index with an
easily interpretable lower limit of 1.0. With an upper
limit of 1.0, the range of our new index is double the
range of d or d1 . Our refined index also is logically
related to increases and decreases in MAE.
In retrospect, while the over-sensitivity of d to large
error-magnitudes was reduced in d1 , two aspects of
d1 remain suboptimal. Its overall range (0 d1 1)
remained somewhat narrow to resolve adequately the
great variety of ways that P can differ from O. It
also seems that including the variability of P (around
O) within makes the interpretation of somewhat
murky. That is, as long as contains variability in P ,
it cannot be interpreted as a model-independent standard
of comparison for .
A natural way to remove the influence of P on is
to replace P with O within the denominator of d1 . The
revised d1 (d 1 ) then can be written as

(3)

(|Pi O| + |Oi O|)

|(Pi O) (Oi O)|

i=1

An advantage that d1 has over d is that it approaches

1.0 more slowly as the P approaches O and, therefore,
provides greater separation when comparing models that
Copyright 2011 Royal Meteorological Society

d1 = 1

i=1
n

(|Oi O| + |Oi O|)

i=1

Int. J. Climatol. 32: 20882094 (2012)

2090

C. J. WILLMOTT et al.
n

|Pi Oi |

i=1
n

(4)
|Oi O|

represents the difference between two deviations, one

model-predicted and one observed; thus, the number of
deviations evaluated within the numerator and within the
denominator of the fractional part of dr are the same and
in conceptual balance.

i=1

This revision, however, makes d 1 unbounded on the

lower end, which undermines interpretations of index
values associated with poorly performing models. Our
solution is to invert the fractional part of the expression
and subtract 1.0 from it, when d 1 falls below zero. This
follows an approach taken by Willmott and Feddema
(1992) to refine a climatic moisture index. Our new or
refined index of agreement (dr ) then can be expressed as

|Pi Oi |

i=1

, when
1 n

c
|Oi O|

i=1

n
n

|Pi Oi | c
|Oi O|

i=1
i=1
(5)
dr =
n

|Oi O|
c

i=1

1, when

|Pi Oi |

i=1

n
n

|P O | > c
|O O|
i

i=1

with c = 2, following Equation (4).

Interpretation of dr is relatively straightforward. It
indicates the sum of the magnitudes of the differences
between the model-predicted and observed deviations
about the observed mean relative to the sum of the
magnitudes of the perfect-model (Pi = Oi , for all i)
and observed deviations about the observed mean. A
value of dr of 0.5, for example, indicates that the sum
of the error-magnitudes is one half of the sum of the
perfect-model-deviation and observed-deviation magnitudes. When dr = 0.0, it signifies that the sum of the
magnitudes of the errors and the sum of the perfectmodel-deviation and observed-deviation magnitudes are
equivalent. When dr = 0.5, it indicates that the sum
of the error-magnitudes is twice the sum of the perfectmodel-deviation and observed-deviation magnitudes. Values of dr near 1.0 can mean that the model-estimated
deviations about O are poor estimates of the observed
deviations; but, they also can mean that there simply
is little observed variability. As the lower limit of dr is
approached, interpretations should be made cautiously.
It also is possible to interpret dr in terms of the
average-error and -deviation measures, MAE and MAD.
Our statistic (dr ) inversely follows a scaling (1/c) of
MAE/MAD. With c = 2, the influence of the MAD
is doubled; one of these two MADs accounts for the
observed mean-absolute deviation and the other represents the average magnitude of the perfect-model deviations. Recall that each error-magnitude within the MAE
Copyright 2011 Royal Meteorological Society

5. Comparison with other dimensionless measures

of average error
Within this section of the paper, we assess dr s responses
to varying patterns of differences between P and O and
compare them with the corresponding responses of several comparable measures. Comparable measures that we
consider are: d, d1 , Wattersons M, Mielke and Berrys
, Nash and Sutcliffes E and Legates and McCabes
E1 . It is common for the MSE (RMSE2 ) or a closely
related measure to represent (cf. Watterson, 1996)
and, herein, d, M and E represent this class of measures. With increasing frequency, however, dimensionless
indices based on the MAE have appeared in the literature (cf. Mielke, 1985; Willmott et al., 1985; Mielke and
Berry, 2001; Legates and McCabe, 1999; Krause et al.,
2005). This class of measures is represented here by d1 ,
, E1 as well as by dr .
To show how each of the indices varies, relative to
dr , simulated values of P and O were created using
a uniform random number generator. More specifically,
random samples of size n = 10 were generated separately
for P and for O (using a pseudo-random number generator). Since our primary interest is in identifying the
envelopes of covariability among the various indices, a
small sample size was preferred. Values of each measure
or index of interest were calculated from the ten pairs
of sampled values. To estimate the full extents of the
envelopes of covariation between dr and each of the other
six measures, this random sampling and calculation process was repeated 100 000 times for each index. A stratified percentile-based subsample of these values is plotted
to depict each envelope of covariability (Figures 13).
Less intensive samples, with the mean value of P offset
from that of O, are used to demonstrate the behaviours
of the indices for the case of overprediction (Figure 4).
On each of our first three scatterplots, dr is the x-axis
variable and two of the other six indices are plotted along
the y-axis (Figures 13). On the first graph (Figure 1),
d and d1 are plotted against dr ; on the second graph
(Figure 2), M and are plotted against dr ; and, on the
third graph (Figure 3), E and E1 are plotted against dr .
So, within each scatterplot, a sum-of-squares-based and a
sum-of-absolute-values-based measure are plotted against
dr .
It is clear that dr differs substantially from its two
predecessors, d and d1 (Figure 1). Recall that d is a
sum-of-squares-based measure and d1 a sum-of-absolutevalues-based measure. First and foremost, the range of
dr is twice that of d and d1 . For models with large
error distributions (relative to variability in O and P ),
values of d or d1 usually are higher than comparable
Int. J. Climatol. 32: 20882094 (2012)

2091

A REFINED INDEX OF MODEL PERFORMANCE

0.8

0.6
5

0.2

E and E1

d and d1

0.4

0
0.2

0.4
15

0.6
0.8
1
1

0.5

20
1

0.5

0
dr

Figure 1. Stratified subsamples of statistics from the 100 000 pair-wise

values of dr and d (triangles) and of dr and d1 (black dots) are plotted.
Each comparable value of each index was computed from the same
n = 10 sample of P and O. The 1 : 1 line is plotted for reference.

0.8
0.6
0.4

M and

Figure 3. Stratified subsamples of statistics from the 100 000 pair-wise

values of dr and E (triangles) and of dr and E1 (black dots) are plotted.
Each comparable value of each index was computed from the same
n = 10 sample of P and O. The 1 : 1 line is plotted for reference.

dr . Thus, dr tends to behave rather differently and more

rationally than either d or d1 .
Our comparisons among dr , Wattersons (1996) M,
and Mielke and Berrys (2001) (Figure 2) also show
considerable differences among the three measures. The
form of Wattersons index that we examine here is

0.2

M = (2/) sin

0.2
0.4

MSE
2
sP2 + sO
+ (P O)2

(6)

where sP snd sO are the standard deviations of P and O,

respectively. Mielke and Berry discuss several forms of
their index; but, we only consider

0.6
0.8
1
1

0.5

Figure 2. Stratified subsamples of statistics from the 100 000 pair-wise

values of dr and M (triangles) and of dr and (black dots) are plotted.
Each comparable value of each index was computed from the same
n = 10 sample of P and O. The 1 : 1 line is plotted for reference.

values of dr . Especially for d1 , this difference declines

as P approaches O, to the point where dr approaches
d1 for models with very small error distributions. There
is more scatter (and high values) exhibited among the
random sample-based values of d, relative to d1 , owing
to the effects of squaring, especially on extrema and
within the denominator of the fractional part. It also
is apparent that increases and decreases in dr are not
monotonically related to increases and decreases in d
and d1 . This is because P is a variable within the
denominator of fractional parts of d and d1 , but not within
Copyright 2011 Royal Meteorological Society

=1
2

MAE
n

|Pj Oi |

(7)

i=1 j =1

Both scatterplots (M versus dr and versus dr ) are

centered not too far from zero, since each measure has a
similar structure and their domains extend into negative
numbers. Both indices exhibit a similar scatter pattern
(this is not entirely expected, as M is a sum-of-squaresbased measure and is not). As with d and d1 , increases
and decreases in dr are not monotonically related to
increases and decreases in M and for the same reason:
P is contained within the denominator of fractional parts
of M and .
Of the six comparable measures, the indices of Nash
and Sutcliffe (1970) and Legates and McCabe (1999)
are most closely related to dr , because within these
measures P does not appear in the denominators of
Int. J. Climatol. 32: 20882094 (2012)

2092

C. J. WILLMOTT et al.

(a) 40

(b)

0.5

d and d1

Predicted

30
20

(c)

20
Observed

1
1

(d)

0.5

0
dr

0.5

0
dr

0.5

1
1
5

E and E1

M and

0
0.5

10
0

0.5

10
15

1
1

0.5

0
dr

0.5

20
1

Figure 4. Demonstration of index values for the case of overprediction. Using uniform distributions, 500 values of O and P were generated
(with O centered on 10 and P centered on 20). Fifty subsamples of size n = 10 are drawn, and pair-wise values of dr and the other indices
are calculated. Panels show: (a) 500 values of O and P , (b) 50 pair-wise values of dr and d (triangles) and of dr and d1 (black dots), (c) 50
pair-wise values of dr and M (triangles) and of dr and (black dots), and (d) 50 pair-wise values of dr and E (triangles) and of dr and E1
(black dots). In all cases, the 1 : 1 line is plotted for reference.

fractional parts. Nash and Sutcliffes coefficient of efficiency (E) is

n

(Pi Oi )2

E =1

i=1
n

(8)
(Oi O)

i=1

whereas Legates and McCabes index (E1 ) is written as

E1 = 1

i=1
n

|Pi Oi |
(9)
|Oi O|

i=1

It is apparent in Equations (8) and (9) that E is based

on the squares of differences while E1 is based on
the absolute values of differences. It also is clear that
these two measures are similar to dr , especially over
the positive portion of its domain (Figure 3; note that
Figure 3 has a different y-axis scaling from Figures 1
and 2). The coefficient of efficiency (E) is positively
correlated with dr and increasingly so as both measures
approach their upper limits; but, the summing of squared
Copyright 2011 Royal Meteorological Society

differences within E precludes a monotonic relationship between the increases and decreases in dr and in
E. Legates and McCabes measure, on the other hand,
is monotonically and functionally related to our new
index; and, when positive, E1 is equivalent to dr with
c = 1. As mentioned above, we think that c = 2 is a
better scaling, because it balances the number of deviations evaluated within the numerator and within the
denominator of the fractional part. It (E1 ) is an underestimate of dr , as is evident in the functional relationship(s) between dr and E1 . Over the positive portion of
dr s domain, dr = 0.5(E1 + 1) while, when dr is negative, dr = [2(E1 1)1 + 1]. The second expression
also shows dr s linearisation of E1 s exponential decline
from 0 to .
A nontrivial difference between dr and E1 , as well
as between dr and E, is the indices behaviour over
the negative portions of their domains. The magnitudes
of both E1 and E increase exponentially in the negative direction (Figure 3), which can make comparisons
among some model estimates difficult. When the deviations around O are quite small or perhaps trivial, for
instance, even small differences among competing sets
of model estimates can produce substantially different
values of E1 or of E. In comparing models that estimate daily or monthly precipitation in an arid location, for example, relatively small differences between
Int. J. Climatol. 32: 20882094 (2012)

2093

A REFINED INDEX OF MODEL PERFORMANCE

the sets of model estimates could produce vastly different values of E1 or of E. Values of dr , on the
other hand, would be more usefully comparable to one
another.
It is clear that Legates and McCabe (and Nash and Sutcliffe before them) appreciated the importance of specifying with variation within the observed variable only.
Legates and McCabe further understood the importance
of evaluating error- and deviation-magnitudes, rather than
their squares. Their measure (E1 ), in turn, has a structure
similar to that of dr but with a substantially different
scaling and lower limit, as discussed above.
To show the behaviour of the indices for a specific case
of predicted versus observed data, we selected a typical pattern: overprediction (Figure 4). For this case, we
show the scatterplot that we sample from (Figure 4(a)),
as well as scatterplots of the other six measures versus dr for 50 random samples of the pair-wise values
of P and O. On each of the three scatterplots, dr is
the x-axis variable and two of the other six indices are
plotted along the y-axis (i.e. Figure 4(b)(d) have the
same setup as Figures 13). It is clear that both d and d1
are much less responsive than dr to the various configurations of overprediction that can occur (Figure 4(b)).
For this particular case, where the magnitude of MAE
is consistently larger than the magnitude of the observed
variability, dr produces negative values while the values
of d can range from 0.2 to over 0.5 (d1 is more conservative than d but also is less responsive than dr to
the types of O versus P samples that are produced).
For the 50 samples from our overprediction distribution,
both M and produce almost no variation. , in particular, is very close to zero for almost all of the varied
samples within the overprediction example. Similar to
Figure 3, Figure 4(d) demonstrates how small differences
among the various observed and predicted samples can
produce substantially different values of E1 or of E that
are difficult to interpret. It is useful to note that swapping O and P in this example (i.e. producing a case
where the model systematically underpredicts) produces
virtually no change in any of the indices. In cases where
O and P have different magnitudes of variability, this
symmetry of overprediction and underprediction does not
occur.

6. Bases of comparison, other than O

It is usual for indices of agreement to compare predicted
and observed variability about O, as O is often the best
available estimate of the true average. Sampling and
other representational problems, however, may render
O a suboptimal representation of central tendency in
some circumstances. A better estimate of the true mean,
for instance, may be one that varies over space and
time, rather than one that is averaged over the entire
domain. Consider that, when examining an observed time
series of a climate variable, it may be better (more
representative) to use observed seasonal means rather
Copyright 2011 Royal Meteorological Society

than the mean of the entire time series. Our refined

index (dr ) can accommodate the replacement of O by
any appropriate function (e.g. sub-period averages) of the
observed variable.

Concluding remarks

A refined version of Willmotts dimensionless index of

agreement was developed, described and compared with
two previously published versions of Willmotts index of
agreement. Our presentation also contained a brief history
of the main forms of Willmotts index of agreement.
In addition, the relative performance of the new index
was compared with the performances of comparable
indices proposed by Nash and Sutcliffe (1970), Watterson
(1996), Legates and McCabe (1999) and Mielke and
Berry (2001). The refined index appears to be a nontrivial improvement over earlier versions of the index
as well as over other comparable indices. It is flexible,
relatively well behaved, and applicable to a wide range
of model-performance applications.
Variations in dr s responses to patterns of differences
between comparable sets of model-predicted values (P )
and reliable observations (O) were examined. They
also were compared with the corresponding responses
of six comparable measures. Comparable measures that
we considered were Willmotts d and d1 , Wattersons
M, Mielke and Berrys , Nash and Sutcliffes E and
Legates and McCabes E1 . To examine how each of
these indices varied, relative to dr , simulated values of
P and O were created using a uniform random number
generator. Values of the indices then were calculated
from small samples taken from the simulated values
of P and O. This random sampling and calculation
process was repeated 100 000 times for each index, and
a stratified pair-wise subsample for each set of two
indices was plotted to depict each bivariate envelope of
covariability.
Our new measure shows no consistent relationship
(monotonic increase or decrease) with five of the six
other measures to which we compared it. It does share
a functional relationship, however, with Legates and
McCabes (E1 ) measure. Their measure (E1 ) is monotonically related to our new index; and, when positive, E1
would be equivalent to dr if dr were rescaled with c = 1.
We argue in the paper that c = 2 is a preferred scaling.
It also is true that E1 always is an underestimate of dr .
Over the positive portion of dr s domain, the underestimation is linear but, when dr is negative, the magnitude
of E1 s underestimation of dr increases exponentially.
Comparisons among some model estimates using E1 , in
turn, can be problematic. When the deviations around the
observed mean are quite small, for instance, even small
differences among competing sets of model estimates can
produce substantially different values of E1 . Values of dr
derived from competing sets of models should be usefully comparable to one another. It is important to point
out; however, that both Nash and Sutcliffe and Legates
Int. J. Climatol. 32: 20882094 (2012)

2094

C. J. WILLMOTT et al.

and McCabe preceded us in identifying the importance

of including only observed deviation within the basis of
comparison (their denominators) of the fractional part.
Acknowledgements
Several of the ideas presented in this paper are extensions
of concepts previously considered by the authors of
Willmott et al., 1985. In particular, we are indebted to
David Legates for his early recognition of the potential
utility of an absolute-value-based version of Willmotts
index of agreement. Aspects of the research reported on
here were made possible by NASA Grant NNG06GB54G
to the Institute of Global Environment and Society
(IGES) and we are most grateful for this support.
References
Krause P, Boyle DP, Base, F. 2005. Comparison of different efficiency
criteria for hydrological model assessment. Advances in Geosciences
5: 8997.
Legates DR, McCabe GJ Jr. 1999. Evaluating the use of goodnessof-fit measures in hydrologic and hydroclimatic model validation.
Water Resources Research 35(1): 233241.
Mielke PW Jr. 1985. Geometric concerns pertaining to applications of
statistical tests in the atmospheric sciences. Journal of Atmospheric
Science 42: 12091212.
Mielke PW Jr, Berry KJ. 2001. Permutation Methods: A Distance
Function Approach. Springer-Verlag: New York; 352.
Nash JE, Sutcliffe JV. 1970. River flow forecasting through conceptual

models part I A discussion of principles. Journal of Hydrology

10(3): 282290.
Pontius Jr, RG, Thontteh O, Chen H. 2008. Components of information
for multiple resolution comparison between maps that share
a real variable. Environmental and Ecological Statistics 15(2):
111142.
Watterson IG. 1996. Non-dimensional measures of climate model
performance. International Journal of Climatology 16: 379391.
Willmott CJ. 1981. On the validation of models. Physical Geography
2: 184194.
Willmott CJ. 1982. Some comments on the evaluation of model
performance. Bulletin of the American Meteorological Society 63:
13091313.
Willmott CJ. 1984. On the evaluation of model performance in physical
geography. In Spatial Statistics and Models, Gaile GL, Willmott CJ
(eds). D. Reidel: Boston; 443460.
Willmott CJ, Ackleson SG, Davis RE, Feddema JJ, Klink KM,
Legates DR, ODonnell J, Rowe CM. 1985. Statistics for the
evaluation of model performance. Journal of Geophysical Research
90(C5): 89959005.
Willmott CJ, Feddema JJ. 1992. A more rational climatic moisture
index. Professional Geographer 44(1): 8488.
Willmott CJ, Matsuura K. 2005. Advantages of the mean absolute error
(MAE) over the root mean square error (RMSE) in assessing average
model performance. Climate Research 30: 7982.
Willmott CJ, Matsuura K. 2006. On the use of dimensioned measures
of error to evaluate the performance of spatial interpolators.
International Journal of Geographical Information Science 20(1):
89102.
Willmott CJ, Matsuura K, Robeson SM. 2009. Ambiguities inherent
in sums-of-squares-based error statistics. Atmospheric Environment
43(3): 749752.
Willmott CJ, Wicks DE. 1980. An empirical method for the spatial
interpolation of monthly precipitation within California. Physical
Geography 1: 5973.

Int. J. Climatol. 32: 20882094 (2012)

BioAir Top Safe - User Manual 2015 pg1-46
No ratings yet
BioAir Top Safe - User Manual 2015 pg1-46
46 pages
Service Intervals VOLVO FH 2018
No ratings yet
Service Intervals VOLVO FH 2018
1 page
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
Cab Overview: New Operating Concept
100% (2)
Cab Overview: New Operating Concept
34 pages
Cyclone Separator Assignment
50% (2)
Cyclone Separator Assignment
11 pages
Gale Researcher Guide for: Econometric Models
From Everand
Gale Researcher Guide for: Econometric Models
Chupp
No ratings yet
Peroutka Oct03 2007
No ratings yet
Peroutka Oct03 2007
47 pages
Lab 1
No ratings yet
Lab 1
56 pages
Statistics - CIRPwiki
No ratings yet
Statistics - CIRPwiki
5 pages
Capability Indices For Unilateral Tolerances
No ratings yet
Capability Indices For Unilateral Tolerances
16 pages
Ho Fit
No ratings yet
Ho Fit
4 pages
IMEKO-WC-2012-TC8-O3
No ratings yet
IMEKO-WC-2012-TC8-O3
5 pages
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
From Everand
Level Set Method: Advancing Computer Vision, Exploring the Level Set Method
Fouad Sabry
No ratings yet
A New Look at The Statistical Model Identification-M9t
No ratings yet
A New Look at The Statistical Model Identification-M9t
8 pages
A Review of Basic Statistical Concepts: Answers To Odd Numbered Problems 1
No ratings yet
A Review of Basic Statistical Concepts: Answers To Odd Numbered Problems 1
32 pages
The Objective of Design of Experiments
No ratings yet
The Objective of Design of Experiments
26 pages
Brown Hauenstein 2005 Interrater Agreement Reconsidered An Alternative To The RWG Indices
No ratings yet
Brown Hauenstein 2005 Interrater Agreement Reconsidered An Alternative To The RWG Indices
20 pages
Verifying Measurement Uncertainty
No ratings yet
Verifying Measurement Uncertainty
5 pages
Dos and Don'Ts of Reduced Chi-Squared
No ratings yet
Dos and Don'Ts of Reduced Chi-Squared
12 pages
SQC and TQM Tutorial
No ratings yet
SQC and TQM Tutorial
8 pages
3. Measurement Systems
No ratings yet
3. Measurement Systems
20 pages
TR 1
No ratings yet
TR 1
12 pages
A New Look at The Statistical Model Identification PDF
No ratings yet
A New Look at The Statistical Model Identification PDF
8 pages
Akaike 1974
No ratings yet
Akaike 1974
8 pages
Fit Indices in SEM
No ratings yet
Fit Indices in SEM
6 pages
02450ex Spring2018 Sol
No ratings yet
02450ex Spring2018 Sol
22 pages
BookTeghem TL PDF
No ratings yet
BookTeghem TL PDF
24 pages
Rajae Comment
No ratings yet
Rajae Comment
3 pages
An Empirical Investigation of Efficiency 2021
No ratings yet
An Empirical Investigation of Efficiency 2021
6 pages
Hu and Bentler 1998
No ratings yet
Hu and Bentler 1998
30 pages
Sufficient Dimension Reduction Based On Normal and Wishart Inverse Models
No ratings yet
Sufficient Dimension Reduction Based On Normal and Wishart Inverse Models
175 pages
6SIGMA- EXERCISE SECTION_1 TO 4-1
No ratings yet
6SIGMA- EXERCISE SECTION_1 TO 4-1
18 pages
Predictive Modeling MCQs IMT
100% (1)
Predictive Modeling MCQs IMT
19 pages
2015 - Proctor - Dynamic Mode Decomposition With Control - SIAM
No ratings yet
2015 - Proctor - Dynamic Mode Decomposition With Control - SIAM
20 pages
Regression (appendix D)
No ratings yet
Regression (appendix D)
9 pages
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Assignment 4 6331 PDF
No ratings yet
Assignment 4 6331 PDF
10 pages
9309
No ratings yet
9309
14 pages
Block 03 Rel 14 Statistical Model Tuning
No ratings yet
Block 03 Rel 14 Statistical Model Tuning
15 pages
Absolute Fit Indices, Relative Fit Indices, Parsimony Fit Indices, and Those Based On The Noncentrality Parameter
No ratings yet
Absolute Fit Indices, Relative Fit Indices, Parsimony Fit Indices, and Those Based On The Noncentrality Parameter
4 pages
Statistics Model Exam
No ratings yet
Statistics Model Exam
15 pages
Verification Summary Table
No ratings yet
Verification Summary Table
4 pages
Roe Bber 2009
No ratings yet
Roe Bber 2009
8 pages
Cluster Crit
No ratings yet
Cluster Crit
34 pages
006,043 PPT of Quality
No ratings yet
006,043 PPT of Quality
14 pages
References
No ratings yet
References
11 pages
Statistical Quality Control: Simple Applications of Statistics in TQM
No ratings yet
Statistical Quality Control: Simple Applications of Statistics in TQM
57 pages
Verifying Measurement Uncertainty Using A Control Chart With Dynamic Control Limits
No ratings yet
Verifying Measurement Uncertainty Using A Control Chart With Dynamic Control Limits
6 pages
CCIA23 Week-2
No ratings yet
CCIA23 Week-2
22 pages
Clustering Indices: Bernard Desgraupes University Paris Ouest Lab Modal'X
No ratings yet
Clustering Indices: Bernard Desgraupes University Paris Ouest Lab Modal'X
34 pages
04 - Mean, SD & Peer - David Plaut QC Article - Without Notes
No ratings yet
04 - Mean, SD & Peer - David Plaut QC Article - Without Notes
12 pages
Lecture2 Sensor Characteristics
No ratings yet
Lecture2 Sensor Characteristics
44 pages
Lecture 5 Chapter 3
No ratings yet
Lecture 5 Chapter 3
56 pages
Dr. P. K. Chaudhary_Assinment Week-1 (21-22 Dec 24)
No ratings yet
Dr. P. K. Chaudhary_Assinment Week-1 (21-22 Dec 24)
21 pages
Last Assignment May 2022 1
No ratings yet
Last Assignment May 2022 1
204 pages
C195X PDF AppA
No ratings yet
C195X PDF AppA
23 pages
Clustering Indices: Bernard Desgraupes University Paris Ouest Lab Modal'X
No ratings yet
Clustering Indices: Bernard Desgraupes University Paris Ouest Lab Modal'X
34 pages
Clustering Indices: Bernard Desgraupes University Paris Ouest Lab Modal'X
100% (1)
Clustering Indices: Bernard Desgraupes University Paris Ouest Lab Modal'X
34 pages
Chapter 1 Introduction To Data Mining
No ratings yet
Chapter 1 Introduction To Data Mining
10 pages
Cluster Crit
No ratings yet
Cluster Crit
34 pages
Chapter 6
No ratings yet
Chapter 6
14 pages
3 Measures of Central Tendency
No ratings yet
3 Measures of Central Tendency
15 pages
Lec7 Data Analysis
No ratings yet
Lec7 Data Analysis
15 pages
MTH 233 Week 1 MyStatLab® Post-Test
No ratings yet
MTH 233 Week 1 MyStatLab® Post-Test
16 pages
Global Flavor Industry
100% (1)
Global Flavor Industry
11 pages
Catena: A A B C D
No ratings yet
Catena: A A B C D
13 pages
Comparison of Two Intergranular Corrosion Tests On
No ratings yet
Comparison of Two Intergranular Corrosion Tests On
8 pages
Ppe Coverall Manufacturers
No ratings yet
Ppe Coverall Manufacturers
40 pages
MDM3310 Satellite Modem: Markets
No ratings yet
MDM3310 Satellite Modem: Markets
2 pages
Cinderella
No ratings yet
Cinderella
3 pages
Amun The Egyptian God
No ratings yet
Amun The Egyptian God
9 pages
Kilburn Chemical
No ratings yet
Kilburn Chemical
3 pages
Lawn and Its Management
100% (3)
Lawn and Its Management
13 pages
LCD Display Manual
No ratings yet
LCD Display Manual
8 pages
The Inchcape Rock
0% (1)
The Inchcape Rock
2 pages
Zitai Catalog
0% (1)
Zitai Catalog
16 pages
New Trends in Forced Degradation Studies
No ratings yet
New Trends in Forced Degradation Studies
10 pages
Design and Implementation A Smart Greenhouse: International Journal of Computer Science and Mobile Computing
No ratings yet
Design and Implementation A Smart Greenhouse: International Journal of Computer Science and Mobile Computing
13 pages
Unit 2 - Horticulture
No ratings yet
Unit 2 - Horticulture
5 pages
Room Asset-1
No ratings yet
Room Asset-1
5 pages
Maroque Orange Book
100% (1)
Maroque Orange Book
47 pages
Energyaudit-Datacollectionlog-Charlotte Jeanne
No ratings yet
Energyaudit-Datacollectionlog-Charlotte Jeanne
4 pages
Determination of Bulk Density of Aggregates
No ratings yet
Determination of Bulk Density of Aggregates
4 pages
Animals+with+Attitude+4+EPIC
No ratings yet
Animals+with+Attitude+4+EPIC
14 pages
Q1 EPP 1st Summative Test
No ratings yet
Q1 EPP 1st Summative Test
5 pages
k100 300 Pellet Stoves v1 Ceza Controller
No ratings yet
k100 300 Pellet Stoves v1 Ceza Controller
33 pages
Anil Laul PDF
100% (2)
Anil Laul PDF
20 pages
Stat Unit-5
No ratings yet
Stat Unit-5
43 pages
Phenol Hypochlorite Reaction For Determination of Ammonia Weatherburn1967
No ratings yet
Phenol Hypochlorite Reaction For Determination of Ammonia Weatherburn1967
4 pages
20L Vectb D0
No ratings yet
20L Vectb D0
1 page

A Refined Index of Model Performance

Uploaded by

A Refined Index of Model Performance

Uploaded by

INTERNATIONAL JOURNAL OF CLIMATOLOGY

Int. J. Climatol. 32: 20882094 (2012)

accuracy indices; model-performance statistics

Received 4 February 2011; Revised 22 July 2011; Accepted 23 July 2011

Statistical measures of model performance, including

estimates or predictions (Pi ; i = 1, 2, . . . , n) with pairwise-matched observations (Oi ; i = 1, 2, . . . , n) that are

3. Brief history of the index of agreement

A REFINED INDEX OF MODEL PERFORMANCE

measure, within which was the sum of the squared

(|Pi O| + |Oi O|)

which simplifies to and is commonly written as

This specification of ensured that the potential upper

perform relatively well. It (d1 ) also is much less sensitive

A refined index of agreement

Our primary goal is to present a refined index of

(|Pi O| + |Oi O|)

|(Pi O) (Oi O)|

An advantage that d1 has over d is that it approaches

(|Oi O| + |Oi O|)

Int. J. Climatol. 32: 20882094 (2012)

represents the difference between two deviations, one

This revision, however, makes d 1 unbounded on the

with c = 2, following Equation (4).

5. Comparison with other dimensionless measures

A REFINED INDEX OF MODEL PERFORMANCE

Figure 1. Stratified subsamples of statistics from the 100 000 pair-wise

Figure 3. Stratified subsamples of statistics from the 100 000 pair-wise

dr . Thus, dr tends to behave rather differently and more

where sP snd sO are the standard deviations of P and O,

Figure 2. Stratified subsamples of statistics from the 100 000 pair-wise

values of dr . Especially for d1 , this difference declines

Both scatterplots (M versus dr and  versus dr ) are

fractional parts. Nash and Sutcliffes coefficient of efficiency (E) is

whereas Legates and McCabes index (E1 ) is written as

It is apparent in Equations (8) and (9) that E is based

A REFINED INDEX OF MODEL PERFORMANCE

6. Bases of comparison, other than O

than the mean of the entire time series. Our refined

A refined version of Willmotts dimensionless index of

and McCabe preceded us in identifying the importance

Copyright 2011 Royal Meteorological Society

models part I A discussion of principles. Journal of Hydrology

Int. J. Climatol. 32: 20882094 (2012)

You might also like

Both scatterplots (M versus dr and versus dr ) are