A Refined Index of Model Performance
A Refined Index of Model Performance
Short Communication
A refined index of model performance
Cort J. Willmott,a * Scott M. Robesonb and Kenji Matsuuraa
a
Center for Climatic Research, Department of Geography, University of Delaware, Newark, DE 19716, USA
b Department of Geography, Indiana University, Bloomington, IN 47405, USA
ABSTRACT: In this paper, we develop, present and evaluate a refined, statistical index of model performance. This
new measure (dr ) is a reformulation of Willmotts index of agreement, which was developed in the 1980s. It (dr ) is
dimensionless, bounded by 1.0 and 1.0 and, in general, more rationally related to model accuracy than are other existing
indices. It also is quite flexible, making it applicable to a wide range of model-performance problems. The two main
published versions of Willmotts index as well as four other comparable dimensionless indices proposed by Nash and
Sutcliffe in 1970, Watterson in 1996, Legates and McCabe in 1999 and Mielke and Berry in 2001 are compared with the
new index. Of the six, Legates and McCabes measure is most similar to dr . Repeated calculations of all six indices, from
intensive random resamplings of predicted and observed spaces, are used to show the covariation and differences between
the various indices, as well as their relative efficacies. Copyright 2011 Royal Meteorological Society
KEY WORDS
1.
Introduction
Numerical models of climatic, hydrologic, and environmental systems have grown in number, variety and
sophistication over the last few decades. There has been
a concomitant and deepening interest in comparing and
evaluating the models, particularly to determine which
models are more accurate (e.g. Krause et al., 2005). Our
interest lies in this arena; that is, in statistical approaches
that can be used to compare model-produced estimates
with reliable values, usually observations.
Our main purpose in this paper is to present and evaluate a refined version of Willmotts dimensionless index of
agreement (Willmott and Wicks, 1980; Willmott, 1981,
1982, 1984; Willmott et al., 1985). The refined index, we
believe, is a nontrivial improvement over earlier versions
of the index and is quite flexible, making it applicable to an extremely wide range of model-performance
applications. Our discussion contains a brief history, a
description and assessment of its form and properties,
and comparisons with a set of other dimensionless measures of average model accuracy to illustrate its relative
effectiveness.
2.
Background
2089
d =1
i=1
n
(2a)
i=1
d =1
i=1
n
(|Pi O| + |Oi O|)2
(2b)
i=1
d1 = 1
|Pi Oi |
i=1
n
4.
(3)
n
i=1
d1 = 1
i=1
n
i=1
2090
C. J. WILLMOTT et al.
n
=1
|Pi Oi |
i=1
n
(4)
|Oi O|
i=1
n
|Pi Oi |
i=1
, when
1 n
c
|Oi O|
i=1
n
n
|Pi Oi | c
|Oi O|
i=1
i=1
(5)
dr =
n
|Oi O|
c
i=1
1, when
|Pi Oi |
i=1
n
n
|P O | > c
|O O|
i
i=1
i=1
2091
0.8
0.6
5
0.2
E and E1
d and d1
0.4
0
0.2
10
0.4
15
0.6
0.8
1
1
0.5
0.5
20
1
0.5
0
dr
dr
0.8
0.6
0.4
M and
0.2
M = (2/) sin
0.2
0.4
MSE
2
sP2 + sO
+ (P O)2
(6)
0.6
0.8
1
1
0.5
0.5
0.5
dr
=1
2
n
MAE
n
|Pj Oi |
(7)
i=1 j =1
2092
C. J. WILLMOTT et al.
(a) 40
(b)
0.5
d and d1
Predicted
30
20
(c)
20
Observed
1
1
40
(d)
0.5
0.5
0
dr
0.5
0.5
0
dr
0.5
1
1
5
E and E1
M and
0
0.5
10
0
0.5
10
15
1
1
0.5
0
dr
0.5
20
1
Figure 4. Demonstration of index values for the case of overprediction. Using uniform distributions, 500 values of O and P were generated
(with O centered on 10 and P centered on 20). Fifty subsamples of size n = 10 are drawn, and pair-wise values of dr and the other indices
are calculated. Panels show: (a) 500 values of O and P , (b) 50 pair-wise values of dr and d (triangles) and of dr and d1 (black dots), (c) 50
pair-wise values of dr and M (triangles) and of dr and (black dots), and (d) 50 pair-wise values of dr and E (triangles) and of dr and E1
(black dots). In all cases, the 1 : 1 line is plotted for reference.
E =1
i=1
n
(8)
(Oi O)
i=1
E1 = 1
i=1
n
|Pi Oi |
(9)
|Oi O|
i=1
differences within E precludes a monotonic relationship between the increases and decreases in dr and in
E. Legates and McCabes measure, on the other hand,
is monotonically and functionally related to our new
index; and, when positive, E1 is equivalent to dr with
c = 1. As mentioned above, we think that c = 2 is a
better scaling, because it balances the number of deviations evaluated within the numerator and within the
denominator of the fractional part. It (E1 ) is an underestimate of dr , as is evident in the functional relationship(s) between dr and E1 . Over the positive portion of
dr s domain, dr = 0.5(E1 + 1) while, when dr is negative, dr = [2(E1 1)1 + 1]. The second expression
also shows dr s linearisation of E1 s exponential decline
from 0 to .
A nontrivial difference between dr and E1 , as well
as between dr and E, is the indices behaviour over
the negative portions of their domains. The magnitudes
of both E1 and E increase exponentially in the negative direction (Figure 3), which can make comparisons
among some model estimates difficult. When the deviations around O are quite small or perhaps trivial, for
instance, even small differences among competing sets
of model estimates can produce substantially different
values of E1 or of E. In comparing models that estimate daily or monthly precipitation in an arid location, for example, relatively small differences between
Int. J. Climatol. 32: 20882094 (2012)
2093
the sets of model estimates could produce vastly different values of E1 or of E. Values of dr , on the
other hand, would be more usefully comparable to one
another.
It is clear that Legates and McCabe (and Nash and Sutcliffe before them) appreciated the importance of specifying with variation within the observed variable only.
Legates and McCabe further understood the importance
of evaluating error- and deviation-magnitudes, rather than
their squares. Their measure (E1 ), in turn, has a structure
similar to that of dr but with a substantially different
scaling and lower limit, as discussed above.
To show the behaviour of the indices for a specific case
of predicted versus observed data, we selected a typical pattern: overprediction (Figure 4). For this case, we
show the scatterplot that we sample from (Figure 4(a)),
as well as scatterplots of the other six measures versus dr for 50 random samples of the pair-wise values
of P and O. On each of the three scatterplots, dr is
the x-axis variable and two of the other six indices are
plotted along the y-axis (i.e. Figure 4(b)(d) have the
same setup as Figures 13). It is clear that both d and d1
are much less responsive than dr to the various configurations of overprediction that can occur (Figure 4(b)).
For this particular case, where the magnitude of MAE
is consistently larger than the magnitude of the observed
variability, dr produces negative values while the values
of d can range from 0.2 to over 0.5 (d1 is more conservative than d but also is less responsive than dr to
the types of O versus P samples that are produced).
For the 50 samples from our overprediction distribution,
both M and produce almost no variation. , in particular, is very close to zero for almost all of the varied
samples within the overprediction example. Similar to
Figure 3, Figure 4(d) demonstrates how small differences
among the various observed and predicted samples can
produce substantially different values of E1 or of E that
are difficult to interpret. It is useful to note that swapping O and P in this example (i.e. producing a case
where the model systematically underpredicts) produces
virtually no change in any of the indices. In cases where
O and P have different magnitudes of variability, this
symmetry of overprediction and underprediction does not
occur.
7.
Concluding remarks
2094
C. J. WILLMOTT et al.