Modeling Book
Modeling Book
Basic Concepts
1
2 CHAPTER 1. BASIC CONCEPTS
B
A
11111111111
00000000000
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
Toast
the second, Experiment 2, a piece of toast is flipped twice. The first step in
analysis is describing the outcomes and sample space.
There are two possible outcomes for Experiment 1: (1) the piece of toast
falls butter-side down or (2) it falls butter-side up. We denote the former
outcome by D and the latter by U (for “down” and “up,” respectively). For
Experiment 2, the outcomes are denoted by ordered pairs as follows:
• (D, D) : The first and second pieces fall butter-side down.
• (D, U) : The first piece falls butter-side down and the second falls
butter-side up.
• (U, D) : The first piece falls butter-side up and the second falls butter-
side down.
The sample space for Experiment 1 is {U, D}. The sample space for
Experiment 2 is {(D, D), (U, D), (D, U), (U, U)}.
Although outcomes describe the results of experiments, they are not suf-
ficient for most analyses. To see this insufficiency, consider Murphy’s Law
in Experiment 2. We are interested in whether one or more of the flips is
butter-side up. There is no outcome that uniquely represents this event.
Therefore, it is common to consider events:
There are four events associated with Experiment 1 and sixteen associated
with Experiment 2. For Experiment 1, the four events are:
The event {U, D} refers to the case that either the toast falls butter-side
down or it falls butter-side up. Barring the miracle that the toast lands on
its side, it will always land either butter-side up or butter-side down. Even
though this event seems uninformative, it is still a legitimate event and is
included. The null set (∅) is an empty set; it has no elements.
The 16 events for Experiment 2 are:
1.1.2 Probability
Event Pr
{D} .7
{U} .3
{U, D} 1
∅ 0
Event Probability
{(D, D)} .49
{(D, U)} .21
{(U, D)} .21
{(U, U)} .09
{(D, D), (D, U)} .70
{(D, D), (U, D)} .70
{(D, D), (U, U)} .58
{(D, U), (U, D)} .42
{(D, U), (U, U)} .30
{(U, D), (U, U)} .30
{(D, U), (U, D), (U, U)} .51
{(D, D), (U, D), (U, U)} .79
{(D, D), (D, U), (U, U)} .79
{(D, D), (D, U), (U, D)} .91
{(D, D), (D, U), (U, D), (U, U)} 1
∅ 0
Outcome Value of X
D 1
U 0
Outcome Value of X
(D, D) 2
(D, U) 1
(U, D) 1
(U, U) 0
For Experiment 2, it is
.09 x=0
.42 x=1
f (x) = (1.2)
.49 x=2
0 Otherwise
> x=c(0,1,2)
> f=c(.09,.42,.49)
> x
[1] 0 1 2
> f
[1] 0.09 0.42 0.49
1.1.4 Parameters
Up to now, we have assumed probabilities on some events (those correspond-
ing to outcomes) and used the laws of probability to assign probabilities to
the other events. In experiments, though, we do not assume probabilities; in-
stead, we estimate them from data. We introduce the concept of a parameter
10 CHAPTER 1. BASIC CONCEPTS
0.6
Probability Mass Function
0.4
0.2
0.0
−1 0 1 2 3
Value of X
Figure 1.2: Probability mass functions for random variable X, the number
of butter-side down flips, in Experiment 2.
The use of the semicolon in f (x; p) indicates that the function is of one
variable, x, for a given value of the parameter p.
Let’s consider the probabilities in Experiment 2 to be parameters defined
as p1 = P r(D, D), p2 = P r(D, U), p3 = P r(U, D). By the laws of probability,
1.2. BINOMIAL DISTRIBUTION 11
1. There are two throws left in the game; list all of the possible
outcomes for the last two throws. Hint: these outcomes may be
expressed as ordered pairs.
3. In this game there are only two possible outcomes: winning and
losing. Given the information in Problem 2, what is the proba-
bility mass function for the random variable that maps the event
that you win to 0 and the event that you lose to 1? Plot this
function in R.
distribution is not violated. If a player gets tired and does worse over time,
regardless of the outcome of his or her throws, then identical distribution is
violated. It is possible to violate one and not the other.
0 5 10 15 20
Value of X
Figure 1.3: Probability Mass function for a binomial random variable with
N = 20 and p = .7
1.3.2 Variance
Whereas the expected value measures the center of a random variable, vari-
ance measures its spread.
[x − E(X)]2 f (x; p)
X
V(X) = (1.7)
x
The first rule can be used to find the expected value of a binomial random
variable. By definition, binomial RV Y is defined as Y = N 1=1 Xi , where the
P
The second rule can be used to find the expected value of p̂. The random
variable p̂ = g(Y ) is g(Y ) = Y /N. The expected value of p̂ is given by:
E(p̂) = E(g(Y ))
X
= (x/N)f (x; p)
x
X
= (1/N) xf (x; p)
x
= (1/N)E(Y )
= (1/N)(Np)
= p.
Each Yi is a different random variable, but all Yi are independent and dis-
tributed as identical binomials. Each i could repeasent a different trial, a
different person, or a different experimental condition.
Of course, we are not limited to 5 replicates; for example y=rbinom(200,
20, .7) produces 200 replicates and stores them in vector y. To see a his-
togram, type hist(y, breaks=seq(-.5, 20.5, 1), freq=T). We prefer a
different type of histogram for looking at realizations of discrete random
variables—one in which the y-axis is not the raw counts but the propor-
tion, or relative frequency, of counts. These histograms are called relative
frequency histograms.
iid
Definition 20 (Relative Frequency Histogram) Let Yi ∼
Y be a sequence of M independent and identically distributed
discrete random variables and let y1 , .., yM be a sequence of cor-
responding realizations. Let hM (j) be the proportion of realiza-
tions with value j. The relative frequency histogram is a plot of
hM (j) against j.
The code draws the histogram as a series of lines. The relative histogram
plot looks like a probability mass function. Figure 1.4A shows that this is
no coincidence. The lines are the relative frequency histogram; the points
are the probability mass function for a binomial with N = 20 and p = .7
(The points were produced with the points() function. The specific form is
points(0:21,dbinom(0:21,20,.7),pch=21)).
1.4. SEQUENCES OF RANDOM VARIABLES 21
A B
Relative Frequency
Relative Frequency
0.20
0.20
0.10
0.10
0.00
0.00
0 5 10 15 20 0 5 10 15 20
Outcome Outcome
The match between the relative histogram and the pmf is not exact. The
problem is that there are only 200 realizations. Figure 1.4B shows the match
between probability mass function and the relative frequency histogram when
there are 10,000 realizations. Here, the match is nearly perfect. This match
indicates that as the number of realizations grows, the relative frequency
histogram converges to the probability mass function. The convergence is a
consequence of the Law of Large Numbers. The Law of Large says, informally,
that the proportion of realizations attaining a particular value will converge
to the true probability of that realization. More formally,
0.20
Probability Mass Function
0.10
0.00
y=rbinom(10000,20,.7)
p.hat=y/20 #10,000 iid replicates of p-hat
freq=table(p.hat)
plot(freq/10000,type=’h’)
The resulting plot is shown in Figure 1.5. The plot shows the approximate
probability mass function for the p̂ estimator. The distribution of an estima-
tor is so often of interest that it has a special name: a sampling distribution.
1.5 Estimators
Estimators are random variables that are used to estimate parameters from
data. We have seen one estimator, the common-sense estimator of p in a
1.5. ESTIMATORS 23
binomial: p̂ = Y /N. Two others are the sample mean and sample vari-
ance defined below, which are used as estimators for the expected value and
variance of an RV, respectively.
How good are these estimators? To answer this question, we first discuss
properties of estimators.
Scale A Scale B
180 174
160 170
175 173
165 171
Mean 170 172
Bias 0 2.0
RMSE 7.91 2.55
BN = E(θˆN ) − θ
More efficient efficient estimators have less error, on average, than less effi-
cient estimators. Sample mean and sample variance are the most efficient
unbiased estimators of expected value and variance, respectively. One of the
main issues is estimation is the trade-off between bias and efficiency. Of-
ten, the most efficient estimator of a parameter is biased, and this facet is
explored in the following section.
The final property of estimators is consistency.
lim RMSE(θ̂N ) = 0
N →∞
Consistency means that as the sample sizes gets larger and larger, the esti-
mator converges to the true value of the parameter. If an estimator is con-
sistent, then one can estimate the parameter to arbitrary accuracy. To get
more accurate estimates, one simply increases the sample size. Conversely, if
an estimator is inconsistent, then there is a limit to how accurately the pa-
rameter can be estimated, even with infinitely large samples. Most common
estimators in psychology, including the sample mean, sample variance, and
sample correlation, are consistent.
Because sample means and sample variances converge to expected value
and variances, respectively, they can be used to estimate these properties.
For example, let’s approximate the expected value, variance, and standard
error of p̂ with the sample statistics in R. We first generate a sequence of
iid
realizations y1 , .., yM for binomial random variables Yi ∼ Y i = 1, .., M. For
each realization, we compute an estimate pi = yi /N. The sample mean,
sample variance, and sample standard deviation approximate the expected
value, variance, and standard error. To see this, run the following R code:
26 CHAPTER 1. BASIC CONCEPTS
y=rbinom(10000,20,.7)
p.hat=y/20
mean(p.hat) #sample mean
var(p.hat) #sample variance (N-1 in denominator)
sd(p.hat) #sample std. deviation (N-1 in denominator)
Let’s use the simulation method to further study the common-sense esti-
mator of the expected value of the binomial, the sample mean. Suppose in
an experiment, we had ten binomial RVs, each the result of 20 toast flips.
Here is a formal definition of the problem:
iid.
Yi ∼ Binomial(p, 20), i = 1...10,
i Yi
P
Ȳ = .
10
The following code generates 10 replicates from a binomial, each of 20
flips. Here we have defined a custom function called bsms() (bsms stands
for “binomial sample mean sampler”). Try it a few times. This is analogous
to having 10 people each flip 20 coins, then returning the mean number of
heads across people.
#define function
bsms=function(m,n,p)
{
z=rbinom(m,n,p)
mean(z)
}
#call function
bsms(10,20,.7)
1.5. ESTIMATORS 27
0.06
Relative Frequency
0.04
0.02
0.00
12 13 14 15 16
Sample Mean of Binomials
Figure 1.6: Relative Frequency plot of 10,000 calls to the function bsms().
for this plot, bsms() computed the mean of 10 realizations from binomials
with N = 20 and p = .7
The above code returns a single number as output: the sample mean of 10
binomials. Since the sample mean is an estimator, it has a sampling distribu-
tion. The bsms() function returns one realization of the sample mean. If we
are interested in the sampling distribution of the sample mean, we need to
sample it many times and plot the results in a relative frequency histogram.
This can be done by repeatedly calling bsms(). Here is the code for 10,000
replicates of bsms():
M=10000
bsms.realization=1:M #define the vector ppes.realization
for(m in 1:M) bsms.realization[m]=bsms(10,20,.7)
bsms.props=table(bsms.realization)/M
plot(ppes.props, xlab="Estimate of Expected Value (Sample Mean)",
ylab="Relative Frequency", type=’h’)
Y
p̂0 = , (1.11)
N
Y + .5
p̂1 = , (1.12)
N +1
Y +1
p̂2 = . (1.13)
N +2
p=.7
N=10
z=rbinom(10000,N,p)
est.p0=z/N
est.p1=(z+.5)/(N+1)
est.p2=(z+1)/(N+2)
bias.p0=mean(est.p0)-p
rmse.p0=sqrt(mean((est.p0-p)^2))
bias.p1=mean(est.p1)-p
rmse.p1=sqrt(mean((est.p1-p)^2))
bias.p2=mean(est.p2)-p
rmse.p2=sqrt(mean((est.p2-p)^2))
Figure 1.7 shows the sampling distributions for the three estimators.
These sampling distributions tend to be roughly centered around the true
value of the parameter, p = .7. Estimator p̂2 is the least spread out, followed
by pˆ1 and pˆ0 . Bias and efficiency of the estimators are indicated. Although
estimator pˆ0 is unbiased, it is also the least efficient! Figure 1.8 shows bias
and efficiency for all three estimators for the full range of p. The conventional
estimator p̂0 is unbiased for all true values of p, but the other two estimators
are biased for extreme probabilities. None of the estimators are always more
efficient than the others. For intermediate probabilities, estimator p̂2 is most
efficient; for extreme probabilities, estimator p̂0 is most efficient. Typically,
researchers have some idea of what types of probabilities of success to expect
in their experiments. This knowledge can therefore be used to help pick the
best estimator for a particular situation. We recommend p̂1 as a versatile al-
ternative to p̂0 for many applications even though it is not the common-sense
estimator.
30 CHAPTER 1. BASIC CONCEPTS
0.30 p0
0.25
Bias=0
0.20
Probablity
RMSE=.145
0.15
0.10
0.05
0.00
p1
0.25
0.20 Bias=−.018
Probablity
0.15 RMSE=.133
0.10
0.05
0.00
p2
0.25
0.20 Bias=−.033
Probablity
RMSE=.125
0.15
0.10
0.05
0.00
0.0 0.2 0.4 0.6 0.8 1.0
Estimated Probability of Success
Figure 1.7: Sampling distribution of p̂0 , p̂1 , and p̂2 . Bias and root-mean-
squared-error (RMSE) are included. This figure depicts the case that there
are N = 10 trials with a p = .7.
1.6. THREE BINOMIAL PROBABILITY ESTIMATORS 31
p2
p1
0.05
Bias
0.00 p0
−0.05
0.025 p0 p1
0.020
RMSE
0.015
p2
0.010
0.005
0.000
Figure 1.8: Bias and root-mean-squared-error (RMSE) for the three estima-
tors as a function of true probability. Solid, dashed, and dashed-dotted lines
denote the characteristics of p̂0 , p̂1 , and p̂2 , respectively.
32 CHAPTER 1. BASIC CONCEPTS
Throughout this book, we use a general set of techniques for analysis that are
based on the likelihood function. In this chapter we present these techniques
within the context of three examples involving the binomial distribution.
At the end of the chapter, we provide an overview of the theoretical justi-
fication for the likelihood approach. In the following chapters, we use this
approach to analyze pertinent models in cognitive and perceptual psychology.
Throughout this book, analysis is based on the following four steps:
33
34 CHAPTER 2. THE LIKELIHOOD APPROACH
Y ∼ Binomial(N, p).
The right-hand side of the equation is the same as the probability mass
function; the difference is on the left-hand side. Here, we have switched the
arguments to reflect the fact that the likelihood function, denoted by L, is a
function of the parameter p.
likelihood=function(p,y,N)
return(dbinom(y,N,p))
Now, let’s examine the likelihood, a function of p, for the case in which 5
successes were observed in 10 flips.
p=seq(0,1,.01)
like=likelihood(p,5,10)
plot(p,like,type=’l’)
The seq() function in the first line assigns p to the vector (0, .01, .02, ..., 1).
The second line computes the value of the likelihood for each of these val-
ues of p. Figure 2.1 shows the resulting plot (Panel A). The likelihood is
unimodal and is centered over .5. Also shown in the figure are likelihoods
for 50 successes out of 100 flips, 7 successes out of 10 flips, and 70 successes
out of 100 flips. Two patterns are evident. First, the maximum value of the
likelihood is at y/N. Second, the width of the likelihood is a function of N.
The larger N, the smaller the range of parameter values that are likely for
the observation y.
It is a reasonable question to ask what a particular value of likelihood
means. For example, in Panel A, the maximum of the likelihood (about
.25) is far smaller than that in Panel B. In most applications, the actual
value of likelihood is not important. For the binomial, likelihood depends
on the observation, the parameter p and the number of Bernoulli trials, N.
For estimation, it is the shape of the function and the location of the peak
that are important, as discussed subsequently. For model comparison, the
difference in likelihood values among models is important.
36 CHAPTER 2. THE LIKELIHOOD APPROACH
0.08
A B
0.20
0.04
0.10
0.00
0.00
Likelihood
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.08
C D
0.20
0.04
0.10
0.00
0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p
Figure 2.1: A plot of likelihood for (A) 5 successes from 10 flips, (B) 50
successes from 100 flips, (C) 7 successes from 10 flips, and (D) 70 successes
from 100 flips
2.1. ESTIMATING A PROBABILITY 37
methods. Figure 2.2 shows log likelihood functions for the binomial distri-
bution. To find the log likelihood, one takes the logarithm of the likelihood
function, e.g., for the binomial:
A B
−10
−10
−30
−30
−50
−50
Log−Likelihood
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
C D
−10
−10
−30
−30
−50
−50
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p
Figure 2.2: Plots of log likelihood for (A) 5 successes from 10 flips, (B) 50
successes from 100 flips, (C) 7 successes from 10 flips, and (D) 70 successes
from 100 flips
40 CHAPTER 2. THE LIKELIHOOD APPROACH
" ! #
∂l(p; y) ∂ N
= log + y log p + (N − y) log(1 − p) ,
∂p ∂p y
" !#
∂ N ∂ ∂
= log + [y log p] + [(N − y) log(1 − p)] ,
∂p y ∂p ∂p
y N −y
= 0+ − ,
p 1−p
y N −y
= − .
p 1−p
y N −y
− = 0,
p 1−p
y N −y
= ,
p 1−p
(1 − p)y = (N − y)p,
y − py = Np − yp,
y = Np,
y
p̂ = , (2.5)
N
loglike=function(p,y,N)
return(dbinom(y,N,p,log=T))
Note the log=T option in dbinom(). With this option, dbinom() re-
turns the log of the probability mass function.
• Step 2: Enter the data. Suppose we observed five successes on ten flips.
N=10
y=5
• Step 3: Find the maximum likelihood estimate. There are a few differ-
ent numerical methods implemented in R for optimization. For models
with one parameter, the function optimize() is an appropriate choice
(Brent, 1973). Here is an example.
optimize(loglike,interval=c(0,1),maximum=T,y=y,N=N)
$maximum
[1] 0.5
$objective
[1] -1.402043
42 CHAPTER 2. THE LIKELIHOOD APPROACH
0
General
Model
−1
Log−likelihood
−2
−3
−4
Restricted
−5 Model
0 5 10 15 20
Number of Successes
N=20
y=13
mle.general=y/N
mle.restricted= .5 #by definition
log.like.general=dbinom(y,N,mle.general,log=T)
log.like.restricted=dbinom(y,N,mle.restricted,log=T)
log.like.general
[1] -1.690642
log.like.restricted
[1] -2.604652
The log likelihood is greater for the general model than the restricted
model (-1.69 vs. -2.60). This fact, in itself, is not informative. Figure 2.3
shows maximized log likelihood for the restricted and general models for all
possible outcomes. Maximized likelihood for the general model is as great or
greater than that of the restricted model for all outcomes. This trend always
holds: general models always have higher log likelihood than their nested
restrictions.
2.2. IS BUTTERED TOAST FAIR? 45
25
Log−Likelihood
Ratio Statistic
20
15
10
5
0
0 5 10 15 20
Number of Successes
G2=-2*(log.like.restricted-log.like.general)
G2
[1] 1.828022
The value is about 1.83, which is less than 3.84. Hence, for 13 successes in
20 trials, we cannot reject the restriction that buttered toast is fair.
Figure 2.4 shows which outcomes will lead to a rejection of the fair-toast
restriction: all of those outcomes with likelihood ratio statistics above 3.84.
The horizontal line denotes the value of 3.84. From the figure, it is clear that
observing fewer than 6 or more than 14 successes out of twenty flips would
lead to a G2 greater than 3.84 and, therefore, to a rejection of the statement
that toast is fair.
Testing the applicability of a restricted model against a more general one
is a form of null hypothesis testing. In this case, the restricted model serves
as the null hypothesis. When G2 is high, the restricted model is implausible,
leading to us reject the restriction and accept the more general alternative.
2.3. COMPARING CONDITIONS 47
1/3 x=1
1/3 x=2
fX (x) =
1/3
x=3
0 Otherwise
A B
−20
−20
Log likelihood
Log likelihood
−60
−60
−100
−100
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p
Figure 2.5: The maximum of the log likelihood function does not change with
the addition of a constant. Panel A shows the log likelihood function for a
binomial with N=100 and 70 observed successes. The vertical line denotes
the maximum at .7. Panel B shows the log likelihood with constant log Ny
subtracted. The function has the same shape and maximum as the function
in Panel A.
$par
[1] 0.2999797 0.5384823
$value
[1] 21.1897
$counts
function gradient
69 NA
$convergence
[1] 0
$message
NULL
Restricted Model
The restricted model can also be solved intuitively. Under the model, perfor-
mance in the low- and high-frequency conditions reflects a single Bernoulli
process. Hence, we can pool the data across conditions. Pooling the number
1
For more information about what each of these values mean, try ?optim in R to get
help on optim().
54 CHAPTER 2. THE LIKELIHOOD APPROACH
N=c(20,13)
y=c(6,7)
optimize(nll.restricted,interval=c(0,1),y=y,N=N,maximum=F)
$objective
[1] 22.12576
gen.ll=-gen$value
res.ll=-res$objective
Type gen and notice the additional field $hessian, which is a matrix. The
diagonal elements are of interest and standard errors for the parameters are
calculated by genSE=sqrt(diag(solve(gen$hessian))). The elements of
the vector genSE correspond to the standard errors for the parameters in the
vector par.
Many researchers report standard errors along with estimates. Standard
errors, in our opinion, give a rough guide to the amount of variability in
the estimate as well as calibrate the eye as to the magnitude of significant
effects (Rouder & Morey, 2005). We discuss the related concept of confidence
intervals in Chapter 7.
The following code plots standard errors in R. For convenience, we define
a new function errbar(), which we can use for adding error bars to any plot.
errbar=function(x,y,height,width,lty=1){
arrows(x,y,x,y+height,angle=90,length=width,lty=lty)
arrows(x,y,x,y-height,angle=90,length=width,lty=lty)
}
xpos=barplot(gen$par,names.arg=c(’Low Frequency’,
’High Frequency’),col=’white’,ylim=c(0,1),space=.5,
ylab=’Probability Estimate’)
1.0
0.8
Probability Estimate
0.6
0.4
0.2
0.0
Condition
The last value is simply the width of the error bar. The resulting bar plot
with standard error bars is shown in Figure 2.6. The overlap of the error bars
indicates that any effect of word frequency is quite small given the variabil-
ity of the estimate. The likelihood ratio tests confirms the lack of statistical
significance.
minimizing the mean squared error between model predictions and observed
data. Likelihood has the following advantages under mild technical condi-
tions2 :
In sum, the likelihood method is not necessarily ideal for every situation,
but it is straightforward and tractable with many nonlinear models. With
it, psychologists can test theories to a greater level of detail than is possible
with standard linear models.
More advanced, in-depth treatments of maximum likelihood techniques
can be found in mathematical statistics texts. While these in-depth treat-
ments do require some calculus knowledge, the advanced student can bene-
fit from learning about properties of maximum likelihood estimators. Some
good texts to consider for further reading are Hogg & Craig (1978), Lehmann
(1991), and Rice (1998).
60 CHAPTER 2. THE LIKELIHOOD APPROACH
Chapter 3
Experiments with data in the form of Table 3.1 are called signal-detection
experiments. Psychologists express the results of such experiments in terms
of four events:
敏感度 1. Hit: Participant responds “tone present” on a signal trial.
61
62 CHAPTER 3. THE HIGH-THRESHOLD MODEL
Response
Stimulus Tone Present Tone Absent Total
Signal 75 25 100
Noise 30 20 50
Total 105 45 150
Table 3.1: Sample data for a signal-detection experiment.
Hit and correct rejection events are correct responses while false alarm and
miss events are error responses.
Signal detection experiments are used in many domains besides the de-
tection of tones. One prominent example is in the study of memory. In
many memory experiments, participants study a set of items, and then at
a later time, are tested on them. In the recognition memory paradigm both
previously studied items and unstudied novel items are presented at the test
phase. The participant indicates whether the item was previously studied or
is novel. In this case, a studied item is analogous to the signal stimulus and
the novel item is analogous to a noise stimulus. Consequently, the miss error
occurs when a participant is presented a studied item and indicates that it is
novel. The false alarm error occurs when a participant is presented a novel
item and indicates it was studied.
It is reasonable to ask why there are two different types of correct and
error events. Why not just measure overall accuracy? One of the most
dramatic example of the importance of differentiating the errors comes from
the repressed-memory literature. The controversy stems from the question of
whether it is possible to recall events that did not happen, especially those
regarding sexual abuse. According to some, child sexual abuse is such a
shocking event that memory for it may be repressed (Herman & Schatzow,
1987). This repressed memory may then be “recovered” at some point later
in life. The memory, when repressed is a miss; but when recovered, is a
hit. Other researchers question the veracity of these recovered memories,
claiming it is doubtful that a memory of sexual abuse can be repressed and
then recovered (Loftus, 1993). The counter claim is that the sexual abuse
3.1. THE SIGNAL-DETECTION EXPERIMENT 63
may not have occurred. The “recovered memory” is actually a false alarm.
In this case, differentiating between misses and false alarms is critical in
understanding how to evaluate claims of recovered memories.
The results of a signal detection experiment are commonly expressed as
the following rates:
where ph and pf refer to the true probabilities of hits and false alarms, respec-
tively. The other two probabilities, the probability of a miss and the proba-
bility of a correct rejection are denoted pm and pc , respectively. The outcome
of a signal trial may only be a hit or a miss. Consequently, ph + pm = 1.
Likewise, pf + pc = 1. Hence, there are only two free parameters (ph , pf ) of
concern. Once these two are estimated, estimates of (pm , pc ) can be obtained
by subtraction.
The model may be analyzed by treating each component independently.
Hence, by the results with the binomial distribution, maximum likelihood
estimates are given by
p̂h = yh /Ns
p̂f = yf /Nn
The terms p̂h and p̂f are the hit and false alarm rates, respectively.
1−g 1−g
Miss Correct Rejection
沒聽見,猜沒聲音
Signal Trials Noise Trials
沒聽見
where 0 < d, g < 1. The goal is to estimate parameters d and g. We use the
four-step likelihood approach. The first step, defining a hierarchy of models
has been done. The above model is the only one.
and
!
Nn
f (yf ; d, g) = (g)yf (1 − g))Nn −yf .
yf
66 CHAPTER 3. THE HIGH-THRESHOLD MODEL
Some of the terms are not functions of parameters and may be omitted:
l(d, g; yh, yf ) = − (yh log(d + (1 − d)g) + (ym ) log((1 − d)(1 − g)) + yf log(g) + (yc ) log(1 − g
why?
Typically, these are written in terms of the hit and false alarm rates, p̂h and
p̂f , respectively:
p̂h − p̂f
dˆ = ,
1 − pˆf
ĝ = p̂f . pf像是一個baseline
3.2. THE HIGH-THRESHOLD MODEL 67
These equations are used to generate estimates from the sample data in
Table 3.1:
7530
− 50
dˆ = 100
30 = .375
1 − 50
ĝ = yf /Nn = .6.
where C is the log of the choose terms that do not depend on parameters.
This log likelihood can be rewritten as
X
l= yi log(pi ),
i
where i ranges over the four events. Let p denote a vector of probabilities
(ph , pm , pf , pc ). The function is rewritten as:
68 CHAPTER 3. THE HIGH-THRESHOLD MODEL
Frequencies of Responses
Hit Miss False Alarm Correct Rejection
Condition 1 40 10 30 20
Condition 2 15 35 2 48
Table 3.2: Hypothetical data to test selective influence in the High Threshold
model.
pay the reverse (1c for each hit and 10c for each correct rejection). Condition
1 favors tone-present responses; condition 2 favors a tone-absent responses.
Parameter g should therefore be higher in Condition 1 than Condition 2.
Parameter d does not reflect these payoffs and should be invariant to the
manipulation. Suppose the experiment was run with 50 signal and 50 noise
trials in each condition. Hypothetical data is given in Table 3.2.
There are two parts to the selective influence test: The first is whether
the manipulation affected g as hypothesized. The second part is whether the
the manipulation had no effect on d. This second test is as least as important
as the first; the invariant of d, if it occurs, is necessary for support of the
model. We follow the four steps in answering this question.
Model 1 (3.7)
Yh,i ∼ B(di + (1 − di )gi, Ns,i ), (3.8)
Yf,i ∼ B(gi , Nn,i ). (3.9)
Model 2 (3.10)
Yh,i ∼ B(d + (1 − d)gi , Ns,i), (3.11)
Yf,i ∼ B(gi , Nn,i). (3.12)
The three models form a hierarchy with Model 1 being the most general
and Models 2 and 3 being proper restrictions. This hierarchy allows us to
test the selective influence hypotheses. Accordingly, the expected variation
of g can be tested by comparing Model 3 to Model 1. Likewise, the expected
invariance of d can be tested by comparing Model 2 to Model 1.
p[1]=d+(1-d)*g
p[2]=1-p[1]
p[3]=g
p[4]=1-p[3]
return(-sum(y*log(p)))
}
Given common parameters, data from the different conditions are indepen-
dent. The joint likelihood across conditions is the product of likelihoods for
each condition, and the joint log likelihood across conditions is the sum of
the log likelihoods for each condition. The following function, nll.1() com-
putes the negative log likelihood for Model 1. It does so by calling individual
condition log likelihood function nll.condition() twice and adding the re-
sults. Because Model 1 specifies different parameters for each condition, each
call to nll.condition() has different parameters.
The input to the function are the vector of four parameters (d1 , g1, d2 , g3 )
and the vector of eight data points from the two conditions.
In Model 2, there is a single detection parameter d. The log likelihood
for this model is evaluated similarly to that in Model 1. The difference is
that when nll.condition() is called for each condition, it is done with a
common detection parameter. The input is the vector of three parameters
and eight data points:
nll.condition(par3[1:2],y8[1:4])+
nll.condition(par3[c(1,3)],y8[5:8])
}
dat=c(40,10,30,20,15,35,2,48)
#Model 1
par=c(.5,.5,.5,.5) #starting values
mod1=optim(par,nll.1,y8=dat,hessian=T)
#Model 2
par=c(.5,.5,.5) #starting values
mod2=optim(par,nll.2,y8=dat,hessian=T)
#Model 3
par=c(.5,.5,.5) #starting values
mod3=optim(par,nll.3,y8=dat,hessian=T)
1.0
Condition 1
0.8
Parameter Estimates
Condition 2
0.6
0.4
0.2
0.0
Parameter
Figure 3.2: Parameter estimates and standard errors from Models 1 and 3.
Bars are parameter estimates from the general model and points are estimates
from the restricted model.
The output is in variables mod1, mod2, and mod3. There is one element
of the analysis that is of concern. The estimate of dˆ2 for Model 3, given in
mod3 is -.03. This estimate is invalid and we discuss a solution in Section ??.
For now the value of dˆ2 may be set to 0. Figure 3.2 provides an appropriate
graphical representation of the results. It was constructed with barplot and
errbar as discussed in Chapter 2. The bar plots are from Model 1, the
general model. The point between the two bar-plotted detection estimates
is the common detection estimate from Model 2. The point between the two
bar-plotted guessing estimates is the common guessing estimate from Model
3. From these plots, it would seem that the manipulation certainly affected
g. The case for d is more ambiguous, but it seems plausible that d depends on
the payoff, which would violate selective influence and question the veracity
of the model.
(with dˆ2 set to 0 the value is 41.37). Under the null-hypothesis that g1 = g2 ,
this value should be distributed as a chi-square. As mentioned previously, the
degrees of freedom for the test is the difference in the number of parameters
in the models, which is 1. The criterial value of the chi-square statistic with
1 degree of freedom is 3.84. Hence, Model 3 can be rejected in favor of Model
1. The payoff manipulation did indeed influence g as hypothesized.
The second part of selective influence is the invariance of d. From Fig-
ure 3.2, it is evident that there is a large disparity of sensitivity across the
conditions (.50 vs. .27). This difference appears relatively large given the
standard errors. Yet, the value of G2 (2*(mod2$value-mod1$value)) is 1.26,
which is less than the criterial value of 3.84. Therefore, the invariance cannot
be rejected.
This later finding is somewhat surprising given the relatively large size of
the effect in Figure 3.2. As a quick check of this obtained invariance, it helps
to inspect model predictions. The model predictions for a condition can be
obtained by:
p̂h = dˆ + (1 − d)ĝ
ˆ (3.15)
p̂f = ĝ. (3.16)
With these equations, the predictions from Model 1 and Model 2 are
shown in Table 3.3. As can be seen, Model 2 does a fair job at predicting
the data, even though the parameter estimate of d is different than d1 and
d2 in Model 1. This result is evidence for the invariance of d. When there
is a common detection parameter, the ability to predict the data is almost
as good as with condition-specific detection parameters. The lesson learned
is that it may be difficult with nonlinear models to inspect parameter values
with standard errors and decide if they differ significantly.
3.3. SELECTIVE INFLUENCE IN THE HIGH THRESHOLD MODEL75
The lines in Figure 3.3 are predictions from the high-threshold model.
Each line corresponds to a particular value of d. The points on the line are
obtained by varying g. The line is the prediction for the case of invariance
of sensitivity and it is called the isosensitivity curve (Luce, 1963). The high-
threshold model predicts straight line isosensitivity curves with a slope of
(1 − d) and an intercept of d. The following is the derivation of the result:
3.4. RECEIVER OPERATING CHARACTERISTIC 77
B
0.6
d=.6
Hit Rate
D
0.4
d=.35
E
0.2
d=.1
0.0
Figure 3.3: ROC plot. The points are the data from Table 3.4. Lines denote
predictions of the high-threshold model.
78 CHAPTER 3. THE HIGH-THRESHOLD MODEL
2. Plot the data as an ROC and add a line denoting the common-
detection model.
1−g 1−g
Miss Correct Rejection
so. A closely related alternative is the double high-threshold model. Like the
high-threshold model, the double high-threshold model is also predicated
on all-or-none mental processes. In contrast to the high-threshold model,
however, the double high-threshold model posits that participants may enter
a noise-detection state in which they are sure no signal has been presented.
The model is shown graphically in Figure 3.4. The model is the same as
the high-threshold model for signal trials: either the signal is detected, with
probability d, or not. If the signal is not detected, the participant guesses as
before. On noise trials, participants either detect that the target is absent
(with probability d) or enter a guessing state. Model equations are given by
Yh ∼ Binomial(d + (1 − d)g, Ns ) (3.17)
Yf ∼ Binomial((1 − d)g, Nn), (3.18)
where 0 < d, g < 1.
Analysis of this model is analogous to the high-threshold model. The log
likelihood is given by:
l(d, g; yh, yf ) = yh log(d + (1 − d)g) + (ym ) log((1 − d)(1 − g)),
+yf log((1 − d)g) + (yc ) log(d + (1 − d)(1 − g)).
Either calculus methods or numerical methods may be used to provide
maximum likelihood estimates. The calculus methods provide the following
estimates:
yh yf
dˆ = − (3.19)
Ns Nn
yf /Nn
ĝ = . (3.20)
1 − dˆ
80 CHAPTER 3. THE HIGH-THRESHOLD MODEL
1.0
0.8
0.6 d=0.6
Hit Rate
0.4
d=0.35
0.2
d=0.1
0.0
Figure 3.5: ROC lines of the double high-threshold model for several values
of d.
c = (Yh + Yc )/2N.
d 1
c= + .
2 2
This relationship indicates that overall accuracy is a simple linear transform
of d for when Nn = Ns . Overall accuracy for this case may be derived from
a double-high threshold model; its validity as a measure is tested by testing
the selective influence of the double-high threshold model.
Instead, the presence of memory is inferred from its ability to indirectly affect
a mental action such as completing the stem with the first word that comes to
mind. Most surprisingly, amnesics have somewhat preserved performance on
indirect tests. While an amnesic may not recall studying a specific word, that
word still has an elevated chance of being used by the amnesic to complete
the stem at test (Graf & Schacter,1985).
This finding, as well as many related ones, have led to the current con-
ceptualization of memory as consisting of two systems or components. One
of these components reflects conscious recollection, which is willful and pro-
duces the feeling of explicitly remembering an event. This type of memory is
primarily used in direct tasks. The other component is the automatic, uncon-
scious, residual activation of previously processed or encountered material.
Automatic activation corresponds to the feeling of familiarity but without
explicit memorization. Within this conceptualization, amnesics’ deficit is in
the conscious recollective component but not in the automatic component.
Hence, they tend to have less impairment in tasks that do not require a
conscious recollection. There are a few variations of this dichotomy (e.g.,
Jacoby, 1991; Schacter, 1990; Squire, 1994), but the conscious-automatic one
is influential.
The goal of the process dissociation procedure is to measure the degree
of conscious and automatic processing and we describe its application to the
stem completion task. The task starts with a study phase in which partic-
ipants are presented a sequence of items. There are two test conditions in
process dissociation: an include condition and an exclude condition. In the
include condition, participants are instructed to complete the stem with a
previously studied word. In the exclude condition, participants are instructed
to complete the stem with any word other than the one studied. In the in-
clude condition, stem completion of studied word can occur either through
successful conscious recollection or automatic activation. In the exclude con-
dition, successful conscious recollection does not lead to stem completion
with the studied item, instead it leads to stem completion with a different
item.
The following notation is used to implement the model. Let Ni and
Ne denote the number of words in the include and exclude test conditions,
respectively. Let random variables Yi,s and Yi,n be the frequency of stems
completed with a studied word and a word not studied, respectively, in the
include condition. Let random variables Ye,s and Ye,n denote the same for
words in the exclude condition. It is assumed that recollection is all-or-none
84 CHAPTER 3. THE HIGH-THRESHOLD MODEL
1−a 1−a
word not studied word not studied
yi,s ye,s
r̂ = − (3.25)
Ni Ne
ye,s /Ne
â = . (3.26)
1 − r̂
3.5. THE DOUBLE HIGH-THRESHOLD MODEL 85
Condition
Include Exclude
Studied Not Studied Studied Not Studied
Younger 69 81 37 113
Elderly 36 64 12 88
86 CHAPTER 3. THE HIGH-THRESHOLD MODEL
Chapter 4
The theory of signal detection (Green & Swets, 1966) is the dominant model-
based method of assessing performance in perceptual and cognitive psychol-
ogy. We describe the model for the tone-in-noise signal detection experiment.
The model, however, is applied more broadly to all sorts of two-choice tasks
in the literature including those in assessing memory. McMillan and Creel-
man (1991) provide an extensive review of the variations and uses of this
flexible model. It is important to distinguish between the theory of signal
detection and signal-detection experiments. The former is a specific model
like the high-threshold model; the latter is an experimental design with two
stimuli and two responses.
The presentation of the model relies on continuous random variables,
which were not covered in Chapter 1. We first discuss this type of random
variable as well as density, cumulative distribution, and quantile functions.
Then we introduce the normal distribution, upon which the theory of signal
detection is based. After covering this background material in the first half
of the chapter, we present the theory itself in the second half.
87
88 CHAPTER 4. THE THEORY OF SIGNAL DETECTION
0.12
A B
0.20
0.08
Probability
Probability
0.10
0.04
0.00
0.00
0 10 20 30 40 50 0 5 10 15
Number of Successes Number of Events
0.015
C D
0.8
Density (1/pounds)
0.010
Density
0.4
0.005
0.000
0.0
Figure 4.1: Top: Probability mass functions for discrete random variables.
Bottom: Density functions for continuous random variables.
.
What is the probability that a continuous random variable takes any sin-
gle value? It is the area under a single point, which is zero. This condition
makes sense. For example, we can ask what the probability that a person
weighs 170lbs. There are many people who report their weight at 170lbs,
but this report is only an approximation. In fact, very few people world-
wide weigh between 169.99lbs and 170.01lbs. Surely almost nobody weighs
between 169.999999lbs and 170.000001lbs. As we decrease the size of the
interval around 170lbs, the probability that anybody’s weight falls in the
90 CHAPTER 4. THE THEORY OF SIGNAL DETECTION
interval becomes smaller. In the limit, nobody can possibly weigh exactly
170lbs. Hence the probability of someone weighing exactly some weight, to
arbitrary precision, is zero.
In order to more fully understand the density function, it is useful to
consider the units of the axes. The units of the x-axis of a density function is
straightforward—it is the units of the measurement of the random variable.
For example, the x-axis of the normal density in Figure 4.1 is in units of
pounds. The units of the y-axis is more subtle. It is found by considering
the units of area. On any graph, the units of area under the curve is given
by:
1
Convergence involves shrinking the bin size. As the number of realizations grows,
the bins should become smaller, and, in the limit, the bins should become infinitesimally
small.
4.1. CONTINUOUS RANDOM VARIABLES 91
0.015
Density (1/pounds)
Density (1/pounds)
0.010
0.010
0.005
0.000
0.000
100 120 140 160 180 200 220 100 150 200 250
Weight in pounds Weight in pounds
Figure 4.3 shows the relationship between density and cumulative dis-
tribution functions for a uniform between 0 and 2. There are two dotted
vertical lines, labeled a and b, are for a = .5 and b = 1.3. The values of
density and CDF are shown in left and right panels, respectively. The area
under the density function to the left of a is .25, and this is the value of the
CDF in the right panel. Likewise, the area under the density function to the
left of b is .65, and this is also graphed in Panel B. Cumulative distribution
functions are limited to the [0, 1] interval and are always increasing.
The cumulative distribution function can be used to compute the proba-
bility an observation occurs on the interval [a, b]:
P r(a < X ≤ b) = F (b) − F (a). (4.2)
92 CHAPTER 4. THE THEORY OF SIGNAL DETECTION
Cumulative Probability
0.6
0.8
a b a b
Density
0.4
0.4
0.2
0.0
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Outcome Outcome
Figure 4.3: Density (left) and cumulative distribution function (right) for a
uniform random variable. The cumulative distribution function is the area
under the density function to the left of a value.
1.0
Cumulative Probability
0.8
0.6
0.4
0.2
0.0
0 2 4 6 8 10
Outcome
percentile is the value below which 75% of the distribution lies. For example,
for the uniform in Figure 4.3, the value 1.5 is the 75th percentile because
75% percent of the area is below this point. Quantiles are percentiles for
distributions, except they are indexed by fractions rather than by percentage
points. The .75 quantile corresponds to the 75th percentile.
The quantile function takes a probability p and returns the associated pth
quantile for a distribution. The quantile function is the inverse of the cumula-
tive distribution function. Whereas the CDF returns the proportion of mass
below a given point, the quantile function returns the point below which a
given proportion of the mass lies. Examples of density functions, cumulative
distribution functions and quantile functions for three different continuous
distributions are shown in Figure 4.5. The top row is for a uniform between
0 and 2; the middle row is for a normal distribution; the bottom row is for
an exponential distribution. The exponential is a skewed distribution used
to model the time between events such as earthquakes, light bulb failures,
94 CHAPTER 4. THE THEORY OF SIGNAL DETECTION
2.0
Uniform
0.4
0.8
Probability
Outcome
Density
1.0
0.2
0.4
0.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0
Outcome Outcome Probability
Normal
0.8
200
0.010
Probability
Outcome
Density
160
0.4
120
0.000
0.0
100 150 200 250 100 150 200 250 0.0 0.2 0.4 0.6 0.8 1.0
Outcome Outcome Probability
Exponential
8
0.8
0.4
Probability
Outcome
6
Density
0.4
0.2
4
2
0.0
0.0
0
0 2 4 6 8 10 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0
Outcome Outcome Probability
Figure 4.5: Density, cumulative probability, and quantile functions for uni-
form, normal, and an exponential distributions.
4
qchisq(.95,1) qt(.025,4) qt(.975,4)
0.3
3
Density
Density
0.2
2
95%
0.1
1
0.0
0
0 1 2 3 4 5 −4 −2 0 2 4
Figure 4.6: Criterial bounds for the chi-square distribution with 1 df and the
t distribution with 4 df.
d’
0.4
Tone Absent Tone Present
0.3
Density
0.2
0.1
0.0
c
−4 −2 0 2 4 6
Sensory Strength
Analysis begins with model predictions about hit, false-alarm, miss, and
correct-rejection probabilities. Correct rejection probability is the easiest to
derive. Correct rejection events occur when strengths from the tone-absent
distribution are below the criterial bound c. This probability is the CDF of
a standard normal at c, which is denoted as Φ(c).
pc = Φ(c). (4.4)
pm = Φ(c − d′ ). (4.6)
Because ph = 1 − pm ,
ph = 1 − Φ(c − d′ ). (4.7)
Equations 4.5 through 4.7 describe underlying probabilities, and not data.
The resulting data is distributed as a binomials, e.g.,
Yh ∼ Binomial(ph , Ns )
Yf ∼ Binomial(pf , Nn ).
4.2.1 Analysis
In this section, we provide analysis for a single condition. Data are the
numbers of hit, miss, false alarm and correct rejects and denoted with the
100 CHAPTER 4. THE THEORY OF SIGNAL DETECTION
c = −Φ−1 (pf )
d′ = Φ−1 (ph ) − Φ−1 (pf )
dprime.est=qnorm(hit.rate)-qnorm(fa.rate)
c.est=-qnorm(fa.rate)
y=c(40,10,30,20)
par=c(1,0) #starting values
optim(par,ll.sd,y=y,maximum=T)
The results are d̂′ = .588, and ĉ = −.253. These estimates match those
obtained from Equations 4.10 and 4.11.
1.0
2
0.8 d=1.5
d=1.5
1
z(Hit Rate)
0.6
d=0.75 d=0.75
Hit Rate
0
d=0.1
0.4
−1
0.2
d=0.1
0.0
A B
−2
0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 0 1 2
False Alarm Rate z(False Alarm Rate)
Figure 4.8: ROC and zROC plots for the data from Table 3.4. Signal detec-
tion model predictions are overlaid as lines.
For notational convenience, let yi = Φ−1 (p̂h,i) and xi = Φ−1 (p̂f,i ). Then,
yi = xi + d′ ,
which is the equation for a straight line with a slope of 1.0 and an intercept
of d′ . If the signal detection model holds and the conditions each have the
same sensitivity, then the zROC points should fall on a straight line with
slope 1.0. The function Φ−1 is also called a z-transform and z-transformed
proportions are also called z-scores.
To draw a zROC, we use qnorm(p), where p is either the hit or false-alarm
rate. The following code plots the zROC for the data in Table 3.4.
4.2. THEORY OF SIGNAL DETECTION 103
hit.rate=c(.81,.7,.57,.5,.30)
fa.rate=c(.6,.47,.37,.2,.04)
plot(qnorm(fa.rate),qnorm(hit.rate),ylab="z(Hit Rate)",
xlab="z(False-Alarm Rate)",ylim=c(-2,2),xlim=c(-2,2))
Fit the signal-detection model to the data in Table 3.4. Fit a general
model with separate parameters for each condition. The 10 parameters
in this model are (d′A , cA , d′B , cB , d′C , cC , d′D , cD , d′E , cE ). Fit a common
sensitivity model; the six parameters are (d′ , cA , cB , cC , cD , cE ).
The above code introduces some new programming elements. Let’s work
through an example with three conditions. Suppose for each condition, there
are 20 observations (N = 20) and the number of successes (hits or false
alarms) is y = (2, 0, 20). The code works as follows: From the first line, p
is a vector with values (.1, 0, 1). The second line of code is more complex.
Consider first the term p==0. The symbol == tests each term for equality
with 0, and so this line returns a vector of true and false values. In this case,
the second element is true, because p[2] does equal 0. The left-hand side,
p[p==0] refers to all of those elements in which the term within the brackets
is true, i.e., all those in which p does indeed equal zero. These elements
are replaced with the value of 1/2N. The third line operates analogously–it
replaces all estimated proportions of 1.0 with the value 1−1/2N. Hautus and
Lee (1998) provide further discussion of the properties of these estimators.
106 CHAPTER 4. THE THEORY OF SIGNAL DETECTION
The novel element in these plots are horizontal error bars which are drawn
with the following code:
horiz.errbar=function(x,y,height,width,lty=1)
{
arrows(x,y,x+width,y,angle=90,length=height,lty=lty)
arrows(x,y,x-width,y,angle=90,length=height,lty=lty)
}
1.0
1.0
0.8 A B
0.5
z(Hit Rate)
0.6
Hit Rate
d’=0.56
0.0
0.4
−0.5
0.2
−1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 −1.0 −0.5 0.0 0.5 1.0
False Alarm Rate z(False Alarm Rate)
Figure 4.9: A: ROC plot for the data from Table 4.1 with standard errors on
hit and false-alarm rates. The black line represents the isosensitivity curve
for the value of d′ obtained by maximum likelihood. B: zROC plot for the
same data with standard errors on sensitivity estimate. The solid line is the
isosensitivity curve for the best-fitting signal detection model. The dotted
lines are standard errors on sensitivity.
hit=c(82,68,48)
fa=c(62,44,28)
N=100
hit.rate=hit/N
fa.rate=fa/N
std.err.hits=sqrt(hit.rate*(1-hit.rate)/N)
std.err.fa=sqrt(fa.rate*(1-fa.rate)/N)
#plot ROC
plot(fa.rate,hit.rate,xlim=c(0,1),ylim=c(0,1))
errbar(fa.rate,hit.rate,height=std.err.hits,width=.05)
horiz.errbar(fa.rate,hit.rate,height=.05,width=std.err.fa)
for all three conditions. The resulting isosensitivity curve is the solid line.
Standard errors were derived from optim with the option hessian=T. For the
data in Table 4.1, the estimate of sensitivity is 0.56 and its standard error is
0.077. These standard errors are plotted as parallel isosensitivity curves and
are denoted with dotted lines. The code for drawing these lines is
1. Create an ROC plot with standard errors on hit and false alarms.
111
112CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
the likelihood fail for these models. They do not return parameters that
maximize likelihood. Second, the previously introduced techniques for model
comparison are insufficient for the three new models. None of the new models
is a restriction of another; e.g., the models are not nested. The likelihood
ratio statistic, while appropriate for nested models, is not appropriate for
non-nested ones. We discuss an alternative method, based on the Akaike
Information Criterion (AIC, Akaike, 1974), to make these comparisons. The
remainder of the chapter is divided into three sections: a discussion of more
advanced numerical techniques for improving optimization, a discussion of
the three new models, and a discussion of nonnested model comparisons.
h=function(theta,y) return(sum((theta-y)^2))
y=c(3,8,13,18)
par=rep(10,4) #starting values
optim(par,h,y=y)
5.1. IMPROVING OPTIMIZATION 113
Results are:
$par
[1] 0.3535526 1.7091429 4.1538558 3.0865141 5.8710791 5.2499118
[7] 7.8665955 8.8864199 7.8365200 9.2342382 11.6387186 13.8285510
[13] 13.6197244 13.8850712 12.5633528 15.3360254 16.7935952 16.6734207
[19] 19.3004116 19.8297374
$value
[1] 19.91514 bad
These results are troubling as the estimates are surprisingly far from their
true values, and the function minimizes to a value of 19.9 instead of 0.
This example demonstrates that the optim() function with default set-
tings is unable to handle problems with more than a few parameters. The
above case demonstrates that optim() is not foolproof. In the following sec-
tions, we explore a few strategies to increase the accuracy of optimization.
h is a function of only y1 and not of the other nineteen data points. This fact
holds analogously for the other parameters as well—the appropriate value of
each parameter depends on a single data point. Consider the following code
for minimization that takes advantage of this fact:
In this code, we call optimize twenty times from within the loop. Each pass
through the loop optimizes a single parameter. The resulting parameters
values are stored in vector par.est. The minimum of the function is stored
in min. The results are that the estimated values (par.est) are nearly the
true values and the function minimizes to nearly 0. For this case, performing
20 one-parameter optimizations is far more accurate than performing a single
twenty-parameter optimization.
This strategy of framing analysis so that it involves multiple optimizations
with smaller numbers of parameters is often natural in psychological contexts.
Consider the analysis of the high-threshold model for payoffs data (Table 3.4)
as an example. The model reflecting selective influence has six parameters:
(d, gA , gB , gC , gD , gE ). We start by assuming the true value of d. Of course,
this assumption is unwarranted. We only use the assumption to get started
and will soon dispense of it during estimation. If the true value of d is known,
then estimation of gA only depends on data in Condition A, estimation of gB
only depends on the data of Condition B and so on. This fact leads naturally
to multiple optimization calls. The first step is to compute the likelihood for
g in a single condition given a fixed value of d:
nll.ht.given.d=function(g,y,d)
{
p=1:4 # reserve space
p[1]=d+(1-d)*g #probability of a hit
p[2]=1-p[1] # probability of a miss
p[3]=g # probability of a false alarm
p[4] = 1-p[3] #probability of a correct rejection
return(-sum(y*log(p)))
}
The body of the function is identical to that in Section 3.2.3; the difference
is how the parameters are passed. The function in Section 3.2.3 is minimized
with respect to two parameters. The current function will be minimized with
respect to g alone. The maximum log likelihood for all five conditions for a
known value of d is:
#d is detection parameter
#dat is hA,mA,faA,cA,...hE,mE,faE,cE
nll.ht=function(d,dat)
{
return( g
optimize(nll.ht.given.d,interval=c(0,1),y=dat[1:4],d=d)$objective+
optimize(nll.ht.given.d,interval=c(0,1),y=dat[5:8],d=d)$objective+
optimize(nll.ht.given.d,interval=c(0,1),y=dat[9:12],d=d)$objective+
optimize(nll.ht.given.d,interval=c(0,1),y=dat[13:16],d=d)$objective+
optimize(nll.ht.given.d,interval=c(0,1),y=dat[17:20],d=d)$objective
)
}
The function nll.ht can be called for any value of d, and when it is called, it
performs five one-parameter optimizations. Of course, we wish to estimate d
rather than assume it. This can be done by optimizing nll.ht with respect
to d. Here is the code for the data in Table 3.4:
dat=c(404,96,301,199,348,152,235,265,
287,213,183,317,251,249,102,398,148,352,20,480)
g=optimize(nll.ht,interval=c(0,1),dat=dat)
d
116CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
The result of this last optimization provides the minimum of the negative log
likelihood as well as the ML estimate of d. The minimum here is 2901.941,
which is probably lower than the negative log likelihood you found for the
restricted model in Your Turn 3.4.1 (depending on the starting values you
chose). To find the ML estimates of the guessing parameters, each of the
optimization statements in nll.ht may be called with the ML estimate of d.
We call this approach nested optimization. Optimization of parameters
specific to conditions (gA , gB , .., gE ) is nested within parameters common
across all conditions (d). Overall nested optimization often provides for more
accurate results than a large single optimization. The disadvantage of nested
optimization is that it does not immediately provide standard error estimates
for parameters. One strategy is to do two separate optimizations. The first
of these is with nested optimizations. Afterward, the obtained ML estimates
can be used as starting values for a single optimization of all parameters with
a single optim() call. In this case, optim() should return these starting val-
ues as parameter estimates. In addition, it will return the Hessian which can
be used to estimate standard errors as discussed in Chapter 2.
1.0
0.8
0.6
p
0.4
0.2
0.0
−6 −4 −2 0 2 4 6
z
y=c(75,25,30,20)
z=c(qlogis(.5),qlogis(.5)) #ranges from -infty to infty
5.1. IMPROVING OPTIMIZATION 119
results=optim(z,snll,y=y)
plogis(results$par)
Function optim() evaluates the log likelihood function with various values
of z that are free to vary across the reals. The results are for z, which ranges
across all reals. To interpret these values, they should also be transformed
to probabilities; this is done with the plogis(result$par) statement.
Rerun the code from Chapter 3 and then run the above code for compari-
son. The parameter estimates are almost identical. Look at the the $counts
returned by optim(). It is lower for the current code (45 evaluations) than
for that in Chapter 3 (129 evaluations). When we transformed parameters, R
required only one-third the evaluation calls, saving time in the optimization
process.
5.1.4 Convergence
Optimization routines work by repeatedly evaluating the function until a
minimum is found. By default optim() continues to search for better pa-
rameter values until one of two conditions is met: 1. new iterations do not
lower the value of the to-be-minimized function much or 2. a maximum
number of iterations occurs. If the optim() reaches this maximum number
of iterations, then it is said to have not converged and optim() returns a
value of 1 for the $convergence field. If the algorithm stops before this
maximum number of iterations because new iterations do not lower the func-
tion value much, then the algorithm is said to have converged and a value of
0 is returned in the convergence field. The maximum number of iterations
defaults to 500 for the default algorithm in optim().
120CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
y=1:20
par=rep(10,20)
optim(par,h,control=list(maxit=10000),y=y) #takes a while
We achieve convergence and the estimates are much better than for 500 it-
erations (although the function minimizes to .08 instead of zero). A related
approach for convergence is to run the algorithm repeatedly. For each rep-
etition, the previous parameter values serve as the new starting values. For
example,
par=rep(10,20)
a1=optim(par,h,control=list(maxit=10000),y=y)
a2=optim(a1$par,h,control=list(maxit=10000),y=y)
The results of the second call are reasonable (the function minimizes to .006
instead of zero).
The advantage of these brute-force approaches of raising the number of
function evaluations is that they are trivially easy to implement. The dis-
advantage is that they often are not as effective as nesting optimization and
transforming parameters.
with a single simplex call, yet it works quickly with a single nlm() call:
y=1:200
par=rep(100,200)
optim(par,h,y=y)
nlm(h,par,y=y)
5.1.6 Caveats
Optimization is far from foolproof. The lack of an all-purpose, sure-fire nu-
merical optimizer is perhaps the most significant drawback to the numerical
approach we advocate. As a result, it is incumbent on the researcher to
use numerical optimization with care and wisdom. We recommend that a
researcher consider additional safeguards to understand the quality of their
optimizations. Here are a few:
• Repeat optimization with different starting points. It is a good
idea to repeat optimization from a number of different starting values.
Hopefully, many starting values lead to the same minimum.
122CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
ph = ds + (1 − ds )g, (5.3)
pf = (1 − dn )g (5.4)
1−g 1−g
Miss Correct Rejection
2 2
0.8
0.8
0.8
1 1
1
Hits
Hits
Hits
2
0.4
0.4
0.4
0.0
0.0
0.0
We find the best value of g for each condition with optimize(). Because
optimize() works well on restricted intervals such as [0, 1], there is no reason
to transform these guessing parameters. The likelihood across all conditions
is obtained by adding the likelihood from each condition. Note the use of
transformed detection parameters.
Next, the log likelihood for the model is maximized by finding the appropri-
ate detection parameters (ds , dn ). Both nlm() and optim() work well with
transformed detection parameters in this application.
dat=c(404,96,301,199,348,152,235,265,
287,213,183,317,251,249,102,398,148,352,20,480)
zdet=rep(0,2) #starting values of zdet
est=optim(zdet,nll.ght,dat)
plogis(zdet$par)
pf = 1 − Φ(c), (5.5)
ph = 1 − F (c, d′, σ 2 ), (5.6)
where F (x, µ, σ 2 ) is the CDF for a normal with parameter (µ, σ 2). It is
conventional to rewrite this equation in terms of the CDF for the standard
normal: F (x, µ, σ 2) = Φ([x − µ]/σ). Therefore,
c − d′
!
ph = 1 − Φ . (5.7)
σ
d’
0.4
0.3
Density
0.2
σ
0.1
0.0
−4 −2 bound 4 6 8
Sensory Strength
origin and (1,1); two examples of isosensitivity curves are drawn in Fig-
ure 5.3B. For the five conditions in Table 3.4, the model has 7 parameters
(d′ , σ 2 , c1 , c2 , c3 , c4 , c5 ). Evaluation of the log likelihood for known d′ and σ 2
in a single condition may be evaluated in R:
amount.of.bias=function(b,p)
ifelse(b>0,(1-p)*b,p*b)
The ifelse function has been introduced earlier (p.xx). If b > 0 then the
second argument ((1-p)*b) is evaluated and returned; otherwise the third
argument (p*b) is evaluated and returned.
With this function, it is straightforward to implement a function that
evaluates the log likelihood for a single condition. The following is the log
likelihood of b for known detection parameters (ds , dn ).
#low-threshold model
#det=c(ds,dn)
#y=c(hits,misses,fa’s,cr’s)
nll.lt.1=function(b,y,det)
{
ds=det[1]
dn=det[2]
p=1:4 # reserve space
p[1]=ds+amount.of.bias(b,ds) #probability of a hit
p[2]=1-p[1] # probability of a miss
p[3]=dn+amount.of.bias(b,dn) # probability of a false alarm
p[4] = 1-p[3] #probability of a correct rejection
return(-sum(y*log(p)))
}
Binomial Model
10 Parameters
logL=−32.57 AIC=85.15
Figure 5.5: A hierarchy of seven models for the payoff experiment in Ta-
ble 3.4. Included are model comparison statistics G2 , log likelihood, and
AIC.
1.0
A B
0.8
0.8
0.6
0.6
Hits
Hits
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
where L is the likelihood of the model, θ∗ are the MLEs of the parameters,
and M is the number of parameters. The lower the AIC, the better the
model fit. The model with the lowest AIC measure is selected as the most
parsimonious. AIC measures for the seven models are shown in Figure 5.5.
The general high-threshold model is the most parsimonious, followed by the
double-high threshold model and the binomial model.
Consider the AIC measure for two models that have the same number
of parameters. In this case, the model with the lower AIC score is the one
with the higher log likelihood. This is a reasonable boundary condition. Log
likelihood values, however, are insufficient when two models have different
numbers of parameters. As discussed in Chapter 2, models with more param-
eters tend to have higher log likelihoods. For example, the binomial model
with ten parameters will always have higher likelihood than any of its re-
strictions simply because it has a greater number of parameters. The AIC
measure accounts for the number of parameters by penalizing models with
more parameters. For each additional parameter, the AIC score is raised by
2 points (this value of 2 is not arbitrary; it is derived from statistical theory).
Because of this penalty, the AIC score for the general high-threshold model
is lower than that of the binomial model, even though the latter has greater
log likelihood.
The AIC and likelihood ratio test analyses concord fairly well in that
they both favor the general high-threshold and double-high threshold model
over the other competitors. They appear to disagree, however, on which of
these two is the most appropriate. According to the likelihood ratio test, the
double high-threshold model may not be rejected in favor of the general high-
threshold model. In contrast, according to AIC, the general high-threshold
model is preferred over the double high-threshold model. The disagreement
is more apparent than real because the analyses have different logical bases.
The likelihood ratio test is vested in the logic of a null hypothesis testing.
In this case, the double high-threshold restriction (ds = dn ) serves as the
null hypothesis and we do not have sufficient evidence to reject it at the .05
5.5. NESTED-MODEL AND AIC ANALYSES 133
Multinomial Models
135
136 CHAPTER 6. MULTINOMIAL MODELS
Response
Tone Absent Tone Present
Stimulus High Medium Low Low Medium High Total
Tone in Noise 3 7 14 24 32 12 92
Noise Alone 12 16 21 15 10 6 80
Table 6.1: Hypothetical data from confidence ratings task.
where denotes the product of the terms, and is analogous to Σ for sums.
Q
Log likelihood is
X X
l(p1 , .., pI ; y1 , ..., yI ) = log N! − log yi ! + yi log pi . (6.3)
i i
The terms in log N! and − i log yi are not dependent on the parameters and
P
the critical term in the log likelihood of the binomial. Maximum likelihood
estimates of pi can be obtained by numerical minimization or by calculus
methods. The calculus methods yield:
yi
p̂i = . (6.4)
N
In this model, the Ns and Nn are the number of noise and signal trials,
respectively. Parameters pi,j are the probability of the ith response to the j th
stimuli and are subject to the restrictions i pi,s = 1 and i pi,n = 1. ML
P P
estimates for pij are analogous to those for the binomial and are p̂i,j = Yi,j /Nj .
Calculation is straightforward and an example for the data of Table 6.1 is:
y.signal=c(3,7,14,24,32,12)
y.noise=c(12,16,21,15,10,6)
NS=sum(dat.signal)
NN=sum(dat.noise)
6.2. SIGNAL DETECTION MODEL OF A CONFIDENCE-RATING TASK139
c1 c2 c3 c4 c5
0.6
−2 0 2 4 6
Strength
Figure 6.1: The signal detection model for confidence ratings. The left and
right distributions represent the noise and signal-plus-noise distributions, re-
spectively. The five bounds, c1 , .., c5 , divide up the strengths into six intervals
from “Tone absent with high confidence” to “Tone present with high confi-
dence.”
140 CHAPTER 6. MULTINOMIAL MODELS
parest.signal=dat.signal/NS
parest.noise=dat.noise/NN
We assume independence holds across stimuli; hence, the overall log likeli-
hood for both stimuli in the general model is
nll.general=nll.mult.1(parest.signal,dat.signal)+
nll.mult.1(parest.noise,dat.noise)
sdprob.1=function(mean,sd,bounds)
{
cumulative=c(0,pnorm(bounds,mean,sd),1)
p.ij=diff(cumulative)
6.2. SIGNAL DETECTION MODEL OF A CONFIDENCE-RATING TASK141
return(p.ij)
}
To understand the code, consider the case in which the stimulus is noise
alone (standard normal) and the bounds are (−2, −1, 0, 1, 2). The vector
cumulative is assigned values (0, .05, .16, .5, .84, .95, 1). The middle five val-
ues are the areas under the normal density function to the left of each of the
five bounds. Eq. 6.7 describes the area between the bounds, which are the
successive differences between these cumulative values. These differences are
conveniently obtained by the diff(), function.
Now that we can compute model-based probabilities for a single condition,
it is fairly straightforward to estimate d′ , σ, and the bounds. We provide code
that computes the (negative) log likelihood of (d′, σ 2 , c1 , .., c5 ) and leave it to
the reader to optimize it. The first step is to write a function that returns
response probabilities for both stimuli as a function of model parameters:
nll.sigdet=function(par,y)
#negative log likelihood of free-variance signal detection
#for confidence interval paradigm
#par=d,sigma,bounds
#y=y_(1,s),..,y_(I,s),y_(1,n),..,y_(I,n)
{
I=length(y)/2
d=par[1]
sigma=par[2]
bounds=par[3:length(par)]
p.noise=sd.prob.1(0,1,bounds)
p.signal=sd.prob.1(d,sigma,bounds)
nll.signal=nll.mult.1(p.signal,y[1:I])
nll.noise=nll.mult.1(p.noise,y[(I+1):(2*I)
return(-nll.signal-nll.noise)
} 這邊錯了吧 nll.mult.1已經是nll了幹嘛用-的
142 CHAPTER 6. MULTINOMIAL MODELS
r E1
s E1
s
1−r 1−s E2
a s E2
1−s
E3
1−s
u E4
1−a u
1−u E5
u E5
1−u
E6
1−u
Figure 6.2: The Rouder and Batchelder storage-retrieval model for bizarre
imagery. See Table 6.2 for description of events E1 , .., E6 .
6.3.2 Analysis
Event
Condition E1 E2 E3 E4 E5 E6
Bizarre 103 2 46 0 7 22
Common 80 0 65 3 9 2
FOUR CHOICES
A) Ant Eater Lager B) Oil Change Stout
C) Tiger Ale D) Chancellor’s Reserve
BELIEFS
A) .20 B) .19
C) .60 D) .01
TWO CHOICES
B) Oil Change Stout
D) Chancellor’s Reserve
REVISED BELIEFS
B) .95
D) .05
malized by the available choices. For the example in Figure 6.3, let pi and
p∗i denote the belief in the ith alternative before and after two choices are
eliminated. According to the law of conditional probability:
pi
p∗i = , i = 2, 4. (6.8)
p2 + p4
The correct answer for this question is (2). When probabilities follow the
law of conditional probability, we say the decision maker properly conditions
on events.
In this model, ηi,j describes the similarity between stimuli i and j, and
βj describes the bias toward the jth response. Similarities range between
0 and 1. The similarity between any item and itself is 1. To make the
6.4. SIMILARITY CHOICE MODEL 149
6.4.3 Analysis
Consider SCM for the identification of the first three letters a, b, c. Table 6.3
depicts the format of data. Random variable Yij is the number of times
stimulus i elicits response j. In the table, the stimuli are denoted by lower-
case letters (even though the physical stimuli may be upper-case) and the
responses are denoted by upper-case letters. Table 6.3 is called a confusion
matrix.
The general multinomial model for this confusion matrix is
Response
Stimulus A B C # of trials per stimulus
Stimulus a Ya,A Ya,B Ya,C Na
Stimulus b Yb,A Yb,B Yb,C Nb
Stimulus c Yc,A Yc,B Yc,C Nc
Table 6.3: Confusion matrix for three stimuli (a, b, c) associated with three
responses (A, B, C).
ular stimulus. These three probabilities must sum to 1.0. Hence, for each
component, there are two free parameters. For the three stimuli, there are a
total of six free parameters. The log likelihood is
3 X
X 3
l= yi,j log pi,j . (6.14)
i=1 j=1
6.4.4 Implementation in R
In this section, we implement analysis of SCM in R. One element that makes
this job difficult is that the similarities, ηi,j , are most easily conceptualized
as a matrix, as in Eq. 6.10. Yet, all of our previous code relied on passing
parameters as vectors. To complicate matters, not all of the elements of the
similarity matrix are free. Hence, the code must keep careful track of which
6.4. SIMILARITY CHOICE MODEL 151
matrix elements are free and which are derived. Consequently, the code
is belabored. Even so, we present it because these types of problems are
common in programming complex models, and this code serves as a suitable
exemplar. Readers not interested in programming these models can skip this
section without loss.
The first step in implementing the SCM model is to write a function
that yields the log likelihood of the data as a function of all similarity and
response parameters, whether they are free or derived. The following code
does so. It calculates the log likelihood for each stimulus within a loop and
steps through the loop for all I stimuli. Before running the code, be sure to
define I; e.g., I=3.
par2eta=function(par.e)
{
eta=matrix(1,ncol=I,nrow=I) #create an I-by-I matrix of 1s.
eta[upper.tri(eta)]=par.e
eta[lower.tri(eta)]=par.e
return(eta)
}
The novel elements in this code are the functions upper.tri() and lower.tri().
These functions map the elements of the vector par.e into the appropriate
locations in the matrix eta. To see how these work, try the following lines
sequentially:
In SCM, there are I(I − 1)/2 free similarity parameters and I − 1 free re-
sponse bias parameters. The following function returns the log likelihood as a
function of these I(I −1)/2+I −1 free parameters. Because all similarity pa-
rameters are restricted to be between zero and 1, we use logistic-transformed
similarity parameters. Likewise, because response biases must always be
positive, we use an exponential transform of response bias parameters. The
function ex is positive for a real values of x.
6.4. SIMILARITY CHOICE MODEL 153
nll.scm=function(par,dat) {
#par is concatanation of I(I-1)/2 logistic transformed similarity
#parameters and (I-1) exponential transformed response bias parameters
#dat is confusion matrix
par.e=plogis(par[1:(I*(I-1)/2)])
par.b=exp(par[((I*(I-1)/2)+1):((I*(I-1)/2)+I-1)])
beta=par2beta(par.b)
eta=par2eta(par.e)
return(nll.scm.first(eta,beta,dat))
}
The following code shows the analysis of a sample confusion matrix. The
first line reads sample data into the matrix dat.
I=3
dat=matrix(scan(),ncol=3,byrow=T)
49 1 15
6 22 5
17 1 35
># similarities
> par2eta(plogis(g$par[1:3]))
> par2beta(exp(g$par[4:5]))
[1] 1.0000000 0.2769761 0.7926932
From this data, it may be seen that a and c are more similar to each other
than either is to b. In addition, there is a tendency to have greater response
bias to a and c and not to b.
Response
Four-Choice Condition Two-Choice Condition
Stimulus A B C D A B
a 49 5 1 2 83 10
b 12 35 7 1 15 76
c 1 5 22 11
d 1 6 18 19
Decide if the similarly between A and B is the same across the two
conditions. Consider the following steps:
C
B
A
F
E
D
I
H
G
ABCDE F GH I
A B C DE F G H I
Length
Unidimensional Representation
A B C
Angle
D E F
E
CD F G
B H G H I
A I
Length Size
Two dimensional Representation Two dimensional Representation
K−Dimension K−1
Fortunately, only minor changes to the SCM code are needed to analyze
this model. There are I positions xi , but the first one of these may be set to
0 without any loss. Therefore, there are only I − 1 free position parameters.
Likewise, there are I − 1 free response biases. The first step is modifying the
mapping from free parameters to similarities:
par2eta.1D=function(par.e)
{
x=c(0,par.e) #set first item’s position at zero
d=as.matrix(dist(x,diag=T,upper=T)) #type d to see distances
return(exp(-d)) #returns similarities
}
Response
Item A B C D E F G H
a 84 8 7 1 0 0 0 0
b 14 45 23 9 5 4 0 0
c 8 10 36 19 15 7 3 2
d 2 8 17 31 21 9 8 4
e 3 3 6 28 34 14 5 7
f 2 4 8 18 18 32 11 7
g 0 2 5 6 7 9 49 22
h 0 0 2 0 3 5 5 85
M
!1/r
|(xi,m − xj,m )|r
X
di,j = . (6.18)
m=1
This is the familiar formula for the distance of a straight line between two
points. When r = 2, the distance is called the Euclidean distance. Fig-
ure 6.6 shows the Euclidean distance between two points as well as two other
distances. When r = 1, the distance is called the city-block distance. For
the example with two points in two dimensions, city-block distances are com-
puted with only vertical and horizontal lines. Much like navigating in a dense
city, diagonal lines are not admissible paths between points. The city-block
distance in the figure is 7. The maximum distance occurs asr → ∞. The
distance is maximum difference between the points on a single dimension.
In the figure, the differences are 4 (x-direction) and 3 (y-direction). The
maximum difference is 4. We use the Euclidean distance here although re-
searchers have proposed the city-block distance for certain classes of stimuli
(e.g., Shepard, 1986)
The following code implements a two-dimensional SCM model for the
line lengths. There are I points, each with two position parameters. The
first item may be placed at (0, 0). Also, the y-coordinate of the second point
may also be placed at 0 without loss. There are, therefore, 2(I − 1) − 1
free position parameters in I − 1 free response biases. The total, therefore,
is 3(I − 1) − 1 free parameters. The function par2eta.2D() converts the
2(I − 1) − 1 position parameters to similarities:
par2eta.2D=function(par.e)
{
x=c(0,par.e[1:(I-1)]) #x-coordinate of each of the I points
162 CHAPTER 6. MULTINOMIAL MODELS
City−Block Distance: 7
Euclidean Distance: 5
5
3 Maximum Distance: 4
nll.scm.2D=function(par,dat)
{
par.e=par[1:(2*(I-1)-1)]
par.b=par[(2*(I-1)):(3*(I-1)-1)]
beta=par2beta(par.b)
eta=par2eta.2D(par.e)
return(nll.scm.first(eta,beta,dat))
}
par=c(1:7,rep(0,6),rep(1/8,7))
start=optim(par,nll.scm.2D,dat=dat)
g.2D=nlm(nll.scm.2D,start$par,dat=dat,iterlim=200)
6.5. SCM AND DIMENSIONAL SCALING 163
One of the drawbacks of the SCM models is that the parameters grow
quickly with increasing numbers of items. With sixteen items, for example,
a three-dimensional SCM model has 56 parameters. While this is not a
prohibitive number in some contexts, it is difficult to achieve stable ML with
the methods we have described.
Fortunately, there are standard, high performance multidimensional scal-
ing techniques (Cox and Cox, 2001; Shepard, Romney, and Nerlove, 1972;
Torgeson, 1958). The mathematical bases of these techniques are outside
the scope of this book. Instead, we describe their R implementation. When
distances are known, the function cmdscale() provides a Euclidean repre-
sentation. An example is built into R and may be found with ?cmdscale.
For psychological applications, however, we rarely know mental distances.
There are a few common alternatives. One alternative is to simply ask people
the similarity of items, two at a time. These data can be transformed to
distances using Equation 6.15. One critique of this approach is that similarity
data is treated as a ratio scale. For example, suppose a participants rates
the similarity between items i and j as a “2” and that between items k and
l as a “4.” The implicit claim is that k is twice as similar to l as i is to j.
This may be too strong; instead, it may be more prudent to just consider the
ordinal relations, e.g., k is more similar to l than i is to j. Fortunately, there
is a form of multidimensional scaling, called non-metric multidimensional
scaling, that is based on these ordinal relations. In R, two methods work
well: either sammon() and isoMDS(). Both of these functions are in the
package “MASS” (Venebles & Ripley, 2002) The package is load with the
library() command; e.g., library(MASS).
164 CHAPTER 6. MULTINOMIAL MODELS
gories. This activation depends on the overall similarity of the item to the
exemplars for specific categories. The activation for category A is
X
a= ηy,i ,
i∈A
where the sum is over all exemplars of Category A. Activation for category
B is given analogously:
X
b= ηy,i .
i∈B
aβA
p(y, A) = ,
aβA + bβA
Coordinates
Letter x1 x2 x3 x4 Category
1 -0.167 0.136 0.042 -0.088 A
2 -0.445 0.265 -0.008 -0.218 A
3 -0.308 0.325 -0.046 0.241 A
4 -0.648 0.120 0.138 0.350 A
5 -0.254 -0.048 0.354 0.428 A
6 0.001 -0.746 -0.080 0.033 B
7 0.013 -0.357 -0.887 0.203 B
8 0.441 -0.064 0.261 0.309 B
9 0.014 -0.206 -0.065 0.671 B
10 0.251 0.492 0.157 -0.102 B
11 -0.357 0.387 -0.267 -0.084 ?
12 -0.218 0.402 -0.218 -0.044 ?
13 0.466 0.113 -0.054 -0.124 ?
14 0.549 0.289 -0.047 -0.052 ?
15 -0.082 -0.181 0.782 -0.075 ?
16 -0.218 0.402 -0.218 -0.044 ?
Response Category
Letter A B
11 28 22
12 26 24
13 16 34
14 21 29
15 23 27
16 30 20
The first six chapters focused on substantive models for binomial and multi-
nomial data. The binomial and multinomial are ideal for data that is discrete,
such as the frequency of events. Psychologists often deal with data which is
more appropriately modeled as continuous. A common example is the time
to complete a task in an experiment, or response time (RT). Because RT
may take any positive value, it is appropriate to model it with a continuous
random variable. In some contexts, even discrete data may be more conve-
niently modeled with continuous RVs. One example is intelligence quotients
(IQ). IQ scores are certainly discrete, as there are a fixed number of questions
on a test. Even though it is discrete, it is typically modeled as a continuous
RV. The most common model of data in psychology is the normal. In this
chapter we cover models based on the normal distribution.
Parameters µ and σ 2 are the mean and variance of the distribution. The goal
is to estimate these parameters. One way to do this is through maximum
likelihood.
169
170 CHAPTER 7. THE NORMAL MODEL
µ̂ ∼ Normal(µ, σ 2 /I).
Figure 7.1: The distribution of the IQ population and the distribution of the
sample mean, with N = 9.
172 CHAPTER 7. THE NORMAL MODEL
The maximum likelihood estimator for σˆ2 , however, is not the sample
variance. Sample variance, s2 , is defined as:
PI
2 − µ)2
i=1 (yi
s = .
I −1
The difference between s2 and σˆ2 is in the denominator; the former involves a
factor of I −1 while the latter involves a factor I. The practical consequences
of this difference are explored in the following exercise.
2. Compute the RMSE and bias for each estimate. Which is more
efficient?
treatment=c(103,111,112,89,120,123,92,105,87,126)
control=c(81,114,105,75,104,98,114,106,92,122)
t.test(treatment,control,var.equal=T)
The output indicates the t value, the degrees of freedom for the test, and
the p value. The p value denotes the probability of observing a t value as
1
The two tests are in fact equivalent in this case.
174 CHAPTER 7. THE NORMAL MODEL
115
105
95
85
Control Treatment
extreme or more extreme than the one observed under the null hypothesis
that the true means are equal. In this case, the p value is about 0.39. We
cannot reject the null hypothesis that the two groups have the same mean.
It is common when working with the normal model to plot µ̂, as shown in
Figure 7.3. The bars denote confidence intervals (CIs) rather than standard
errors. Confidence intervals have an associated percentile range and the 95%
confidence interval is typically used. The interpretation of CI, like much
of standard statistics, is rooted in the concept of repeated experiments. In
the limit that an experiment is repeated infinitely often, the true value of
the parameter will lie in the 95% CI for 95% of the replicates. Confidence
intervals are constructed with the t distribution as follows.
mean(control)+c(-1,1)*qt(.025,9)*sd(treatment)/sqrt(10)
Alternatively, R will compute the confidence intervals for you if you use the
t.test() function. Reporting CI’s or standard errors are equally acceptable
in most contexts. We recommend CI’s over standard errors in general, though
we tend to make exceptions when plotting CI’s clutters a plot more than
plotting standard errors.
tion. Both designs have advantages; for the purposes of this section we will
consider the between-subjects design.
For an experiment design like the one described above, the most widely
used model for analysis is ANOVA. The details and theory behind ANOVA
analyses is covered in elementary statistics texts (Hays, 19xx). The basic
ANOVA model for the 2-factor, between-subjects design above is
the interaction term, and describes any deviation from the additivity of the
treatment effects. For instance, ginko may only be effective in the presence
of music. For each effect, the null hypothesis is all treatment means are the
same.
R has functions for ANOVA analyses built in. Consider the experimental
design above. Data for a hypothetical experiment using this design is in the
file factorial.dat. Download this file into your working directory, then use the
following code to load and analyze it.
dat=read.table(’factorial.dat’,header=T)
summary(aov(IQ~Ginko*Music,data=dat))
The result should be a table like the one in Table 7.1. The p values in
the last column tell us the probability under the null hypothesis (that all
treatment means are the same) of getting an F statistic as large as the one
obtained. If p is less than our prespecified alpha, we reject the null and
conclude that all treatment means are not equal. In this case, there is a
significant effect of ginko root, but not of exposure to music.
To see this graphically, consider Figure 7.5. The boxplot shows that the
median scores of the groups receiving ginko are higher than the groups not
receiving it, which accords with the ANOVA analysis. It also allows us to
quickly check the equal-variance assumption of the ANOVA test; in this case
there does not appear to be any violation of this assumption.
178 CHAPTER 7. THE NORMAL MODEL
140
120
IQ
100
80
" s #
MSE
CIij = X̄ ± t.025 (errordf ) (7.7)
Nij
where Nij is the number of subjects in the ijth group.
Applying this to the data and plotting,
gmeans=tapply(dat$IQ,list(dat$Ginko,dat$Music),mean)
size=qt(.025,36)*sqrt(284.2/10)
g=barplot(gmeans,beside=T,ylim=c(70,130),xpd=F)
errbar(g,gmeans,size,.3,1)
The resulting plot is shown in Figure 7.6. Two things are apparent from this
plot. First, with only 10 participants in each group, our ability to localize
means with 95% confidence is limited; the confidence intervals are fairly wide.
Second, the mean of the two dark bars (no ginko) is less than the mean of the
light bars (ginko treatment). The pattern of means is similar to the pattern
of medians from the boxplot in Figure 7.5.
180 CHAPTER 7. THE NORMAL MODEL
130
110
90
70
NM M
7.5 Regression
7.5.1 Ordinary Least-Squares Regression
Regression is perhaps the most popular method of assessing the relationships
between variables. We provide a brief example of how to do regression in R.
The example comes from an experiment by Gomez, Perea & Ratcliff (in
press) who asked participants to perform a lexical decision task. In this task,
participants decide whether a string of letters is a valid English word. For
example, the string cafe is a valid English word while the string mafe in not.
In this task, it has been well established that responses to common valid
words are faster than the rare common words. The goal is to explore some
of the possible functional relationships between word frequency and response
time (RT). Word frequency is a measure of the number of times of word
occurs for every million words of text in magazinge (Kucera & Francis, 1968).
Gomez et al. collected over 9,000 valid observations across 55 participants
and 400 words. The basic data for this example are the mean RTs for each
of the 400 words computed by averaging across participants. The following
code loads the data and draws a scatter plot of RT as a function of word
frequency. Before running it, download the file regression.dat and set it in
your working directory 2 .
dat=read.table(’regression.dat’,header=T)
colnames(dat) #returns the column names of dat
plot(dat$freq,dat$rt,cex=.3,pch=20,col=’red’)
There are four hundred points, with each coming from a different word.
From the scatter plot in the left panel of Figure 7.7, it seems evident that as
frequency increases, RT decreases.
The linear regression model is
RTj and fj denote the mean RT to the jth item and the frequency of the
jth item, respectively. Parameters beta0 and β1 are the intercept and slope
of the best fitting line, respectively. The model is written equivalently as
RTi = β0 + β1 fi + ǫi ,
2
How to in Windows
182 CHAPTER 7. THE NORMAL MODEL
2 6 14
1.1
1.1
1.0
1.0
Response Time
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
ǫ ∼ Normal(0, σ 2 ).
g=lm(dat$rt~dat$freq)
summary(g)
abline(g)
The first line fits the regression model. The commandlm() stands for “linear
model,” and is extremely flexible. The argument is the specification of the
model. The output is stored in the object g. The command summary(g)
returns a summary of the fit. Of immediate interest are the estimates βˆ0 and
βˆ1 , which are .753 seconds and -.008 seconds per frequency units, respectively.
Also reported are standard errors and a test of whether these statistics are
significantly different than zero. The model accounts for 21.4% of the total
variance in the data (multiple R2 = .214). The last line adds the regression
line in the plot. This line has intercept βˆ0 and slope βˆ1 .
184 CHAPTER 7. THE NORMAL MODEL
Xi ∼ Normal(100, 10).
2. The results should indicate that the slope and intercept estima-
tors are unbiased. Does the bias depend on the distribution of
Xi ? Assess the bias when Xi is distributed as a binomial with
N = 120 and p = .837.
(Cleveland, 1981). In a small interval, lowess fits the points with a poly-
nomial regression line. Points further away from the center of the interval
are weighted less than those near the center. These polynomial fits are done
across the range of the points, and a kind of ”average” line is constructed.
In this way, a fit to the points is generated without recourse to parametric
assumptions.
In R, lowess is by the commands lowess() or loess(). The syntax of the
former is somewhat more convenient and we use it here. The nonparametric
smoothing line may be added to the plot by
lines(lowess(dat$freq,dat$rt),lty=2)
The nonparametric smooth is the dotted line (and is produced with the
option lty=2). The regression line overestimates the nonparametric smooth
in the middle of the range while underestimating it in the extremes. The
nonparametric smooth is a more faithful representation of the data. The
discordance between the nonparametric line and the regression line indicates
a misfit of the regression model.
RT is frequently modeled as the logarithm of word frequency, e.g.,
Figure 7.7, right panel, shows this relationship along with regression line and
lowess nonparametric smooth. The figure is created with the following code:
logfreq=log(dat$freq)
plot(logfreq,dat$rt,cex=.3,pch=20,col=’red’)
w=seq(2,20,4) # tick marks for axis drawn in next statemet
axis(3,at=log(w),labels=w) #create axis on top
g=lm(dat$rt~logfreq)
summary(g)
abline(g)
lines(lowess(logfreq,dat$rt),lty=2)
The new element in the code is the use of an axis on top of the figure, and
this is done through the axis() command. The intercept is .775 seconds
and the slope is -.051 second per log unit frequency. The slope indicates
how many seconds faster RT is when word frequency is multiplied by e, the
natural number (e ≈ 2.72).
186 CHAPTER 7. THE NORMAL MODEL
dat$length=as.factor(dat$length)
summary(lm(rt~freq+length+neigh,data=dat))
d′ Φ−1 pf
Φ−1 (ph ) = + .
σ σ
This last equation describes a line with slope of 1/sigma and intercept of
d′ /σ. Based on this fact, researchers have used linear regression of z-ROC
plots to estimate σ and d′ (e.g., Ratcliff, Shue & Grondlund, 1993). Specifi-
188 CHAPTER 7. THE NORMAL MODEL
cally:
bˆ0
d̂′ =
bˆ1
σ̂ = 1/bˆ1
hit.rate=c(69,89,98)/100
fa.rate=c(12,23,38)/100
z.hit=qnorm(hit.rate)
z.fa=qnorm(fa.rate)
plot(z.fa,z.hit)
g=lm(z.hit~z.fa)
lines(g)
summary(g)
coef=g$coefficients #first element is intercept, second is slope
est.sigma=1/coef[2]
est.dprime=coef[1]/coef[2]
Estimates of the intercept and slope are 2.58 and 1.79, respectively; corre-
sponding estimates of d′ and σ are 1.44 and .558, respectively.
We use the simulation method to assess the accuracy of this approach vs.
the ML approach. Consider a five-condition payoff experiment with true d′ =
1.3, true criteria c = (.13, .39, .65, .91, 1.17) and true σ = 1.2. Assume there
are 100 noise and 100 signal trials for each of the five conditions. For each
replicate experiment, data were generated and then d′ and σ were estimated
two ways: 1. from least squares regression, and 2. from the maximum
likelihood method discussed in Section x.x. Figure 7.8 shows the results.
The top panel shows a histogram of estimates of d′ for the two methods. The
likelihood estimators are more accurate; the regression method is subject to
an overestimation bias. The middle panel shows the same for σ; once again,
the likelihood method is more accurate. These differences in bias are not
7.5. REGRESSION 189
severe and it is reasonable to wonder if they are systematic. The bottom row
shows scatter plots. The x-axis value is the likelihood estimate for a replicate
experiment; the y-axis value is the regression method estimate. For every
replicate experiment, the regression method yielded greater estimates of d′
and σ showing that there is a systematic difference between the methods. In
sum, the ML method is more accurate because it does not suffer the same
degree of systematic over estimation.
Why are the likelihood estimates better than the regression ones? The
regression method assumes that the regressor variable, the variable on the x-
axis, is known with certainty. Consider the example in Your Turn in which we
regressed IQ gain (denoted Yi ) onto the number of days of treatment (denoted
Xi ) for autistic children. The x-axis variable, time in treatment, is, for each
participant, a constant that is known to exact precision. This statement
holds true regardless of the overall distribution of these times. This precision
is assumed in the regression model. Let’s explore a violation.. Suppose, due
to some sloppy bookkeeping, we had error-prone information about how long
each individual was in therapy. Let Xi′ be the length recorded, and let’s
assume Xi′ ∼ Normal(Xi , 5). We wish to recover the slope and intercepts
when we regress X onto Y . Since we don’t have true values Xi , we use our
error-prone values X ′ instead. You will show, in the following problem, that
the estimates of slope and intercept are biased. The bias in slope is always
toward zero; that is, estimated slopes are always flattened with respect to
the true slope.
Let’s return to the signal detection problem. Let’s say we wanted to study
subliminal priming. Subliminal priming is often operationalized as priming
in the absence of prime detectability (cite). For instance, imagine a paradigm
where trials are constructed as in Figure 7.9. The task is determine where a
given number is less-than or greater-than 5. With the sequence in Figure 7.9,
there are two possible types of trial: participants can be instructed to respond
190 CHAPTER 7. THE NORMAL MODEL
3.0
3.0
2.0
2.0
ML LS
Density
Density
1.0
1.0
0.0
0.0
1.0 2.0 1.0 2.0
d’ d’
0.0 0.5 1.0 1.5 2.0
ML LS
Density
Density
Foreperiod
578ms
Forward Mask
##### 66ms
Time
Prime
2 22ms
Backward Mask
##### 66ms
Target
6 200ms
Until
Response
Figure 7.9: The distribution of the IQ population and the distribution of the
sample mean, with N = 9.
to the prime or the target. When participants respond to the target, the
prime can actually affect decision to the prime. If both numbers are on the
same side of 5, responses are faster than if they are not.
In a subliminal priming paradigm, the primes are displayed fast enough
that it is difficult to see them. If there is a priming effect in the target
task when detectability in the prime task is 0, we call this subliminal prim-
ing. Although this is conceptually straightforward, actually establishing that
detectability is 0 is difficult.
One method proposed by Greenwald et al (cite) uses linear regression to
establish subliminal priming. In this method, priming effects are regressed
onto detectability (typically measured by d′ ), and a non-zero y intercept is
taken as evidence for subliminal priming. An example of this is shown in
Figure 7.10.
192 CHAPTER 7. THE NORMAL MODEL
30
25
Priming Effect (RT)
20
15
10
5
0
γi = d′i (7.10)
d′i ≥ 0 (7.11)
In this case, the priming effect for the ith subject is linearly related to
detectability. Follow these steps:
3. Sample 50 signal and 50 noise trials for each person, and estimate
d′ . Call these d̂′i
5. Are the slope and intercept estimates biased? What is the true
type-I error rate for the intercept if you use p < .05 to reject the
null? What does this mean for inferences of subliminal priming?
Response Times
The main emphasis of this chapter and next is the modeling of response time.
Response time may be described as the time needed to complete a specified
task. Response time typically serves as a measure of performance. Stimuli
that yield lower response times are thought to be processed more mental
facility than those that yield higher response times. In this sense, response
time is often used analogously to accuracy, with the difference being that a
lower RT corresponds to better performance. Although this direct use of RT
is the dominant mode, there are a growing number of researchers who have
used other characteristics of RT to draw more detailed conclusions about
processing. This chapter and the next provide some of the material needed
to study RT models. The goal of this chapter is to outlay statistical models
of response time; the goal of the next is to briefly introduce a few process
models.
195
196 CHAPTER 8. RESPONSE TIMES
x1=c(0.794, 0.629, 0.597, 0.57, 0.524, 0.891, 0.707, 0.405, 0.808, 0.733,
0.616, 0.922, 0.649, 0.522, 0.988, 0.489, 0.398, 0.412, 0.423, 0.73,
0.603, 0.481, 0.952, 0.563, 0.986, 0.861, 0.633, 1.002, 0.973, 0.894,
0.958, 0.478, 0.669, 1.305, 0.494, 0.484, 0.878, 0.794, 0.591, 0.532,
0.685, 0.694, 0.672, 0.511, 0.776, 0.93, 0.508, 0.459, 0.816, 0.595)
x2=c(0.503, 0.5, 0.868, 0.54, 0.818, 0.608, 0.389, 0.48, 1.153, 0.838,
0.526, 0.81, 0.584, 0.422, 0.427, 0.39, 0.53, 0.411, 0.567, 0.806,
0.739, 0.655, 0.54, 0.418, 0.445, 0.46, 0.537, 0.53, 0.499, 0.512,
0.444, 0.611, 0.713, 0.653, 0.727, 0.649, 0.547, 0.463, 0.35, 0.689,
0.444, 0.431, 0.505, 0.676, 0.495, 0.652, 0.566, 0.629, 0.493, 0.428)
1.2
Response Time (sec)
1.0
0.8
0.6
0.4
Condition 1 Condition 2
whiskers extend to the most extreme point 1.5 times the inter-quartiel range
past the box. Observations outside the whiskers are denotes with a small
circles; these should be considered extreme observations. One advantage
of histograms is that several distributions can be compared at once. For
the displayed plots, it may be seen that the RT data in the second con-
dition is quicker and less variable than that in the first condition. Box-
plots are drawn in R with the boxplot() command; Figure 8.1 was drawn
with the command boxplot(x1,x2,names=c("Condition 1","Condition
2"),ylab="Response Time (sec)").
8.1.2 Histograms
The top panel of the Figure 8.2 shows histograms of the distributions for
each condition separately. We have plotted relative area histograms as these
converge to probability density functions (see Chapter 4). The advantage of
histograms are that they provide a detailed and intuitive approximation of
the desnity function. There are two main disadvantages to histograms: First,
it is difficult to draw more than one distribution’s histogram per plot. This
fact often makes it less convenient to compare distributions with histograms
than other graphical methods. For example, the comparison between the
two distributions is easier with two boxplots (Figure 8.1) than with two
histograms. The second disadvantage is that the shape of the histogram
depends on the choice of boundaries for the bins. The bottom row shows
what happens if the bins are chosen to finely (panel C) or too coarsely (panel
D). An alternative appoach to histograms is advocated by Ratcliff (1979)
who chooses bins with equal area instead of equal width. An example for
198 CHAPTER 8. RESPONSE TIMES
2.0 A B
3.0
1.5
2.0
Density
Density
1.0
1.0
0.5
0.0
0.0
0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4
Response Time Response Time
1.5
8
C D
6
1.0
Density
Density
4
0.5
2
0.0
0
0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4
Response Time Response Time
Figure 8.2: A & B: Histograms for the first and second conditions, respec-
tively. C & D: Histogram for Condition 1 with bins chosen too finely (5 ms)
and too coarsely (400 ms), respectively.
Condition 1 is shown in Figure 8.3. Here, each bin consists of area .2 but
the width changes with height. Equal-area histograms may be drawn in R
by setting bins widths with the quantile command. The figure was drawn
with hist(x1,prob=T,main="",xlab="Response Time",xlim=c(.3,1.5),
breaks=quantile(x,seq(0,1,.2))).
2.0
1.5
Density
1.0
0.5
0.0
0.8
Condition 1
2.0
Probability
Condition 2
Density
0.4
1.0
Condition 1
Condition 2
0.0
0.0
0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.4 0.6 0.8 1.0 1.2 1.4
Response Time (sec) Response Time (sec)
8.2.1 Moments
Moments are a standard set of statistics used to describe a distribution. The
terminology is borrowed from mechanics in which the moments of an object
described where its center of mass was, its momenton when spun, and how
much wobble it had when spun. Moments are numbered, that is, we speak
of the first moment, the second moment, etc. of a distribution. The first
moment of a distribution is its expected value; i.e., the first moment of RV
X is E(X) (see Eq. xx and yy). The first moment, therefore, is an index
of the center of middle of a distribution and it is estimated with the sample
mean. Higher moments come in two flavors, central and raw, and are defined
as:
Definition here (Richard?) [ Raw moments are given as E(X n ) where n
is an integer. The second and third raw moments, for example, are given by
E(X 2 ) and E(X 3 ). Central moments are g ]
8.3. COMPARING DISTRIBUTIONS 201
Central moments are far more common than raw moments. In fact, cen-
tral moments are so common that they are sometimes refered to simply as the
moments, and we do so here. The second moment is given by E(X −E(X)2 ),
which is also the variance. Variance indexes the spread or dispersion of a
distributoin and is estimated with the sample variance (Eq. ). The third
moment and fourth moments are integral for the computation of skewness
and kurtosis, defined below.
Defn here: (Richard).
Skewness indexes assymetries in the shape of a distribution. Figure xx
provides an example. The normal distribution (solid line) is symmetric
around its mean. The corresponding skewness value is 0. The distribution
with the long right tails all have positive skewness values. If these distribu-
tions had skewed the other direction and had long left tails, the skewness
would be negative.
Kurtosis is especially useful for characterizing symmetric distributions. A
normal has.
In R, sample mean and variance is given by mean() and var(), respec-
tively. Functions for skewness and kurtosis are not built in, but may defined
as:
RCODE of skewness and kurtosis.
8.2.2 Quantiles
In Chapter 3, we explored and defined quantile functions. One method of
describing a distribution is to simply list its quantiles. For example, it is not
characterize a distribution by its .1, .3, .5, .7, and .9 quantiles. This listing is
analogous to box plots and portrays, in list form, the same basic information.
8.3.1 Location
Figure 8.5 shows distributions that differ only in location. The left, center,
and right panels show the effects of locaton changes on probability density
functions, cumulative distribution functions, and quantile functions, respec-
tively. The effect is most easily described with reference to probability and
cumulative distribution functions: a location effect is a shift or translation
of the entire distribution. Location changes are easiest to see in the quan-
tile functions; the lines are parallel over all probabilities. (In the figure, the
lines may not appear parallel, but this appearance is an illusion. For every
probability, the top line is .2 seconds larger than the bottom one.) It is
straightforward to express location changes formally. Let X and Y be two
random variables with density functions denoted by f and g, respectively,
and cumulative distribution functions denoted by F and G. If X and Y
differ only in location, then the following relations hold:
Y = X + a,
g(t) = f (t − a),
G(t) = F (t − a),
−1
G (p) = a + F −1 (p),
for location difference a. Location is not the same as mean. It is true that if
two distributions differ in only location, than they differ in mean by the same
amount. The opposite, however, does not hold. There are several changes
that result in changes in mean besides location.
8.3.2 Scale
Scale refers to the dispersion of a distribution, but it is different than vari-
ance. Figure 8.6 shows three distributions that differ in scale. The top row
shows two zero-centered normals that differ in scale. The panels, from left-
to-right, show density, cuulative distribution, and quantile functions. For
the normal, scale describes the dispersion from the mean (depicted with a
vertical dotted line in the density plot). The middle row show the case for
a Weibull distribution. The Weibull is a reasonable model for RT as it is
unimodal and skewed to the right; it is discussed subsequently. The depicted
Weibull distributions have the same location and shape; they differ only in
scale. For this distribution, scale describes the dispersion from the lowest
8.3. COMPARING DISTRIBUTIONS 203
Cumulative Probability
1.0
0.8
2.0
Time (ms)
Density
0.6
0.4
1.0
0.2
0.0
0.0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.4 0.8
Time (sec) Time (sec) Cumulative Probability
Figure 8.5: Distributions that differ in location are shifted. The plots show,
from left to right, locaton changes in density, cumulative distribution, and
quanitle functions, respectively. The arrowed line at the right shows a differ-
ence of .2 sec; the distance separating the quantile functions for all probability
values.
value (depicted with a vertical line) rather than from the mean. The bottom
panel shows scale changes in an ex-Gaussian distribution, another popular
RT distribution that is described subsequently. For this distribution the scale
describes the amount of dispersion from a value that is near the mode (de-
picted with a vertical line). The quantile plot of the ex-Gaussian is ommited
as we do not know of a convenient expression for this function.
= a + bX, Y
1 t−a
g(t) = f ,
b b
t−a
G(t) = F ,
b
G−1 (p) = a + bF −1 (p).
8.3.3 Shape
Shape is a catch-all category that refers to any change in distributions that
cannot be described as location and scale changes. Figure 8.7 shows some
examples of shape changes. The left panel shows two distributions; the one
with the dotted line is more symmetric than the one with the solid one.
204 CHAPTER 8. RESPONSE TIMES
0.8
3
Cumulative Probability
0.8
2
1
Density
Score
0.4
0.4
−1
0.0
0.0
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0
Score Score Cumulative Probability
4
Cumulative Probability
0.8
1.0
3
Time (ms)
Density
2
0.4
0.6
1
0.0
0.2
0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0
Time (sec) Time (sec) Cumulative Probability
4
Cumulative Probability
0.8
3
Density
2
0.4
1
0.0
0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Time (sec) Time (sec)
Figure 8.6: Distributions that differ in scale. The plots show, from left to
right, scale changes in density, cumulative distribution, and quanitle func-
tions, respectively. The top, middle, and bottom rows show scale changes for
the normal, Weibull, and ex-Gaussian distributions, respectively.
8.3. COMPARING DISTRIBUTIONS 205
3.0
4
0.4
3
2.0
0.3
Density
Density
Density
2
0.2
1.0
0.1
1
0.0
0.0
0
The center and right panel shows more subtle shape differences. The center
panel shows the case in which the right tail of the dotted-line distribution
is stretched relative to the right tail fo the solid-line distribution. There
is no comparable stretching on the left side of the distributions. Because
the stretching is inconsistent, the effect is a shape change and not a scale
change. The right panel shows two symmetric distribution. Nonetheless,
they are different with the solid one having more mass in the extreme tails
and in the center. There is no stretching that can take one distribution into
the other; hence they have different shape.
1.4
RT in Condition 1
1.0
0.6
0.2
One aspect that simplifies the drawing of qqplots in the above example
is that there are the same number of observations in each condition. QQ
plots can still be drawn when there are different numbers and R provides
a convenient function qqplot(). For example, suppose there were only 11
observations in the first condition: z=x1[1:11]. The QQ plot is drawn with
qqplot(x2,z). In this plot, there are 11 points. The function computes the
approriate 11 values from x2 corresponding to the same quantiles as the 11
points in z.
QQ plots graphical depcit the location-scale-shape loci of the effect of
variables. Figure 8.9 shows the relationships. The left column shows the
pdf (top)of two distributions that differ in location. The resulting QQ plot
(bottom) is a straight line with a slope of 1.0. The y-intercept indexes the
degree of location change. The middle colun shows the same for scale. The
QQ plot s a straight line and the slope denotes the scale change. In the figure,
the scale of the slow distribution is twice that of the fast one; the slope of
the QQ plot is 2. The bottom row shows the case for shape changes. If
shape changes, then the QQ plots are no longer straight lines and show some
208 CHAPTER 8. RESPONSE TIMES
degree of curvature. The QQ plot of the sample data x1 and x2 (Figure ??)
indicate that the primary difference between the conditions is a scale effect.
One drawback to QQ plots is that it is often visually difficult to inspect
small effects. Typical effects in subtle tasks, such as in priming, are on the
order of 30 ms or less. QQ plots are not ideally suited to express such small
effects because each axis must encompass the full range of the distribution,
which often encompasses a second or more. The goal then is to produce
a graph that, like the QQ plot, is diagnostic for location, scale, and shape
changes, but for which small effects are readily apperent. The solution is
the delta plot, which is shown in Figure 8.10. Like QQ plots, these plots
are derived from quantiles. The y-axis in these plots are difference between
quantiles; the x-axis is the average between quantiles. The R code to draw
delta plots is
p=seq(.1,.9,.1)
df=quantile(x1,p)-quantile(x2,p)
av=(quantile(x1,p)+quantile(x2,p))/2
plot(av,df,ylim=c(-.05,.25),
ylab="RT Difference (sec)",xlab="RT Average (sec)")
abline(h=0)
axis(3,at=av,labels=p,cex.axis=.6,mgp=c(1,.5,0))
4
2.5
2.0
3
3
1.5
Density
Density
Density
2
2
1.0
1
1
0.5
0.0
0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
1.2
1.2
1.0
1.0
1.0
Condition 2
Condition 2
Condition 2
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2 0.4 0.6 0.8 1.0 1.2 0.4 0.6 0.8 1.0 1.2 0.4 0.6 0.8 1.0 1.2
Figure 8.9: QQ plots are useful for comparing distributions. Left: Changes
in location affect the intercept of the QQ plot. Middle: Changes in scale
affect the slope of the QQ plot. Right: Changes in shape add curvature to
the QQ plot.
210 CHAPTER 8. RESPONSE TIMES
0.1 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.25
0.3
RT Difference (sec)
RT Difference (sec)
0.2
0.15
0.1
0.05
0.0
−0.05
−0.1
0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8
RT Average (sec) RT Average (sec)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0.15
0.4
0.1 0.2 0.3
RT Difference (sec)
RT Difference (sec)
0.05 0.10
−0.05
−0.1
0.4 0.5 0.6 0.7 0.8 0.40 0.45 0.50 0.55 0.60 0.65 0.70
RT Average (sec) RT Average
Figure 8.10: Delta plots are also useful for comparing distributions. Top-
Left: Delta plot of two distributions. Top-Right: Location changes result in
a straight horizontal delta plot. Bottom-Left: Changes in scale are reflected
in an increasing slope in the delta plot. Bottom-Richt: Changes in shape
result in a curved delta plot.
8.5. STOCHASTIC DOMINANCE AND RESPONSE TIMES 211
2
1
0
−1
−2
−2 −1 0 1 2
√
value is (1/ 2)(x1 (p) − x2 (p)), where x1 (p) and x2 (p) are the observed pth
quantile for
√ observations x1 and x2 , respectively. Multiplying
√ the new y-axis
value by 2 and dividing the new x-axis value by 2 yields the delta plot.
Hence, delta plots retain all of the good features of QQ plots, but are placed
on a more convenient scale.
The parameters ψ, θ, and β are location, scale, and shape parameters, re-
spectively. Figure 8.12 shows the effect of changing each parameter while
8.6. PARAMETRIC MODELS 213
leaving the remaining ones fixed. This figure provides one rationale for the
Weibull distribution; it is a convenient means of measuring location, scale,
and shape. The location parameter, ψ is the minimum value and the scale
parameter θ describes the dispersion of mass above this point. The Weibull
is flexible with regard to shape. When the shape is β = 1.0, the Weibull
reduces to an exponential. As the shape parameter increases past 1.0, the
Weibull becomes less skewed. At β = 3.43, the Weibull is approximately
symmetric. In general, shapes that characterize RT distributions lie between
β = 1.2 and β = 2.5. The Weibull is stochastically dominant for effects
in location and scale but not for in shape. Therefore, the Weibull is espe-
cially useful for modeling manipulations that affect location and scale, but
not those that affect shape. The Weibull has a useful process interpretation
which we cover in the following chapter.
Maximum likelihood estimation of the Weibull is straightforward as R al-
ready has built-in Weibull functions (dweibull(), pweibull(), qweibull(),
rweibull()). The log likelihood of a single observation t at parameters
(ψ, θ, β) may be computed by dweibull(t-psi, shape=beta, scale=theta,log=T).
The likelihood of a set of independent and identically distributed observa-
tions, such as x (page cc), is given by sum(dweibull(t-psi, shape=beta,
scale=theta,log=T)). Estimation of Weibull parameters is straightforward:
#par = psi,theta,beta
nll.wei=function(par,dat)
return(-sum(dweibull(data-par[1],shape=par[3],scale=par[2],log=T)))
par=c(min(x),.3,2)
optim(par,nll.wei,dat=x)
(log(t − ψ) − µ)2
!
2 1
f (t; ψ, µ, σ ) = √ exp −
(t − ψ)σ 2π 2σ 2
where parameters are (ψ, µ, σ 2 ). Parameter ψ and σ 2 are location and shape
parameters, respectively. Parameter µ is not a scale parameter, but exp µ
is. There reason for the use of µ and σ 2 in this context is as follows: Let
214 CHAPTER 8. RESPONSE TIMES
3.0
4
2.0
2.0
Density
2
Weibull
1.0
1.0
1
0.0
0.0
0
0.2 0.6 1.0 1.4 0.2 0.6 1.0 1.4 0.2 0.6 1.0 1.4
3.0
3.0
3
2.0
Density
Lognormal
1.5
1.0
1
0.0
0.0
0
0.2 0.6 1.0 1.4 0.2 0.6 1.0 1.4 0.2 0.6 1.0 1.4
5
8
3.0
6
Density
Inverse
1.5
Gaussian
2
2
1
0.0
0.2 0.6 1.0 1.4 0.2 0.6 1.0 1.4 0.2 0.6 1.0 1.4
Time (sec) Time (sec) Time (sec)
Figure 8.12: Probability density functions for Weibull, lognormal, and inverse
Gaussian distributions. Left: Changes in shift parameters; Middle: Changes
in scale parameters; Right: Changes in shape parameters.
8.6. PARAMETRIC MODELS 215
nll.lnorm=function(par,dat)
-sum(dlnorm(dat-par[1],mu=par[2],sigma=par[3],log=T))
Here, parameters (ψ, λ, φ) are location, scale, and shape. The following R
code provides the pdf and negative log-likelihood of the inverse Gaussian.
nll.ig=function(par,dat)
-sum(log(dig(dat-par[1],lambda=par[2],phi=par[3])))
which are only stochastically dominant in location and scale. Even though
this stochastic dominance seems to be an advantage, it is not necessarily so.
The inverse Gaussian does not change much in shape across wide ranges of
parameter settings. In fact, shape changes describe more subtle behavior of
the tail without much affecting the overall asymmetry of the distribution.
The consequence is that it is often difficult to estimate inverse Gaussian pa-
rameters. Figure ?? provides an illustration; here vastly different parameters
give rise to similar distributions. It is evident that it would take large sam-
ple sizes to distinguish the two distributions. The inverse Gaussian may be
called weakly identifiable because it takes large sample sizes to identify the
parameters.
8.6.4 ex-Gaussian
The ex-Gaussian is the most popular descriptive model of response time. It
is motivated by assuming that RT is the sum of two processes. For example,
Hohle (1965), who introduced the model, speculated that RT was the sum
of the time to make a decision and a time to execute a response. The first of
these two was assumed to be distributed as an exponential (see Figure 8.13),
the second was assumed to be a normal. The sum of an exponential and
a normal distribution is an ex-Gaussian distribution. The exponential com-
ponent has a single parameter, τ , which describes its scale. The normal
component has two parameters: µ and σ 2 . The ex-Gaussian, therefore has
three components: µ, σ, and τ . Examples of the ex-Gaussian are provided
in Figure 8.13.
The ex-Gaussian pdf is given by
and programmed in R as
dexg<-function(t,mu,sigma,tau)
{
temp1=(mu/tau)+((sigma*sigma)/(2*tau*tau))-(t/tau)
temp2=((t-mu)-(sigma*sigma/tau))/sigma
(exp(temp1)*pnorm(temp2))/tau
}
4
6
2.0
5
3
Density
Density
Density
4
2
3
1.0
2
1
1
0.0
0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
Time (sec) Time (sec) Time (sec)
3.0
3.0
2.0
2.0
Density
Density
Density
2.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
Time (sec) Time (sec) Time (sec)
3.0
3.0
2.0
2.0
Density
Density
Density
2.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
Time (sec) Time (sec) Time (sec)
nll.exg=function(par,dat)
-sum(log(dexg(dat,mu=par[1],sigma=par[2],tau=par[3])))
nll.exg.lss=function(par,dat)
-sum(log(dexg(dat,mu=par[1],sigma=par[2],tau=par[3]*par[2])))
5
3
4
Density
Density
3
2
2
1
1
0
0
0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2
x1 x2
4
2.0
3
Density
Density
2
1.0
1
0.0
0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.4 0.6 0.8 1.0 1.2
Binned z−Transformed
of this practice, consider the example in Figure ??. The top row shows his-
tograms of observations from two individuals. The bottom right shows the
case in which the observations are aggregated, that is, they are grouped to-
gether. The resulting distribution is more variable than those for the two
individuals. Moreover, it is bimodal where the distributions for the two
indeividuals are unimodal. In sum, aggregation greatly distorts RT distribu-
tions.
We address the question of how to draw a group-level distribution. Defin-
ing a group-level distribution is straightforward if shape is assumed not to
vary across participants and far more difficult otherwise. If shape is invariant,
220 CHAPTER 8. RESPONSE TIMES
Tj = ψj + θj Z,
S = ψ̄ + θ̄Z.
decile.p=seq(.1,.9,.1)
x1.d=quantile(x1,decile.p)
x2.d=qunatile(x2,decile.p)
s=mean.sd*(c(z1,z2))+mean.mean
hist(s)
8.7.2 Inference
In the previous section, we described how to draw goup level distributions.
These distributions are useful in exploring the possible loci of effects. In this
section, we discuss a different problem: how to draw the conclusion about
the loci of effects. If the researcher is willing to adopt a parametric model,
such as a Weibull or ex-Gaussian, the problem is well formed. For example,
suppose previous research has indicates that an effect is in parameter τ . We
descrbe how to use a likelihood ratio test for this example. We let Xijk
222 CHAPTER 8. RESPONSE TIMES
sub=dat$sub
pos=dat$pos
rt=dat$rt
One new element in this code is the axes=F option in the matplot() state-
ment. This option supresses the drawing of axes. When the axis are auto-
matically plotted, values of 1 and 2 are drawn on the x-axis, which do not
have relavence to the reader. We manually add a more approriate x-axis
with the command axis(1,label=c(’Nouns’,’Verbs’),at=c(1,2)). The
“1” in the first argument indicates the x-axis. The y-axis and box are added
224 CHAPTER 8. RESPONSE TIMES
with axis(2) and box(), respectively. The final element to add are group
means:
grp.m=apply(ind.m,2,mean)
lines(1:2,m1,lwd=2)
effect=ind.m[,2]-ind.m[,1]
boxplot(effect,col=’lightblue’,ylab="Verb-Noun Effect (sec)")
errbar(1.3,mean(effect),qt(.975,I-1)*sd(effect)/sqrt(length(effect)),.2)
points(1.3,mean(effect),pch=21,bg=’red’)
abline(0,0)
decile.p=seq(.1,.9,.1)
noun=matrix(ncol=length(decile.p),nrow=I)
verb=matrix(ncol=length(decile.p),nrow=I)
for (i in 1:I)
{
noun[i,]=quantile(rt[sub==i & pos==1],p=decile.p)
verb[i,]=quantile(rt[sub==i & pos==2],p=decile.p)
}
Difference in Deciles (sec) Response Time (sec)
−0.15 0.00 0.10 0.20 0.4 0.6 0.8 1.0 1.2
0.4
0.6
Nouns
0.8
1.0
1.2
Average of Deciles (sec)
1.4
Verbs
1.6
8.8. AN EXAMPLE ANALYSIS
0.45
0.75
226 CHAPTER 8. RESPONSE TIMES
ave=(verb+noun)/2
dif=verb-noun
The last two lines are matrices for the delta plot: the difference between
deciles and the average of deciles. These matrices of deciles may be drawn
with the matplot(ave,dif,typ=’l’). The result is a colorful mess. The
problem is that matplot() treats each column as a line to be plotted. We
desire a line for each participant, that is, for each row. The trick is to
transpose the matrices. The approriate command, with a few options to
improve the plot, is
matplot(t(ave),t(dif),typ=’l’,
lty=1,col=’blue’,ylab="Difference in Deciles (sec)",
xlab="Average of Deciles (sec)")
abline(0,0)
These individual delta plots are too noisy to be informative. This fact
necessitates a group-level plot. We therefore assume that participants do not
vary in shape and quantile average deciles across people. Deciles of group
level distributions are given by
grp.noun=apply(noun,2,mean)
grp.verb=apply(verb,2,mean)
dif=grp.verb-grp.noun
ave=(grp.verb+grp.noun)/2
plot(ave,dif,typ=’l’,ylim=c(0,.05),
ylab="Difference of Deciles",xlab="Average of Deciles")
points(ave,dif,pch=21,bg=’red’)
abline(0,0)
The indication form this plot is that the effect of part-of-speech is primarily
in scale.
8.8. AN EXAMPLE ANALYSIS 227
The matrix est has seven columns: a participant label, a condition label,
ML estimates of µ, σ, and τ , an indication of convergence, and a minimized
negative loglikelihood. All optimizations should converge and this facet may
be checked by confirming sum(est[,6]) is zero. Figure 8.16 (left) provides
boxplots on the effect of part-of-speech on each parameter. The overall effects
on µ, σ, and τ are 9.4 ms, 1.5 ms, and 16.6 ms, respectively. The overall
NLL for this model may be obtained by summing the last column; it is
Researchers typically test the effects of the manipulation on a parameter
by the t-test. For example, effect of part-of-speech 0n τ is marginally signif-
icant by a t-test (values here). This t-test approach differs from a liklihood
ratio test in two respects. On one hand, it is less princibled as the researcher
is forced to assume that the parameter estimates are normally distributed
and of equal variance. The likelihood ratio test provides for a more princi-
bled alternative as it accounts for the proper sampling distributions of the
parameters. The above code provides estimates of the general models in
which all parameters are free to vary. The overall negative loglikelihood for
the general model may be obtained by summing the last column; it is xx.xx.
The restricted model is given by:
0.3 228 CHAPTER 8. RESPONSE TIMES
0.6
Verb−Noun Effect (sec)
0.2
0.1
−0.2
0.0
−0.6
−0.2
[5] F. R. Clarke. Constant ratio rule for confusion matrices in speech com-
munications. Journal of the Accoustical Society of America, 29:715–720,
1957.
[8] S. Glover and P. Dixon. Likelihood ratios: a simple and flexible statistic
for empirical psychologists. Psychonomic Bulletin & Review, 11:791–
806, 2004.
231
232 BIBLIOGRAPHY
[9] P. Graf and D. L. Schacter. Implicit and explicit memory for new as-
sociations in normal and amnesic subjects. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 11:501–518, 1985.