Fleming 2017
Fleming 2017
doi: 10.1093/nc/nix007
Research article
Abstract
Metacognition refers to the ability to reflect on and monitor one’s cognitive processes, such as perception, memory and
decision-making. Metacognition is often assessed by whether an observer’s confidence ratings are predictive of objective
success, but simple correlations between performance and confidence are susceptible to undesirable influences such as re-
sponse biases. Recently, an alternative approach to measuring metacognition has been developed that characterizes meta-
cognitive sensitivity (meta-d’) by assuming a generative model of confidence within the framework of signal detection the-
ory. However, current estimation routines require an abundance of confidence rating data to recover robust parameters,
and only provide point estimates of meta-d’. In contrast, hierarchical Bayesian estimation methods provide opportunities to
enhance statistical power, incorporate uncertainty in group-level parameter estimates and avoid edge-correction con-
founds. Here I introduce such a method for estimating metacognitive efficiency (meta-d’/d’) from confidence ratings and
demonstrate its application for assessing group differences. A tutorial is provided on both the meta-d’ model and the prepa-
ration of behavioural data for model fitting. Through numerical simulations I show that a hierarchical approach outper-
forms alternative fitting methods in situations where limited data are available, such as when quantifying metacognition in
patient populations. In addition, the model may be flexibly expanded to estimate parameters encoding other influences on
metacognitive efficiency. MATLAB software and documentation for implementing hierarchical meta-d’ estimation (HMeta-
d) can be downloaded at https://github.com/smfleming/HMeta-d.
1
2 | Fleming
and clinical disorders (David et al. 2012; Moeller and Goldstein Previously meta-d’ has been fitted using gradient ascent on
2014). the likelihood [maximum likelihood estimation (MLE)],
Metacognitive ‘sensitivity’ can be assessed by the extent to minimization of sum-of-squared error (SSE) or using analytic
which an observer’s confidence ratings are predictive of their approximation (Maniscalco and Lau 2012; Barrett et al. 2013).
actual success. Consider a simple decision task such as whether However, several factors make a Bayesian approach attractive
a briefly flashed visual stimulus is categorized as being tilted to for typical metacognition studies:
the left or right, followed by a confidence rating in being correct.
1. Point estimates of meta-d’ are inevitably noisy. Several pa-
The task of assessing response accuracy using confidence rat-
rameters must be estimated in the signal detection model,
ings is often called the ‘type 2 task’ (Clarke et al. 1959; Galvin
including multiple type 2 criteria [specifically, ðk 1Þ 2,
et al. 2003) to differentiate it from the ‘type 1 task’ of discrimi-
where k ¼ number of confidence ratings available]. One com-
nating between states of the world (e.g. left or right tilts). If
mon issue in cognitive neuroscience is that trial numbers
higher confidence ratings are given after correct judgments and
per condition are also low (e.g. in patient studies, or tasks
lower confidence ratings after incorrect judgments, we can as-
conducted in conjunction with neuroimaging), and fre-
cribe high metacognitive sensitivity to the subject. Thus a sim-
Figure 1. The meta-d’ model. (A) The right-hand panel shows schematic confidence-rating distributions conditional on correct and incorrect de-
cisions. A subject with good metacognitive sensitivity will provide higher confidence ratings when they are correct, and lower ratings when in-
correct, and these distributions will only weakly overlap (solid lines). Conversely a subject with poorer metacognitive sensitivity will show
greater overlap between these distributions (dotted lines). These theoretical correct/error distributions are obtained by ‘folding’ a type 1 SDT
model around the criterion [see Galvin et al. (2003), for further details], and normalizing such that the area under each curve sums to 1. The
overlap between distributions can be calculated through type 2 ROC analysis (middle panel). The theoretical type 2 ROC is completely deter-
mined by an equal-variance Gaussian SDT model; we can therefore invert the model to determine the type 1 d’ that best fits the observed confi-
dence rating data, which is labelled meta-d’. Meta-d’ can be directly compared with the type 1 d’ calculated from the subject’s decisions—if
meta-d’ is equal to d’, then the subject approximates the ideal SDT prediction of metacognitive sensitivity. (B) Simulated data from a SDT
model with d’ ¼ 2. The y-axis plots the conditional probability of a particular rating given the first-order response is correct (green) or incorrect
(red). In the right-hand panel, Gaussian noise has been added to the internal state underpinning the confidence rating (but not the decision)
leading to a blurring of the correct/incorrect distributions. Open circles show fits of the meta-d’ model to each simulated dataset.
or belief in a particular parameter. The ‘hierarchical’ component control over false positives. Model code and examples are freely
of hierarchical Bayes simply indicates that multiple instances of available online at https://github.com/smfleming/HMeta-d (last
a particular parameter (e.g. across different subjects) are esti- accessed 4th January 2017).
mated in the same model. The development of efficient sam-
pling routines for arbitrary models such as Markov chain Monte
Carlo (MCMC), their inclusion in freely available software pack- Methods
ages such as JAGS (http://mcmc-jags.sourceforge.net; last ac-
Outline of the meta-d’ model
cessed 31st August 2016) and STAN (http://mc-stan.org; last
accessed 31st August 2016) and advances in computing power The meta-d’ model is summarized in graphical form in Fig. 1A.
means that Bayesian estimation of arbitrary models is now The raw data for the model fit is the observed distribution of
straightforward to implement in practice (Kruschke 2014). confidence ratings conditional on whether a decision is correct
In this article, I briefly introduce the meta-d’ model and its or incorrect. Intuitively, if a subject has greater metacognitive
hierarchical Bayesian variant [further details of the model can sensitivity, they are able to monitor their decision performance
be found in the Appendix and in Maniscalco and Lau (2014)]. by providing higher confidence ratings when they are correct,
I then provide a step-by-step MATLAB tutorial for fitting meta- and lower ratings when incorrect, and these distributions will
d’ to single-subject and group data. Finally, I conduct parameter only weakly overlap (solid lines). Conversely, a subject with
recovery simulations to compare hierarchical Bayesian and poorer metacognitive sensitivity will show greater overlap be-
standard estimation routines. These results show that, particu- tween these distributions (dotted lines). The overlap between
larly when data are limited, the new HMeta-d method outper- distributions can be calculated through type 2 receiver operat-
forms traditional fitting procedures and provides appropriate ing characteristic (ROC) analysis. The conditional probability
4 | Fleming
P(confidence ¼ y j accuracy) is calculated for each confidence k ¼ number of confidence ratings available. These criteria are
level; cumulating these conditional probabilities and plotting response-conditional, with k 1 criteria following an S1 re-
them against each other produces the type 2 ROC function. sponse and k 1 criteria following an S2 response (c2; “S1” and
A type 2 ROC that bows sharply upwards indicates a high degree c2; “S2” ). The raw data comprise counts of confidence ratings con-
of sensitivity to correct/incorrect decisions; a type 2 ROC closer ditional on both the stimulus category (S1 or S2) and the re-
to the major diagonal indicates weaker metacognitive sponse (S1 or S2). Type 1 criterion c and sensitivity d’ are
sensitivity. estimated from the data using standard formulae (Macmillan
The area under the type 2 ROC (AUROC2) is itself a useful and Creelman 2005) (In HMeta-d there is also a user option for
non-parametric measure of metacognitive sensitivity, indicat- jointly estimating both d’ and meta-d’ in a hierarchical
ing how well an observer’s ratings discriminate between correct framework).
and incorrect decisions. However, as outlined in the introduc- The fitting of meta-d’ rests on calculating the likelihood of
tion, AUROC2 is affected by type 1 performance. In other words, the confidence rating data given a particular type 2 ROC gener-
a change in task performance (d’ or criterion) is expected, a pri- ated by systematic variation of type 1 SDT parameters d’ and c,
ori, to lead to changes in AUROC2 despite endogenous metacog- and type 2 criteria c2. By convention, the prefix ‘meta-’ is added
then the subject approximates the ideal SDT prediction of meta- y;i;j
The HMeta-d toolbox uses MCMC sampling as implemented confidence counts following S1 presentation listed above for
in JAGS (Plummer 2003) to estimate the joint posterior distribu- subject 1, one would enter in MATLAB:
tion of all model parameters, given the model specification and
nR_S1{1} ¼ [100 50 20 10 5 1]
the data. This estimation takes the form of samples from the
posterior, with the entire sequence of samples known as a and so on for each subject in the dataset. These cell arrays then
chain. It is important to check that these samples approximate contain confidence counts for all subjects, and are passed in
the ‘stationary distribution’ of the posterior; i.e. that they are one step to the main HMeta-d function:
not affected by the starting point of the chain(s), and the sam-
fit ¼ fit_meta_d_mcmc_group(nR_S1, nR_S2)
pling behaviour is roughly constant over time without slow
drifts or autocorrelation. The default settings of the toolbox dis- An optional third argument to this function is mcmc_params
card early samples to avoid sensitivity to initial values and run which is a structure containing fields for choosing different
multiple chains, allowing the user to diagnose convergence model variants, and for specifying the details of the MCMC rou-
problems as described below. tine. If omitted reasonable default settings are chosen.
The call to fit_meta_d_mcmc_group returns a ‘fit’ structure
In addition to enabling inference on individual parameter d’/d’ was fixed at 0.8. The chains show excellent mixing with a
distributions, there may be circumstances in which we wish to modest number of samples (10 000 per chain; R ^ ¼ 1.000) and the
compare models of different complexity (see ‘Discussion’ posterior is centred around the ground truth simulated value.
Section). To enable this, JAGS returns the deviance information
criteria (DIC) for each model which is a summary measure of
goodness of fit, while penalizing for model complexity Parameter recovery
(Spiegelhalter et al. 2002; lower is better). While DIC is known to To further validate the model, a parameter recovery exer-
be somewhat biased towards models with greater complexity, it cise was carried out in which data were simulated from 7 groups
is a common metric for assessing model fit in hierarchical mod- of 20 subjects with different levels of meta-d’/
els. In HMeta-d the DIC for each model can be obtained in d’ ¼ [0.5 0.75 1.0 1.25 1.5 1.75 2]. All other settings were as de-
fit.mcmc.dic. scribed in the previous section. Figure 3B plots the fitted
group-level lmetad0 =d0 and its associated 95% CI for each of the
simulated datasets against the empirical ground truth, demon-
Simulations strating robust parameter recovery.
To assess properties of the model fit and compare alternative
fitting procedures, simulated confidence rating data were gen-
erated for pre-specified levels of metacognitive efficiency. Type Empirical examples
2 probabilities Pðconf ¼ yjvstim; respÞ were computed from the To illustrate the practical application of HMeta-d I fit data from
equations in the Appendix for particular settings of meta-d’, c a recent experiment that examined metacognitive sensitivity in
and c2. These probabilities were then used to generate multino- perceptual and mnemonic tasks in patients with post-surgical
mial response counts using the function mnrnd in MATLAB, lesions and controls (Fleming et al. 2014). This study found (us-
where the sample size of each type 1 response class (hits, false ing single-subject estimates of meta-d’/d’) that metacognitive
alarms, correct rejections and misses) was obtained from a efficiency in patients with lesions to anterior prefrontal cortex
standard type 1 SDT model with criterion c and d’. This allowed (aPFC) was selectively compromised on a visual perceptual task
for independent control over d’ (i.e. the number of hits and false but unaffected on a memory task, suggesting that the neural ar-
alarms) and meta-d’ (the response-conditional distribution of chitecture supporting metacognition may comprise domain-
confidence ratings). After determining the value of d’ for each specific components differentially affected by neurological
simulation, the relevant value of meta-d’ could then be chosen insult.
to ensure a particular target meta-d’/d’ level. This procedure is For didactic purposes here I restrict comparison of metacog-
implemented in the MATLAB function metad_sim included as nition in the aPFC patients (N ¼ 7) and healthy controls (HC;
part of the toolbox. N ¼ 19) on the perceptual task. The task required a two-choice
discrimination as to which of the two briefly presented patches
contained a greater number of small white dots, followed by a
Results continuous confidence rating on a sliding scale from 1 (low con-
fidence) to 6 (high confidence). For analysis these confidence
Example fit
ratings were binned into four quantiles. For each subject confi-
Figure 3A shows the output of a typical call to HMeta-d and the dence rating data (levels 1–4) were sorted according to the posi-
resultant posterior samples of the population-level estimate of tion of the target stimulus (L/R) and the subject’s response (L/R),
metacognitive efficiency, lmetad0 =d0 , plotted with plotSamples. thereby specifying the two nR_S1 and nR_S2 arrays required for
The data were generated as 20 simulated subjects, each with estimating meta-d’.
400 trials and 4 possible confidence levels (confidence criteria For each group I constructed cell arrays of confidence counts
c2 ¼ 6½0:5 1 1:5; type 1 criterion c ¼ 0). For each subject, type 1 and estimated lmetad0 =d0 with the default settings in HMeta-d.
d’ was sampled from a normal distribution Nð2; 0:2Þ, and meta- The resultant posterior distributions are plotted in the left panel
8 | Fleming
of Fig. 4A, and the posterior distribution of the difference is was selected from the set (0.5, 1, 2), and two type 2 criteria were
shown in the right panel. Several features are evident from specified such that 6 dc20 ¼ 1. The generated data thus consisted
these outputs. First, there is a reduced metacognitive efficiency of a 2 (stimulus) 2 (responses) 2 (high/low confidence) matrix
in the aPFC group compared with controls, as revealed by the of response counts. In the second set of experiments type 1 d’
95% CI of the difference being greater than zero (right-hand was kept constant at 1, and the type 2 criteria were selected
panel). Second, the posterior distribution of metacognitive effi- from the set 6 dc20 ¼ ð0:5; 1; 2Þ. Generative meta-d’/d’ was fixed at
ciency in the healthy controls is overlapping with the optimal 1, and type 1 criterion was fixed at 0.
estimate of 1. Finally, for the aPFC group, which compromises Each simulated subject’s data was fit using the MLE and SSE
fewer subjects, there is a higher degree of uncertainty about the routines available from http://www.columbia.edu/bsm2105/
true metacognitive efficiency—the width of the posterior distri- type2sdt/, correcting for zero response counts by adding 0.25 to
bution is greater. This is due to the parameter estimate being all cells [a generalization of the log-linear correction typically
constrained by fewer data points and is a natural consequence applied when estimating type 1 d’, as recommended by Hautus
of the Bayesian approach. (1995)]. For each group of 20 subjects the mean meta-d’/d’ ratio
and the output of a one-sample t-test against the null value of 1
was stored. The same data (without padding) were entered into
Comparison of fitting procedures the hierarchical Bayesian estimation routine as described above
To compare the quality of the fit of the hierarchical Bayesian and the posterior mean stored. A false positive was recorded if a
method against MLE and SSE point-estimate approaches, I ran a one-sample t-test against the null value (meta-d’/d’ ¼ 1) was sig-
series of simulation experiments to investigate parameter re- nificant (P < 0.05) for the MLE/SSE approaches, or if the symmet-
covery of known meta-d’/d’ ratios for different d’ and type 2 cri- ric 95% credible interval excluded 1 for the hierarchical
teria placements across a range of trial counts. Bayesian approach. This procedure was repeated 100 times for
In each experiment, I simulated confidence rating data for each setting of trial counts and parameters.
groups of N ¼ 20 subjects while manipulating the number of tri- Figure 5A and B shows the results of Experiments 1 and 2,
als (20, 50, 100, 200, 400). In the first set of experiments, type 1 d’ respectively, for medium levels of metacognitive efficiency
HMeta-d: estimating metacognitive efficiency | 9
Figure 6. Simulation experiments—low metacognitive efficiency (meta-d’/d’ ¼ 0.5). For legend see Fig. 5.
(meta-d’/d’ ¼ 1). For intermediate values of d’ and criteria (mid- values, even when trial counts are low, by avoiding padding and
dle panels), all methods perform similarly, and recover the true capitalizing on the hierarchical structure of the model to mutu-
meta-d’/d’ ratio. However when d’ is low, or criteria are extreme, ally constrain subject-level fits. Alternatively, HMeta-d may rely
the MLE and SSE methods tend to misestimate metacognitive more on the prior when data are scarce, thus shrinking group
efficiency when the number of trials per subject is < 100, leading estimates to the prior mean. The second explanation predicts
to high false positive rates. These misestimations are similar to that HMeta-d would become less accurate when true metacog-
the effect of zero cell-count corrections on recovery of type 1 d’ nitive efficiency deviates from the prior mean (meta-d’/d’ 1).
(Hautus 1995). In contrast, HMeta-d provides accurate parame- To adjudicate between these explanations I repeated the
ter recovery in the majority of cases. simulations at low (meta-d’/d’ ¼ 0.5) and high (meta-d’/d’ ¼ 1.5)
Why does HMeta-d outperform classical estimation proce- metacognitive efficiency (Figs 6 and 7). These results show that
dures in this case? There are two possible explanations. First, HMeta-d is able to retrieve the true meta-d’/d’ even when meta-
HMeta-d may be more efficient at retrieving true parameter cognitive efficiency is appreciably less than or greater than 1
10 | Fleming
(see also Fig. 3B), consistent with the prior exerting limited in- Priors were specified as follows:
fluence on the results. One notable exception is found when
type 1 d’ is high, and trial counts are very low (20 per subject); l1M ; l2M Nð0; 1Þ
in this case (upper right-hand panels), all fitting methods tend
to overestimate metacognitive efficiency.
Figure 8 provides a summary of false positive rates recorded rM1 ; rM2 InvSqrtGammað0:001; 0:001Þ
across all experiments for the three methods. Point-estimate
approaches (SSE and MLE) return unacceptably high false posi-
tive rates when trial counts are less than 200 per subject, due q Uniformð1; 1Þ:
to consistent over- or underestimation of metacognitive effi-
To demonstrate the application of this expanded model I
ciency. In contrast, HMeta-d provides good control of the false
simulated 100 subjects’ confidence data from the type 2 SDT
positive rate in all cases except when trial counts are very low
model in two ‘tasks’. Each task’s generative meta-d’/d’ was
(<50 per subject).
HMeta-d: estimating metacognitive efficiency | 11
drawn from a bivariate Gaussian with mean ¼ lM1 ¼ lM2 ¼ 0:8 More generally, whether one should use metacognitive sen-
and standard deviations rM1 ¼ rM2 ¼ 0:5. Type 1 d’ was gener- sitivity (e.g. meta-d’ or AUROC2) or metacognitive efficiency
ated separately for each task from a Nð2; 0:2Þ distribution. The (meta-d’/d’) as a measure of metacognition depends on the goal
generative correlation coefficient q was set to 0.6. Data from of an analysis. For example, if we are interested in establishing
both domains are then passed into the model simultaneously, the presence or absence of metacognition in a particular condi-
and a group-level posterior distribution on the correlation coef- tion, such as when performance is particularly low (Scott et al.
ficient q is returned. Figure 4B shows this posterior together 2014) or in particular subject groups such as human infants
with the 95% CI, which encompasses the generative correlation (Goupil et al. 2016), computing metacognitive sensitivity alone
coefficient. may be sufficient. However, when comparing experimental con-
ditions or groups which may differ systematically in perfor-
mance, estimating metacognitive efficiency appropriately
Discussion controls for confounds introduced by type 1 performance and
The quantification of metacognition from confidence ratings is response biases. Note however there are also limitations in the
a question with application in several subfields of psychology applicability of the meta-d’ model. First and foremost, the task
efficiencies across domains. More broadly, it may be possible to ability for memory and perception. J Neurosci 2013;33:16657–65.
specify flexible general linear models linking trial- or subject- http://doi.org/10.1523/JNEUROSCI.0786-13.2013
level variables to meta-d’ (Kruschke 2014). Currently this re- Barrett AB, Dienes Z, Seth AK. Measures of metacognition on
quires bespoke model specification, but in future work we hope signal-detection theoretic models. Psychol Methods
to provide a flexible user interface for the specification of arbi- 2013;18:535–52. http://doi.org/10.1037/a0033268
trary models (cf. Wiecki et al. 2013). Estimation of single-trial in- Charles L, Van Opstal F, Marti S, et al. Distinct brain mecha-
fluences on metacognitive efficiency, such as attentional state nisms for conscious versus subliminal error detection.
or brain activity, is a particularly intriguing proposition. NeuroImage 2013;73:80–94. http://doi.org/10.1016/j.neuro
Currently, estimation of meta-d’ requires many trials, restrict- image.2013.01.054
ing studies of the neural basis of metacognitive efficiency to Clarke F, Birdsall T, Tanner W. Two types of ROC curves and defi-
between-condition or between-subject analyses. Extending the nition of parameters. J Acoust Soc Am 1959;31:629–30.
HMeta-d framework to estimate trial-level effects on meta-d’ David AS, Bedford N, Wiffen B, et al. Failures of metacognition
may therefore accelerate our understanding of the neural basis and lack of insight in neuropsychiatric disorders. Philos Trans
of metacognitive efficiency. R Soc B Biol Sci 2012;367:1379–90. http://doi.org/10.1098/rstb.
Ko Y, Lau H. A detection theoretic explanation of blindsight sug- Palmer EC, David AS, Fleming SM. Effects of age on metacogni-
gests a link between conscious perception and metacognition. tive efficiency. Conscious Cogn 2014;28:151–60. (http://doi.org/
Philos Trans R Soc B Biol Sci 2012;367:1401–11. 10.1016/j.concog.2014.06.007
Kruschke JK. Doing Bayesian Data Analysis. Academic Press, 2014. Persaud N, McLeod P, Cowey A. Post-decision wagering objec-
Lau HC, Rosenthal D. Empirical support for higher-order theories tively measures awareness. Nat Neurosci 2007;10:257–61.
of conscious awareness. Trends Cogn Sci 2011;15:365–73. http:// Plummer M. JAGS: A program for analysis of Bayesian graphical
doi.org/10.1016/j.tics.2011.05.009 models using Gibbs sampling. In: Proceedings of the 3rd
Lee MD. BayesSDT: software for Bayesian inference with signal International Workshop on Distributed Statistical Computing, 2003.
detection theory. Behav Res Methods 2008;40:450–6. Pouget A, Drugowitsch J, Kepecs A. Confidence and certainty:
Lee MD, Wagenmakers E-J. Bayesian Cognitive Modeling: A Practical distinct probabilistic quantities for different goals. Nat Neurosci
Course. Cambridge: Cambridge University Press, 2014. 2016;19:366–74. http://doi.org/10.1038/nn.4240
Lichtenstein S, Fischhoff B, Phillips LD. Calibration of probabili- Rabbitt P, Vyas S. Processing a display even after you make a re-
ties: the state of the art to 1980. In: Kahneman D, Slovic P, sponse to it. how perceptual errors can be corrected. Quart J
Tversky A (eds), Judgment under Uncertainty: Heuristics and Exp Psychol Sect A 1981;33:223–39. http://doi.org/10.1080/
Appendix
Type 2 SDT model equations Probðconf ¼ y j stim ¼ S2; resp ¼ “S1”Þ
0 0
/ c2;“S1” ðyÞ; d2 / c2;“S1” ðy þ 1Þ; d2
For a discrete confidence scale ranging from 1 to k, k – 1 type 2 ¼ d0
criteria are required to rate confidence for each response type. / c; 2
We define type 2 confidence criteria for S1 and S2 responses as: