The 2x2 Matrix Contingency, Confusion and the Metrics of Binary Classification 2nd Edition Complete Digital Book
The 2x2 Matrix Contingency, Confusion and the Metrics of Binary Classification 2nd Edition Complete Digital Book
Visit the link below to download the full version of this book:
https://medipdf.com/product/the-2x2-matrix-contingency-confusion-and-the-metrics
-of-binary-classification-2nd-edition/
The principles underpinning this second edition remain the same as in the first:
to describe, extend, and illustrate (through worked examples) the many available
measures used in binary classification. (In this context, the title of “metrics” uses the
word in its common usage, as synonymous with “measures,” and not in its mathe-
matical definition of satisfying the triangle inequality, in the light of which not all the
measures discussed here are metrics, e.g. F measure; Sect. 4.8.3. [1]). As previously,
the basis for most of the worked examples is the dataset of a screening test accuracy
study of the Mini-Addenbrooke’s Cognitive Examination (MACE) administered in
a dedicated clinic for the diagnosis of cognitive disorders.
The whole text has been revised, involving some reorganisation of material.
Specific new material includes:
• A section devoted to the interrelations of Sens, Spec, PPV, NPV, P, and Q (Chap. 2),
as well as the expression of other measures in terms of P and Q, such as Y, PSI,
HMYPSI, MCC, CSI, and F (Chap. 4).
• Introduction of the recently described Efficiency index and its extensions
(Chap. 5).
• Sections on balanced and unbiased measures have been added, for accuracy
(Chap. 3), identification index (Chap. 4), and Efficiency index (Chap. 5).
• Discussion of the (thorny) issue of “diagnostic yield” (Chap. 4).
• More on the number needed (reciprocal) measures and their combination as likeli-
hoods, including new metrics: number needed to classify correctly, number needed
to misclassify, likelihood to classify correctly or misclassify (Chap. 5).
• The previously scattered material on “quality” metrics has been brought together
and treated systematically in a new chapter (Chap. 6).
• More on the classification of the various metrics of binary classification (meta-
classification) and on fourfold classifications more generally (Chap. 9).
The audience for the book, as before, is potentially very broad and may include
any researcher or clinician or professional allied to medicine who uses or plans to
vii
viii Preface to the Second Edition
use measures of binary classification, for example for diagnostic or screening instru-
ments. In addition, the metrics described may be pertinent to fields such as infor-
matics, data searching and machine learning, and any discipline involving predictions
(e.g. ecology, meteorology, analysis of administrative datasets).
The Scottish Enlightenment philosopher David Hume (1711–1776) in his
Philosophical Essays of 1748 (First Enquiry, Part 4, Sect. 1) wrote that:
All the Objects of human Reason or Enquiry may naturally be divided into two kinds, viz.
Relations of Ideas, and Matters of Fact. Of the first Kind are the Propositions in Geometry,
Algebra, and Arithmetic … [which are] discoverable by the mere operation of thought …
Matters of Fact, which are the second Objects of human Reason, are not ascertain’d to us in
the same Manner; nor is our Evidence of their Truth, however great, of a like Nature with
the foregoing. The contrary of every Matter of Fact is still possible; … (cited in [2], p.253,
427n.xii).
The material presented here, being largely arithmetical and, to a lesser extent, graph-
ical, may be considered to fall into Hume’s category of Relations of Ideas and hence
definitionally or necessarily true because purely logical. However, by using empir-
ical datasets to illustrate these operations, the outcomes may also be considered as
at least putative Matters of Fact (if the calculations have been performed correctly!)
and hence contingent because dependent on evidence. Thus, whilst the latter may be
contradicted or rejected in the light of further evidence, the former, the Relations of
Ideas, will persist and, dependent upon their perceived utility, be applicable in any
situation where binary classification is used.
Liverpool, UK A. J. Larner
Acknowledgements My thanks are due to Elizabeth Larner who drew Figs. 8.6 and 9.1–9.4. Also
to Dr. Gashirai Mbizvo with whom I have collaborated productively in work on CSI and F (Chap. 4)
and the ROC plot (Chap. 7). All errors or misconceptions expressed in this work are solely my own.
References
1. Powers DMW. What the F measure doesn’t measure … Features, flaws, fallacies and fixes.
2015. https://arxiv.org/abs/1503.06410.2015.
2. Wootton D. The invention of science. A new history of the scientific revolution. London:
Penguin; 2016.
Preface to the First Edition
It is important to state at the outset what this book is not. It is not a textbook of medical
statistics, as I have no training, far less any expertise, in that discipline. Rather it
is a heuristic, based on experience of using and developing certain mathematical
operations in the context of evaluating the outcomes of diagnostic and screening
test accuracy studies. It therefore stands at the intersection of different disciplines,
borrowing from them (a dialogic discourse?) in the hope that the resulting fusion or
intermingling will result in useful outcomes. This reflects part of a wider commitment
to interdisciplinary studies [6, 16].
Drawing as they do on concepts derived from decision theory, signal detection
theory, and Bayesian methods, 2 × 2 contingency tables may find application in
many areas, for example medical decision-making, weather forecasting, information
retrieval, machine learning and data mining. Of necessity, this volume is written
from the perspective of the first of these areas, as the author is a practising clinician,
but nevertheless it contains much material which will be of relevance to a wider
audience.
Accordingly, this book is a distillate of what I have found to be helpful as a clinician
working in the field of test accuracy studies, specifically related to the screening
and diagnosis of dementia and cognitive impairment, undertaken in the context of
a dedicated cognitive disorders clinic. The conceptual framework described here
is supplemented with worked examples in each section of the book, over 60 in
all, since, to quote Edwin Abbott, “An instance will do more than a volume of
generalities to make my meaning clear” ([1], p. 56). Many of these examples are
based on a large (N = 755) pragmatic prospective screening test accuracy study of one
particular brief cognitive screening instrument, the Mini-Addenbrooke’s Cognitive
Examination (MACE) [3], the use of which in my clinical work has been extensively
analysed and widely presented [7–9, 11–15]. (It has also been included in a systematic
review of MACE [2]). Further analyses of MACE are presented here, along with
material from some of the other studies which have been undertaken in the clinic
[4]. My previous books documenting the specifics of such pragmatic test accuracy
studies of cognitive screening instruments [5, 10] may be categorised as parerga, the
current work being more general in scope and hence applicable to many branches
ix
x Preface to the First Edition
Liverpool, UK A. J. Larner
Acknowledgements Thanks are due to Alison Zhu, a proper mathematician (1st in mathematics
from Trinity College Cambridge) for advising on many of the equations in this book, although I
hasten to add that any remaining errors are solely my own. Thanks also to Elizabeth Larner who
drew Fig. 7.7 [1st edition; Fig. 8.6 in 2nd edition].
References
1. Abbott EA. Flatland. An edition with notes and commentary by William F Lindgren and
Thomas F Banchoff. New York: Cambridge University Press/Mathematical Association of
America; [1884] 2010.
2. Beishon LC, Batterham AP, Quinn TJ, et al. Addenbrooke’s Cognitive Examination III
(ACEIII) and mini-ACE for the detection of dementia and mild cognitive impairment.
Cochrane Database Syst Rev. 2019;12:CD013282. [There are several numerical errors relating
to my data in this publication, viz.: P7, Summary of findings: PPV for MCI vs. none incorrect
for both 21 and 25 test threshold; P17, Figure 8: Incorrect N (= 754!), TP, FP, TN. Figure 9:
Incorrect TP, FP, FN; P18, Figure 10: Incorrect N (= 756!), FP. Figure 11: Incorrect TP, FP,
TN; P29, Characteristics of individual studies: “22 with MCI” should read “222 with MCI”;
P52, Appendix 6, Summary of included studies: “DSM-V” should read “DSM-IV”.]
3. Hsieh S, McGrory S, Leslie F, Dawson K, Ahmed S, Butler CR, et al. The Mini-Addenbrooke’s
Cognitive Examination: a new assessment tool for dementia. Dement Geriatr Cogn Disord.
2015;39:1–11.
4. Larner AJ. Dementia in clinical practice: a neurological perspective. Pragmatic studies in the
Cognitive Function Clinic. 3rd edition. London: Springer; 2018.
5. Larner AJ. Diagnostic test accuracy studies in dementia: a pragmatic approach. 2nd edition.
London: Springer; 2019.
6. Larner AJ. Neuroliterature. Patients, doctors, diseases. Literary perspectives on disorders of
the nervous system. Gloucester: Choir Press; 2019.
7. Larner A. MACE: optimal cut-offs for dementia and MCI. J Neurol Neurosurg Psychiatry.
2019;90:A19.
8. Larner AJ. MACE for diagnosis of dementia and MCI: examining cut-offs and predictive
values. Diagnostics (Basel). 2019;9:E51.
9. Larner AJ. Applying Kraemer’s Q (positive sign rate): some implications for diagnostic test
accuracy study results. Dement Geriatr Cogn Dis Extra. 2019:9:389–96.
Preface to the First Edition xi
10. Larner AJ. Manual of screeners for dementia. Pragmatic test accuracy studies. London:
Springer; 2020.
11. Larner AJ. Screening for dementia: Q* index as a global measure of test accuracy revisited.
medRxiv. 2020; https://doi.org/10.1101/2020.04.01.20050567.
12. Larner AJ. Defining “optimal” test cut-off using global test metrics: evidence from a cognitive
screening instrument. Neurodegener Dis Manag. 2020;10:223–30.
13. Larner AJ. Mini-Addenbrooke’s Cognitive Examination (MACE): a useful cognitive screening
instrument in older people? Can Geriatr J. 2020;23:199–204.
14. Larner AJ. Assessing cognitive screening instruments with the critical success index. Prog
Neurol Psychiatry. 2021;25(3):33–7.
15. Larner AJ. Cognitive testing in the COVID-19 era: can existing screeners be adapted for
telephone use? Neurodegener Dis Manag. 2021;11:77–82.
16. Larner AJ. Neuroliterature 2. Biography, semiology, miscellany. Further literary perspectives
on disorders of the nervous system. In preparation.
17. Mahon B. The man who changed everything. The life of James Clerk Maxwell. Chich-
ester:Wiley; 2004.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 History and Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Fourfold (2 × 2) Contingency Table . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Marginal Totals and Marginal Probabilities . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Marginal Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Marginal Probabilities; P, Q . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Pre-test Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Type I (α) and Type II (β) Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Calibration: Decision Thresholds or Cut-Offs . . . . . . . . . . . . . . . . . . 12
1.6 Uncertain or Inconclusive Test Results . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Measures Derived from a 2 × 2 Contingency Table;
Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Paired Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Error-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Sensitivity (Sens) and Specificity (Spec), or True
Positive and True Negative Rates (TPR, TNR) . . . . . . . . . 18
2.2.2 False Positive Rate (FPR), False Negative Rate
(FNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Information-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Positive and Negative Predictive Values (PPV, NPV) . . . . 24
2.3.2 False Discovery Rate (FDR), False Reassurance
Rate (FRR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Bayes’ Formula; Standardized Positive
and Negative Predictive Values (SPPV, SNPV) . . . . . . . . . 28
2.3.4 Interrelations of Sens, Spec, PPV, NPV, P, and Q . . . . . . . 30
2.3.5 Positive and Negative Likelihood Ratios
(PLR, NLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.6 Post-test Odds; Net Harm to Net Benefit (H/B) Ratio . . . . 38
xiii
xiv Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Chapter 1
Introduction
Contents
this notation will be referred to as “literal equations”. Some readers may find such
literal terminology easier to use and comprehend than algebraic notation.
The 2 × 2 table is often set out using algebraic terms, where:
• TP = a
• FP = b
• FN = c
• TN = d
This is shown in Fig. 1.2. Henceforward the terms a, b, c, d will be referred to
as “algebraic notation”, and equations using this notation as “algebraic equations”.
Some readers may find this algebraic terminology easier to use and comprehend than
literal notation.
N is used throughout this book to denote the total number of instances observed
or counted (e.g. patients or observations), in other words a population, such that:
True Status
Condition Condition
present absent
True Status
Condition Condition
present absent
N = TP + FP + FN + TN
=a+b+c+d
It should be noted, however, that some authors have used N to denote total nega-
tives or negative instances (FN + TN, or c + d), for which the term q is used here
(see Sect. 1.3.1).
In any fourfold or quadripartite classification, the four terms can be paired in six
possible combinations or permutations, expressed as either totals or probabilities,
and these are exhaustive if combinations of two of the same kind are not permitted
(if permitted, ten permutations would be possible).
It is immediately obvious that the four cells of the 2 × 2 table will generate six
marginal values or totals by simple addition in the vertical, horizontal, and diagonal
directions (Fig. 1.3). These totals will be denoted here using lower case letters.
Reading vertically, down the columns:
p = TP + FN = a + c
p = FP + TN = b + d
p + p = N
In other words, p = positive instances and p = negative instances. This may also
be illustrated in a simple truth table akin to that used in logic when using “p” and
“–p” or “not p” for propositions (Fig. 1.4).
A dataset is said to be balanced when p = p . A difference in these columnar
marginal totals (i.e. in the numbers of positive, p, and negative, p , instances) is
termed class imbalance. This may impact on the utility of some measures derived
from the 2 × 2 contingency table.
Reading horizontally, across the rows:
q = TP + FP = a + b
q = FN + TN = c + d
q + q = N