100% found this document useful (12 votes)

143 views

The 2x2 Matrix Contingency, Confusion and the Metrics of Binary Classification 2nd Edition Complete Digital Book

The second edition of 'The 2x2 Matrix Contingency, Confusion and the Metrics of Binary Classification' expands on the measures used in binary classification, providing revised content and new sections on various metrics and their interrelations. It targets a wide audience, including researchers and clinicians, and emphasizes the application of these metrics in fields like medicine, informatics, and machine learning. The book includes numerous worked examples, primarily based on the Mini-Addenbrooke’s Cognitive Examination dataset, to illustrate the principles discussed.

Uploaded by

kanh.quoangnguang.ngay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (12 votes)

143 views

The 2x2 Matrix Contingency, Confusion and the Metrics of Binary Classification 2nd Edition Complete Digital Book

Uploaded by

kanh.quoangnguang.ngay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

The 2x2 Matrix Contingency, Confusion and the Metrics of

Binary Classification - 2nd Edition

Visit the link below to download the full version of this book:

https://medipdf.com/product/the-2x2-matrix-contingency-confusion-and-the-metrics
-of-binary-classification-2nd-edition/

Click Download Now

Preface to the Second Edition

The principles underpinning this second edition remain the same as in the first:
to describe, extend, and illustrate (through worked examples) the many available
measures used in binary classification. (In this context, the title of “metrics” uses the
word in its common usage, as synonymous with “measures,” and not in its mathe-
matical definition of satisfying the triangle inequality, in the light of which not all the
measures discussed here are metrics, e.g. F measure; Sect. 4.8.3. [1]). As previously,
the basis for most of the worked examples is the dataset of a screening test accuracy
study of the Mini-Addenbrooke’s Cognitive Examination (MACE) administered in
a dedicated clinic for the diagnosis of cognitive disorders.
The whole text has been revised, involving some reorganisation of material.
Specific new material includes:
• A section devoted to the interrelations of Sens, Spec, PPV, NPV, P, and Q (Chap. 2),
as well as the expression of other measures in terms of P and Q, such as Y, PSI,
HMYPSI, MCC, CSI, and F (Chap. 4).
• Introduction of the recently described Efficiency index and its extensions
(Chap. 5).
• Sections on balanced and unbiased measures have been added, for accuracy
(Chap. 3), identification index (Chap. 4), and Efficiency index (Chap. 5).
• Discussion of the (thorny) issue of “diagnostic yield” (Chap. 4).
• More on the number needed (reciprocal) measures and their combination as likeli-
hoods, including new metrics: number needed to classify correctly, number needed
to misclassify, likelihood to classify correctly or misclassify (Chap. 5).
• The previously scattered material on “quality” metrics has been brought together
and treated systematically in a new chapter (Chap. 6).
• More on the classification of the various metrics of binary classification (meta-
classification) and on fourfold classifications more generally (Chap. 9).
The audience for the book, as before, is potentially very broad and may include
any researcher or clinician or professional allied to medicine who uses or plans to

vii
viii Preface to the Second Edition

use measures of binary classification, for example for diagnostic or screening instru-
ments. In addition, the metrics described may be pertinent to fields such as infor-
matics, data searching and machine learning, and any discipline involving predictions
(e.g. ecology, meteorology, analysis of administrative datasets).
The Scottish Enlightenment philosopher David Hume (1711–1776) in his
Philosophical Essays of 1748 (First Enquiry, Part 4, Sect. 1) wrote that:
All the Objects of human Reason or Enquiry may naturally be divided into two kinds, viz.
Relations of Ideas, and Matters of Fact. Of the first Kind are the Propositions in Geometry,
Algebra, and Arithmetic … [which are] discoverable by the mere operation of thought …
Matters of Fact, which are the second Objects of human Reason, are not ascertain’d to us in
the same Manner; nor is our Evidence of their Truth, however great, of a like Nature with
the foregoing. The contrary of every Matter of Fact is still possible; … (cited in [2], p.253,
427n.xii).

The material presented here, being largely arithmetical and, to a lesser extent, graph-
ical, may be considered to fall into Hume’s category of Relations of Ideas and hence
definitionally or necessarily true because purely logical. However, by using empir-
ical datasets to illustrate these operations, the outcomes may also be considered as
at least putative Matters of Fact (if the calculations have been performed correctly!)
and hence contingent because dependent on evidence. Thus, whilst the latter may be
contradicted or rejected in the light of further evidence, the former, the Relations of
Ideas, will persist and, dependent upon their perceived utility, be applicable in any
situation where binary classification is used.

Liverpool, UK A. J. Larner

Acknowledgements My thanks are due to Elizabeth Larner who drew Figs. 8.6 and 9.1–9.4. Also
to Dr. Gashirai Mbizvo with whom I have collaborated productively in work on CSI and F (Chap. 4)
and the ROC plot (Chap. 7). All errors or misconceptions expressed in this work are solely my own.

References

1. Powers DMW. What the F measure doesn’t measure … Features, flaws, fallacies and fixes.
2015. https://arxiv.org/abs/1503.06410.2015.
2. Wootton D. The invention of science. A new history of the scientific revolution. London:
Penguin; 2016.
Preface to the First Edition

It is important to state at the outset what this book is not. It is not a textbook of medical
statistics, as I have no training, far less any expertise, in that discipline. Rather it
is a heuristic, based on experience of using and developing certain mathematical
operations in the context of evaluating the outcomes of diagnostic and screening
test accuracy studies. It therefore stands at the intersection of different disciplines,
borrowing from them (a dialogic discourse?) in the hope that the resulting fusion or
intermingling will result in useful outcomes. This reflects part of a wider commitment
to interdisciplinary studies [6, 16].
Drawing as they do on concepts derived from decision theory, signal detection
theory, and Bayesian methods, 2 × 2 contingency tables may find application in
many areas, for example medical decision-making, weather forecasting, information
retrieval, machine learning and data mining. Of necessity, this volume is written
from the perspective of the first of these areas, as the author is a practising clinician,
but nevertheless it contains much material which will be of relevance to a wider
audience.
Accordingly, this book is a distillate of what I have found to be helpful as a clinician
working in the field of test accuracy studies, specifically related to the screening
and diagnosis of dementia and cognitive impairment, undertaken in the context of
a dedicated cognitive disorders clinic. The conceptual framework described here
is supplemented with worked examples in each section of the book, over 60 in
all, since, to quote Edwin Abbott, “An instance will do more than a volume of
generalities to make my meaning clear” ([1], p. 56). Many of these examples are
based on a large (N = 755) pragmatic prospective screening test accuracy study of one
particular brief cognitive screening instrument, the Mini-Addenbrooke’s Cognitive
Examination (MACE) [3], the use of which in my clinical work has been extensively
analysed and widely presented [7–9, 11–15]. (It has also been included in a systematic
review of MACE [2]). Further analyses of MACE are presented here, along with
material from some of the other studies which have been undertaken in the clinic
[4]. My previous books documenting the specifics of such pragmatic test accuracy
studies of cognitive screening instruments [5, 10] may be categorised as parerga, the
current work being more general in scope and hence applicable to many branches

ix
x Preface to the First Edition

of medical decision-making as well as to disciplines beyond medicine wherein the

binary classification is required. Statistical programmes are not discussed.
James Clerk Maxwell (1831–1879) believed that there was “a department of the
mind conducted independently of consciousness” where ideas could be “fermented
and decocted so that when they run off they come clear” ([17], p. 94–5, 154), views
with which I am largely in agreement (clarity of my own ideas, however, seldom being
apparent). This prompts me to acknowledge two periods of annual leave, both spent
at Center Parcs, Whinfell Forest, Cumbria, which lead to the embryonic overall plan
for this book (January 2019) and to the formulation of ideas about the epistemological
matrix (November 2016).

Liverpool, UK A. J. Larner

Acknowledgements Thanks are due to Alison Zhu, a proper mathematician (1st in mathematics
from Trinity College Cambridge) for advising on many of the equations in this book, although I
hasten to add that any remaining errors are solely my own. Thanks also to Elizabeth Larner who
drew Fig. 7.7 [1st edition; Fig. 8.6 in 2nd edition].

References

1. Abbott EA. Flatland. An edition with notes and commentary by William F Lindgren and
Thomas F Banchoff. New York: Cambridge University Press/Mathematical Association of
America; [1884] 2010.
2. Beishon LC, Batterham AP, Quinn TJ, et al. Addenbrooke’s Cognitive Examination III
(ACEIII) and mini-ACE for the detection of dementia and mild cognitive impairment.
Cochrane Database Syst Rev. 2019;12:CD013282. [There are several numerical errors relating
to my data in this publication, viz.: P7, Summary of findings: PPV for MCI vs. none incorrect
for both 21 and 25 test threshold; P17, Figure 8: Incorrect N (= 754!), TP, FP, TN. Figure 9:
Incorrect TP, FP, FN; P18, Figure 10: Incorrect N (= 756!), FP. Figure 11: Incorrect TP, FP,
TN; P29, Characteristics of individual studies: “22 with MCI” should read “222 with MCI”;
P52, Appendix 6, Summary of included studies: “DSM-V” should read “DSM-IV”.]
3. Hsieh S, McGrory S, Leslie F, Dawson K, Ahmed S, Butler CR, et al. The Mini-Addenbrooke’s
Cognitive Examination: a new assessment tool for dementia. Dement Geriatr Cogn Disord.
2015;39:1–11.
4. Larner AJ. Dementia in clinical practice: a neurological perspective. Pragmatic studies in the
Cognitive Function Clinic. 3rd edition. London: Springer; 2018.
5. Larner AJ. Diagnostic test accuracy studies in dementia: a pragmatic approach. 2nd edition.
London: Springer; 2019.
6. Larner AJ. Neuroliterature. Patients, doctors, diseases. Literary perspectives on disorders of
the nervous system. Gloucester: Choir Press; 2019.
7. Larner A. MACE: optimal cut-offs for dementia and MCI. J Neurol Neurosurg Psychiatry.
2019;90:A19.
8. Larner AJ. MACE for diagnosis of dementia and MCI: examining cut-offs and predictive
values. Diagnostics (Basel). 2019;9:E51.
9. Larner AJ. Applying Kraemer’s Q (positive sign rate): some implications for diagnostic test
accuracy study results. Dement Geriatr Cogn Dis Extra. 2019:9:389–96.
Preface to the First Edition xi

10. Larner AJ. Manual of screeners for dementia. Pragmatic test accuracy studies. London:
Springer; 2020.
11. Larner AJ. Screening for dementia: Q* index as a global measure of test accuracy revisited.
medRxiv. 2020; https://doi.org/10.1101/2020.04.01.20050567.
12. Larner AJ. Defining “optimal” test cut-off using global test metrics: evidence from a cognitive
screening instrument. Neurodegener Dis Manag. 2020;10:223–30.
13. Larner AJ. Mini-Addenbrooke’s Cognitive Examination (MACE): a useful cognitive screening
instrument in older people? Can Geriatr J. 2020;23:199–204.
14. Larner AJ. Assessing cognitive screening instruments with the critical success index. Prog
Neurol Psychiatry. 2021;25(3):33–7.
15. Larner AJ. Cognitive testing in the COVID-19 era: can existing screeners be adapted for
telephone use? Neurodegener Dis Manag. 2021;11:77–82.
16. Larner AJ. Neuroliterature 2. Biography, semiology, miscellany. Further literary perspectives
on disorders of the nervous system. In preparation.
17. Mahon B. The man who changed everything. The life of James Clerk Maxwell. Chich-
ester:Wiley; 2004.
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 History and Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Fourfold (2 × 2) Contingency Table . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Marginal Totals and Marginal Probabilities . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Marginal Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Marginal Probabilities; P, Q . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Pre-test Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Type I (α) and Type II (β) Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Calibration: Decision Thresholds or Cut-Offs . . . . . . . . . . . . . . . . . . 12
1.6 Uncertain or Inconclusive Test Results . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Measures Derived from a 2 × 2 Contingency Table;
Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Paired Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Error-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Sensitivity (Sens) and Specificity (Spec), or True
Positive and True Negative Rates (TPR, TNR) . . . . . . . . . 18
2.2.2 False Positive Rate (FPR), False Negative Rate
(FNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Information-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Positive and Negative Predictive Values (PPV, NPV) . . . . 24
2.3.2 False Discovery Rate (FDR), False Reassurance
Rate (FRR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.3 Bayes’ Formula; Standardized Positive
and Negative Predictive Values (SPPV, SNPV) . . . . . . . . . 28
2.3.4 Interrelations of Sens, Spec, PPV, NPV, P, and Q . . . . . . . 30
2.3.5 Positive and Negative Likelihood Ratios
(PLR, NLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.6 Post-test Odds; Net Harm to Net Benefit (H/B) Ratio . . . . 38

xiii
xiv Contents

2.3.7 Conditional Probability Plot . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.3.8 Positive and Negative Predictive Ratios (PPR, NPR) . . . . 42
2.4 Association-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.1 Diagnostic Odds Ratio (DOR) and Error Odds
Ratios (EOR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.4.2 Positive and Negative Clinical Utility Indexes
(PCUI, NCUI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4.3 Positive and Negative Clinical Disutility Indexes
(PCDI, PCDI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3 Paired Complementary Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Error-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.1 Sensitivity (Sens) and False Negative Rate (FNR) . . . . . . 57
3.2.2 Specificity (Spec) and False Positive Rate (FPR) . . . . . . . 58
3.2.3 “SnNout” and “SpPin” Rules . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.4 Classification and Misclassification Rates;
Misclassification Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.5 Accuracy (Acc) and Inaccuracy (Inacc) . . . . . . . . . . . . . . . 64
3.2.6 Balanced Accuracy and Inaccuracy (BAcc, BInacc) . . . . . 68
3.2.7 Unbiased Accuracy and Inaccuracy (UAcc, UInacc) . . . . 69
3.3 Information-Based Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 Positive Predictive Value (PPV) and False
Discovery Rate (FDR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.2 Negative Predictive Value (NPV) and False
Reassurance Rate (FRR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.3 “Balanced Level” Formulations (BLAcc, BLInacc) . . . . . 73
3.4 Dependence of Paired Complementary Measures
on Prevalence (P) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4 Unitary Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Youden Index (Y) or Statistic (J) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Predictive Summary Index (PSI, ) . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Harmonic Mean of Y and PSI (HMYPSI) . . . . . . . . . . . . . . . . . . . . . 94
4.5 Matthews’ Correlation Coefficient (MCC) . . . . . . . . . . . . . . . . . . . . 97
4.6 Identification Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.6.1 Identification Index (II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.6.2 Balanced Identification Index (BII) . . . . . . . . . . . . . . . . . . . 101
4.6.3 Unbiased Identification Index (UII) . . . . . . . . . . . . . . . . . . . 102
4.7 Net Reclassification Improvement (NRI) . . . . . . . . . . . . . . . . . . . . . . 103
4.8 Methods to Combine Sens and PPV . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.8.1 Critical Success Index (CSI) or Threat Score (TS) . . . . . . 105
4.8.2 Equitable Threat Score (ETS) or Gilbert Skill Score . . . . 109
Contents xv

4.8.3 F Measure (F) or F1 Score (Dice Co-efficient) . . . . . . . . . 110

4.8.4 F*: CSI by Another Name . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.9 Summary Utility Index (SUI) and Summary Disutility Index
(SDI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.10 “Diagnostic Yield” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5 Number Needed (Reciprocal) Measures and Their
Combinations as Likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.2 Number Needed to Diagnose (NND and NND*) . . . . . . . . . . . . . . . 126
5.3 Number Needed to Predict (NNP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.4 Number Needed to Screen (NNS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.5 Number Needed to Misdiagnose (NNM) . . . . . . . . . . . . . . . . . . . . . . 132
5.6 Likelihood to Be Diagnosed or Misdiagnosed (LDM) . . . . . . . . . . . 133
5.7 Likelihood to Be Predicted or Misdiagnosed (LPM) . . . . . . . . . . . . 135
5.8 Number Needed to Classify Correctly (NNCC) . . . . . . . . . . . . . . . . 137
5.9 Number Needed to Misclassify (NNMC) . . . . . . . . . . . . . . . . . . . . . 138
5.10 Likelihood to Classify Correctly or Misclassify (LCM) . . . . . . . . . 139
5.11 Efficiency Index (EI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.11.1 Balanced Efficiency Index (BEI) . . . . . . . . . . . . . . . . . . . . . 145
5.11.2 Balanced Level Efficiency Index (BLEI) . . . . . . . . . . . . . . 146
5.11.3 Unbiased Efficiency Index (UEI) . . . . . . . . . . . . . . . . . . . . . 148
5.12 Number Needed for Screening Utility (NNSU) . . . . . . . . . . . . . . . . 149
5.13 Number Needed for Screening Disutility (NNSD) . . . . . . . . . . . . . . 150
5.14 Likelihood for Screening Utility or Disutility (LSUD) . . . . . . . . . . 151
5.15 Comparing Likelihood Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6 Quality (Q) Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2 Error-Based Quality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2.1 Quality Sensitivity and Specificity (QSens, QSpec) . . . . . 157
6.2.2 Quality False Positive and False Negative Rates
(QFPR, QFNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.2.3 Quality Accuracy, Inaccuracy (QAcc, QInacc) . . . . . . . . . 159
6.2.4 Quality Balanced Accuracy, Inaccuracy (QBAcc,
QBInacc) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6.3 Information-Based Quality Measures . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3.1 Quality Positive and Negative Predictive Values
(QPPV, QNPV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3.2 Quality False Discovery and False Reassurance
Rates (QFDR, QFRR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.3.3 Quality Balanced Level Accuracy, Inaccuracy
(QBLAcc, QBInacc) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
xvi Contents

6.3.4 Quality Positive and Negative Likelihood Ratios

(QPLR, QNLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.3.5 Quality Positive and Negative Predictive Ratios
(QPPR, QNPR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.4 Association-Based Quality Measures . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.4.1 Quality Diagnostic Odds Ratio (QDOR) . . . . . . . . . . . . . . . 167
6.4.2 Quality Positive and Negative Clinical Utility
Indexes (QPCUI, QNCUI) . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.4.3 Quality Positive and Negative Clinical Disutility
Indexes (QPCDI, QNCDI) . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.5 Unitary Quality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.5.1 Quality Youden Index (QY) . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.5.2 Quality Predictive Summary Index (QPSI) . . . . . . . . . . . . . 170
6.5.3 Quality Harmonic Mean of Y and PSI (QHMYPSI) . . . . . 171
6.5.4 Quality Matthews’ Correlation Coefficient (QMCC) . . . . 172
6.5.5 Quality Identification Index (QII) . . . . . . . . . . . . . . . . . . . . 172
6.5.6 Quality Critical Success Index (QCSI) and Quality
F Measure (QF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.5.7 Quality Summary Utility Index and Summary
Disutility Index (QSUI, QSDI) . . . . . . . . . . . . . . . . . . . . . . . 174
6.6 Quality Number Needed (Reciprocal) Measures and Their
Combinations as Quality Likelihoods . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.6.1 Quality Number Needed to Diagnose (QNND
and QNND*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.6.2 Quality Number Needed to Predict (QNNP) . . . . . . . . . . . 176
6.6.3 Quality Number Needed to Misdiagnose (QNNM) . . . . . . 177
6.6.4 Quality Likelihood to Be Diagnosed
or Misdiagnosed (QLDM) . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6.5 Quality Likelihood to Be Predicted or Misdiagnosed
(QLPM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.6.6 Quality Number Needed to Classify Correctly
(QNNCC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
6.6.7 Quality Number Needed to Misclassify (QNNMC) . . . . . 179
6.6.8 Quality Likelihood to Classify Correctly
or Misclassify (QLCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.6.9 Quality Efficiency Index (QEI) . . . . . . . . . . . . . . . . . . . . . . 181
6.6.10 Quality Balanced Efficiency Index (QBEI) . . . . . . . . . . . . 182
6.6.11 Quality Balanced Level Efficiency Index (QBLEI) . . . . . . 182
6.6.12 Quality Number Needed for Screening Utility
(QNNSU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.6.13 Quality Number Needed for Screening Disutility
(QNNSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.6.14 Quality Likelihood for Screening Utility
or Disutility (QLSUD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Contents xvii

7 Graphing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7.2 Receiver Operating Characteristic (ROC) Plot or Curve . . . . . . . . . 188
7.2.1 The ROC Curve or Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.2.2 Area Under the ROC Curve (AUC) . . . . . . . . . . . . . . . . . . . 190
7.3 Defining Optimal Cut-Offs from the ROC Curve . . . . . . . . . . . . . . . 195
7.3.1 Youden Index (Y) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.3.2 Euclidean Index (D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.3.3 Q* Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.3.4 Other ROC-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.3.5 Diagnostic Odds Ratio (DOR) . . . . . . . . . . . . . . . . . . . . . . . 198
7.3.6 Non-ROC-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.4 Other Graphing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.4.1 ROC Plot in Likelihood Ratio Coordinates . . . . . . . . . . . . 199
7.4.2 Precision-Recall (PR) Plot or Curve . . . . . . . . . . . . . . . . . . 200
7.4.3 Prevalence Value Accuracy Plots . . . . . . . . . . . . . . . . . . . . . 201
7.4.4 Agreement Charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8 Other Measures, Other Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
8.2 Combining Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.2.1 Bayesian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
8.2.2 Boolean Method: “AND” and “OR” Logical
Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
8.2.3 Decision Trees: “if-and-then” Rules . . . . . . . . . . . . . . . . . . 216
8.3 Effect Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.3.1 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.3.2 Cohen’s d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.3.3 Binomial Effect Size Display (BESD) . . . . . . . . . . . . . . . . 221
8.4 Other Measures of Association, Agreement, and Difference . . . . . 222
8.4.1 McNemar’s Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
8.4.2 Cohen’s Kappa (κ) Statistic . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.4.3 Bland–Altman Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8.5 Other Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.5.1 Higher Order Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
8.5.2 Interval Likelihood Ratios (ILRs) . . . . . . . . . . . . . . . . . . . . 231
8.5.3 Three-Way Classification (Trichotomisation) . . . . . . . . . . . 233
8.5.4 Fourfold Pattern of Risk Attitudes . . . . . . . . . . . . . . . . . . . . 234
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
9 Classification of Metrics of Binary Classification . . . . . . . . . . . . . . . . . . 239
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
9.2 Classification of Metrics of Binary Classification . . . . . . . . . . . . . . . 240
9.2.1 Error/Information/Association-Based . . . . . . . . . . . . . . . . . 240
9.2.2 Descriptive Versus Predictive . . . . . . . . . . . . . . . . . . . . . . . . 240
xviii Contents

9.2.3 Statistical: Frequentist Versus Bayesian . . . . . . . . . . . . . . . 240

9.2.4 Test-oriented Versus Patient-Oriented . . . . . . . . . . . . . . . . . 241
9.2.5 Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.3.1 Fourfolds, Uncertainty, and the Epistemological
Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.3.2 Which Measure(s) Should Be Used? . . . . . . . . . . . . . . . . . . 245
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Chapter 1
Introduction

Contents

1.1 History and Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Fourfold (2 × 2) Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Marginal Totals and Marginal Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Marginal Totals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Marginal Probabilities; P, Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Pre-test Odds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Type I (α) and Type II (β) Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Calibration: Decision Thresholds or Cut-Offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Uncertain or Inconclusive Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.7 Measures Derived from a 2 × 2 Contingency Table; Confidence Intervals . . . . . . . . . . . . 13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1 History and Nomenclature

Use of a fourfold or quadripartite 2 × 2 table to represent counts of different classes,

a dichotomous classification, may date back to Aristotle [27]. Examples of the appli-
cation of such tables to medical issues may be found in the work of Gavarret in the
1840s and Liebermeister in the 1870s [21, 27].
The term “contingency table” to describe a matrix format used to display a
frequency distribution of variables is generally credited to Karl Pearson (1857–1936)
in his lecture to the Draper’s Society in 1904 [24]. Furthermore, in his book, The
grammar of science, Pearson stated:
… all we can do is to classify things into like within a certain degree of observation, and
record whether what we note as following from them are like within another degree of
observation. Whenever we do this … we really form a contingency table, … (cited from the
3rd edition of The grammar of science [21], p.90-1).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 1

A. J. Larner, The 2x2 Matrix, https://doi.org/10.1007/978-3-031-47194-0_1
2 1 Introduction

Nevertheless, “contingency table” barely features in Porter’s biography of Pearson

([25] p.266, 273).
The term “confusion matrix” or error matrix, or table of confusion, has also been
applied to 2 × 2 tables, with the columns denoting instances of an actual class and the
rows denoting instances of a predicted class. The term “confusion matrix” derives
from the observation that this form of cross tabulation (or crosstab) makes it easy to
see if the system is confusing two classes. The confusion matrix may be described as a
special kind (2 × 2) of contingency table, since higher order contingency tables may
be constructed, cross tabulating instances into more classes, represented by more
columns and/or rows (see Sect. 8.5.1). Another term sometimes used to describe
the 2 × 2 table is a “truth table”, by analogy with those tables developed by the
philosopher Ludwig Wittgenstein (1889–1951) for use in logic (see Sects. 1.3.1 and
8.2.2; Fig. 8.1).
The 2 × 2 contingency table or confusion matrix effectively entabulates chance
and contingency, hence it reifies a binary system of classification. This should not,
however, be confused with the binary or dyadic notation, using the base 2, the devel-
opment of which is often associated (perhaps unjustifiably) with Gottfried Leibniz
(1646–1716) [3].
The 2 × 2 contingency table is applicable to decision making in many disciplines.
The details of this table are now described.

1.2 The Fourfold (2 × 2) Contingency Table

The simple 2 × 2 contingency table or confusion matrix cross-tabulates all instances

(N) into classes, according to some form of reference standard or “true status”
(e.g. diagnosis, or meteorological observations) in the vertical columns, against the
outcome of a test or classifier of interest (e.g. a diagnostic or screening test, or mete-
orological forecasts) in the horizontal rows. In other words, there is a mapping of
all instances to predicted classes, giving an indication of the performance of the
classifier. The columns may be characterised as explanatory variables, the rows as
response variables. This quadripartite classification thus categorises all individuals,
observations, or events as:
• true positive (TP), or a “hit”
• false positive (FP), or a “false alarm” or “false hit”
• false negative (FN), or a “miss”
• true negative (TN), or a “non event” or “correct rejection”
These are the four outcomes recognised by the standard theory of signal detection
[23]. Using the nomenclature of sets, these categories are disjoint sets.
This cross-classification is shown in a 2 × 2 contingency table in Fig. 1.1. Hence-
forward the terms TP, FP, FN, TN will be referred to as “literal notation”, and
measures derived from the 2 × 2 contingency table and denoted by equations using
1.2 The Fourfold (2 × 2) Contingency Table 3

this notation will be referred to as “literal equations”. Some readers may find such
literal terminology easier to use and comprehend than algebraic notation.
The 2 × 2 table is often set out using algebraic terms, where:
• TP = a
• FP = b
• FN = c
• TN = d
This is shown in Fig. 1.2. Henceforward the terms a, b, c, d will be referred to
as “algebraic notation”, and equations using this notation as “algebraic equations”.
Some readers may find this algebraic terminology easier to use and comprehend than
literal notation.
N is used throughout this book to denote the total number of instances observed
or counted (e.g. patients or observations), in other words a population, such that:

True Status
Condition Condition
present absent

Positive True positive False positive

Test [TP] [FP]
Outcome
Negative False negative True negative
[FN] [TN]

Fig. 1.1. 2 × 2 contingency table using literal notation

True Status
Condition Condition
present absent

Positive True positive False positive

Test (a) (b)
Outcome
Negative False negative True negative
(c) (d)

Fig. 1.2. 2 × 2 contingency table using algebraic notation

4 1 Introduction

N = TP + FP + FN + TN
=a+b+c+d

It should be noted, however, that some authors have used N to denote total nega-
tives or negative instances (FN + TN, or c + d), for which the term q is used here
(see Sect. 1.3.1).

1.3 Marginal Totals and Marginal Probabilities

In any fourfold or quadripartite classification, the four terms can be paired in six
possible combinations or permutations, expressed as either totals or probabilities,
and these are exhaustive if combinations of two of the same kind are not permitted
(if permitted, ten permutations would be possible).

1.3.1 Marginal Totals

It is immediately obvious that the four cells of the 2 × 2 table will generate six
marginal values or totals by simple addition in the vertical, horizontal, and diagonal
directions (Fig. 1.3). These totals will be denoted here using lower case letters.
Reading vertically, down the columns:

p = TP + FN = a + c
p = FP + TN = b + d
p + p = N

In other words, p = positive instances and p = negative instances. This may also
be illustrated in a simple truth table akin to that used in logic when using “p” and
“–p” or “not p” for propositions (Fig. 1.4).
A dataset is said to be balanced when p = p . A difference in these columnar
marginal totals (i.e. in the numbers of positive, p, and negative, p , instances) is
termed class imbalance. This may impact on the utility of some measures derived
from the 2 × 2 contingency table.
Reading horizontally, across the rows:

q = TP + FP = a + b
q = FN + TN = c + d
q + q = N