0% found this document useful (0 votes)
191 views

Tackling Prediction Uncertainty in Machine Learning For Healthcare

This document discusses the need for prediction uncertainty metrics in machine learning models used for healthcare applications. It outlines that ML models can produce unsafe prediction failures if they do not convey a lack of confidence in predictions. There are different sources of prediction uncertainty, including a lack of data, poor quality data, and issues with the model itself. The document argues that prediction uncertainty metrics should be implemented and appropriate thresholds set to prevent unsafe failures and allow physician review of uncertain predictions. This can help ensure safety as ML is applied to impactful healthcare tasks.

Uploaded by

ZQ TANG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views

Tackling Prediction Uncertainty in Machine Learning For Healthcare

This document discusses the need for prediction uncertainty metrics in machine learning models used for healthcare applications. It outlines that ML models can produce unsafe prediction failures if they do not convey a lack of confidence in predictions. There are different sources of prediction uncertainty, including a lack of data, poor quality data, and issues with the model itself. The document argues that prediction uncertainty metrics should be implemented and appropriate thresholds set to prevent unsafe failures and allow physician review of uncertain predictions. This can help ensure safety as ML is applied to impactful healthcare tasks.

Uploaded by

ZQ TANG
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

nature biomedical engineering

Perspective https://doi.org/10.1038/s41551-022-00988-x

Tackling prediction uncertainty in machine


learning for healthcare

Received: 25 April 2022 Michelle Chua1, Doyun Kim1, Jongmun Choi1, Nahyoung G. Lee2,
Vikram Deshpande3, Joseph Schwab4, Michael H. Lev    1, Ramon G. Gonzalez1,
Accepted: 17 November 2022
Michael S. Gee1 & Synho Do    1,3 
Published online: xx xx xxxx

Check for updates Predictive machine-learning systems often do not convey the degree
of confidence in the correctness of their outputs. To prevent unsafe
prediction failures from machine-learning models, the users of the systems
should be aware of the general accuracy of the model and understand the
degree of confidence in each individual prediction. In this Perspective,
we convey the need of prediction-uncertainty metrics in healthcare
applications, with a focus on radiology. We outline the sources of
prediction uncertainty, discuss how to implement prediction-uncertainty
metrics in applications that require zero tolerance to errors and in
applications that are error-tolerant, and provide a concise framework
for understanding prediction uncertainty in healthcare contexts. For
machine-learning-enabled automation to substantially impact healthcare,
machine-learning models with zero tolerance for false-positive or
false-negative errors must be developed intentionally.

Fully automated machine-learning (ML) systems are expected to to impact and automation complacency1,6. These safety concerns are
transform healthcare delivery by alleviating resource constraints in fact interrelated and contribute to (or arise from) inadequate man-
and reducing overall healthcare costs. However, few ML models have agement of prediction uncertainty7. They all converge into a common
been deployed in clinical practice, and those that are being used in the pathway towards unsafe prediction failure (Fig. 1 and Table 1).
clinic typically provide only decision-making support. This is partly a In this Perspective, we discuss the need and use of prediction-
consequence of high public expectations for healthcare delivery and uncertainty metrics in healthcare, with a focus on radiology, in par-
of demands for high safety and ethical standards from medical pro- ticular for applications that demand zero tolerance to errors and for
fessionals. Indeed, ML systems can cause unsafe prediction failures. applications that are error tolerant, and provide a framework for the
That is, they can produce an erroneous prediction, and either fail to understanding and identification of task-specific and impact-sensitive
convey a lack of confidence in the prediction or deceitfully convey a safety thresholds for prediction uncertainty in healthcare contexts.
high confidence in its correctness1–5 (Fig. 1). By contrast, safe prediction Prediction uncertainty remains under-recognized by the healthcare
failure occurs when an ML system produces an erroneous prediction community, and prediction-uncertainty metrics are rarely considered
yet conveys a lack of confidence in its correctness, therefore allowing or are ineffectively used8–10.
for referral to suitable clinicians for the review of the prediction and
eventual error correction (Fig. 1). This is similar to when a junior phy- Prediction-uncertainty metrics
sician facing a difficult case and feeling unconfident about a decision Standard metrics for the evaluation of ML models, such as accuracy, sen-
consults with a more experienced colleague. sitivity, specificity and area under the receiver operating characteristic
Other unresolved safety issues in the medical use of ML systems curve measure overall model performance on a limited test dataset, and
are imperfect generalizability, black-box decision-making, insensitivity provide no indication of model confidence in the correctness of individual

Department of Radiology, Massachusetts General Hospital, Boston, MA, USA. 2Department of Ophthalmology, Massachusetts Eye and Ear Infirmary,
1

Boston, MA, USA. 3Department of Pathology, Massachusetts General Hospital, Boston, MA, USA. 4Department of Orthopedic Surgery, Massachusetts
General Hospital, Boston, MA, USA.  e-mail: [email protected]

Nature Biomedical Engineering


Perspective https://doi.org/10.1038/s41551-022-00988-x

Out-of-distribution uncertainty Aleatoric uncertainty Model uncertainty


(owing to lack of data) (owing to poor-quality data) (including metric uncertainty)

Class A Class B
Operational
dataset
Model architecture

Model hyperparameters

Goal metric
Inaccurate or inconsistent
assignment of ground-truth labels to Evaluation metric
Source dataset a subset of the training data

No reasonable basis for discrimination Imperfect basis for discrimination

Prediction uncertainty

Prediction failure

Prediction uncertainty conveyed via:


Appropriate prediction–uncertainty metric
+ No physician review
Appropriate prediction–uncertainty threshold
(task-specific and impact-sensitive)

Physician review

Sensitivity to impact Safe Unsafe


+
prediction failure prediction failure
Automation complacency

Fig. 1 | Unsafe prediction failure and prediction uncertainty. Prediction failure outcome as a function of the probability of the prediction for positively labelled
may result from the absence of a reasonable basis for discrimination owing to cases (blue curve) and negatively labelled cases (red curve), respectively. Unsafe
out-of-distribution uncertainty or from an imperfect basis for discrimination prediction failure, insensitivity to impact, and automation complacency result
owing to aleatoric uncertainties or to model and metric uncertainties. Imperfect from the inadequate management of prediction uncertainty, owing to the
model discrimination can be visualized as an overlap of the positive and negative failure of using prediction-uncertainty metrics, a poor choice of metrics, or an
probability density distributions (graph), which plot the probability of the inappropriate selection of a prediction-uncertainty threshold.

predictions. An unapprised reliance on overall model-performance met- may also be designed to abstain entirely from making unconfident
rics is therefore a breeding ground for unsafe prediction failure. Models predictions19,20, yet this approach may be computationally demanding
with low overall error on training and test datasets will inevitably make and substantially more costly.
pointwise errors on prospective predictions. A hypothetical model with
zero overall error on training and test data is not immune to this, owing to Sources of prediction uncertainty
imperfect generalizability (Fig. 1). In fact, healthcare stakeholders should Prediction failure may result from the absence of a reasonable basis for
be aware of potentially substantial prospective error rates resulting from discrimination, owing to out-of-distribution uncertainty (also known
poor generalizability11–17 (Supplementary Table 1). as epistemic uncertainty), or from an imperfect basis for discrimina-
Moreover, overall model-performance metrics cannot be easily tion, owing to aleatoric uncertainty or model and metric uncertain-
updated after initial model testing, and cannot be used for the pro- ties (Fig. 1). Imperfect model discrimination resulting from aleatoric
spective monitoring of automated ML systems to detect deteriorating uncertainty or from model uncertainty can be visualized as an overlap
performance18. Blindness to prediction uncertainty and poor aware- of the positive and negative probability density distributions (Figs. 1
ness of it can also reinforce automation complacency (Table 1); that and 2). Non-overlapping positive and negative probability density
is, users who believe in the infallibility of the ML system may overlook distributions reflect perfect model discrimination and the complete
possible alternatives1. absence of aleatoric uncertainty and model uncertainty. By contrast,
Thus, users of ML systems should be aware of whether an ML out-of-distribution uncertainty cannot be visually captured by the
model is accurate on the whole, and also understand the extent to probability density plot.
which individual predictions are likely to be correct. This will require
intentional development of explainable ML models that can be used to Out-of-distribution uncertainty
audit black-box neural networks for the detection and quantification Out-of-distribution uncertainty results from a lack of sufficient knowl-
of prediction uncertainty. ML systems with zero tolerance to errors edge. In ML, this corresponds to shortcomings in the training datasets,

Nature Biomedical Engineering


Perspective https://doi.org/10.1038/s41551-022-00988-x

Table 1 | Unresolved safety issues that contribute to prediction uncertainty or arise from the inadequate management of it

Issue Definition Prevention

Imperfect generalizability: contributes to Poor model performance owing to mismatches in the Implementation and adoption of prediction-uncertainty
prediction uncertainty. training and operational datasets. metrics that identify out-of-distribution uncertainty.
Insensitivity to impact: arises from the Failure of ML models to take into account the impact Use of metrics to define task-specific and impact-sensitive
inadequate management of prediction of false-positive and false-negative predictions on thresholds for prediction uncertainty.
uncertainty. healthcare outcomes. Example: implement a zero-tolerance strategy for false
Example: an automated ML system is trained to negatives, by referring for clinician review all cases with
maximize diagnostic accuracy for benign lesions a prediction-uncertainty value above the predefined
at the expense of the accuracy of detection of threshold.
malignant lesions.
Automation complacency: arises from the Undue weight given to model predictions, owing to Implementation and adoption of suitable
inadequate management of prediction perceptions of model infallibility. application-specific prediction-uncertainty metrics.
uncertainty. Example: clinicians do not consider possible
alternatives when the prediction of the model is
consistent with their initial diagnosis.

leading to imperfect generalizability (Fig. 1). When ML models are used that are limited in size or quality, simple neural-network architectures
in real-world tasks and use prospective data from either the same centre may be preferable. Beyond these basic considerations, neural-network
or other institutions or social communities, out-of-distribution uncer- optimization is largely a process of trial and error24,25.
tainty arises from mismatches between the training and operational
datasets, owing to population differences, including changes in disease ML systems with zero tolerance to errors
patterns over time; hardware shifts from the use of, for example, dif- An ageing population and the rising prevalence of chronic medical
ferent scanners, cameras or sensors1,11,12; or deficiencies in the training conditions are two main factors in the growing demand for healthcare
data resulting from selection biases or from inadequate representation resources26. The expectations of healthcare outcomes by the public are
(as subclasses) of uncommon variations. also increasing26. In particular, there has been an exponential increase
Out-of-distribution uncertainty may be reduced by collecting in inpatient, outpatient and emergency-department uses of radio-
more training data21, by pre-processing target data to mitigate hard- logical services in the past few decades. This has challenged radiology
ware shifts and by filtering target data by, for example, age, gender, departments to improve their operational efficiency. The strategic use
ethnicity and other demographic variables. However, filtering target of automation in ML-amenable tasks, such as the detection of normal
data through patient profiling can introduce bias into the implementa- chest and musculoskeletal radiographs and large-scale preventive
tion process and may potentially exacerbate healthcare disparities. screening, can substantially reduce turnaround times. Such uses of
ML may facilitate early diagnoses and decrease the risk of crowding
Aleatoric uncertainty in urgent-care facilities and emergency departments. It could also
Aleatoric uncertainty stems from stochasticity in the data. In ML, this reduce physician workload and enable expert radiologists to redirect
corresponds to training data that are noisy or of poor quality, and their attention to more complex cases9,10.
results in the inaccurate or inconsistent assignment of ground-truth But who will be responsible for medical errors stemming from
labels to a subset of the training data (Fig. 1). In particular, such ‘label automated ML systems? The debate on the ethics of automation in
noise’ can arise from poor image resolution, from errors in label extrac- medicine may benefit from considering fully automated ML systems
tion by natural-language processing, from difficulties experienced that are intentionally designed to work within a framework of zero
by human annotators when trying to discriminate visually similar tolerance to potential harms to patients; that is, zero tolerance for
classes and from inherent subjectivity in the definition of outcomes. false-positive and false-negative errors.
For example, the detection of pneumonia as a radiographic diagnosis
is inherently more subjective, and is associated with higher inter-rater Selection of task-specific and impact-sensitive safety
variability for ambiguous cases compared with the radiographic detec- thresholds
tion of pulmonary opacity. Prediction-uncertainty metrics may be defined using the desired nega-
The limited availability of high-quality training data, correctly tive predictive value (the fraction of negatives that are true negatives)
labelled with the outcome of interest, is a recurrent difficulty in the or the positive predictive value (the fraction of positives that are true
development of ML models. Clinicians must be willing to invest time positives). For example, a negative predictive value of 1 is required if
and resources into the careful annotation or the prioritized reannota- the desired safety threshold for automated diagnosis (for example, of
tion of training data22. Recruitment of multiple experts for consensus normal chest radiographs) or for automated screening (for diabetic
ground-truth annotation may be required23. A redefinition of the ML retinopathy, for instance) is zero tolerance for false negatives. Similarly,
outcome may also be necessary to decrease the inherent subjectivity. a positive predictive value of 1 is required if the desired safety threshold
is zero tolerance for false positives.
Model uncertainty The selection of an impact-sensitive prediction-uncertainty
Model uncertainty refers to the selection of a model structure and threshold for disease screening requires the judicious determination
model parameters that best represent and explain the observed data and quantification of potential harms and benefits27–29. For example,
(Fig. 1). In parametric modelling, a statistician may select a logarith- healthcare stakeholders should distinguish between the screening
mic model for describing the growth of infants and an exponential of patients for specialist referral for further non-invasive checks and
decay model to quantify drug concentration in the body. In general, the screening of patients for invasive confirmatory studies (such as
goodness-of-fit in neural networks depends on the type and quality the automated detection of suspicious mammographic lesions for
of data, the size of the dataset and other dataset characteristics11,24. biopsy), for which zero tolerance for false positives may be mandatory.
When the amount and quality of the training data are sufficient, The detailed consideration of potential harms, including any negative
deeper and more complex neural networks should be used to raise psychosocial consequences and the financial costs of overdiagnosis, is
model performance25. However, when working with training datasets particularly important if modest or uncertain benefits are expected27–29.

Nature Biomedical Engineering


Perspective https://doi.org/10.1038/s41551-022-00988-x

Positive-prediction probability distribution Negative-prediction probability distribution

0.6

0.5

0.4

0.3

0.2

0.1

0
0–0.1 0.1–0.2 0.2–0.3 0.3–0.4 0.4–0.5 0.5–0.6 0.6–0.7 0.7–0.8 0.8–0.9 0.9–1.0 0.000 *0.0019 0.005 0.010
Prediction probability

Prediction
probability 0–0.1 0.1–0.2 0.2–0.3 0.3–0.4 0.4–0.5 0.5–0.6 0.6–0.7 0.7–0.8 0.8–0.9 0.9–1.0

Accuracy for 0.082 0.250 0.244 0.394 0.528 0.387 0.683 0.659 0.775 0.926
the positive (55/667) (31/124) (22/90) (26/66) (28/53) (24/62) (41/60) (56/85) (90/116) (627/677)
class

Positive 0.387 0.683 0.659 0.775 0.926


– – – – –
predictive value (24/62) (41/60) (56/85) (90/116) (627/677)

Negative 0.918 0.750 0.756 0.606 0.472


– – – – –
predictive value (612/667) (93/124) (68/90) (40/66) (25/53)

Fig. 2 | Performance metrics for the detection of cardiomegaly on distributions (the shaded area denotes the non-overlapping region). As false-
anteroposterior chest radiographs by an InceptionV3 binary classifier, negative predictions are excluded below this threshold value, the negative class
and illustration of the derivation of the prediction-probability threshold may be diagnosed without false-negative errors. For clinical classification tasks
for the zero-error detection of normal cardiac silhouettes. The binary- with relatively high aleatoric and model uncertainties, the zero-error prediction-
classification performance was evaluated using a representative holdout test probability threshold may be encountered only as the prediction probability
dataset comprising 1,000 normal anteroposterior chest radiographs and 1,000 approaches 0 or 1. The InceptionV3 neural network returns overly confident
anteroposterior chest radiographs with cardiomegaly. The overlap of the positive predictions, with a mild miscalibration of prediction probability. For example, its
and negative prediction-probability distributions is indicative of prediction average accuracy for test predictions with an uncalibrated prediction probability
uncertainty. Below a threshold prediction-probability value (indicated by an of 0.5–0.6 is only 0.387.
asterisk), there is no overlap of the positive and negative prediction-probability

Selection of metrics for prediction uncertainty cardiomegaly on anteroposterior chest radiographs, we show that class
In a typical convolutional neural network (the type of ML architecture weights may be applied to the loss function (that is, the function to be
that is most commonly used in image-based diagnostic applications), minimized) for the negative or positive class to optimize normal-case
the outputs of the last layer are a non-normalized raw score for each or abnormal-case coverage, respectively (Table 2). However, prediction
class (Fig. 3), which is converted into a prediction probability (by the probability may provide only up to 30% of target-dataset coverage (and
softmax activation function for multiclass classification, and by a sig- approximately 20% coverage on average; Table 2). This is consistent
moid activation function for binary classification). with the findings of a recent retrospective study of the automated
diagnosis of normal chest radiographs9.
Prediction probability. An overlap of the positive and negative Moreover, prediction probability cannot be relied on to pre-
prediction-probability distributions is indicative of aleatoric or vent unsafe prediction failures resulting from out-of-distribution
model uncertainty (Fig. 1). To avoid this, higher or lower thresh- uncertainty30. In fact, prediction probability is often deceitful when
old prediction-probability values may be identified (Fig. 2). As out-of-distribution inputs are encountered2–4, because neural net-
false-negative predictions are excluded below the lower threshold works will frequently produce high-confidence prediction errors. For
value, the negative class may be diagnosed without false-negative example, an image classifier may return a completely erroneous class
errors. Similarly, the positive class may be diagnosed without prediction for random noise with a prediction probability in excess
false-positive errors above the upper threshold value. of 90% (ref. 2).
Prediction probability is a highly intuitive metric for clinical users,
and is easily derived without additional computational costs. However, Prediction probability cross-entropy. Cross-entropy is a measure
for clinical diagnostic tasks with high aleatoric or model uncertainties, of the difference between comparable probability distributions. In
the zero-error threshold may only be encountered as the prediction ML, the cross-entropy between the ground-truth probability distri-
probability approaches 0 or 1, which limits the coverage of the target bution and the predicted probability distribution is well established
dataset (Table 2 and Fig. 2). By means of a case study of the detection of and is commonly used as a loss function. Cross-entropy may also be

Nature Biomedical Engineering


Perspective https://doi.org/10.1038/s41551-022-00988-x

Input Pooling Pooling Pooling Output

0.2
Horse

0.7
Zebra

0.1
Dog

Non-normalized Softmax activation


Convolution Convolution Convolution raw scores function
+ + + Flatten
Kernel ReLU ReLU ReLU layer

Convolutional layers Fully connected layers

Feature extraction Classification Probabilistic distribution

Fig. 3 | Architecture of a convolutional neural network, and derivation extracted features as input. The final fully connected layer outputs a non-
of prediction probabilities. Convolutional neural networks incorporate normalized raw score for each class that is converted (through a sigmoid or
convolutional layers, in which automatic feature extraction is carried out, and a softmax activation function) into a prediction probability. ReLU, rectified
fully connected network layers, in which classification is performed using the linear unit.

Table 2 | Performance metrics

VGG16 InceptionV3 Xception VGG16a InceptionV3a Xceptiona VGG16b InceptionV3b Xceptionb

Accuracy 0.860 0.838 0.837 0.839 0.843 0.837 0.844 0.831 0.840
Precision 0.870 0.838 0.818 0.795 0.820 0.821 0.853 0.801 0.837
Recall 0.845 0.838 0.865 0.913 0.880 0.862 0.831 0.880 0.844
Prediction-probability threshold <0.017 <0.0019 <0.024 <0.047 <0.0059 <0.00057 <0.0077 <0.011 <0.023
Normal-case coverage (%) 27.4 14.5 15.2 30.9 17.6 17.9 36.0 23.6 14.0
Performance metrics for the detection of cardiomegaly on anteroposterior chest radiographs, and prediction-probability thresholds for the zero-error detection of normal cardiac
silhouettes and the associated normal-case coverage. Neural networks with the VGG16, inception v3 and Xception architectures were trained to detect cardiomegaly on anteroposterior
chest radiographs. The binary-classification performance was evaluated using a representative holdout test dataset comprising 1,000 normal anteroposterior chest radiographs and 1,000
anteroposterior chest radiographs with cardiomegaly. The prediction-probability threshold values for the zero-error detection were derived from the lowest prediction-probability value
associated with false-negative predictions in the representative holdout test dataset (Fig. 2). Experimental class weights were applied to the loss function for the negative class during training
to optimize normal-case coverage. aA class weight of 2.0 was applied to the negative class. bA class weight of 4.0 was applied to the negative class.

appropriated to quantify intermodel-prediction variability, which Ensemble-based prediction discrepancy. Neural-network stochas-
is proportional to the degree of aleatoric and model uncertainty ticity and intermodel prediction variability may also be harnessed to
for each ground-truth label. For example, an appropriate threshold directly identify false-positive and false-negative predictions. With
cross-entropy value for zero-error detection may be derived from the true negatives defined as a complete consensus from all of the par-
lowest cross-entropy value associated with false-negative predictions ticipating neural networks on a negative-class prediction (Table 4), a
in a representative holdout test dataset (Table 3). Below this threshold best performing multi-model ensemble for the detection of normal
value, false-negative predictions are excluded, and the negative class cardiac silhouettes on anteroposterior chest radiographs detected all
may be diagnosed without false-negative errors. but three false-negative errors while providing 60.7% of normal-dataset
Ensemble-based quantification of prediction uncertainty can coverage (Supplementary Table 2). This multi-model ensemble con-
be computationally expensive; yet, it is typically practicable, even sisted of six neural networks based on the InceptionV3, VGG16 and
for large training datasets, owing to high-efficiency modern GPUs Xception architectures. The application of class weights to the loss
and cloud-computing resources. Experimental data suggest that function for the negative class also substantially improved the detec-
binary cross-entropy threshold values may provide as much as 46.8% tion of false-negative errors by ensemble-based prediction discrepancy
of target-dataset coverage, whereas mean cross-entropy threshold (Supplementary Table 2). All of the undetected false-negative errors
values derived from the averaging of binary cross-entropy values from in the experimental dataset were borderline instances of mild car-
multiple pairwise comparisons may provide more than 50% of target diomegaly, suggesting that ensemble-based prediction discrepancy
dataset coverage (Table 3). is not reliable as a standalone prediction-uncertainty indicator for
zero-error-tolerance safety-critical applications if there is high aleato-
Ensemble-based prediction-uncertainty indicators. As neural ric uncertainty arising from inter-rater subjectivity in decision-making.
networks are stochastic and tend to produce variable predictions, Such diagnostic challenges may require visual-similarity analysis
ensembles of neural networks have been used to increase overall model using supervised or unsupervised dimensionality-reduction tech-
performance. Thus, ensemble-based prediction-uncertainty indica- niques that can more sensitively embed local relationships at the
tors may provide a more robust quantification of aleatoric and model decision boundary (that is, at the line or hyperplane that separates
uncertainties. data points belonging to different class labels), particularly if other

Nature Biomedical Engineering


Perspective https://doi.org/10.1038/s41551-022-00988-x

Table 3 | Cross-entropy matrix

As reference distribution
VGG16 InceptionV3 Xception VGG16 a
InceptionV3a Xceptiona VGG16b InceptionV3a Xceptionb

VGG16 0.078 (30.7%) 0.37 (40.8%) 0.21 (32.1%) 0.072 (21.3%) 0.10 (35.8%) 0.12 (44.6%) 0.10 (25.6%) 0.22 (27.4%)
InceptionV3 0.10 (28.0%) 0.47 (44.7%) 0.19 (27.0%) 0.14 (28.1%) 0.076 (27.4%) 0.091 (34.6%) 0.12 (25.0%) 0.19 (22.3%)
Xception 0.11 (28.6%) 0.12 (32.4%) 0.28 (32.4%) 0.068 (19.8%) 0.017 (20.6%) 0.071 (34.2%) 0.065 (20.4%) 0.23 (21.9%)
VGG16a 0.18 (36.0%) 0.032 (19.9%) 0.21 (19.2%) 0.074 (25.2%) 0.074 (35.2%) 0.10 (33.9%) 0.070 (24.2%) 0.23 (26.0%)
InceptionV3 a
0.17 (33.1%) 0.024 (16.9%) 0.18 (19.2%) 0.31 (34.2%) 0.071 (30.0%) 0.070 (30.1%) 0.063 (21.7%) 0.17 (18.2%)
Xceptiona 0.10 (23.5%) 0.024 (18.2%) 0.41 (40.4%) 0.19 (25.0%) 0.065 (21.4%) 0.18 (39.7%) 0.16 (30.6%) 0.17 (19.7%)
VGG16b 0.33 (46.8%) 0.26 (46.7%) 0.23 (26.1%) 0.28 (33.8%) 0.21 (34.3%) 0.18 (44.3%) 0.26 (39.0%) 0.12 (14.0%)
InceptionV3b 0.27 (40.6%) 0.092 (32.7%) 0.14 (12.5%) 0.25 (25.7%) 0.096 (24.3%) 0.10 (34.7%) 0.049 (28.8%) 0.17 (14.8%)
Xception b
0.22 (40.7%) 0.047 (16.1%) 0.05 (3.9%) 0.25 (33.7%) 0.18 (33.0%) 0.053 (25.4%) 0.070 (24.9%) 0.20 (35.4%)
Average 0.36 (47.7%) 0.32 (50.5%) 0.42 (40.6%) 0.56 (47.8%) 0.32 (42.5%) 0.22 (44.4%) 0.35 (52.2%) 0.41 (48.4%) 0.41 (40.7%)
binary
cross-entropy
Cross-entropy matrix of threshold binary cross-entropy values for the zero-error detection of normal cardiac silhouettes on anteroposterior chest radiographs, and the associated normal-case
coverage. The binary cross-entropy threshold values were derived from the lowest binary cross-entropy value associated with false-negative predictions in a representative holdout test
dataset comprising 1,000 normal anteroposterior chest radiographs and 1,000 anteroposterior chest radiographs with cardiomegaly. Cross-entropy threshold values associated with more
than 40% (italic) and more than 50% (bold) of normal-case coverage are highlighted. Normal-case coverage is shown in parentheses. aA class weight of 2.0 was applied to the negative class.
b
A class weight of 4.0 was applied to the negative class.

Table 4 | Detection of false-negative test predictions by ensemble-based prediction discrepancy

Case Label VGG16 InceptionV3 Xception Ensemble prediction Ensemble Ensemble


predictiona prediction prediction discrepancy classification accuracy

Case 1 Normal Normal Normal Normal N TN TN detected


Case 2 Normal Normal Normal Normal N TN TN detected
Case 3 Normal Normal Cardiomegaly Normal Y FN TN missed
Case 4 Cardiomegaly Normal Normal Cardiomegaly Y FN FN detected
Case 5 Cardiomegaly Normal Cardiomegaly Normal Y FN FN detected
Case 6 Cardiomegaly Normal Normal Normal N FN FN missed
FN, false negative; TN, true negative. Fraction of FNs missed, one-third. Fraction of FNs detected, two-thirds. Normal-case coverage, 66.7% (two-thirds). aVGG16 was used as the reference
neural network.

threshold-based prediction-uncertainty strategies do not provide spatial distance between the test patch and the predicted class may
acceptable target-dataset coverage. provide better quantification of visual similarity18). Threshold values
are derived from the highest patch similarity or from the spatial-distance
Clustering strategies for the assessment of visual similarity ratio associated with false-positive or false-negative predictions in a
In convolutional neural networks, convolutional layers extract auto- representative holdout test dataset (Supplementary Fig. 1b).
matic features, and fully connected layers perform classification using
the extracted features as an input (Fig. 3). Visual-similarity assessment Embedding of the local neighbourhood at the decision boundary.
is performed by the fully connected layers through the learning of The successful use of clustering strategies for zero-error prediction
complex nonlinear relationships (a black-box process) between the with effective coverage of the target dataset requires the judicious
input features and the output classes. selection of class prototypes for the initial embedding. For example,
Other quantification methods for visual similarity include class prototypes may be selected by deriving the prediction prob-
measures of the spatial distance between image vectors in an embed- ability or cross-entropy threshold values. Patch atlases may also be
ding space (such as relative cityblock or L1 distance, Euclidean or constructed for each class through the cropping of bounding boxes
L2 distance, Mahalanobis distance, Minkowski distance and cosine generated by high-resolution class-activation mapping21. Furthermore,
similarity). These visual-similarity measures are routinely used by a specific representation of the local neighbourhood at the decision
content-based image retrieval systems, and may therefore be leveraged boundary may be helpful or even necessary. This may be achieved by
as indicators of prediction uncertainty. the embedding of cases without a complete inter-rater ground-truth
consensus as a separate cluster (Supplementary Fig. 1a).
Spatial distance and relative spatial distance. Patch similarity is a Implementation of clustering-based prediction-uncertainty
percentile metric of the spatial distance between a test patch and the metrics is relatively more demanding, because an appropriate
predicted class in a two-dimensional embedding space (it is based on dimensionality-reduction technique must be used and optimized,
the mean Euclidean distance between the test patch and k-nearest pro- and a patch-atlas construction is also required for each class. However,
totypical patches from the predicted class, and on the mean Euclidean this approach is versatile, and can be optimized for different classifi-
distance between all prototypical patches from the predicted class21; cation tasks. All false-negative errors in the experimental dataset that
for multiclass classification tasks, the ratio of the mean spatial distance we provide were detected by clustering analysis using a supervised
between the test patch and the nearest non-predicted class and the mean uniform manifold approximation and projection (UMAP) technique for

Nature Biomedical Engineering


Perspective https://doi.org/10.1038/s41551-022-00988-x

dimensionality reduction (Supplementary Fig. 1a). Out-of-distribution low-to-reasonable inter-rater disagreement (Supplementary Fig. 1b,c)
inputs may also be most effectively captured using clustering strate- and by large target datasets.
gies, because neural networks are vulnerable to out-of-distribution
uncertainty and cannot be relied on to prevent out-of-distribution Stay up-to-date with developments in the implementation of
prediction failure. methods to tackle prediction uncertainty
Evidence from neuroscience and cognitive psychology sup- In this Perspective, we have provided a succinct introduction to a hand-
ports concurrent reliance on exemplar, prototype and boundary- ful of practicable post hoc prediction-uncertainty metrics. Yet, a vast
categorization strategies within a single visual object-recognition array of methodological innovations to measure and limit prediction
task31,32. Thus, example-based explanations, such as the retrieval uncertainty have been developed. Healthcare stakeholders should
of class prototypes and nearest-neighbour examples from visually recognize the limitations of old methodologies and stay informed of
distinct clusters in an embedding, can also promote explainability, methodological developments in the design and implementation of
thereby further increasing the transparency of black-box neural methods for reducing and managing prediction uncertainty.
networks.
Recognize that the interpretability of ML systems is context
ML systems for error-tolerant decision support dependent
Error-tolerant ML systems may be particularly suitable for clinical Interpretability has been defined as the extent to which humans can
applications involving triage-threshold selection without bypassing understand ML operations either through introspection or through a
clinician review. This is the case for radiology decision-support systems produced explanation35,36. But interpretability is context-dependent.
that offer diagnostic assistance to radiologists-in-training and to other In healthcare contexts, ML systems may be insufficiently interpretable
healthcare practitioners. for real-world deployment if healthcare stakeholders and regulators
are unable to understand how and to what extent false-positive and
Calibration of prediction probability false-negative errors can be prevented.
Users of ML systems for decision support should be guided by inter-
faces that are intentionally designed to acknowledge the reported Support suitable regulatory oversight
prediction uncertainty. Prediction probability will often be the most The use of prediction-uncertainty metrics should be mandated. Regu-
suitable prediction-uncertainty metric in error-tolerant conditions, for lators should determine whether prediction-uncertainty metrics and
reasons of ease of implementation and user interpretation. However, prediction-uncertainty thresholds have been appropriately selected.
the output of modern neural networks is often overly confident, and Moreover, the interfaces of ML systems for healthcare applications
the networks are poorly calibrated to ground-truth accuracy8. The should be designed to force the users of the systems to acknowledge
miscalibration of prediction probability may be mild (Fig. 2) or severe the reported prediction uncertainty. Task-specific protocols for the
(Supplementary Fig. 2). prospective monitoring of prediction-uncertainty metrics should
A variety of techniques has been used for the post hoc calibration also be subject to regulatory oversight. The fine-tuning of trained
of prediction probability. For example, temperature scaling is both models using data from target datasets will often be necessary to
simple and effective, and offers comparable or superior performance reduce out-of-distribution uncertainty. And regulatory bodies and
in most cases, relative to more sophisticated methods (such as Bayes- professional associations should facilitate access to data from minority
ian binning into quantiles, isotonic regression and Dirichlet calibra- subpopulations to mitigate implementation biases and the potential
tion33,34). If raw prediction probability is not appropriately calibrated, exacerbation of healthcare disparities.
users of decision-support ML systems may be unduly influenced or
misled by inflated prediction probabilities. References
1. Challen, R. et al. Artificial intelligence, bias and clinical safety.
A concise framework for understanding BMJ Qual. Saf. 28, 231–237 (2019).
prediction uncertainty 2. Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified
In this section, we offer a brief guidance aimed at helping healthcare and out-of-distribution examples in neural networks. Preprint at
stakeholders and developers of ML algorithms for healthcare applica- arXiv https://arxiv.org/abs/1610.02136 (2018).
tions to understand prediction uncertainty and to design systems that 3. Goodfellow, I. J., Shlens, J. & Szegedy, C. Explaining and
implement prediction-uncertainty metrics. harnessing adversarial examples. Preprint at arXiv https://arxiv.
org/abs/1412.6572 (2015).
Use prediction-uncertainty metrics to prevent unsafe 4. Amodei, D. et al. Concrete problems in AI safety. Preprint at arXiv
prediction failure https://arxiv.org/abs/1606.06565 (2016).
Healthcare stakeholders should understand the limitations of standard 5. Nguyen, A., Yosinski, J. & Clune, J. Deep neural networks are
model-performance metrics, and face the inevitable learning curve easily fooled: high confidence predictions for unrecognizable
associated with the adoption of new prediction-uncertainty metrics. images. In Proc. IEEE Conference on Computer Vision and Pattern
Clinical experts should have an essential role in the identification of Recognition (CVPR) 427–436 (2015).
impact-sensitive safety thresholds for prediction uncertainty. 6. He, J. et al. The practical implementation of artificial
intelligence technologies in medicine. Nat. Med. 25,
Select automation-amenable tasks for the development of ML 30–36 (2019).
systems with zero tolerance to errors 7. Kompa, B., Snoek, J. & Beam, A. L. Second opinion needed:
The development of ML models has long relied on proof-of-concept communicating uncertainty in medical machine learning.
processes. Rather, model development should emphasize the inten- NPJ Digit. Med. 4, 4 (2021).
tional selection of automation-amenable tasks. Real-world ML intel- 8. Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of
ligence for medical applications may rarely supersede the aptitude of modern neural networks. In Proc. 34th Int. Conference on Machine
expert human labellers. Useful decision-support ML systems can be Learning (PMLR) 70, 1321–1330 (2017).
developed, yet diagnostic tasks that are difficult for clinical experts 9. Dyer, T. et al. Diagnosis of normal chest radiographs using
should be adjudicated by them. Moreover, to justify the financial costs an autonomous deep-learning algorithm. Clin. Radiol. 76,
of quality assurance, automation-amenable tasks may be defined by 473–473 (2021).

Nature Biomedical Engineering


Perspective https://doi.org/10.1038/s41551-022-00988-x

10. Dyer, T. et al. Validation of an artificial intelligence solution for 29. Peryer, G., Golder, S., Junqueira, D. R., Vohra, S. & Loke, Y. K. in
acute triage and rule-out normal of non-contrast CT head scans. Cochrane Handbook for Systematic Reviews of Interventions (eds
Neuroradiology 64, 735–743 (2022). Higgins, J. P. et al.) Ch. 19, 493–505 (John Wiley & Sons, 2011).
11. Liang, X., Nguyen, D. & Jiang, S. B. Generalizability issues with 30. Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H. S. & Gal, Y.
deep learning models in medicine and their potential solutions: Deep deterministic uncertainty: a simple baseline. Preprint at
illustrated with Cone-Beam Computed Tomography (CBCT) to arXiv https://arxiv.org/abs/2102.11582 (2022).
Computed Tomography (CT) image conversion. Mach. Learn. 31. Kruschke, J. K. in The Cambridge Handbook of Computational
Sci. Technol. 2, 015007 (2020). Psychology (ed. Sun, R.) 267–301 (Cambridge Univ. Press, 2008).
12. Navarrete-Dechent, C. et al. Automated dermatological 32. Bowman, C. R., Iwashita, T. & Zeithamova, D. Tracking prototype
diagnosis: hype or reality? J. Invest. Dermatol. 138, and exemplar representations in the brain across learning. eLife 9,
2277–2279 (2018). e59360 (2020).
13. Krois, J. et al. Generalizability of deep learning models for dental 33. Platt, J. C. in Advances in Large Margin Classifiers (eds Smola, A. J.
image analysis. Sci. Rep. 11, 6102 (2021). et al.) (MIT Press, 1999).
14. Sathitratanacheewin, S., Sunanta, P. & Pongpirul, K. Deep learning 34. Ding, Z., Han, X., Liu, P. & Niethammer, M. Local temperature
for automated classification of tuberculosis-related chest scaling for probability calibration. In Proc. IEEE/CVF International
X-ray: dataset distribution shift limits diagnostic performance Conference on Computer Vision 6889–6899 (2021).
generalizability. Heliyon 6, e04614 (2020). 35. Clinciu, M.-A. & Hastie, H. A survey of explainable AI terminology.
15. Xin, K. Z., Li, D. & Yi, P. H. Limited generalizability of deep learning In Proc. 1st Workshop on Interactive Natural Language Technology
algorithm for pediatric pneumonia classification on external data. for Explainable Artificial Intelligence (NL4XAI) 8–13 (2019).
Emerg. Radiol. 29, 107–113 (2022). 36. Biran, O. & Cotton, C. Explanation and justification in machine
16. Zech, J. R. et al. Variable generalization performance of a deep learning: a survey. In IJCAI-17 Workshop on Explainable Artificial
learning model to detect pneumonia in chest radiographs: a Intelligence (XAI) 8, 8–13 (2017).
cross-sectional study. PLoS Med. 15, e1002683 (2018).
17. Chen, J. S. et al. Deep learning for the diagnosis of stage in Acknowledgements
retinopathy of prematurity: accuracy and generalizability We thank the staff at Massachusetts General Brigham’s Enterprise
across populations and cameras. Ophthalmol. Retina 5, Medical Imaging team and Data Science Office.
1027–1035 (2021).
18. Jiang, H., Kim, B., Guan, M. & Gupta, M. To trust or not to trust a Author contributions
classifier. In Advances in Neural Information Processing Systems M.C., D.K., J.C., M.H.L. and S.D. conceived the project. M.C., D.K., J.C.
31 (2018). and S.D. developed the theory and performed the implementations.
19. Geifman, Y. & El-Yaniv, R. Selectivenet: a deep neural network N.G.L., V.D., J.S., M.H.L., R.G.G. and M.S.G. verified the clinical
with an integrated reject option. In Proc. 36th Int. Conference on implementations. M.C., D.K., R.G.G., M.S.G. and S.D. contributed to
Machine Learning (PMLR) 97, 2151–2159 (2019). writing the manuscript.
20. Madras, D., Pitassi, T. & Zemel, R. Predict responsibly: improving
fairness and accuracy by learning to defer. In Advances in Neural Competing interests
Information Processing Systems 31 (2018). The authors declare no competing interests.
21. Kim, D. et al. Accurate auto-labeling of chest X-ray images
based on quantitative similarity to an explainable AI model. Nat. Additional information
Commun. 13, 1867 (2022). Supplementary information The online version
22. Bernhardt, M. et al. Active label cleaning for improved dataset contains supplementary material available at
quality under resource constraints. Nat. Commun. 13, 1161 (2022). https://doi.org/10.1038/s41551-022-00988-x.
23. Krause, J. et al. Grader variability and the importance of reference
standards for evaluating machine learning models for diabetic Correspondence should be addressed to Synho Do.
retinopathy. Ophthalmology 125, 1264–1272 (2018).
24. Basha, S. H. S., Dubey, S. R., Pulabaigari, V. & Mukherjee, S. Peer review information Nature Biomedical Engineering thanks Steve
Impact of fully connected layers on performance of convolutional Jiang and the other, anonymous, reviewer(s) for their contribution to
neural networks for image classification. Neurocomputing 378, the peer review of this work.
112–119 (2020).
25. Trabelsi, A., Chaabane, M. & Ben-Hur, A. Comprehensive Reprints and permissions information is available at
evaluation of deep learning architectures for prediction of www.nature.com/reprints.
DNA/RNA sequence binding specificities. Bioinformatics 35,
i269–i277 (2019). Publisher’s note Springer Nature remains neutral with regard to
26. Boland, G. W. L. Voice recognition technology for radiology jurisdictional claims in published maps and institutional affiliations.
reporting: transforming the radiologist’s value proposition. J. Am.
Coll. Radiol. 4, 865–867 (2007). Springer Nature or its licensor (e.g. a society or other partner) holds
27. Heleno, B., Thomsen, M. F., Rodrigues, D. S., Jorgensen, K. J. & exclusive rights to this article under a publishing agreement with
Brodersen, J. Quantification of harms in cancer screening trials: the author(s) or other rightsholder(s); author self-archiving of the
literature review. BMJ 347, f5334–f5334 (2013). accepted manuscript version of this article is solely governed by the
28. Dans, L. F., Silvestre, M. A. A. & Dans, A. L. Trade-off between terms of such publishing agreement and applicable law.
benefit and harm is crucial in health screening recommendations.
Part I: general principles. J. Clin. Epidemiol. 64, 231–239 (2011). © Springer Nature Limited 2022

Nature Biomedical Engineering

You might also like