0% found this document useful (0 votes)
4 views

Watson_Learning_How_to_MIMIC_Using_Model_Explanations_To_Guide_Deep_WACV_2023_paper

Uploaded by

sirdmdnd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Watson_Learning_How_to_MIMIC_Using_Model_Explanations_To_Guide_Deep_WACV_2023_paper

Uploaded by

sirdmdnd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Learning How to MIMIC: Using Model Explanations to Guide Deep Learning

Training

Matthew Watson, Bashar Awwad Shiekh Hasan, Noura Al Moubayed


Durham University
Durham, UK
{matthew.s.watson,bashar.awwad-shiekh-hasan,noura.al-moubayed}@durham.ac.uk

Abstract even exceeding) that of medical professionals [22]. How-


ever, despite these developments we are yet to see a similar
Healthcare is seen as one of the most influential appli- growth in the number of DL models being deployed into
cations of Deep Learning (DL). Increasingly, DL models real-world medical scenarios [2]. This is down to numerous
have been shown to achieve high-levels of performance on limiting factors; most notably, before such techniques can
medical diagnosis tasks, in some cases achieving levels of become established in the medical field, they must be eth-
performance on-par with medical experts. Yet, very few are ical in their decision-making, trustworthy, transparent and
deployed into real-life scenarios. One of the main reasons explainable [5, 12].
for this is the lack of trust in those models by medical profes- It is in these areas that many DL models can perform
sionals driven by the black-box nature of the deployed mod- poorly. In particular, many models fail to accurately cap-
els. Numerous explainability techniques have been devel- ture the causal relationships between input features and the
oped to alleviate this issue by providing a view on how the output classification and rely instead on task irrelevant fea-
model reached a given decision. Recent studies have shown tures. For example, a wide-ranging study on the use of
that those explanations can expose the models’ reliance on Machine Learning (ML) and DL techniques for COVID-19
areas of the feature space that has no justifiable medical in- prediction from chest x-rays (CXRs) [17] has shown that
terpretation, widening the gap with the medical experts. In many models are making spurious correlations, leading to
this paper we evaluate the deviation of saliency maps pro- the models being unable to accurately generalise. Further-
duced by DL classification models from radiologist’s eye- more, recent studies on the robustness of DL models have
gaze while they study the MIMIC-CXR-EGD images, and shown that changes to training hyperparameters can greatly
we propose a novel model architecture that utilises model affect the learned features [26] - this damages the trust be-
explanations during training only (i.e. not during inference) tween clinicians and DL techniques as it highlights just how
to improve the overall plausibility of the model explana- sensitive to small changes the models are, even when those
tions. We substantially improve the similarity between the changes are independent of the medical questions the model
model’s explanations and radiologists’ eye-gaze data, re- is trying to answer.
ducing Kullback-Leibler Divergence by 90% and increasing Thus, the gold-standard for any ML model is to be able
Normalised Scanpath Saliency by 216%. We argue that this to achieve high-levels of performance whilst learning the
significant improvement is an important step towards build- concrete causal relationships present in the data. Unfortu-
ing more robust and interpretable DL solutions in health- nately, the presence of learned causal features is extremely
care. difficult to verify due to a lack of useful data supporting the
task. Following practices in pedagogy, expert’s Eye Gaze
Data (EGD) can be used as a proxy for causal relationships
1. Introduction [23, 19]. The release and initial analysis of the MIMIC-
CXR-EGD dataset [15] showed that even current state-of-
Applications of Deep Learning (DL) to healthcare have the-art CXR classification models fail to learn the same set
been growing rapidly in a wide range of medical scenarios; of features as used by radiologists in their diagnoses.
ranging from critical care [24] and diabetes risk prediction In this paper, we present a novel deep learning architec-
[1] to the diagnosis of chest x-rays (CXRs) [28]. This is ture that learns a more consistent feature set than previous
partly driven by the rising accuracy of such models, with techniques. Using the MIMIC-CXR-EGD dataset, which
some beginning to achieve performance on-par with (or to the best of our knowledge is the only large-scale image

1461
dataset with accompanying expert eye-gaze data, we com- similar techniques being used pedagogically in fields such
pare the similarity between explanations computed from DL as radiology [25]. The MIMIC-CXR-EGD dataset [15] is a
models and the EGD from radiologists. We report that there subset of MIMIC-CXR [14], containing 1,083 CXR images
is a significant increase in overlap (increasing from -0.4634 from three classes (Pneumonia, Congestive Heart Failure
to 0.5410 when measured by Normalised Scanpath Saliency and Normal). Accompanying the images are aligned EGD
and improving from 9.1233 to 0.8398 when measured by from a trained radiologist. Both raw eye gaze information
Kullback-Leibler Divergence) between explanations from and calculated fixation points are available for this EGD -
our proposed technique and the EGD than there is from any we refer readers interested in the EGD collection process to
other model architecture tested; including current state-of- [15]. Alongside the release of the dataset the authors also
the-art methods specifically designed to combat this issue. show that explanations from traditional classification mod-
We also show that our proposed architecture produces more els do not significantly overlap with the radiologist’s EGD.
consistent explanations than previous models, increasing They propose a multi-task UNet model which uses EGD at
explanation consistency [26] from 0.1785 to 0.5333 with train-time to learn to both classify the CXR image and re-
no cost to model performance nor the need for specialist’s produce the ground-truth EGD in order to improve the sim-
EGD at inference time. ilarity between model explanations and EGD. However, the
results are not very convincing and the study lacked a verifi-
2. Related Work able method of comparing their model explanations and the
EGD. Additionally, this technique requires the use of expert
In order to explain the decisions made by DL models, nu- EGD during training which is costly and difficult to collect,
merous explainability techniques have been developed with especially in the medical domain. We compare our method
the aim of “opening up” the black-box architectures. In this against both the baseline models and the improved UNet ar-
paper we focus on two post-hoc techniques [13] that are de- chitecture using static EGD heatmaps proposed in [15] re-
signed to explain deep learning models; our aim is to com- sulting in significantly higher degree of similarity between
pare the explanations from a variety of established architec- model explanations and EGD across all tested metrics.
tures (as well as our novel models) and so the techniques
used must be model-agnostic and easy to apply. SHAP
[16] is a permutation-based approach which has theoretical 3. Method
groundings in game theory. Grad-CAM [18] is a gradient- Our proposed architecture consists of an ensemble archi-
based approach which uses the gradient of any target con- tecture M made up of S sub-models (of any architecture)
cept flowing into the final convolutional layer of a network and a discriminator, D. We begin by describing the archi-
to produce a saliency map. We focus on these two tech- tecture of our model, and then detail its training procedure.
niques in this paper as not only are they the current de-facto We define an explanation ensemble model as M : X →
standards, but they can also both be applied to a wide-range Y, where X is the set of inputs, and Y the outputs. M
of model architectures allowing for the easy comparison of consists of S sub-models m0 , ..., mS , where S ∈ N, each
explanations from varying model types. of which has the same architecture suited to the task. In
Previous work has used these explainability techniques our proposed network, each mi is trained with a different
to investigate the robustness and adaptability of DL models hyperparameter setup - i.e. with different random seeds, or
[26, 8], finding that even small changes to the training pro- training data order. Architecture hyperparameters, such as
cedure can result in significant changes to the learned fea- layer size and learning rate, are kept constant. The final
tures. These results, coupled with many network’s suscepti- output of the explanation ensemble is the average output of
bility to issues such as adversarial attacks [10] and shortcut all sub-models:
learning [9], suggest that many modern DL architectures
are not necessarily learning causal relationships in the data P
mi (x)
i∈[0,S]
to achieve high performance and might be relying on spuri- M (x) = (1)
S
ous correlations. It can be extremely difficult to verify that
the learned features are indeed causal - there are only a lim- The network also adds a discriminator, D : E → S,
ited number of mostly toy datasets that include descriptions where E is the set of model explanations (calculated via any
of their causal relationships [3]. feature importance attribution method) and S = [0, S]. We
In the absence of such data, recent work has used EGD denote the explanations of sub-model mi on the inputs x as
of experts making decisions on a visual task as a proxy for Ei (x). The discriminator is trained on the explanations pro-
concrete causal relationships [15]. Such data can be used duced by each of the S sub-models, with the aim of learn-
to determine whether models are learning features that do- ing to identify which of the sub-models a given explanation
main experts would use in their assessment of the data - this originated from. As the task of the discriminator has been
use case has groundings from real-world applications, with shown to be easily learned [26] the architecture of D should

1462
be chosen carefully, ensuring it is not too complex that M models must agree that any given feature is important for it
is drastically overfitting. to be used, it is more likely that these are causally related
The S sub-models and discriminator D are all trained with the target, and thus is more likely to be included in an
together, optimising the loss function in Equation 2, where expert’s eye-gaze data.
CELoss(·, ·) is cross-entropy loss and β ∈ [0, ∞) is a hy-
perparameter weighting D’s contribution during a training 4. Experimental Setup
epoch. The subtraction of the discriminator loss in Eq. 2
ensures that the sub-model mi “fools” the discriminator by All experiments1 are carried out on the MIMIC-CXR-
learning to produce explanations that are similar to that of EGD dataset [15]. The models are trained on the same 3-
the other sub-models in the ensemble. label classification task: given a CXR image, predict its
diagnosis (Pneumonia, Congestive Heart Failure or Nor-
X mal). We train three architectures to compare our explana-
loss = CELoss(mi (x), y) − β · CELoss(D(Ei (x)), i) tion ensemble to: 1) baseline: a standard UNet architecture
i trained with a learning rate (LR) of 0.003 with Adam opti-
(2) miser, batch size 32, and pre-trained EfficientNet-b0 [21] as
Every α epochs (where α is another tunable hyperparam- the encoder and bottleneck layers; 2) improved UNet: the
eter), the discriminator D is updated with respect to the loss modified UNet architecture [15] using static heatmaps dur-
function CELoss(D(Ei (x)), i), without back-propagating ing training to both classify and reproduce the EGD given a
through the sub-models, allowing D to learn how to ef- CXR using identical hyperparameters; and 3) standard en-
fectively classify the explanations. This only needs to be semble: an ensemble architecture consisting of 10 UNet ar-
done every α epochs due to the ease of the task [26]. This chitectures identical to 2), trained with LR=0.003 using the
equates to the S sub-models and D being updated in a two- Adam optimiser and batch size 4 [15]. A reduced batch was
player minimax game - the goal of D is to learn to separate used due to memory constraints. Each experiment allows us
the sub-models’ explanations, whereas the sub-models are to compare our results against a different standard of model:
aiming to fool the discriminator, all whilst also optimising 1) is a standard classification model and used as a baseline,
m0 , ..., mi on the downstream task. The result is a set of S 2) is the SOTA for similarity between model explanations
sub-models that produce similar explanations. The assump- and EGD, and 3) confirms that our results are not just a re-
tion here is that this learnt explanation is closer to represent- sult of utilising an ensemble architecture (and instead are
ing the causal relationships and less reliant on the spurious inherent to our proposed architecture and training proce-
correlations. dure). UNet was chosen to allow for direct comparison with
Training of this model can be unstable - this is a direct the current state of the art model on the MIMIC-CXR-EGD
consequence of the discriminator and ensemble sub-models dataset in [15]. We also experimented with Vision Trans-
having opposing goals. For example, if each sub-model formers [7], however due to the small size of MIMIC-CXR-
gives each feature of the input equal weight then the loss EGD they are unable to gain levels of performance match-
of the discriminator will be maximised, reducing Eq. 2. ing those of our baseline and so we do not include their
However, this would also result in the sub-model predicting results in this paper. Across all experiments the same 80/20
the same class for every input. Training stability is linked train-test split is used for the MIMIC-CXR-EGD dataset.
to a “good” choice of α. This can be optimised like any hy- We train our proposed explanation ensembles using stan-
perparameter (e.g. through a grid-search or random search), dard UNet with a classification head as our sub-models.
although we have empirically found through experimenta- Batch sizes of 4 and a learning rate of 0.00001 using the
tion that α = 2 provides stable training. Adam optimiser are used. We use a CNN for our discrimi-
To summarise, the intuition behind our architecture is to nator, with two convolution layers. Max pooling (with ker-
train a discriminator D which encourages each of the S sub- nel size and stride of 2) and ReLU activations are used af-
models in an ensemble to learn a similar set of features. ter each convolution layer. We set β = 0.2 to ensure the
As each of the sub-models is trained with a different hy- two parts of the main loss function are of the same order of
perparameter setup, they will each learn a slightly differ- magnitude. We use 10 sub-models per Explanation Ensem-
ent set of features. As training progresses, D will learn to ble (see the Supplementary Material for results on different
use the noisy features of each sub-model to (correctly) clas- numbers of sub-models). We report the accuracy (across all
sify which sub-model explanations originate from - and in three labels) for all models as a performance metric.
turn, the sub-models will learn to use different features for In order to allow for direct comparison with [15], we
its classification, in order to fool D. The final result is an compute the explanations for all models using Grad-CAM
ensemble model that has learned to “ignore” a wide range
of spurious features, with each of the sub-models only us- 1 Code to reproduce these experiments can be found at:
ing features which all mi agree are important. As multiple https://github.com/mattswatson/learning-to-mimic

1463
[18] on the final convolution layer. We sampled exam- explanations and the EGD. Table 1 in the Supplementary
ples from the test set for inspection. We compare the sim- Material reports the results for each training hyperparam-
ilarity of these explanations to EGD heatmaps generated eter setup used. The performance of both the Baseline
from the eye-gaze fixations, which gives us scalar values and Improved UNet models are equal to the results re-
of importance for each pixel based on the radiologist’s eye ported in [15], confirming that these models are behaving
gaze [15]. To measure similarity to the EGD heatmaps we as expected. Furthermore, both ensembling techniques per-
follow standard practice of comparing saliency maps [4]; form better than these two models; this is to be expected
we report both the Kullback–Leibler Divergence (KLD) as given that they are ensemble architectures [6]. Importantly,
a distribution-based metric, and the Normalised Saliency our Explanation Ensemble architecture is shown to improve
Scanpath (NSS) as a location-based metric. KLD is an upon the performance of the baseline models by 3.39% in-
information-theoretic measure of the difference between dicating that the models are not sacrificing model perfor-
one probability distribution and another; importantly, note mance for improved explanations. Given that the expla-
that it is a divergence metric, meaning smaller values indi- nations from Explanation Ensembles are shown to better
cate better similarity. NSS is designed to be used to com- align with radiologist EGD, this also suggests that features
pare saliency maps with a ground-truth, and is the nor- used by radiologists are better for disease classification than
malised saliency at fixed locations. We note that metrics those learned by the baseline model.
such as Intersection over Union (IoU) are not suited to com-
paring EGD and saliency heatmaps [4] as one must con- Both Table 1 and Figure 1 report the Kullback-Leibler
sider how much importance is placed on each pixel (by Divergence and Normalised Scanpath Saliency between
both the model and the expert), rather than treating expla- the Grad-CAM explanations from each model architecture
nations/EGD as binary heatmaps. and the radiologist’s EGD heatmaps (for details on EGD
It is known that NSS is sensitive to false positives, how- heatmap generation, see [15]). From Figure 1 we can see
ever that is desirable here - we hypothesise that the (non- that our Explanation Ensemble model produces explana-
explanation ensemble) models are learning many noisy fea- tions that are more similar to the EGD than all other ar-
tures which are not necessarily causally linked to output - chitectures tested, when measured by both a distribution-
we want to penalise the models if this is indeed the case. based measure (KLD) and a location-based metric (NSS).
Negative NSS values highlight negative correlation, with To confirm that these conclusions are statistically correct,
chance at 0 and positive values indicating positive corre- we perform a Paired t-test at the α = 0.05 significance
lation. level between the similarity metrics from the baseline and
Explanation consistency [26] measures the change in Explanation Ensemble models. Our null and alternative hy-
model explanations under different hyper parameter settings potheses are the same for both KLD and NSS: H0 : µd =
perpendicular to the task. Higher consistency is linked to 0, H1 : µd ̸= 0, where µd is the mean of the differences be-
explanations more robust to spurious correlations [26]. We tween the KLD/NSS values for the two architectures. The
would expect our explanation ensemble model to achieve distributions of the differences were confirmed to be nor-
higher explanation consistency than other models tested. mal before carrying out the t-test. Table 2 reports both
For each architecture, 10 models are trained with different the test statistics and p-values for each of our hypothesis
random seeds. The Grad-CAM explanations are generated tests. Given that all p-values are significantly less than α,
on the test set for these 10 models, with these explanations we can conclude that our explanation ensemble architec-
also being used to calculate the explanation consistency C ture produces explanations that are statistically more simi-
for each architecture. Following the methods of [26], we lar to radiologist EGD than both baseline and current state-
use a binary logistic regression classifier to measure the sep- of-the-art techniques. Significantly, all models except ex-
arability of two sets of explanations. planation ensembles achieve negative NSS scores, showing
Furthermore, we confirm our results on Grad-CAM by anti-correspondence against the EGD [4] and making our
repeating these experiments with SHAP. This confirms that explanation ensemble architecture the only method tested
our results are not limited to one explanation technique; if to use features that are positively correlated with those used
both explainability methods agree on the outcome, then we by experts. This is further highlighted by the large reduction
can conclude with increased certainty that the model is in- in KLD from our methods when compared with the base-
deed learning “better” (i.e. similar, causal) features. line models tested; this underlines how significantly dif-
ferent the features used by current state-of-the-art models
5. Results and Discussion and medical experts are (and follows results suggesting that
many networks suffer from shortcut learning [9] and spuri-
Table 1 reports the best model performance as well as ous correlations [27]), and shows that our proposed method
summary statistics for both the KLD and NSS metrics used is a significant improvement. While we have focused on
to compare the similarity between the model’s Grad-CAM Explanation Ensembles of size 10 in this paper, the effect

1464
Table 1. Table reporting the performance of the best-performing model for each architecture, alongside the similarity between the model
Grad-CAM explanations and the EGD. Note that KLD is a divergence metric, and so smaller is better. Grad-CAM explanation consistency
was calculated across all 10 training hyperparameter setups for each architecture.
KLD NSS
Model Accuracy Mean (± std. dev) Median (± IQR) Mean (± std. dev) Median (± IQR) Consistency
Baseline 75.55% 14.4041 ± 7.6886 13.4535 ± 10.5240 −0.8579 ± 1.2345 −1.0391 ± 1.4737 0.1785
Improved UNet 76.51% 9.9371 ± 6.4179 9.1221 ± 8.4260 −0.3244 ± 1.5237 −0.4634 ± 1.9781 0.1596
Normal Ensemble 79.86% 3.8839 ± 3.2510 2.7740 ± 4.0799 −0.1646 ± 1.5721 −0.1307 ± 2.0840 0.3042
Explanation Ensemble (Ours) 78.94% 0.8196 ± 0.1273 0.8398 ± 0.1658 0.6757 ± 1.1178 0.5410 ± 1.5653 0.5333

Figure 1. Boxplots of mean (a) NSS and (b) KLD between model Grad-CAM explanations and radiologist EGD, across each of the 10
training random seeds tested. Note that KLD is a divergence metric meaning smaller values are better.

of changing the number of sub-models is explored in Fig- the Grad-CAM results, we see that our proposed Explana-
ure 1 of the Supplementary Material. These experiments tion Ensemble architecture improves the similarity upon all
show that as the number of sub-models increase so does the other model architectures tested. Similar patterns can be
agreement between model explanations and the EGD - how- seen between all 4 architectures tested across the KLD and
ever, it is important to note the trade-off between training NSS values on the Grad-CAM and SHAP results, with the
cost and increased performance as the Explanation Ensem- boxplots highlighting that the level of improvement of our
ble size increases. explanation ensemble architecture is at the same scale re-
In addition to improved similarity with expert EGD, ex- gardless of the explainablility technique used. As both the
planation consistency (Table 1) is also significantly im- results of Grad-CAM and SHAP agree, we can conclude
proved in our explanation ensemble model. This can also that our proposed model is learning to use features simi-
be seen by the significantly smaller range of NSS and larly to a radiologist. These results can also be seen from
KLD of the explanations from the explanation ensembles a visual comparison of explanations: Figure 3 shows exam-
(as reported in Figure 1) when compared with other ar- ple CXRs and their corresponding EGD and explanations
chitectures tested. This inherently increases trust in the from all models tested, showing that our explanation en-
model, as it shows that our architecture is more robust than semble places much more importance on regions similar to
the others tested. It also further highlights how our net- the expert radiologist (i.e. around the lungs and heart) than
work learns “better” (i.e. similar to those in EGD) fea- both the baseline and current state of the art models. No-
tures than the baseline models - our model is learning fewer tice how columns 2 (baseline Grad-CAM) and 3 (Improved
noisy/spurious features and instead placing more impor- UNet Grad-CAM) in Figure 3 show how much of the fea-
tance on the features that have a higher probability of being ture attribution is placed in spuriously correlated features
causally related to the task. (such as the top-left corner and the image borders). On the
We also investigate the similarity between SHAP values other hand, our explanation ensemble architecture learns a
and the EGD data; this is shown in Figure 2. Similarly to significantly different set of features (using features around

1465
Figure 2. Boxplots showing the mean (a) NSS and (b) KLD between model SHAP explanations and radiologist EGD, across each of the
10 training random seeds tested. Note that KLD is a divergence metric meaning smaller values are better.

the lungs and heart, with these areas much more closely 6. Conclusion
matching the areas shown in the EGD heatmap in the first
column), further showing that our training technique has a Through the use of two explainability techniques and
notable affect on the representations learned by the model. both distribution- and location-based metrics, we have
This is desirable, as it highlights how our model is learn- shown that our Explanation Ensemble technique improves
ing to use features similar to those used by experts, making upon baseline models in both terms of performance and
it less likely that our model is over-reliant on spurious fea- explanation similarity to EGD on the MIMIC-CXR-EGD
tures. dataset. Furthermore, we have shown that the Explanation
Figure 4 shows how the learned features of our explana- Ensemble architecture also improves upon the current state-
tion ensemble model change as training progresses. Note of-the-art models which share learned features with radiol-
that this figure shows only the most important pixels of ogist’s EGD. In addition to improving agreement between
each model - when showing the importance of all pixels, model explanations and expert EGD, our proposed model
the heatmaps become difficult to analyse by eye. In partic- architecture also improves classification performance and
ular, Figure 4 highlights how our training process (i.e. the explanation consistency when compared with current state
discriminator and our loss function in Equation 2) encour- of the art techniques. Qualitative analysis of our results
ages the sub-models of our ensemble to learn similar fea- shows that our proposed architecture is a highly significant
tures as training progresses, despite the sub-models starting improvement upon current models, and whilst we do not
with vastly different sets of explanations. This verifies that claim that our results are yet perfect they are a huge im-
our intuitive understanding of our explanation ensemble ar- provement in what is a very difficult task. Furthermore,
chitecture, and most importantly our understanding of why unlike the previous state of the art [15] technique, our pro-
it produces explanations closer to expert’s EGD, is correct. posed architecture does not require EGD heatmaps during
training - due to the cost of collecting EGD (especially
Table 2. Test statistics t and p-values for the Paired t-test per- in fields such as medicine, where expert knowledge is re-
formed between the Explanation Ensembles and Baseline (top) quired), we believe this is a significant advantage over pre-
and the Explanation Ensembles and Improved UNet (bottom) viously proposed methods.
models. In future work, it would be interesting to perform an in
Test Statistic p-value
depth causal analysis of the learned features of our model
KLD 18.005 6.8698 × 10−34
and compare this with a causal analysis of the learned fea-
NSS -9.9137 5.7567−17
tures of baseline models. The improved performance, in-
Test Statistic p-value creased explanation consistency and improved better agree-
KLD 14.4617 7.5950 × 10−27 ment with expert EGD suggests that our architecture may
NSS -5.8058 3.5764 × 10−8 be learning more causal features than the baseline models,

1466
Figure 3. 3 samples from the MIMIC-CXR-EGD dataset, overlaid with the radiologist’s EGD and Grad-CAM explanations from the
baseline, improved UNet and Explanation Ensemble models.

with the baseline models possibly relying more on spuri- and other sensitive fields, as well as the release of further
ous features. We hypothesise this as one would only ex- datasets similar to MIMIC-CXR-EGD which can facilitate
pect causal features to be those that are learned consistently this type of research.
across multiple variations of a well-performing model. Fur-
thermore, the increased agreement with expert radiologists Acknowledgements
(whom you would expect to use causal features in their di-
agnoses) further supports this conclusion. However, to fully This work is supported by grant 25R17P01847 from the
verify this hypothesis, an extensive causal analysis of the European Regional Development Fund and Cievert Ltd.
trained models, and their learned features, must be under-
taken (using techniques such as those used in [20] and [11]) References
and so we leave this for future work. [1] Zakhriya Alhassan, Matthew Watson, David Budgen, Riyad
Alshammari, Ali Alessa, and Noura Al Moubayed. Improv-
Due to its increased similarity with a medical profes- ing current glycated hemoglobin prediction in adults: Use of
sional’s decision making process, we believe that more trust machine learning algorithms with electronic health records.
will be placed in our model by clinicians than current state- JMIR Med Inform, 9(5):e25237, May 2021.
of-the-art techniques. We hope that these results encourage [2] Stan Benjamens, Pranavsingh Dhunnoo, and Bertalan
the use of our architecture in other areas of medical practice, Meskó. The state of artificial intelligence-based fda-

1467
Figure 4. Average GradCAM values (across the validation split) of each sub-model of our Explanation Ensemble model, as training
progresses. To aid with visualisation, only the most important 50% of pixels are shown. Sub-models start training with vastly different
learned features, and as training progresses our training procedure encourages the sub-models to learn similar features. A fully animated
version of this figure, and code to reproduce it on other models, will be released upon publication.

1468
approved medical devices and algorithms: an online von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fer-
database. npj Digital Medicine, 3(1):118, Sep 2020. gus, S. V. N. Vishwanathan, and Roman Garnett, editors,
[3] Bradley Butcher, Vincent S. Huang, Christopher Robinson, Advances in Neural Information Processing Systems 30: An-
Jeremy Reffin, Sema K. Sgaier, Grace Charles, and Novi nual Conference on Neural Information Processing Systems
Quadrianto. Causal datasheet for datasets: An evaluation 2017, December 4-9, 2017, Long Beach, CA, USA, pages
guide for real-world data analysis and data collection design 4765–4774, 2017.
using bayesian networks. Frontiers in Artificial Intelligence, [17] Michael Roberts, Derek Driggs, Matthew Thorpe, Julian
4, 2021. Gilbey, Michael Yeung, Stephan Ursprung, Angelica I.
[4] Zoya Bylinskii, Tilke Judd, Aude Oliva, et al. What do differ- Aviles-Rivero, Christian Etmann, Cathal McCague, Lucian
ent evaluation metrics tell us about saliency models? IEEE Beer, Jonathan R. Weir-McCall, Zhongzhao Teng, Effrossyni
transactions on pattern analysis and machine intelligence, Gkrania-Klotsas, Alessandro Ruggiero, Anna Korhonen,
41(3):740–757, 2018. Emily Jefferson, Emmanuel Ako, Georg Langs, Ghassem
[5] D. S. Char, M. D. Abràmoff, and C. Feudtner. Identify- Gozaliasl, Guang Yang, Helmut Prosch, Jacobus Preller, Jan
ing Ethical Considerations for Machine Learning Healthcare Stanczuk, Jing Tang, Johannes Hofmanninger, Judith Babar,
Applications. Am J Bioeth, 20(11):7–17, 11 2020. Lorena Escudero Sánchez, Muhunthan Thillai, Paula Mar-
[6] Xibin Dong, Zhiwen Yu, Wenming Cao, Yifan Shi, and tin Gonzalez, Philip Teare, Xiaoxiang Zhu, Mishal Patel,
Qianli Ma. A survey on ensemble learning. Frontiers of Conor Cafolla, Hojjat Azadbakht, Joseph Jacob, Josh Lowe,
Computer Science, 14(2):241–258, 2020. Kang Zhang, Kyle Bradley, Marcel Wassin, Markus Holzer,
Kangyu Ji, Maria Delgado Ortet, Tao Ai, Nicholas Walton,
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Pietro Lio, Samuel Stranks, Tolou Shadbahr, Weizhe Lin,
et al. An image is worth 16x16 words: Transformers for im-
Yunfei Zha, Zhangming Niu, James H. F. Rudd, Evis Sala,
age recognition at scale. arXiv preprint arXiv:2010.11929,
Carola-Bibiane Schönlieb, and AIX-COVNET. Common
2020.
pitfalls and recommendations for using machine learning
[8] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-
to detect and prognosticate for covid-19 using chest radio-
cal Germain, Hugo Larochelle, François Laviolette, Mario
graphs and ct scans. Nature Machine Intelligence, 3(3):199–
Marchand, and Victor Lempitsky. Domain-adversarial train-
217, Mar 2021.
ing of neural networks. The journal of machine learning
[18] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek
research, 17(1):2096–2030, 2016.
Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Ba-
[9] Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, tra. Grad-cam: Visual explanations from deep networks
Richard S. Zemel, Wieland Brendel, Matthias Bethge, and via gradient-based localization. In 2017 IEEE International
Felix A. Wichmann. Shortcut learning in deep neural net- Conference on Computer Vision (ICCV), pages 618–626,
works. CoRR, abs/2004.07780, 2020. 2017.
[10] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. [19] Shyamli Sindhwani, Gregory Minissale, Gerald Weber,
Explaining and harnessing adversarial examples. In Yoshua Christof Lutteroth, Anthony Lambert, Neal Curtis, and Eliz-
Bengio and Yann LeCun, editors, 3rd International Confer- abeth Broadbent. A multidisciplinary study of eye track-
ence on Learning Representations, ICLR 2015, San Diego, ing technology for visual intelligence. Education Sciences,
CA, USA, May 7-9, 2015, Conference Track Proceedings, 10(8), 2020.
2015. [20] Sumedha Singla, Stephen Wallace, Sofia Triantafillou, and
[11] Yash Goyal, Amir Feder, Uri Shalit, and Been Kim. Ex- Kayhan Batmanghelich. Using causal analysis for concep-
plaining classifiers with causal concept effect (cace). arXiv tual deep learning explanation. In International Conference
preprint arXiv:1907.07165, 2019. on Medical Image Computing and Computer-Assisted Inter-
[12] Joshua James Hatherley. Limits of trust in medical ai. Jour- vention, pages 519–528. Springer, 2021.
nal of Medical Ethics, 46(7):478–481, 2020. [21] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
[13] Andreas Holzinger, Chris Biemann, Constantinos S. Pat- scaling for convolutional neural networks. In International
tichis, and Douglas B. Kell. What do we need to build Conference on Machine Learning, pages 6105–6114. PMLR,
explainable AI systems for the medical domain? CoRR, 2019.
abs/1712.09923, 2017. [22] Eric J. Topol. High-performance medicine: the conver-
[14] Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, gence of human and artificial intelligence. Nature Medicine,
Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying 25(1):44–56, Jan 2019.
Deng, Roger G. Mark, and Steven Horng. Mimic-cxr, a [23] A van der Gijp, C J Ravesloot, H Jarodzka, M F van der
de-identified publicly available database of chest radiographs Schaaf, I C van der Schaaf, J P J van Schaik, and Th J
with free-text reports. Scientific Data, 6(1):317, Dec 2019. Ten Cate. How visual search relates to visual diagnostic
[15] Alexandros Karargyris, Satyananda Kashyap, Ismini performance: a narrative systematic review of eye-tracking
Lourentzou, et al. Creation and validation of a chest research in radiology. Adv. Health Sci. Educ. Theory Pract.,
x-ray dataset with eye-tracking and report dictation for ai 22(3):765–787, Aug. 2017.
development. Scientific Data, 8(1):92, Mar 2021. [24] Alfredo Vellido, Vicent Ribas, Carles Morales, Adolfo
[16] Scott M. Lundberg and Su-In Lee. A unified approach to Ruiz Sanmartı́n, and Juan Carlos Ruiz Rodrı́guez. Machine
interpreting model predictions. In Isabelle Guyon, Ulrike learning in critical care: state-of-the-art and a sepsis case

1469
study. BioMedical Engineering OnLine, 17(1):135, Nov
2018.
[25] Stephen Waite, Arkadij Grigorian, Robert G. Alexander,
Stephen L. Macknik, Marisa Carrasco, David J. Heeger, and
Susana Martinez-Conde. Analysis of perceptual expertise in
radiology – current knowledge and a new perspective. Fron-
tiers in Human Neuroscience, 13, 2019.
[26] Matthew Watson, Bashar Awwad Shiekh Hasan, and
Noura Al Moubayed. Agree to disagree: When deep learning
models with identical architectures produce distinct explana-
tions. CoRR, abs/2105.06791, 2021.
[27] Yao-Yuan Yang and Kamalika Chaudhuri. Understanding
rare spurious correlations in neural networks, 2022.
[28] Erdi Çallı, Ecem Sogancioglu, Bram van Ginneken,
Kicky G. van Leeuwen, and Keelin Murphy. Deep learning
for chest x-ray analysis: A survey. Medical Image Analysis,
72:102125, 2021.

1470

You might also like