Naumova Et Al-2024-Npj Digital Medicine
Naumova Et Al-2024-Npj Digital Medicine
https://doi.org/10.1038/s41746-024-01226-1
Distributed collaborative learning is a promising approach for building predictive models for privacy-
sensitive biomedical images. Here, several data owners (clients) train a joint model without sharing
1234567890():,;
1234567890():,;
their original data. However, concealed systematic biases can compromise model performance and
fairness. This study presents MyThisYourThat (MyTH) approach, which adapts an interpretable
prototypical part learning network to a distributed setting, enabling each client to visualize feature
differences learned by others on their own image: comparing one client’s 'This’ with others’ 'That’. Our
setting demonstrates four clients collaboratively training two diagnostic classifiers on a benchmark
X-ray dataset. Without data bias, the global model reaches 74.14% balanced accuracy for
cardiomegaly and 74.08% for pleural effusion. We show that with systematic visual bias in one client,
the performance of global models drops to near-random. We demonstrate how differences between
local and global prototypes reveal biases and allow their visualization on each client’s data without
compromising privacy.
The transformative force of deep learning on clinical decision-making central server is used, the technique is known as federated learning (FL)5,
systems is being increasingly documented1–3. For medical images, these and it has already been shown to hold potential for various medical
advances have the potential to improve and democratize access to high- applications6–9.
quality standardized interpretation, extracting predictive features at a While DISCO addresses the issue of data privacy, it comes at a cost
granularity previously inaccessible to human experts who are often lacking to transparency, resulting in clients learning blindly from their peers. In
in low-resource settings. The potential to automate routine analysis of this black-box data setting, hidden biases between clients of the fed-
medical records3 and help find hidden predictive patterns in the data that eration can make generalization challenging, and even when there are no
may reduce errors and unnecessary interventions (for example, biopsy)4 biases present, its risk may degrade trust. Coupled with the already poor
moves us towards more efficient, personalized, and accessible healthcare. transparency of deep learning architectures used for medical images
However, the performance of these models relies on large, carefully (aka black-box models), interpretability is becoming a critical feature to
curated centralized databanks, which are often challenging to create or ensure a balance in the trade-off between transparency and privacy that
access in practice. Rather, medical data are usually fragmented among will encourage implementation.
several institutions that are unable to share due to a range of well-considered Specifically required, is an approach to inspect data interoperability
reasons. DIStributed COllaborative (DISCO) learning has emerged as a between clients as well as provide insights into the most predictive features.
solution to this issue, offering privacy-preserving collaborative model Additionally, this approach should be adept at detecting and quantifying
training without sharing any original data. Here, instead of sending the data concealed biases in the data in an interpretable way, while preserving clients’
to a central model, the model itself is distributed to the data owners to learn privacy. For instance, in shortcut-learning, a model uses a proxy feature,
in situ, updating a global model via privacy-preserving gradients. When a which is systematically associated with a label to predict that label (e.g. using
1
Laboratory for Intelligent Global Health and Humanitarian Response Technologies (LiGHT), Machine Learning and Optimization Laboratory, Swiss Federal
Institute of Technology Lausanne (EPFL), Lausanne, Switzerland. 2ETH AI Center, Swiss Federal Institute of Technology Zurich (ETH Zurich), Zurich, Switzerland.
3
Berkeley AI Research Laboratory, University of California, Berkeley, CA, USA. 4Department of Computer Science, University of Southern California, Los Angeles,
CA, USA. 5Machine Learning and Optimization Laboratory, Swiss Federal Institute of Technology Lausanne (EPFL), Lausanne, Switzerland. 6Laboratory for
Intelligent Global Health and Humanitarian Response Technologies (LiGHT), School of Medicine, Yale University, New Haven, CT, USA.
e-mail: [email protected]
the hospital logo on an X-ray to diagnose tuberculosis because this infection learn local prototypes separately and send them to a server that aggregates
is specifically treated in that hospital). and averages local prototypes to obtain global ones and sends them back to
As for the black-box neural networks, numerous approaches attempt clients. The patches most activated by each of these two types of prototypes
to explain them by a posthoc analysis of their predictions10–16. Many of these can be visualized and compared on each client’s local test set without a need
methods have been well summarized elsewhere17,18. Feature visualization to share the data. By comparing global and local prototypes, the clients can
(https://distill.pub/2017/feature-visualization/) and saliency mapping with assess the interoperability of the data and directly examine the predictive
Grad-CAM12 are among the most popular techniques. These methods impact of other clients without compromising their privacy. i.e. compare
visualize the regions in data that are most determinant for the prediction, one client’s ’This’ with others ’That’. To the best of our knowledge, this work
however, they fail to tell us why or to what extent the visualized regions are is the first attempt at creating an interpretable methodology to inspect the
essential for a prediction19. interoperability of biomedical imaging data in FL.
This work aims to adapt an inherently interpretable model (IM) to the Our main contributions are as follows:
FL setting20,21. Many IMs exist for tabular data, for example, sparse logical 1. We introduce MyTH adapting ProtoPNet to a federated setting to
models such as decision trees and scoring systems. For image recognition enable privacy-preserving identification of visual data bias in FL.
tasks, models that possess human-friendly reasoning based on a similarity 2. We formalize a set of use cases for interpretable distributed learning on
between a test instance and already known instances (e.g. nearest neighbors imperfectly interoperable biomedical image data containing
from a training set or the closest samples from a set of learned prototypes) hidden bias.
are particularly promising. This learning approach opens possibilities for 3. We demonstrate the performance of our MyTH approach on a
incorporating the scrutiny of domain experts, allowing them to debug the benchmark dataset of human X-rays and compare it to baseline
model’s logic and examine the quality of training data. Prototypical part models.
learning neural network (ProtoPNet) developed by Chen et al.22 is a 4. We show how MyTH helps identify biased clients in FL without dis-
popular IM for images and is the method we adopt for FL in this work. closing the data.
To summarize, ProtoPNet (Supplementary Fig. 1) uses a set of 5. Finally, we propose a new approach to use MyTH for interpretable
convolutional layers to map input images to a latent space, followed by a personalization.
prototype layer, which learns a set of prototypical parts from encoded
training images that best represent each class. Classification then relies on a Results
similarity score computed between these learned prototypes and an encoded Quantitative results
test image. A prototype can be visualized by highlighting a patch in an input We experimented on the CheXpert dataset25, a large public dataset of chest
image, which is the closest in terms of squared L2-distance to this prototype X-rays, after processing it to allow for binary classification of cardiomegaly
in a latent space. The performance of ProtoPNet was demonstrated on and pleural effusion conditions (see the “Methods” section).
the task of bird species identification. The model showed an accuracy
comparable with the state-of-the-art black-box deep neural networks while Unbiased setting. For both tasks, we first trained a Centralized Model
being easily interpretable. ProtoPNet was further extended to perform (CM), i.e. ProtoPNet baseline. We then partitioned the data in an
the classification of mass lesions in digital mammography23 and image independently and identically distributed (IID) manner over four clients
recognition with hierarchical prototypes24. and trained Local models (LM) on each client’s data without collabora-
In this work, we develop an approach called MyTH tion. After that, clients collaboratively trained Global (GM) and Perso-
(MyThisYourThat) through the adaptation of ProtoPNet to FL and nalized (PM) models sharing either all (GM) or part (PM) of their
demonstrate its capacity for identifying bias in medical images. The idea is networks parameters.
that prototypes learned on each client’s local data represent feature The balanced accuracy for these four types of models trained on
importance from that client’s point of view. As summarized in Fig. 1, clients unbiased data is presented in Table 1. The CM gives 74.45% and 75.95%
balanced accuracy for cardiomegaly and pleural effusion classification, • real-world: adding chest drains to a positive class in pleural effusion
respectively. As expected, LMs perform worse than centralized ones due classification as a more real-world bias (Fig. 2b). To achieve this, we
to the smaller dataset: LMs achieve 71.64% for cardiomegaly and 70.66% replaced images in a class Pleural effusion with X-rays labeled for the
for pleural effusion classes. When the clients train ProtoPNet in presence of chest drains26.
collaboration, its performance improves: GM achieves 74.14% balanced
accuracy for cardiomegaly and 74.08% for pleural effusion classes, which The real-world use case can arise as pleural effusions are often drained.
are close to the values achieved by the corresponding CM. Personalized Drain positions are routinely checked with a post-insertion X-ray. Thus, a
models, however, demonstrate worse performance of 63.74% and model may learn to diagnose pleural effusion by detecting a chest drain,
63.76% for cardiomegaly and pleural effusion classes, respectively, which rather than the pathology (i.e., shortcut learning).
may be the consequence of exchanging only a part of the network: We compare local, global and presonalized models (LMb, GMb, and
prototypes and weights of the final layer. The values of classification PMb, respectively) trained in the presence of data bias separately on
sensitivity and specificity used to compute balanced accuracy can be unbiased and biased data (Table 2). It is clear that both types of data poi-
found in Supplementary Table 1. soning have a large effect on model performance.
We see that models with large local contributions, LMb and PMb, give
Biased setting. We experimented with two types of data bias in one of 100.0% and 89.80 % test accuracy, respectively, on biased data and 50.0% on
the clients: unbiased ones in the case of cardiomegaly classification. Thus, these models
• synthetic: adding a small red emoji to a positive class in cardiomegaly strongly rely on the presence of bias to predict a positive class (shortcut
classification (Fig. 2a); learning).
Since the chest drain bias is more difficult to learn than the obvious emoji,
for pleural effusion classification, LMb and PMb do not achieve maximum
accuracy on biased data but instead 73.22% and 64.81%, respectively. At the
Table 1 | Centralized vs federated unbiased settings
same time, their performance on the unbiased test set is as low as for the
Model CM LM GM PM cardiomegaly class, namely 50.37% and 49.87% for LMb and PMb, respectively.
Cardiomegaly 74.45 ± 0.73 71.64 ± 1.05 74.14 ± 0.77 63.74 ± 4.45 Global models (GMb), trained via communication of all the learnable
parameters of ProtoPNet, demonstrate a performance different from
Pleural 75.95 0.68 70.66 ± 2.40 74.08 ± 2.24 63.76 ± 2.01
effusion their local and personalized versions. For cardiomegaly, GMb achieves
61.53% and 55.85% balanced accuracy on biased and unbiased sets,
Classification balanced accuracies (%, ±SD) for CM (centralized model), LM (local model), GM
(global model), and PM (personalized model) trained without data bias on CheXpert dataset for respectively. For pleural effusion, the model achieves nearly 50% on both test
cardiomegaly and pleural effusion classes. The uncertainty is computed over three runs with sets. Sensitivity and specificity for the biased setting are shown in Supple-
different seeds and averaged over four datasets where applicable. mentary Table 2.
Fig. 3 | Prototypes learned in an unbiased setting. Examples of prototypical parts learned by CM (gray), LM (blue), GM (purple), and PM (green) in an unbiased setting and
visualized on a corresponding training set.
Qualitative results column). In the case of a biased client, the local model (LMb) for cardio-
The quantitative performance of the models described above can be further megaly classification (Fig. 4: second-row first column) looks at the emoji in
supported in a visually interpretable way with the help of learned prototypes. the upper left corner of a test image. It tends to search for it in the unbiased
The examples of prototypes visualized on training sets for the models image as well (Fig. 4: first-row last column). The neighborhood of this
trained in the unbiased setting are shown in Fig. 3. We can see that these injected bias turned out to be the most activated patch for the personalized
prototypes represent class characteristic features that align with human model (PMb, Fig. 4: second-row third column). This result explains the
logic. For example, in order to classify an image as cardiomegaly, a cen- 100% accuracy of LMb and 89.80% accuracy for PMb on a biased test set and
tralized model looks at the whole enlarged heart (Fig. 3) or at the collarbone their complete failure on an unbiased one.
level in the center, pointing out the extended aorta characteristic for this In the pleural effusion class, LMb and PMb indeed rely on the presence
condition (Supplementary Fig. 2). As for the pleural effusion classification, of a chest drain in an X-ray image, as we can see from the most activated
most prototypes activate the lower part of the lungs, where fluid accumulates prototypes (Fig. 5: second row first and third columns).
in this disorder. More examples of the prototypes learned in the unbiased As for the fully global models trained in the federated setting with one
and biased settings can be found in Supplementary Figs. 2–15. biased client (GMb), there is a difference in their behavior depending on the
To demonstrate the effect of data bias, we compare the models on test type of bias used. Injected bias (an emoji), applied to the cardiomegaly class,
images by finding a patch mostly activated by the prototypes learned locally did not have an effect on the global prototypes: they still activate the heart in
and collaboratively in the FL setting with three unbiased and one biased both biased and unbiased images (Fig. 4: second column). For pleural
client (Figs. 4 and 5). We see that the local model of an unbiased client looks effusion, however, with a more realistic chest drain bias, the global proto-
at a meaningful class-characteristic patch in both biased (Figs. 4 and 5: types seem to be affected strongly by the biased client’s training set since they
second-row last column) and unbiased (Figs. 4 and 5: first-row first column) tend to activate the upper part of an image instead of the bottom of the lungs
images to reason its predictions. The personalized model of an unbiased where the fluid usually accumulates in the pleural effusion condition (Fig. 5:
client highlights a meaningful patch too (Figs. 4 and 5: first-row third second column).
We demonstrated that prototypes learned by ProtoPNet are sensi- them by means of simple visual inspection (ideally with the help of
tive to data bias and thus can help to create a visually interpretable approach domain experts). There is no need to share a test set with other clients
to explore data interoperability in FL in a privacy-preserving way. We or the server.
discuss this possibility in the next section. 4. A large difference between local and global/personalized prototypes for
certain clients indicates a possible data bias in the federation and
Discussion requires the clients to either quit the federation or take measures to
Data compatibility between the clients in FL is of utmost importance for improve the quality of their training data.
training an efficient and generalizable model. In this work, we present a
visually interpretable approach for bias identification in FL that leverages a We demonstrate this scheme on a task of binary classification of X-ray
prototypical part learning network. A scheme to identify an incompatible images for the presence of cardiomegaly and pleural effusion conditions
client can be approximated as follows: using two different data poisoning patterns. As can be seen from Fig. 4,
1. Each client trains a local ProtoPNet (LM) on its own data set. simple injected bias such as an emoji in the cardiomegaly class easily con-
2. With the help of a central server, the clients train global models (GM fuses the local model making it spuriously rely on this emoji to predict a
and/or PM), sharing all or a portion of learnable parameters (e.g., only positive class. It is interesting to note that for this binary classification task,
the prototypes and weights of the final layer). adding bias to a positive class also changes the prototypes for a negative class.
3. Each client visualizes its local, global, and personalized prototypes This effect can be seen in Supplementary Fig. 4 and Supplementary Fig. 8,
(finding the most activated patches) on its local test set and compares where prototypical parts for a negative (unbiased) class turned out to be left
in upper regions where there was an emoji for a positive class. Obviously, approach, further studies are required. It is also necessary to experiment
these prototypes have no practical value or plausible physiological mapping with other medical datasets and real-world biased settings with a larger
in classifying cardiomegaly. number of clients. Since the objective of this research was to develop a
Training a model via averaging the parameters over all clients helps to novel deep learning-based methodology with the potential for application
level out the effect of the bias completely (GM) or, to a smaller extent (PM). across various medical imaging modalities, we used a retrospectively
This apparent difference between local and global/personalized prototypes collected dataset to align with common practices in building deep learning
should alarm the data owner of possible discrepancies between their data models. However, a prospective study in a real clinical setting is strictly
and others. From the unbiased clients' perspective, since the difference necessary to ensure the effectiveness and safety of our approach for
between the prototypes for them is negligible, a drop in the performance of a practical use.
GM in comparison to LM and larger uncertainty values are signs of poor Additionally, a possible next step is to introduce a debiasing option to
data interoperability in the federation. our approach that will allow us to instantly penalize the contribution of a
To experiment with more practically relevant data bias, we mimic a biased client. It may be done, for example, automatically through prototypes
common real-world example of shortcut learning, where pleural effusion weighing or manually with the help of domain experts.
can be detected by the presence of chest drains (that have been placed after Furthermore, we are currently working on adapting our MyTH
initial diagnosis as a therapeutic intervention). Thus the presence of chest approach to a web-based DISCO application (https://epfml.github.io/
drains in X-ray images can serve as a proxy for pleural effusion class. We disco). It provides a user-friendly framework for distributed learning and
trained our models in the FL setting where one client has images with chest thus has the potential to facilitate the integration of MyTH into medical
drains in the positive class (note that these images do not necessarily have practice.
pleural effusion anymore). Finally, a promising direction for future work is combining our
Figure 5 shows a possible output of applying MyTH on a pleural effu- approach with counterfactual explanation techniques known for providing
sion classification task in a biased setting. As before, an LMb fails to activate a human-understandable post-hoc explanations for model decision-
class-relevant feature, namely the bottom region of the lungs, as an unbiased making27. For instance, clients can apply counterfactual explanation tech-
model does and instead looks at the upper part of the chest where there are niques to their local and global models to observe the changes in model
lots of drains. The same result was observed for PMb. It is interesting that the output after removing features identified by the MyTH approach as potential
fully global model also activates the upper part of a test image in both biased biases. This integration can further help in developing trust in deep learning
and unbiased samples. Unlike the cardiomegaly classification, in this case, models among medical practitioners.
the data incompatibility is clearer for unbiased clients than for biased ones.
Indeed, in the cardiomegaly classification task, only one client has a sys- Methods
tematic bias, while in the pleural effusion case, chest drains may naturally be Model description
present in the images of other clients as well. This data distribution is The ProtoPNet architecture is presented in Supplementary Fig. 1. The
applicable in the real world. It makes the chest drain prototype dominant network is composed of the following parts:
among the positive class prototypes of a global model and significantly • a set of convolutional layers to learn features from the input data;
worsens the overall model performance. • two additional 1 × 1 convolutional layers with D channels and the
As mentioned in the “Experimental details” subsection in “Methods” ReLU activation after the first layer and Sigmoid after the second;
section, the two different ways of parameter aggregation allow us to inves- • a prototype layer with a predefined number of prototypes. Each pro-
tigate a trade-off between privacy and ease of bias identification. Obviously, totype is a vector of size 1 × 1 × D with randomly initialized entries;
the more parameters clients share, the higher the risk of privacy leakage. At • a final fully connected layer with the number of input nodes equal to the
the same time, GMb trained via aggregating all learnable parameters of number of prototypes and the number of output nodes corresponding
ProtoPNet demonstrates a large difference between local and global to the number of classes. The weights indicate the importance of a
prototypes in case of the presence of data bias in the federation facilitating particular prototype for a class. They are initialized as in ref. 22 such
the identification of this bias. PMb, trained by centrally updating only the that the connection between the prototypes and their corresponding
prototypes and weights of the final layer, makes it more challenging for an class is 1 and −0.5 for the connections with the wrong classes.
adversary to get the data from such a small set of network parameters but
have less bias-identification power: due to a large local contribution, the We trained MyTH, an FL adaptation of ProtoPNet, using either
difference between local and personalized prototypes is small. unbiased identical data distribution among clients (unbiased setting) or
In this work, we presented two extreme cases of parameter aggregation. imperfectly interoperable with systematic bias in a single class (biased set-
More experiments are needed to define an optimal amount of parameters to ting). Two parameter aggregation schemes were applied: (1) a central server
share. Note, however, that this potential amount is not strict and depends on aggregates and updates local parameters of all the layers of ProtoPNet or
a certain data sensitivity to privacy. Therefore, it is up to clients to set their (2) only the prototypes and weights of the final fully connected layer. In the
privacy budget, i.e. how many network parameters they are ready to share. second case, the parameters of feature learning layers always stay local,
Optionally, clients can also share their local models with each other to which results in personalized models with global prototypes. A detailed
visualize them on other clients’ data for additional comparison. An example scheme is shown in Fig. 6.
of such a possibility is shown in Figs. 4 and 5 in the last column. We can see a By learning local prototypes, each client identifies the features in its
large difference between the local prototypes learned on biased and unbiased training images most important for the task. In contrast, the global proto-
data for both cardiomegaly and pleural effusion classes. types show the relevance for all clients on average. Finally, by examining the
So far, we have been talking about data bias from a negative perspective. difference between local and global prototypes, a client can identify bias in its
However, it is possible to have large heterogeneity among the clients own or in another client’s dataset.
meaning that some specific features that each of them has are important. For Experimenting with two different parameter aggregation schemes
instance, variations in skin color can be an essential feature for predicting allows us to investigate the trade-off between the bias-revealing and privacy-
dermatological pathologies, as certain skin conditions may present differ- preserving properties of MyTH.
ently depending on skin pigmentation. In this case, training PM allows
clients to benefit from the federation while keeping their specific features Notation. Hereafter, we denote matrices and vectors in bold capital and
essential for the prediction (see Figs. 4 and 5: second-row third column). bold lowercase letters, respectively.
It is worth noting the directions for future work. To investigate the Data split: Each client n ∈ {1, . . ., N} has a training set Dn of size l, which
trade-off between privacy and bias-identification ability of our MyTH consists of training images fXi gli¼1 and their corresponding classes fyi gli¼1 .
Fig. 6 | MyTH architecture. Several clients (blue panel for unbiased and orange has a systematic bias, which contaminates the prototype pool (red cross). Prototypes
panel for biased clients) wish to learn a model in a federated setting via a SERVER. for each class and other learnable parameters of the network are shared with the
MyTH passes raw data through convolutional layers to create embeddings in a latent SERVER by each client and aggregated to make global parameters (circled purple
space, each of which can be seen as H × W image patches of size [1 × 1 × D]. These cross). These are then sent back to the clients. Classification is based on a similarity
patches are clustered around the closest prototypes (blue and red crosses), which are score between the prototypes and the patches of an encoded image. In the final panel,
being learned for each class in the prototype layer. The prototype is a vector we see how global and local prototypes can be compared without sharing any
representing a class-characteristic feature in the latent space. Clientn (orange panel) original data.
Model: Each client learns features with convolutional layers and m Data
local prototypes Pn ¼ fpj gm j¼1
of size [1 × 1 × D] with a fixed number of The experiments were conducted on the CheXpert dataset25, a large public
prototypes per class. First, given an input image Xi, the convolutional dataset of 224,316 chest X-rays of 65,240 patients collected from Stanford
layers produce an image embedding Zi of size [H × W × D], which can be Hospital between October 2002 and July 2017 in both inpatient and out-
represented as [H × W] patches of size [1 × 1 × D]. Then, the prototype patient centers. Each image was accompanied by a radiology report which
layer computes the squared L2-distance between each prototype pj and was labeled for the presence of 14 observations as positive, negative, or
all the patches in the image embedding Zi. This results in m distance uncertain. The images were consequently labeled with a rule-based labeler
matrices of size [H × W] which are then converted into matrices of developed by the authors of ref. 25, which extracted observations from the
similarity scores (activation matrices) and subjected to a global max text radiology reports. The original 14 observations included Enlarged
pooling operation to extract the best similarity score for each prototype. Cardiomediastinum, Cardiomegaly, Lung Lesion, Lung Opacity, Edema,
These final scores are then multiplied by the weight matrix Wnh in the Consolidation, Pneumonia, Atelectasis, Pneumothorax, Pleural Effusion,
final fully connected layer h followed by softmax normalization to Pleural Other, Fracture, Support Devices, and No Finding. To simplify the
output class probabilities. experiments and interpretation, we used a one-vs-rest binary setting. Spe-
Training. The details of local training can be found in ref. 22 and in cifically, we use images with positive labels for classes Cardiomegaly or
Supplementary Note 1. At the global update step, the server aggregates the Pleural effusion as the positive class and all other images as the single
local prototypes Pn, weights of the final layer Wnh , and in the first aggregation negative class. Cardiomegaly is a health condition characterized by an
scheme, also the parameters of the convolutional layers Wnc from each client enlarged heart, and pleural effusion is an accumulation of fluid between the
n and performs simple averaging of these parameters to obtain the global visceral and parietal pleural membranes that line the lungs and chest cavity.
ones: This setting, however, resulted in a data imbalance (7 and 1.6 times for
cardiomegaly and pleural effusion, respectively). To address this issue, we
1X N decreased the size of a negative class in the training set by undersampling to
Pglob ¼ Pn ð1Þ make it equal to the size of a positive class. The final training sets had 48,600
N n¼1
and 37,088 images for cardiomegaly and pleural effusion classification,
respectively. The validation sets were left imbalanced.
1X N
Wh;glob ¼ Wn ð2Þ
N n¼1 h Experimental details
For both cardiomegaly and pleural effusion classification tasks, we first
1X N trained and evaluated a baseline centralized ProtoPNet, which we denote
Wc;glob ¼ Wn ð3Þ as Centralized Model (CM). Then we made an IID partition of the data over
N n¼1 c
four clients and trained Local (LM), Global (GM), and Personalized (PM)
models. Finally, we introduced systematic bias to one client’s dataset and
and then sends them back to clients as shown in Fig. 6. trained LMb, GMb, and PMb models where superscript b denotes a setting
To visualize the prototypes, each client finds for each of the local and with one biased and three unbiased clients. The training details are
global prototypes a patch among its training images from the same class that described below.
is mostly activated by the prototype. It is achieved by forwarding the image
through the trained ProtoPNet and upsampling the activation matrix to Unbiased setting.
the size of the input image. A prototype can be described as the smallest 1. Centralized (CM) ProtoPNet. As a baseline, we follow the archi-
rectangular area within an input image that contains pixels with an acti- tecture and optimization parameters from the ProtoPNet paper22
vation value in the upsampled activation map equal to or greater than the using the DenseNet28 convolutional layers pretrained on ImageNet29 to
95th percentile of all activation values in that map22. learn a CM on the whole dataset. We used 10 prototypes of size
1 × 1 × 128 per class. We report balanced average validation accuracy 4. Personalized models (PM): We used the second FL setup within
due to the validation set imbalance: which the server aggregates only the prototypes Pn and weights of
the final fully connected layer Wnh to train PMs. We followed the
Sensitivity þ Specificity same communication technique as described for GM, with the
Balanced accuracy ¼ ð4Þ
2 difference that, after receiving the updated prototypes Pglob and
weights Wh,glob from the server, each client performs an addi-
2. Local models (LM): We trained and evaluated LM for each of the four tional prototype update locally by finding the nearest latent
IID clients. training patch from the same class and assigning it as a prototype.
3. Global models (GM): Using the first FL setup where the server This operation is known as prototype push in ref. 22, and we use it
aggregates parameters of all the layers, GMs were trained according to to adapt global prototypes to a local dataset for personalization.
the scheme depicted in Fig. 6. The training comprises three (for pleural This step is followed by local optimization of the final layer to
effusion) or four (for cardiomegaly) communication rounds between improve accuracy.
the clients and the server. The server initializes a ProtoPNet model
and sends it to the clients who learn LMs. After five epochs, a subset of Biased setting.
local parameters is communicated to the server and aggregated. 5. Local, global, and personalized models: We trained LMb, GMb, and PMb
Importantly, during this training stage, each client keeps the pretrained models in an FL setting with three unbiased clients and one with
convolutional weights frozen and trains two additional convolutional systematic bias in one class (Fig. 2). This setting is schematically shown
layers. Each of the next communication rounds includes the follow- in Fig. 7 between two clients. We visually inspect the prototypes learned
ing steps: locally and globally to detect the differences between clients’ data
• Local training: Each client trains convolutional layers, a prototype without sharing them.
layer, and a final fully connected layer locally on its own dataset.
• Local parameters: A set of local prototypes Pn, weights Wnh and Wnc Statistical analysis
is sent to the server after every 10 epochs. For each of the models described above, we present the uncertainty in terms
• Global parameters: The server averages local parameters to create a of standard deviation. It was computed over three runs with different seeds.
set of global prototypes Pglob, weights Wh,glob and Wc,glob. These are For LM and PM, the final performance was also averaged over four datasets
shared back to each client to iterate training. (clients).
Data availability 15. Shrikumar, A., Greenside, P. & Kundaje, A. Learning important
The benchmark CheXpert dataset (CheXpert-v1.0-small) used in the cur- features through propagating activation differences. PMLR 70,
rent study is publicly available on the Kaggle platform at https://www. 3145–3153 (2017).
kaggle.com/datasets/ashery/chexpert. A list of CheXpert images labeled for 16. Chen, H., Lundberg, S. & Lee, S.-I. Explaining Models by Propagating
the presence of chest drains, which was used to generate biased data in this Shapley Values of Local Components 261–270 (Springer International
study, was adopted from ref. 26 and is publicly available at https://github. Publishing, Cham, 2021).
com/EPFLiGHT/MyTH. 17. Samek, W., Montavon, G., Vedaldi, A., Hansen, L. K. & Müller, K.-R.
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning
Code availability (Springer, 2019).
The code to reproduce our approach is available at https://github.com/ 18. Molnar, C. Interpretable Machine Learning (Independently published,
EPFLiGHT/MyTH. 2023).
19. Adebayo, J. et al. Sanity checks for saliency maps. Adv. Neural Inf.
Received: 30 January 2024; Accepted: 14 August 2024; Process. Syst. 31, 9505–9515 (2018).
20. Rudin, C. et al. Interpretable machine learning: fundamental principles
and 10 grand challenges. Stat. Surv. 16, 1–85 (2022).
References 21. Sun, S., Woerner, S., Maier, A., Koch, L. M. & Baumgartner, C. F.
1. Piccialli, F., Di Somma, V., Giampaolo, F., Cuomo, S. & Fortino, G. A Inherently interpretable multi-label classification using class-specific
survey on deep learning in medicine: Why, how and when? Inf. Fusion counterfactuals. Proc. Mach. Learn. Res. 227, 937–956 (2023).
66, 111–137 (2021). 22. Chen, C. et al. This looks like that: deep learning for interpretable
2. Shen, D., Wu, G. & Suk, H.-I. Deep learning in medical image analysis. image recognition. In Proc. of the 33rd International Conference on
Annu. Rev. Biomed. Eng. 19, 221–248 (2017). Neural Information Processing Systems, Curran Associates Inc. (eds.
3. Landi, I. et al. Deep representation learning of electronic health Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox E. &
records to unlock patient stratification at scale. npj Digit. Med. 3, Garnett, R.), 8930–8941 (2019).
96 (2020). 23. Barnett, A. et al. A case-based interpretable deep learning model for
4. Barnett, A. J. et al. Interpretable deep learning models for better classification of mass lesions in digital mammography. Nat. Mach.
clinician-AI communication in clinical mammography. In Proc. of SPIE Intell. 3, 1067–1070 (2021).
Medical Imaging 2022: Image Perception, Observer Performance, and 24. Hase, P., Chen, C., Li, O. & Rudin, C. Interpretable image recognition
Technology Assessment, SPIE Digital Library (eds. Mello-Thoms, C. with hierarchical prototypes. In Proc. of the 7th AAAI Conference on
R. & Taylor-Phillips, S.), 12035, 1203507 (2022). Human Computation & Crowdsourcing, PKP:PS (eds. Law, E. &
5. McMahan, H. B., Moore, E., Ramage, D., Hampson, S. & Arcas, B. A. Vaughan, J. W.), 7, 32–40 (2019).
Communication-efficient learning of deep networks from 25. Irvin, J. et al. Chexpert: a large chest radiograph dataset with
decentralized data. In Proc. of the 20th International Conference on uncertainty labels and expert comparison. In Proc. of the 33rd AAAI
Artificial Intelligence and Statistics (AISTATS) (ed. Lawrence, N.), 54, Conference on Artificial Intelligence (AAAI-19) (eds. Van Hentenryck,
1273–1282 (JMLR: W&CP, 2017). P. & Zhou, Z.-H.), 33, 590–597 (AAAI Press, 2019).
6. Nguyen, D. C. et al. Federated learning for smart healthcare: a survey. 26. Jiménez-Sánchez, A., Juodelye, D., Chamberlain, B. & Cheplygina, V.
ACM Comput. Surv. 55, 1–37 (2022). Detecting shortcuts in medical images—a case study in chest X-rays.
7. Rieke, N. et al. The future of digital health with federated learning. npj In Proc. of the IEEE 20th International Symposium on Biomedical
Digit. Med. 3, 1–7 (2020). Imaging (ISBI), (eds. Salvado, O. & Rittner, L.), 1–5 (IEEE, 2023).
8. Nazir, S. & Kaleem, M. Federated learning for medical image analysis 27. Ser, J. D. et al. On generating trustworthy counterfactual
with deep neural networks. Diagnostics 13, 1532 (2023). explanations. Inf. Sci. 655, 119898 (2024).
9. Islam, M., Reza, M. T., Kaosar, M. & Parvez, M. Z. Effectiveness of 28. Huang, G., Liu, Z., van der Maaten, L. & Weinberger, K. Q. Densely
federated learning and CNN ensemble architectures for identifying connected convolutional networks. In Proc. of the IEEE Conference on
brain tumors using MRI images. Neural Process. Lett. 55, Computer Vision and Pattern Recognition (CVPR) (ed. O’Conner, L.),
3779–3809 (2023). 2261–2269 (IEEE, 2017).
10. Ribeiro, M. T., Singh, S. & Guestrin, C. “Why should I trust you?": 29. Deng, J. et al. ImageNet: a large-scale hierarchical image database. In
explaining the predictions of any classifier. In Proc. of the 22nd ACM Proc. of IEEE Conference on Computer Vision and Pattern Recognition
SIGKDD International Conference on Knowledge Discovery and Data (CVPR) (eds. Essa, I., Kang S. B. & Pollefeys, M.), 248–255 (IEEE, 2009).
Mining, Association for Computing Machinery (eds. Krishnapuram, B.,
Shah, M., Smola, A. J., Aggarwal, C., Shen D. & Rastogi, R.), Acknowledgements
1135–1144 (2016). We would like to acknowledge Khanh Nguyen, Valérian Rousset, Julien
11. Lundberg, S. & Lee, S.-I. A unified approach to interpreting model Vignoud, and other members of the DISCO team for their valuable help in
predictions. In Proc. of the 31st International Conference on Neural adapting the MyTH approach to a web-based DISCO application. This work
Information Processing Systems, Curran Associates Inc. (eds. von was funded by the Science and Technology for Humanitarian Action Chal-
Luxburg, U., Guyon, I., Bengio, S., Wallach H. & Fergus, R.), lenges (HAC) program from the Engineering for Humanitarian Action (EHA)
4768–4777 (2017). initiative, a partnership between the International Committee of the Red
12. Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep Cross (ICRC), EPFL, and ETH Zurich. EHA initiatives are managed jointly by
networks via gradient-based localization. J. Comput. Vis. 128, ICRC, EPFL EssentialTech Centre, and ETH Zurich’s ETH4D. This work was
336–359 (2019). also partially supported by NIH under award U24LM013755 and Innosuisse
13. Simonyan, K., Vedaldi, A. & Zisserman, A. Deep inside convolutional grant 113.137 IP-ICT.
networks: visualising image classification models and saliency maps.
In Workshop at International Conference on Learning Author contributions
Representations (2014). Methodology and investigation: K.N., S.P.K., A.D.; Validation: K.N.; Data
14. Singh, A., Sengupta, S. & Lakshminarayanan, V. Explainable deep curation: K.N.; Formal analysis: K.N., A.D., S.P.K., M.-A.H.; Conceptualization:
learning models in medical image analysis. J. Imaging 6, 52 M.-A.H., S.P.K., K.N.; Visualization: K.N., M.-A.H., A.D.; Discussion: all
(2020). authors; Supervision: M.-A.H., M.J.; Project administration: M.-A.H., M.J.
Resources: M.J.; Writing (original draft): K.N.; Writing (final draft): K.N., A.D., Publisher’s note Springer Nature remains neutral with regard to
and M.-A.H. All authors approved the final version of the manuscript for jurisdictional claims in published maps and institutional affiliations.
submission and agreed to be accountable for their contributions.
Open Access This article is licensed under a Creative Commons
Competing interests Attribution 4.0 International License, which permits use, sharing,
The authors declare no competing interests. adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
Additional information provide a link to the Creative Commons licence, and indicate if changes
Supplementary information The online version contains were made. The images or other third party material in this article are
supplementary material available at included in the article’s Creative Commons licence, unless indicated
https://doi.org/10.1038/s41746-024-01226-1. otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons licence and your intended use is not permitted
Correspondence and requests for materials should be addressed to by statutory regulation or exceeds the permitted use, you will need to
Mary-Anne Hartley. obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.
Reprints and permissions information is available at
http://www.nature.com/reprints © The Author(s) 2024