0% found this document useful (0 votes)

3 views

Self-eXplainable AI for Medical Image Analysis

This document presents a comprehensive survey on Self-eXplainable AI (S-XAI) for medical image analysis, emphasizing the need for transparent and reliable AI models in high-stakes medical decision-making. It reviews over 200 papers, categorizing S-XAI methods into input, model, and output explainability, and discusses their applications, evaluation metrics, and future research directions. The survey highlights the significance of S-XAI in enhancing trustworthiness, accountability, and collaboration between clinicians and AI systems.

Uploaded by

yeshw537

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Self-eXplainable AI for Medical Image Analysis

Uploaded by

yeshw537

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

GENERIC COLORIZED JOURNAL, VOL. XX, NO.

XX, XXXX 2017 1

Self-eXplainable AI for Medical Image Analysis:

A Survey and New Outlooks
Junlin Hou, Sicen Liu, Yequan Bie, Hongmei Wang, Andong Tan, Luyang Luo, Hao Chen

Abstract— The increasing demand for transparent and

reliable models, particularly in high-stakes decision- Medical Image Prediction

Post-hoc XAI
making areas such as medical image analysis, has led to
arXiv:2410.02331v1 [cs.CV] 3 Oct 2024

the emergence of eXplainable Artificial Intelligence (XAI).

Post-hoc Method Explanation
Post-hoc XAI techniques, which aim to explain black-box
models after training, have been controversial in recent Black-box Model
works concerning their fidelity to the models’ predictions.
In contrast, Self-eXplainable AI (S-XAI) offers a compelling
alternative by incorporating explainability directly into the Medical Image Prediction

S-XAI
training process of deep learning models. This approach
allows models to generate inherent explanations that are
Explainable Input Explanation
closely aligned with their internal decision-making pro- Explainable Model
cesses. Such enhanced transparency significantly sup-
ports the trustworthiness, robustness, and accountabil- Fig. 1. Illustration of post-hoc XAI versus Self-eXplainable AI (S-XAI).
ity of AI systems in real-world medical applications. To
facilitate the development of S-XAI methods for medical
image analysis, this survey presents an comprehensive
answering (VQA). Deep neural networks (DNNs) automat-
review across various image modalities and clinical appli- ically learn features from input data and produce optimal
cations. It covers more than 200 papers from three key outputs. However, the inherent complexity nature of DNNs
perspectives: 1) input explainability through the integration hinder our understanding of the decision-making processes
of explainable feature engineering and knowledge graph, 2) behind these models. Consequently, DNNs are often consid-
model explainability via attention-based learning, concept-
based learning, and prototype-based learning, and 3) out-
ered as black-box models, which has raised concerns about
put explainability by providing counterfactual explanation their transparency, interpretability, and accountability for their
and textual explanation. Additionally, this paper outlines successful deployment in real-world clinical applications [1].
the desired characteristics of explainability and existing To tackle the challenge of developing more trustworthy AI
evaluation methods for assessing explanation quality. Fi- systems, research efforts are increasingly focusing on various
nally, it discusses the major challenges and future research
directions in developing S-XAI for medical image analysis. eXplainable AI (XAI) methods, enhancing transparency [2],
fairness [3], and robustness [4]. However, most XAI methods
Index Terms— Self-eXplainable Artificial Intelligence (S- aim to generate explanations for the outputs of black-box
XAI), Medical Image Analysis, Input Explainability, Model
Explainability, Output Explainability, S-XAI Evaluation
AI models after they have been trained, a category known
as post-hoc XAI, as illustrated in Fig. 1 top. These methods
utilize additional explanation models or algorithms to provide
I. I NTRODUCTION insights into the decision-making process of the primary AI
Artificial intelligence (AI), particularly deep learning, has model. In the field of medical image analysis, commonly used
driven significant advancements in medical image analysis, post-hoc XAI techniques include feature attribution methods,
including applications in disease diagnosis, lesion segmenta- such as gradient-based approaches (e.g., LRP [5], CAM [6])
tion, medical report generation (MRG), and visual question and perturbation-based approaches (e.g., LIME [7], Kernel
SHAP [8]). Additionally, some methods explored concept
This work was supported by the Hong Kong Innovation and Technol- attributions, learning human-defined concepts from the internal
ogy Fund (Project No. MHP/002/22), HKUST (Project No. FS111) and
Research Grants Council of the Hong Kong (No. R6003-22 and T45- activations of DNNs (e.g., TCAV [9], CAR [10]). Post-hoc
401/22-N). XAI techniques are often model-agnostic, indicating that they
J. Hou, Y. Bie, H. Wang, and A. Tan are with the Department of can be flexibly applied to a variety of already-trained black-
Computer Science and Engineering, Hong Kong University of Science
and Technology, Hong Kong, China (email: [email protected]) box AI models.
S. Liu is with the Department of Engineering, Shenzhen MSU-BIT Since post-hoc explanations are generated separately from
University, Shenzhen, China (email: [email protected]) the primary AI model, several valid concerns have been
L. Luo is with the Department of Biomedical Informatics, Harvard
University, Cambridge, USA (email: luyang [email protected]) raised: 1) these explanations may not always be faithful to
H. Chen is with the Department of Computer Science and Engineer- the actual decision-making process of black-box models [11],
ing, Department of Chemical and Biological Engineering and Division of [12]; 2) they may lack sufficient detail to fully elucidate
Life Science, Hong Kong University of Science and Technology, Hong
Kong, China; HKUST Shenzhen-Hong Kong Collaborative Innovation the model’s functioning [13]. These limitations of post-hoc
Research Institute, Futian, Shenzhen, China. (email: [email protected]) XAI approaches are particularly problematic in high-stakes
2 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

works. The statistics of research articles using keywords self-

explainable, medical, image on Google Scholar are presented
in Fig. 2, which reveal two key observations: 1) there has
been a significant and consistent increase in research papers
focused on self-explainable AI over the years, indicating
growing interest and emphasis within the research community;
2) nearly half of the total research papers (green bars) are
dedicated to applying S-XAI techniques in medical imaging
(blue bars), highlighting the vital importance of S-XAI in the
medical field.
To summarize, this survey presents insights into S-XAI for
medical image analysis, with our contributions outlined below:
1) Novel Scope of XAI Survey: As an emerging XAI
Fig. 2. The upward trend of the total number of S-XAI research papers method that actively offers explainability from the model
from 2018 to 2024 (Jan-Sept), where medical articles account for half. itself, S-XAI is attracting growing attention from the
research community. This work represents the first com-
domains like medical image analysis, where clinicians require prehensive survey on this topic.
a deep and trustworthy understanding of how an AI model 2) Systematic Review of Methods: We present a novel
arrives at its predictions. The issues about the faithfulness and taxonomy of relevant papers and review them based on
sufficiency of post-hoc explanations highlight the importance input explainability, model explainability, and output ex-
of exploring self-explainable AI models as a potentially more plainability. This offers insights into potential technical
reliable and transparent alternative. innovations for S-XAI methods.
Self-eXplainable AI (S-XAI) is a category of XAI meth- 3) Thorough Overview of Applications: We overview
ods designed to be interpretable by nature, as illustrated in various applications across different anatomical loca-
Fig. 1 bottom. These methods incorporate explainability as tions and modalities in current S-XAI research. This
an integral part of the model during the training process, illustrates the ongoing development of S-XAI technolo-
rather than generating explanations after the model has been gies in medical image practices, serving as a reference
trained. Conventional inherently interpretable methods include for future applications in diverse scenarios.
various white-box machine learning models, such as decision 4) Comprehensive Survey of Evaluations: We analyze a
trees [14], generalized additive models [15], and rule-based range of desired characteristics and evaluation methods
systems [16]. In this survey, we focus primarily on DNNs to assess the quality of explainability. This provides
and extend the characteristics of self-explainability across the guidelines for developing clinically explainable AI sys-
entire pipeline, from model input to architecture to output, tems that are trustworthy and meaningful for end-users.
enabling direct inspection and understanding of the reasoning 5) In-depth Discussion of Challenges and Future Work:
behind the model’s predictions without reliance on external We discuss the key challenges and look forward to
explanation methods. In contrast to post-hoc XAI approaches, the promising future directions. This highlights current
S-XAI methods aim to provide explanations that are inherent, shortcomings and identifies new opportunities for re-
transparent, and faithful, aligning directly with the model’s searchers to drive further advancements.
internal decision-making mechanisms. Such explanations are
essential for the effective adoption and clinical integration of II. S-XAI IN M EDICAL I MAGE A NALYSIS
AI-powered decision support systems. Furthermore, S-XAI fa- Transparency and trustworthiness are essential for deep
cilitates collaborative decision-making between clinicians and learning models deployed in real-world applications of medical
AI systems, fostering better-informed and more accountable image analysis. To address this need, the research community
medical diagnoses and interventions. has explored various XAI methods and proposed several XAI
This paper presents the first systematic review of S-XAI taxonomy. According to existing literature [17], [19], [23],
for medical image analysis, covering methodology, medical XAI methods can be categorized by the following criteria.
applications, and evaluation metrics, while also offering an in- 1) Intrinsic versus Post-hoc: This criteria differentiates
depth discussion on challenges and future directions. Although whether interpretability is inherent to the model’s architecture
there is a wealth of literature on medical XAI surveys [2], (intrinsic) or achieved after the model training (post-hoc).
[17]–[22] that deliver valuable insights, none have focused 2) Model-specific versus Model-agnostic: Model-specific
specifically on a comprehensive review of S-XAI methods methods are restricted to particular model classes, whereas
applied to medical image analysis. We analyze more than 200 model-agnostic methods can be applied to explain any model;
papers published from 2018 to September 2024, sourced from 3) Local versus Global: The scope of an explanation
the proceedings of NeurIPS, ICLR, ICML, AAAI, CVPR, distinguishes between those for an individual prediction (local)
ICCV, ECCV, and MICCAI as well as top-tier journals in or those for the entire model behavior (global).
the field, including Nature Medicine, Transactions on Pattern 4) Explanation Modality: The common types of explana-
Analysis and Machine Intelligence, Transactions on Medical tion include visual explanation, textual explanation, concept
Imaging, Medical Image Analysis, or those cited in related explanation, sample explanation, etc.
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 3

Input Explainability (Sec III) Model Explainability (Sec IV) Output Explainability (Sec V)

Medical Images Medical Applications

Medical Image Predictor Predictions
Model
Location Modality Encoder Explainability Mono-modal Multi-modal
Explainable Explainable
Breast X-ray Input A C Generator Output
B Disease Visual Question
Chest CT/MRI Classification Answering
Attention-based Learning (Sec IV-A)
Skin Dermatology Lesion Medical Report
1) Structure-guided Attention Models Segmentation Generation
Eye Ultrasound 2) Loss-guided Attention Models
… … ... ...

l Feature Engineering (Sec III-A) Concept-based Learning (Sec IV-B)

l Textual Explanation (Sec V-A)
Explainable Feature Engineering 1) Supervised Concept Learning 1) Fully-structured Text Generation
2) Unsupervised Automatic Concept Discovering 2) Semi-structured Text Generation
3) Generative Concept Learning 3) Free-structured Text Generation
l Knowledge Graph (Sec III-B)
1) Prior Knowledge Graph Prototype-based Learning (Sec IV-C)
2) Data Knowledge Graph l Counterfactual Explanation (Sec V-B)
1) Explicit Prototype based Models
3) Hybrid Knowledge Graph Counterfactual Image Generation
2) Implicit Prototype based Models

Fig. 3. Overview of Self-eXplainable AI (S-XAI) frameworks including input explainability, model explainability, and output explainability.

This survey concentrates on S-XAI methods for medical A. Explainable Feature Engineering
image analysis that allow models to inherently explain their
Feature engineering focuses on transforming raw images
own decision-making. As depicted in Fig. 3, we introduce a
into a more useful set of human-interpretable features. This
new taxonomy of S-XAI based on the three key components
process is crucial for traditional machine learning methods to
of DNNs.
achieve accurate predictions, but it can be time-consuming and
1) Input Explainability (Sec. III): Input explainability
demands significant domain expertise. In contrast, deep learn-
focuses on integrating additional explainable inputs with deep
ing models automatically extract features from raw images,
features of medical images obtained from various anatomical
simplifying the manual crafting process but often resulting
locations and modalities to produce final predictions. By
in reduced interpretability. A promising approach to enhance
incorporating external knowledge and context-specific infor-
input explainability is to incorporate explainable feature engi-
mation, the accuracy and reliability of these predictions can
neering into deep learning, which injects domain knowledge
be significantly improved.
2) Model Explainability (Sec. IV): Model explainability into the model, as shown in Fig. 4(a). This integration en-
aims to design inherently intepretable model architectures of hances the model’s interpretability by ensuring that the learned
DNNs. Instead of explaining a black-box model, transforming features are relevant and meaningful for clinical applications.
the model into an interpretable format enhances understanding Ultimately, this method improves model performance and
of how it processes information. offers valuable insights into the decision-making process.
3) Output Explainability (Sec. V): Output explainability A common strategy in explainable feature engineering is
refers to the model’s ability to generate not just predictions to combine both handcrafted and deep features from an input
for various medical image tasks but also accompanying expla- image to make final predictions [24], [25]. For example, Kapse
nations through an explanation generator. This capability aids et al. [24] introduce a self-interpretable multiple instance
in understanding the rationale behind the model’s predictions, learning (SI-MIL) framework that simultaneously learns from
facilitating informed medical decision-making. deep image features and handcrafted morphometric and spatial
The following sections summarize and categorize the most descriptors. They assess the local and global interpretability
relevant works on S-XAI methods applied to medical image of SI-MIL through statistical analysis, a user study, and key
analysis. Comprehensive lists of the reviewed S-XAI meth- interpretability criteria. Another line of approach involves
ods are provided, detailing the employed S-XAI techniques, incorporating interpretable clinical variables as additional in-
publication year, anatomical location, image modality, medical puts alongside the images, often utilizing multimodal learning
application, and the datasets used. techniques [26], [27]. For instance, Xiang et al. [26] introduce
OvcaFinder, an interpretable model that combines deep learn-
III. I NPUT E XPLAINABILITY ing predictions from ultrasound images with Ovarian–Adnexal
In this section, we will explore input explainability by Reporting and Data System scores provided by radiologists, as
integrating external domain knowledge, focusing on two key well as routine clinical variables for diagnosing ovarian cancer.
approaches, i.e., a) explainable feature engineering (Sec. III- This approach enhances diagnostic accuracy and explains the
A) and b) knowledge graph (Sec. III-B). As shown in Fig. 4, impact of key features and regions on the prediction outcomes.
these explainable inputs will interact with the deep features of Discussion: Although explainable feature engineering can
image inputs and be combined to support final predictions. be time-consuming, it brings valuable prior knowledge and en-
4 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

(a) Explainable Feature Engineering foundation for medical decision-making, clinical research, and
healthcare analytics [31]–[33]. By harnessing the medical prior
Feature Engineering Clinical Variables E Encoder
knowledge encoded in the graph, AI models can gain valuable
Shape Local Body Temperature
P
insights, identify patterns, predict patient outcomes, assist in
Predictor
Color Textual Complete Blood Count diagnosis, recommend personalized treatments, and ultimately
… …
Fusion improve patient care and outcomes [34]–[38]. For example,
Handcrafted features
Liu et al. [36] and Huang et al. [37] develop KGs based on the
Image
professional perspective related to medical images to enhance
image understanding. Another way to utilize prior knowledge
E P “Pneumonia” is by collecting a large number of relationship triples to
create a domain-knowledge-enhanced medical VQA dataset.
For instance, Liu et al. [38] extract a set of 52.6K triplets
1) Prior Knowledge Graph 2) Data Knowledge Graph in the format < head, relation, tail > containing medi-
Ultrasound cal knowledge from OwnThink (https://www.ownthink.com).
Normal Other Finding CT X-ray
Text
They then use this external information to create SLAKE, a
Heart Spine Pleural …
large-scale, semantically annotated, and knowledge-enhanced
Cardiomegaly Scoliosis … …
Inter-data Intra-data bilingual dataset for training and testing Med-VQA systems.
Prior KGs enhance S-XAI models by integrating expert-
3) Hybrid Knowledge Graph
derived knowledge and medical facts, enabling these models
(b) Knowledge Graph to better understand key medical concepts and make more in-
formed predictions. However, the creation of these KGs largely
Fig. 4. Input explainability that incorporates (a) explainable feature depends on specialized expertise, making the process labor-
engineering (b) knowledge graph as additional inputs.
intensive. Furthermore, these KGs often lack the adaptability
required for analyzing dynamic clinical datasets.
hances the interpretability of deep learning models concerning 2) Data Knowledge Graph: A data KG differs from a prior
input features. Despite the increasing research in this area, KG in its construction methodology. While a prior KG relies
most studies prioritize accuracy improvements, with limited on expert insights and established medical facts, a data KG is
analysis given to the explainability. Additionally, effective built directly from the dataset itself. This means that instead of
information fusion and interaction poses a key challenge. relying solely on pre-existing knowledge, the data knowledge
graph leverages the inherent information contained within the
B. Knowledge Graph dataset. This approach allows the data KG to provide a unique
A knowledge graph (KG) is a structured representation of perspective and the potential to discover previously unknown
factual knowledge that captures relationships between entities relationships and correlations within the data [64]–[67]. There
in a specific area. It provides a way to organize and represent are two primary approaches to leveraging data knowledge
knowledge in a semantically rich and interconnected manner for enhancing the explainability of AI models: 1) extracting
and plays a crucial role in enhancing the interpretability of S- knowledge directly from the dataset [43], [45], [46], [49],
XAI models. Recently, integrating structured domain knowl- [51], [57], [61]. Liu et al. [49] employ a bipartite graph
edge into downstream tasks has attracted significant attention convolutional network to model the intrinsic geometric and
of both industry and academia [28]–[30]. This growing interest semantic relation of ipsilateral views, and an inception graph
stems from the recognition that leveraging domain knowledge convolutional network to model the structural similarities of
can greatly improve the performance and effectiveness of bilateral views. Huang et al. [61] develop a medical KG based
various applications. As shown in Fig 4(b), regarding medical on the types of diseases and questions concerned by patients
imaging analysis, the utilization of KG can be broadly cate- during their treatment process. 2) Transferring knowledge from
gorized into three categories: 1) prior KG, which serves as a pre-trained models. For example, Qi et al. [52] use a pre-
foundational resource that gathers existing domain expertise trained U-Net to segment lung lobes and then model both
and established medical knowledge; 2) data KG, which is the intra-image and inter-image relationships of these lobes
derived from the analysis of large-scale medical imaging and in-batch images through their respective graphs. Elbatel
datasets; and 3) hybrid KG, which combines the strengths of et al. [53] distill knowledge from pre-training models to small
both prior and data KGs for medical image analysis. models for disease classification.
1) Prior Knowledge Graph: A prior KG in the medical Overall, constructing a data KG involves leveraging the
domain is a specialized KG that captures and organizes facts inherent characteristics of the dataset itself to build a graph
information about medical concepts and their relationships. structure to assist S-XAI models. However, it is important to
It can be constructed from a multi-sources, including med- note that these methods often harbor inherent biases that can
ical literature, electronic health records, medical ontologies, vary significantly across different datasets.
clinical guidelines, and expert opinions. This graph serves as 3) Hybrid Knowledge Graph: A hybrid KG integrates both
a comprehensive repository of medical knowledge, encom- the prior KG and the data KG, representing an interactive
passing details about diseases, symptoms, treatments, medi- approach. The prior KG provides a static foundation of estab-
cations, anatomical structures, and more. It provides a vital lished medical facts, while the data KG utilizes dataset charac-
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 5

TABLE I
I NPUT EXPLAINABILITY METHODS BASED ON KNOWLEDGE GRAPH (KG). T HE ABBREVIATIONS HERE ARE CLS: CLASSIFICATION , DET:
DETECTION , MRG: MEDICAL REPORT GENERATION , VQA: VISUAL QUESTION ANSWERING .

Method Year Location Modality Task Dataset KG Type

Naseem et al. [34] 2023 Multiple Pathology VQA [39] Prior KG
Zhang et al. [35] 2020 Chest X-ray MRG [40] Prior KG
Liu et al. [36] 2021 Chest X-ray MRG [41], [42] Prior KG
Huang et al. [37] 2023 Chest X-ray MRG [41], [42] Prior KG
Liu et al. [38] 2021 Multiple X-ray, CT, MRI VQA [38] Prior KG
Chen et al. [43] 2020 Chest X-ray CLS [40], [44] Data KG
Zheng et al. [45] 2021 Chest X-ray, CT, US, text CLS private Data KG
Hou et al. [46] 2021 Chest X-ray CLS [41], [42] Data KG
Zhou et al. [47] 2021 Chest X-ray CLS [40], [44] Hybrid KG
Wu et al. [48] 2023 Chest X-ray CLS [42] Hybrid KG
Liu et al. [49] 2021 Breast Mammogram DET [50] Data KG
Zhao et al. [51] 2021 Chest X-ray DET [40] Data KG
Qi et al. [52] 2022 Chest X-ray DET [40] Data KG
Elbatel et al. [53] 2023 Chest Dermatology, Endoscopy DET [54], [55] Data KG
Li et al. [56] 2019 Chest X-ray MRG [41] Hybrid KG
Liu et al. [57] 2021 Chest X-ray MRG [41], [42] Data KG
Li et al. [58] 2023 Chest X-ray MRG [41], [42] Hybrid KG
Kale et al. [59] 2023 Chest X-ray MRG [41] Hybrid KG
Guo et al. [60] 2022 Multiple X-ray, CT, MRI VQA [38] Prior KG
Huang et al. [61] 2023 Multiple X-ray, CT, MRI, US VQA [38] Data KG
Hu et al. [62] 2023 Chest X-ray VQA [62] Hybrid KG
Hu et al. [63] 2024 Chest X-ray VQA [62] Hybrid KG

teristics to dynamically update and enhance this foundational into a graph format is labor-intensive and costly, requiring
knowledge. By incorporating data-specific insights discovered constant updates and refinements to incorporate the latest
from the dataset, the hybrid KG allows for the integration of research findings, clinical guidelines, and emerging medical
new information and the refinement of existing knowledge. data to maintain up-to-date prior medical knowledge. Another
This dynamic updating process ensures that the KG remains challenge lies in the heterogeneity of medical image data. With
up-to-date and relevant. Consequently, the hybrid KG com- the continuous growth of medical image data, the variety of
bines the strengths of both the prior and data KG, offering a image modalities expands, complicating their effective inte-
more comprehensive and adaptable knowledge representation gration within KGs. Developing robust algorithms to extract
for S-XAI models in the medical field [35], [47], [48], [56], meaningful features from medical images and link them with
[58], [59], [62], [63]. For instance, Wu et al. [48] implement a relevant medical KGs remains an ongoing research endeavor.
triplet extraction module to extract medical information from
reports, combining entity descriptions with visual signals at the IV. M ODEL E XPLAINABILITY
image patch level for medical diagnosis. For the medical report
generation tasks, Li et al. [56] decompose medical report In this section, we present model explainability by design-
generation into explicit medical abnormality graph learning ing interpretable model architectures, such as attention-based
and subsequent natural language modeling. Each node in the learning (Sec. IV-A), concept-based learning (Sec. IV-B), and
abnormality graph represents a possible clinical abnormality prototype-based learning (Sec. IV-C).
based on prior medical knowledge, with the correlations
among these nodes encoded as edge weights to inform clinical A. Attention-based Learning
diagnostic decisions. Hu et al. [63] utilize large language
Attention-based learning aims to capture specific areas in an
models to extract labels and build a large-scale medical VQA
image that are relevant to the prediction task while suppressing
dataset, Medical-CXR-VQA. They then leverage graph neural
irrelevant regions based on feature maps. Therefore, it can
networks to learn logical reasoning paths based on this dataset
be naturally combined with S-XAI methods to provide visual
for medical visual question answering task.
explanations that enhance model decision-making [2], [19],
In summary, the construction of a hybrid KG relies on [68]. We categorize attention-based S-XAI models into 1)
prior knowledge and involves automatically adjusting the structure-guided attention models and 2) loss-guided attention
nodes or edges based on the data characteristics. This process models. As illustrated in Fig. 5, the former specifically designs
ensures that the KG remains aligned with the specific domain the attention structure and obtains model predictions directly
knowledge and captures the most relevant, data-specific in- from the attention map, while the latter constrains the attention
formation. It provides a comprehensive representation of both map using a loss function to ensure attention map align with
data-specific knowledge and prior knowledge, enhancing the an ideal interpretable distribution.
interpretability of S-XAI models. 1) Structure-guided Attention Model: As shown in the top
Discussion: The utilization of medical KGs in medical branch of Fig. 5, structure-guided attention models are char-
image analysis poses both challenges and promising oppor- acterized by the association between the attention structure
tunities. First, integrating diverse prior medical knowledge in the model and the components directly influencing the
6 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

1) Structure-Guided Attention Models three directions, the 3D attention map can be visualized to
explain the model’s decision-making process. For the fast MRI
Visual Attention Explanations
reconstruction task, Huang et al. [77] propose a shifted win-
Attention Weighted dows deformable attention mechanism which uses reference
Image Combination points to impose spatial constraints on attention and directly
A=en>on Es>ma>on
combines the outputs from the attention modules of different
E P “Pneumonia” windows to produce the model’s reconstruction results.
Although structure-guided attention maps can provide ex-
Attention Mechanism planations for model predictions, they are still difficult to align
with clear human-understandable decision-making basis.
Generated Attention Reference
2) Loss-guided Attention Model: As shown in the bottom
Attention Loss Attention branch of Fig. 5, loss-guided attention models use interpretable
labels (i.e., reference attention maps) to construct loss func-
2) Loss-Guided A.en0on Models
tions that directly constrain the generated attention maps. This
Fig. 5. Attention-based learning, including 1) structure-guided and 2) method encourages the model to focus on areas that are un-
loss-guided attention models. X-ray images borrowed from [69]. derstandable and beneficial for making predictions. Benefiting
from lesion area annotations and professional analyses by
model’s predictions. This allows the generated attention map to doctors, which provide clear references for model decisions,
effectively explain the model’s predictions. Jetley et al. [70] is loss-guided attention learning techniques are commonly used
the first to introduce attention learning for XAI. They propose in medical image analysis.
an attention estimator which calculates feature compatibility Using ground-truth masks of regions of interest (RoIs) to
scores to weight feature maps as feature activation scores, guide the generation of attention maps is a widely adopted
which are then directly used as input for a linear classifier. approach in medical image classification [80], [83], [87]. For
This approach guides the model’s attention toward areas that instance, Yang et al. [80] directly optimize the attention maps
are more relevant to its decision-making while suppressing by a Dice loss, which encourages the model to focus on
irrelevant regions. Fukui et al. [71] present an Attention target areas that are highly relevant to the classification of
Branch Network (ABN), which replaces the fully connec- breast cancer microscopy images. To alleviate the challenge
tion layer of Class Activation Mapping (CAM) [6] using a of obtaining pixel-level annotations, Yin et al. [87] pre-train a
convolution layer to output class probabilities. There is also histological feature extractor to identify significant clinically
a perception branch to apply a classifier to the combination relevant feature masks, which are then used to guide and
of attention maps from the original features. Furthermore, Li regularize the attention maps. By considering the varying
et al. [72] propose a slot attention-based method in which contributions of histological features for classification, the
the attention output of each slot are directly processed and model can selectively focus on different features based on
summed up by a main block named SCOUTER to support the distribution of nuclei in each instance. In medical image
a specific category, eliminating the need for a linear classifier segmentation, labels corresponding to edges and shapes of
and further improving the model’s transparency. They also use specific regions are often reused to guide attention in learning
the output from the slot attention mechanism to represent the semantic information [102], [105], [109]. Sun et al. [102]
model’s final confidence for each category. Notably, positive combine spatial attention with the attention estimator in U-
and negative interpretations can be controlled through the Net decoders, enabling the model to interpret learned features
parameters in the loss function. This method demonstrates at each resolution. They also introduce a gated shape stream
improved interpretability in the glaucoma diagnosis task. alongside the texture stream, where the resulting shape atten-
Numerous studies have incorporated multiple attention tion maps are aligned with actual edges through binary cross-
mechanisms for medical image classification and segmentation entropy loss, enhancing the cardiac MRI segmentation.
tasks [73]–[75]. For example, Schempler et al. [73] extend Compared with lesion masks, eye tracking data provides a
the attention estimator by extracting local information from more accurate depiction of expert focus, as it captures the way
coarse-scale feature maps for attention gates, facilitating more doctors visually process information during diagnosis. Bhat-
fine-grained visual interpretation for lesion segmentation or tacharya et al. [69] leverage the captured doctors’ attention to
ultrasound diagnosis. Similarly, Gu et al. [74] develop a guide model training. They employ a teacher-student network
comprehensive attention module that enhances model inter- to replicate the visual cognitive behavior of doctors when
pretability through spatial, channel, and scale attention. Their diagnosing diseases on chest radiographs. The teacher model is
segmentation experiments on skin lesions and fetal organs trained based on the visual search patterns of radiologists, and
demonstrate improved performance and better interpretability the student model utilizes an attention loss to predict attention
of target area positioning and scale. Beyond 2D data, how from the teacher network using eye tracking data.
to use attention to explain more complex 3D medical image Discussion: Attention-based S-XAI methods guide model
diagnosis is more challenging. Lozupone et al. [76] present predictions by focusing on critical areas of images, thereby
an attention module that fuses attention weights from sagittal, providing effective attention explanations. Structure-guided
coronal, and axial slices to diagnose Alzheimer’s disease on attention models typically utilize the attention-weighted output
3D MRI brain scans. By integrating these attention scores from as input for the predictor, reflecting the model’s decision-
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 7

TABLE II
M ODEL EXPLAINABILITY METHODS BASED ON ATTENTION - BASED LEARNING . T HE ABBREVIATIONS HERE ARE CLS: CLASSIFICATION , SEG:
SEGMENTATION , IRE: IMAGE RECONSTRUCTION , REG: REGRESSION

Method Year Location Modality Task Dataset Attention Type

Wang et al. [78] 2018 Breast X-ray CLS [79] Structure-Guided
Yang et al. [80] 2019 Breast Histopathology CLS [81] Loss-Guided
Li et al. [72] 2021 Eye Retinal images CLS [82] Structure-Guided
Yan et al. [83] 2019 Skin Dermatology CLS [84], [85] Loss-Guided
Barata et al. [75] 2021 Skin Dermatology CLS [85], [86] Loss-Guided
Yin et al. [87] 2021 Liver Histopathology CLS [88] Loss-Guided
Bhattacharya et al. [69] 2022 Chest X-ray CLS [89]–[98] Loss-Guided
Lozupone et al. [76] 2024 Brain MRI CLS [99] Structure-Guided
Schempler et al. [73] 2019 Abdominal, Fetal CT, US SEG + DET [100], [101] Structure-Guided
Sun et al. [102] 2020 Cardiac MRI SEG [103], [104] Loss-Guided
Gu et al. [74] 2020 Skin, Fetal Dermatology, MRI SEG [86] Structure-Guided
Karri et al. [105] 2022 Skin, Brain, Abdominal Dermatology, MRI, CT SEG [106]–[108] Loss-Guided
Li et al. [109] 2023 Skin Dermatology SEG [54], [85], [86], [106] Loss-Guided
Huang et al. [77] 2022 Head MRI IRE [110] Structure-Guided
Lian et al. [111] 2019 Brain MRI REG [99], [112] Structure-Guided

making basis. However, these attention explanations often lack Image

clear semantic information and can be subjectively interpreted.
In contrast, loss-guided attention models generally provide E P “Pneumonia”
attention explanations with explicit semantic details. How-
ever, since the attention output does not directly influence
the model’s decisions, evaluating how well these attention Query: What are useful Clinical Concepts 1) Supervised
visual concepts for Concept Learning
maps explain the decision-making process remains challeng- Opacity
diagnosing pneumonia in YES
ing. Overall, while attention-based S-XAI methods enhance chest X- ray images?
Eﬀusion Use concept
model transparency and offer insights into decision-making,
annotations?
the understandability of attention maps and their relevance to Foundation Model

…
the decisions still require further investigation. (LLM/VLM) NO
Infiltration
2) Unsupervised Automatic
3) Generative Concept Learning
Concept Discovering
B. Concept-based Learning
Fig. 6. Concept-based learning, including 1) supervised concept learn-
Concept-based S-XAI methods provide explanations in ing, 2) unsupervised automatic concept discovering, and 3) generative
terms of high-level, human-interpretable attributes rather than concept learning.
low-level, non-interpretable features. This approach reveals
the inner workings of deep learning models using easily
understandable concepts, enabling users to gain deeper insights explicitly described in natural language. For example, Sun et
into underlying reasoning. It also helps in identifying model al. [122] consider the instances segmented by SAM [123] as
biases and allows for adjustments to enhance performance and the concepts of a given image.
trustworthiness. Most concept-based S-XAI methods focus on 1) Supervised Concept Learning: Supervised concept learn-
making decisions based on a set of concepts while also de- ing methods train deep learning models using annotations of
tailing the contribution of each concept to the final prediction textual concepts, particularly by supervising an intermediate
[113]–[116]. These methods introduce concept learning into layer to represent these concepts. A notable example is Con-
the training pipeline of the models, instead of simply analyzing cept Bottleneck Model (CBM) [113], which is an inherently
explainability after training a black-box model (i.e., post-hoc interpretable deep learning architecture. It first maps latent
XAI methods) [117]–[119]. We propose to categorize concept- image features to a concept bottleneck layer, where the number
based S-XAI methods into three types: 1) supervised concept of neurons corresponds to the number of human-defined
learning, 2) unsupervised automatic concept discovering, and concepts, and then predicts final results based on the concept
3) generative concept learning, as shown in Fig. 6. scores from this layer. By enforcing the neurons in the concept
The term Concept has been defined in different ways, which bottleneck layer to learn concept representations supervised by
commonly represents high-level attributes [117], [120], [121]. concept labels, CBMs can directly show each concept’s contri-
In this paper, we suggest adopting a straightforward and easily bution to the final prediction (i.e., class-concept relation) using
understandable categorization: Textual Concepts and Visual the neuron values of the last layer. Specifically, the authors
Concepts. Textual Concepts refer to textual descriptions of of CBM conduct experiments on the knee X-ray dataset OAI
attributes associated with the classes. For example, in Fig. [124] to explore the importance of concepts such as bone spurs
6, the textual concepts for the classes (i.e., “pneumonia” and and calcification in determining arthritis grading. Additionally,
“normal”) include terms like Opacity, Effusion, Infiltration, CBMs allow model editing. When domain experts find certain
etc. Visual Concepts, on the other hand, consist of semanti- predicted concept importance unreasonable, they can easily
cally meaningful features within the image that may not be adjust the model’s predictions by intervening in the weights
8 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

TABLE III
M ODEL EXPLAINABILITY METHODS BASED ON CONCEPT- BASED LEARNING . T HE ABBREVIATIONS HERE ARE CLS: CLASSIFICATION

Method Year Location Modality Task Dataset Concept

Koh et al. [113] 2020 Knee X-ray CLS [124] Supervised (CBM)
Chauhan et al. [125] 2023 Knee, Chest X-ray CLS [124], [44] Supervised (CBM)
Patricio et al. [126] 2023 Skin Dermatology CLS [127], [128] Supervised (CBM)
Yan et al. [129] 2023 Skin Dermatology CLS Private Supervised (CBM)
Bie et al. [130] 2024 Skin Dermatology CLS [127], [128], [131] Supervised (CBM)
Kim et al. [132] 2024 Skin Dermatology CLS [54], [127], [131], [133], [134] Supervised (CBM)
Lucieri et al. [135] 2022 Skin Dermatology CLS [127], [128], [54] Supervised
Jalaboi et al. [136] 2023 Skin Dermatology CLS [137], [138] Supervised
Hou et al. [139] 2024 Skin Dermatology CLS [127], [131] Supervised
Kim et al. [140] 2023 Skin Dermatology CLS [106] Generated concept
Patricio et al. [141] 2024 Skin Dermatology CLS [127], [128], [86] Generated concept
Pang et al. [142] 2024 Skin, Blood cell Dermatology, Microscopy CLS [131], [143] Supervised (CBM)
Marcinkevivcs et al. [144] 2024 Appendix US CLS Private Supervised (CBM)
Zhao et al. [145] 2021 Chest CT CLS [146] Supervised
Fang et al. [147] 2020 Eye Slit lamp microscopy CLS [148] Concept discovery
Wen et al. [149] 2024 Eye Retinal images CLS [150], [151] Supervised
Kong et al. [152] 2022 Thyroid US CLS Private Concept discovery
Liu et al. [153] 2023 Multiple X-ray, CT CLS [90], [154], [155] Generated concept
Gao et al. [156] 2024 Multiple Dermatology, Pathology, US, X-ray CLS [42], [106], [155], [157], [158] Supervised

of the concept bottleneck layer (test-time intervention). The concepts may not be associated with human-specified textual
CBM architecture has inspired many researchers to develop concepts. However, these methods can still provide concept-
inherently interpretable methods, resulting in a series of based explanations by visualizing the unsupervised concepts
CBM-like methods. For example, Concept Embedding Models and detailing their contributions to the final predictions. For
(CEMs) [159] utilize a group of neurons (concept embeddings) instance, Ghorbani et al. [162] propose Automatic Concept-
instead of a single neuron to represent a concept, which based Explanations (ACE), which automatically extract visual
effectively improves the performance of the original CBM concepts that are meaningful to humans and important for
while preserving its interpretability. Different from CBMs, the network’s predictions. Self-Explaining Neural Networks
Concept Whitening [139], [160] aims to whiten the latent (SENN) [4] first utilize a concept encoder to extract clusters
space of neural networks and aligns the axes of the latent of image representations corresponding to different visual
space with known concepts of interest. Zhao et al. [145] concepts, and also adopt a relevance parametrizer to calculate
introduce a hybrid neuro-probabilistic reasoning algorithm the relevance scores of concepts. The final prediction is
for verifiable concept-based medical image diagnosis, which determined by the combination of discovered concepts and
combines clinical concepts with a Bayesian network. the corresponding relevance scores. Inspired by SENN, Sarkar
The self-explainable nature of concept-based learning mod- et al. [163] propose an ante-hoc explainable framework that
els has led to its application in medical image analysis. includes both a concept encoder and a concept decoder, which
Chauhan et al. [125] propose Interactive CBMs, which can map images into concept space and use the concepts to
request labels for certain concepts from a human collaborator. reconstruct the original images, respectively. Yeh et al. [164]
This method is evaluated on chest and knee X-ray datasets. argue that the discovered concepts may not be sufficient to
Yan et al. [129] discover and eliminate confounding concepts explain model predictions, so they define a completeness score
within datasets using spectral relevance analysis [161], and to evaluate whether the concepts adequately support model
conduct experiments on skin image datasets. Marcinkevics predictions and propose a framework for complete concept-
et al. [144] adapt CBM for prediction tasks with multiple based explanations.
views of ultrasonography and incomplete concept sets. Kim et Since medical concept annotations are costly and require
al. [132] present a medical concept retriever, which connects experts’ efforts, unsupervised automatic concept discovering
medical images with text and densely scores images on is usually adopted to offer concept-based explanations in
concept presence. This enables important tasks in medical AI medical image analysis without expert-annotated labels. For
development and deployment, such as data auditing, model example, Fang et al. [147] address the practical issue of
auditing, and model interpretation, using a CBM architecture classifying infections by proposing a visual concept mining
to develop an inherently interpretable model. (VCM) method to explain fine-grained infectious keratitis
However, a significant challenge in supervised concept images. Specifically, they first use a saliency map based po-
learning is the scarcity of concept annotations, which require tential concept generator to discover visual concepts, and then
labor-intensive efforts from human experts. Therefore, some propose a visual concept-enhanced framework that combines
researchers prefer unsupervised automatic concept discover- both image-level representations and the discovered concept
ing, as it eliminates the need for extra annotations. features for classification. Moreover, Kong et al. [152] develop
2) Unsupervised Automatic Concept Discovering: Models a novel Attribute-Aware Interpretation Learning (AAIL) model
that perform unsupervised concept discovery modify their to discover clinical concepts, and then adopt a fusion module
internal representations to identify concepts within image fea- to integrate these concepts with global features for thyroid
tures without relying on explicit annotations. These discovered nodule diagnosis from ultrasound images.
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 9

Image
Although unsupervised automatic concept discovering can
offer concept-based explanations, these explanations are ab-
“Nodule”
stract and usually cannot be directly described in natural
language. To alleviate this issue while also addressing the
Prototypical Images
lack of concept annotations, generative concept learning has Compare Because these areas look
similar to typical Nodule images.
become a promising research direction. E
3) Generative Concept Learning: Leveraging foundation 1) Explicit prototype 2) Implicit prototype
Cardiomegaly
models, such as Large Language Models (LLMs) and Vision- extracted from a
certain typical image.
close to features extracted
from a set of typical images.

…
Language Models (VLMs), can assist in generating and la-

…
+ + + ** +
+ + *
beling textual concepts. A notable generative concept learning + ++ * * *
* ** + + ** *
+ + **
method, namely Language Guided Bottlenecks (LaBo) [165], Nodule + * image features prototype
employs an LLM (GPT-3 [166]) to generate textual concepts
for each image category, which are filtered to form the concept Fig. 7. Prototype-based learning, including 1) explicit prototype and 2)
bottleneck layer. LaBo then uses a pre-trained VLM (CLIP implicit prototype. X-ray images borrowed from [170].
[167]) to calculate the similarity between input images and
the generated concepts to obtain concept scores. The final image and then compare the feature maps with the proto-
prediction is based on the multiplication of a weight matrix types to calculate similarities. Ultimately, these similarities
and these concept scores. Label-free CBM [168] employs are combined for the final decision making. This process
a similar pipeline, but trains an independent network that is considered interpretable because the decision making can
includes a concept bottleneck layer. In the medical domain, be clearly attributed to the contribution of each interpretable
Kim et al. [140] enhance LaBo [165] by incorporating a prototype (e.g., by the similarity scores). According to how
more fine-grained concept filtering mechanism and conducted the prototypes are obtained, we define and categorize them to
explainability analysis on dermoscopic images, achieving per- two types: 1) explicit prototypes and 2) implicit prototypes,
formance improvements compared to the baseline. Similarly, as presented in Fig. 7. Explicit prototypes are specific high-
Liu et al. [153] employ ChatGPT and CLIP for explainable dimensional feature representations extracted from certain
zero-shot disease diagnosis on X-ray and CT. Bie et al. [169] training images, whereas implicit prototypes are latent high-
propose an explainable prompt learning framework that lever- dimensional representations that are close to a set of typical
ages medical knowledge by aligning the semantics of images, images’ representations. All existing prototype-based S-XAI
learnable prompts, and clinical concept-driven prompts at mul- models do not require supervision at the prototype level and
tiple granularities, where the category-wise clinical concepts aim to automatically find meaningful prototypes to facilitate
are obtained by eliciting knowledge from LLMs. interpretable decision making.
Discussion: Methods that provide concept-based explana- 1) Explicit prototype based models: The first model of this
tions hold significant importance in medical research and ap- type is ProtopNet [171], which introduces a three-stage train-
plications, particularly in advancing evidence-based medicine. ing scheme that is widely adopted by subsequent research:
By offering human-understandable explanations, these meth- 1) Feature extractor training: in this step, the final layer is
ods have the capability to help doctors and patients better un- frozen, and only the feature extraction backbone is trained.
derstand AI-assisted diagnosis, hence holding the potential to 2) Prototype replacement: this step replaces the learned rep-
make AI technologies effectively supported and disseminated resentations in the prototype layer with the nearest feature
in healthcare. The lack of fine-grained label annotations and patch from the training set. 3) Final layer fine-tuning: in this
the performance-explainability trade-off are the limitations of stage, the feature extractor remains fixed while the parameters
concept-based methods. Thanks to the development of LLMs, of the final layer are fine-tuned. Later works closely follow
researchers are exploring new ways to alleviate these issues, this training scheme while addressing different limitations of
e.g., generative concept learning [140], [153], [165]. In addi- this initial framework. For example, ProtoShare [191] proposes
tion, as there are more and more medical foundation models to share prototypes across different classes to reduce the
being developed, incorporating the knowledge of the models overall number of prototypes and enhance model efficiency. A
and medical experts to efficiently annotate concept labels for similar idea is explored in ProtoPool [192], where prototypes
datasets will be a promising and meaningful direction. Besides are assigned to various classes in a differentiable manner.
the most popular classification task, other medical applications To address the limitation of prior models that use spatially
of concept-based approaches should be further explored. rigid prototypes, ProtoDeform [193] proposes to additionally
learn an offset to obtain prototypes that are more spatially
flexible. TesNet [194] leverages the Grassman manifold to con-
C. Prototype-based Learning struct a transparent embedding space, achieving competitive
Prototype-based S-XAI models aims to provide a decision- accuracy. Inspired by the theory of support vector machines,
making process where a model reasons through comparisons ST-protopnet [195] aims to further improve the accuracy of
with a set of interpretable example prototypes [171]. This prototype-based models by separating prototypes into support
reasoning aligns with human recognition patterns, as humans and trivial prototypes, where support prototypes are located
often identify objects by comparing them to example compo- near the decision boundary in feature space, while trivial ones
nents [172]. These models first extract features from a given lie far from it. To investigate the hierarchical relationships
10 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

TABLE IV
M ODEL EXPLAINABILITY METHODS BASED ON PROTOTYPE - BASED LEARNING . T HE ABBREVIATIONS HERE ARE CLS: CLASSIFICATION , REG:
REGRESSION .

Method Year Location Modality Task Dataset Prototype Type

Kim et al. [170] 2020 Chest X-ray CLS [40] Explicit
Singh et al. [173] 2021 Chest X-ray CLS [174] Explicit
Mohammadjafari et al. [175] 2021 Brain MRI CLS [176] Explicit
Barnett et al. [177] 2021 Breast Mammogram CLS Private Explicit
Carloni et al. [178] 2022 Breast Mammogram CLS [179] Explicit
Wang et al. [180] 2022 Breast Mammogram CLS [181] Explicit
Wei et al. [182] 2024 Brain MRI CLS [183] Explicit
Hesse et al. [184] 2022 Eye Retinal images REG [185] Explicit
Santos et al. [186] 2024 Eye Retinal images CLS [187] Explicit
Hesse et al. [188] 2024 Brain MRI, US REG [189], [190] Explicit

between classes, Hase et al. [196] propose hierarchical pro- users to identify which specific property is important in the
totypes to offer explanations according to class taxonomy. As corresponding image patch (e.g., is it the color or texture that
prototype-based models are mostly based on linear classifiers, matters in this prototypical area?). This issue can be partially
ProtoKnn [197] explores the usage of k nearest neighbors mitigated using implicit prototype based models.
as a classifier and offers counterfactual explanations within 2) Implicit prototype based models: This type of model
the prototype-based framework. Recognizing the importance follows a similar training scheme as the models based on
of interpretability methods for debugging models, ProtoDe- explicit prototypes, with the major difference in avoiding the
bug [198] proposes an approach where a human supervisor prototype replacement step, or only projecting the prototype
can provide feedback to the discovered prototypes and learn to the training images’ feature patches for visualizations. This
confounder-free prototypes. scheme is simpler than one that includes prototype replace-
Adopting prototype-based S-XAI models in the medical ment step and has different interpretability benefits. Li et al.
domain presents additional challenges. Unlike natural images [199] propose the earliest work using latent prototypes, which
where the representative prototype occupies an area with a leverages a decoder to visualize the meanings of the learned
relatively stable size, medical image features such as disease prototypes. Protoeval [200] designs a set of loss functions to
regions in chest X-ray images can vary significantly in size. encourage the learned latent prototypes to be more stable and
To address this, XProtoNet [170] proposes to predict an occur- consistent across different images. To address the issue of the
rence map and summing the similarity scores within those ar- same prototype potentially representing different concepts in
eas, rather than relying solely on the maximum similarity score the real world, Nauta et al. [201] introduce PIP-net which
as done in ProtopNet. Similarly, [173] introduces prototypes learns prototypes by encouraging the augmented two views of
with square and rectangular spatial dimensions for COVID- the same image patch to be assigned to the same prototype.
19 detection in chest X-rays. In evaluations of ProtopNet, To help users identify the specific properties in an image that
Mohammadjafari et al. [175] observe a performance drop for contribute to the final classification (e.g., color or texture),
Alzheimer’s disease detection using MRI, whereas Carlon et instead of allowing users to observe only one example image
al. [178] report a high-level of interpretability satisfaction from patch per prototype, Ma et al. [202] propose to illuminate
radiologists in breast mass classification using mammograms. prototypical concepts via multiple visualizations. Due to the
In mammogram based breast cancer diagnosis, Wang et al. interpretability benefits of decision trees, ProtoTree [203]
[180] propose to leverage knowledge distillation to improve explores the incorporation of decision trees into prototype-
model performance. To overcome the confounding issue in based models, using latent prototypes as the nodes throughout
mammogram based mass lesion classification, Barnett et al. the decision-making process. Recently, to address the concern
[177] employ a multi-stage framework that identifies the mass that prototype-based models often underperform their black
margin features for malignancy prediction, skipping image box counterparts, Tan et al. [204] develop an automatic
patches that have already been used in previous prototypes prototype discovery and refinement strategy to decompose
during the prototype projection step to improve prototype the parameters of the trained classification head and thus
diversity. In brain tumor classification, MProtoNet [182] intro- guarantees the performance.
duces a new attention module with soft masking and online- Discussion: In terms of performance, implicit prototype
CAM loss applied in 3D multi-parametric MRI. To predict the based models generally outperform explicit ones, probably
brain age based on MR and ultrasound images, Hesse et al. due to the greater flexibility in prototype learning. Regarding
[188] utilize the weighted mean of prototype labels. Addition- interpretability, both types of models offer unique advantages.
ally, INSightR-Net [184] formulates the diabetic retinopathy For example, explicit prototypes can be intuitively explained
grading as a regression task and apply the prototype based through one-to-one mappings to the input image, while im-
framework, while ProtoAL [186] explores an active learning plicit prototypes can be explained using a diverse set of images
setting for prototype-based models in diabetic retinopathy. with similar activations. However, in medical image analysis,
Although these models offer interpretability in a one-to-one current prototype-based S-XAI models primarily utilize ex-
mapping to the input image, they can also make it difficult for plicit prototypes. Therefore, investigating the use of implicit
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 11

Image Image

E P “Pneumonia” E P “Mass”

1) Fully-structured text 2) Semi-structured text

Language Model patchy opacities silhouette, present, right [Cardiac silhouette]: Patchy
heart border / pleural effusions, absent opacities silhouette the
Query: What’s right heart border. [Left
3) Free-structured text lower lung zone]: Minimal Encoder Decoder “Healthy”
the findings of left lower lung atelectasis is
Foundation Model Patchy opacities silhouette the right heart
the given x-ray noted. [Mediastinum]: The mediastinal 𝑧
(LLM/VLM) border. The mediastinal contours are
image? normal. There are no pleural effusions. contours are normal. Generative Model Counterfactual Image

(a) Textual Explanation (b) Counterfactual Explanation

Fig. 8. Output explainability that provides (a) textual explanations, including fully-structured, semi-structured, and free-structured text; and (b)
counterfactual explanations. The difference between the generated counterfactual image and raw image (red box) indicates the explanation. X-ray
images borrowed from [205].

prototypes in the medical domain could be a promising avenue certain types of cell attributes along with a concluding
for future research. statement. Additionally, Wang et al. [214] introduce a
hierarchical framework for medical image explanation, which
V. O UTPUT E XPLAINABILITY first predicts semantically related topics and then incorporates
these topics as constraints for the language generation model.
This section discusses output explainability by generating
In the context of hip fracture detection from pelvic X-rays,
explanations alongside model predictions, including textual
Gale et al. [215] utilize a visual attention mechanism to
(Sec. V-A) and counterfactual (Sec. V-B) explanations.
create terms related to location and characteristics, which
are then used to generate sentences structured as: “There is
A. Textual Explanation a [degree of displacement], [+/- comminuted][+/- impacted]
Textual explanations in S-XAI involve generating human- fracture of the [location] neck of femur [+/- with an avulsed
readable descriptions that accompany model predictions as fragment].” More recently, some studies have focused on
part of outputs, similar to image captioning. These methods generating individual sentences based on anatomical regions
use natural language to clarify model decisions and typically [216]–[219]. For example, Tanida et al. [218] introduce
require textual descriptions for supervision. Some studies a Region-Guided Radiology Report Generation (RGRG)
explore the integration of textual explanations with visual ones. method that identifies unique anatomical regions in the chest
We categorize these methods into three types based on the and generates specific descriptions for the most salient areas,
structure of textual explanations: 1) fully-structured, 2) semi- ensuring each sentence in the report is linked to a particular
structured, and 3) free-structured text, as shown in Fig. 8(a). anatomical region. Overall, semi-structured approaches
1) Fully-structured text generation: To address the chal- effectively balance the rigidity of fully structured reports with
lenges posed by complex unstructured medical reports, early the inconsistency of completely free-text reports.
efforts transformed target texts into fully structured formats, 3) Free-structured text generation: With the advancement of
such as descriptive tags, attributes, or fixed templates, rather language models, reports generated for a given input image
than natural language. For example, Pino et al. [206] propose are no longer limited to structured formats; instead, they now
CNN-TRG, which detects abnormalities through multilabel focus on more open, free-structured text descriptions. These
classification and generates reports based on pre-defined tem- approaches typically involve combining an image encoder to
plates. Some works utilize controlled vocabulary terms (e.g., extract visual features with a language model to produce co-
Medical Subject Headings (MeSH) [207]) to describe image herent sentences [221]. Several research efforts provide com-
content instead of relying on unstructured reports. Both Shin et prehensive explanations that include both textual and visual
al. [208] and Gasimova et al. [209] employ CNN-RNN frame- justifications for diagnostic decisions [222], [224]–[226]. For
works to identify diseases and generate corresponding MeSH instance, Spinks and Moens [222] propose a holistic system
sequences, detailing location, severity, and affected organs in that delivers diagnosis results along with generated textual
chest X-ray images. In addition, Rodin et al. [210] present a captions and a realistic medical image representing the closest
multitask and multimodal model to produce a short textual alternative diagnosis as visual evidence. Additionally, Wang
summary structured as “[pathology], [present/absent], [(op- et al. [226] explore a multi-expert Transformer to generate
tional) location], [(optional) severity]”. However, complete de- reports and attention-mapping visualization of key medical
scriptions in natural language are more human-understandable terms and expert tokens.
than a set of simple tags, leading several studies to focus on In addition to directly generating medical reports, some
generating reports in a semi-structured format. research studies have incorporated the classification of patho-
2) Semi-structured text generation: Generating semi- logical terms or tags in two distinct ways. The first approach
structured text involves a partially structured format with utilizes a “classification-report generation” pipeline, integrat-
predefined topics and constraints in the medical report ing a classifier within the report generation network to enhance
generation process. For instance, pathology report generation feature representations [227], [228]. For example, Yuan et al.
methods [211]–[213] produce reports that focus on describing [227] further employ a sentence-level attention mechanism
12 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

TABLE V
O UTPUT EXPLAINABILITY METHODS THAT PROVIDE TEXTUAL EXPLANATIONS . T HE ABBREVIATIONS HERE ARE MRG: MEDICAL REPORT
GENERATION , CLS: CLASSIFICATION , LOC: LOCATION , SEG: SEGMENTATION , VQA: VISUAL QUESTION ANSWERING , V IS : V ISUAL EXPLANATION .

Method Year Location Modality Task Dataset Text Type Vis.

Pino et al. [206] 2021 Chest X-ray MRG [41], [42] Fully-structured ✓
Shin et al. [208] 2016 Chest X-ray CLS + MRG [220] Fully-structured -
Gasimova et al. [209] 2019 Chest X-ray CLS + MRG [220] Fully-structured -
Rodin et al. [210] 2019 Chest X-ray MRG [42] Fully-structured ✓
Zhang et al. [211] 2017 Bladder Pathology MRG Private Semi-structured ✓
Zhang et al. [212] 2017 Bladder Pathology MRG Private Semi-structured ✓
Ma et al. [213] 2018 Cervix Pathology MRG Private Semi-structured ✓
Wang et al. [214] 2019 Chest X-ray CLS + MRG [41] Semi-structured ✓
Gale et al. [215] 2019 Pelvic X-ray MRG Private Semi-structured ✓
Tanida et al. [218] 2023 Chest X-ray LOC + MRG [216] Semi-structured ✓
Wang et al. [219] 2022 Chest X-ray CLS + MRG [41], [42] Semi-structured -
Singh et al. [221] 2019 Chest X-ray MRG [41] Free-structured -
Spinks et al. [222] 2019 Chest X-ray MRG [41], [223] Free-structured ✓
Liu et al. [224] 2019 Chest X-ray MRG [41], [42] Free-structured ✓
Chen et al. [225] 2020 Chest X-ray MRG [41], [42] Free-structured ✓
Wang et al. [226] 2023 Chest X-ray MRG [41], [42] Free-structured ✓
Yuan et al. [227] 2019 Chest X-ray CLS + MRG [41], [44] Free-structured ✓
Lee et al. [228] 2019 Breast Mammogram CLS + MRG [50] Free-structured ✓
Zhang et al. [229] 2019 Bladder Pathology CLS + MRG [230] Free-structured ✓
Wang et al. [231] 2018 Chest X-ray LOC + MRG [40], [220] Free-structured ✓
Jing et al. [232] 2018 Multiple X-ray, Pathology LOC + MRG [41], [233] Free-structured ✓
Zeng et al. [234] 2020 Multiple US, X-ray LOC + MRG [220] Free-structured ✓
Tian et al. [235] 2018 Abdomen CT SEG + MRG [236] Free-structured ✓
Thawkar et al. [237] 2023 Chest X-ray VQA [42] Free-structured -
Zhou et al. [238] 2024 Skin Dermatology VQA [131], [239] Free-structured -
Moor et al. [240] 2023 Multiple Multiple VQA [241] Free-structured -
Li et al. [242] 2024 Multiple Multiple VQA [38], [39], [243] Free-structured -
He et al. [244] 2024 Multiple Multiple VQA [39], [42], [243] Free-structured -
Chent et al. [245] 2024 Multiple Multiple VQA [38], [39], [243], [246] Free-structured -
Kang et al. [247] 2024 Chest X-ray VQA [41], [42], [248] Free-structured -
Chen et al. [249] 2024 Chest X-ray VQA [249] Free-structured -

alongside a word-level attention model to analyze multi-view analyze and respond to open-ended questions about the input
chest X-rays, using predicted medical concepts to improve the images, thanks to their pretraining on extensive datasets of
accuracy of medical reports. Conversely, the second approach image-report pairs. For instance, XrayGPT [237] demonstrates
employs a “report generation-classification” pipeline, leverag- the alignment of a medical visual encoder (MedClip) with
ing interpretable region-of-interest (ROI) characterization for a fine-tuned LLM (Vicuna) using a linear transformation.
final diagnoses. For instance, Zhang et al. [229] construct Given an input image, this combined model can address open-
a pathologist-level interpretable diagnostic framework that ended questions, such as “What are the main findings and
first detects tumour regions in whole slide images (WSIs), impressions from the given X-ray?”. These models not only
then generates natural language descriptions of microscopic excel in medical image captioning but also demonstrate ex-
findings with feature-aware visual attention, and finally estab- ceptional capability in delivering comprehensive explanations
lishes a diagnostic conclusion. Moreover, integrating region for a wide range of medical inquiries. By leveraging their
localization and lesion segmentation can enhance the quality extensive knowledge and understanding, they contribute to
of textual explanations [231], [232], [234], [235]. For instance, the generation of detailed and informative textual explanations
Wang et al. [231] develop a Text-Image Embedding network within the medical field.
(TieNet) that incorporates multi-level attention to highlight Discussion: Textual explanations have demonstrated signifi-
meaningful text words and X-ray image regions for disease cant effectiveness in providing human-interpretable judgments
detection and reporting. Leveraging fine-grained annotations of through natural language. This type of S-XAI approach has
segmentation masks or bounding boxes for lesions, Tian et al. become especially valuable with the advancement of language
[235] combine a segmentation model with a language model, models, enabling the generation of lengthy reports and the
creating a multimodal framework with a semi-supervised at- ability to answer open-ended questions. However, it is crucial
tention mechanism for CT report generation. to enhance the quality and reliability of these generated textual
explanations. Some recent studies utilize techniques such as
Compared to traditional report generation approaches, the
knowledge decoupling [247] and instruction tuning [249] to
utilization of LLMs offers a more interactive and comprehensi-
address challenges like hallucination, thereby improving the
ble method for generating textual explanations. Recent medical
effectiveness and trustworthiness of textual explanations in
VLMs applied to various medical images, such as chest X-rays
medical applications.
(e.g., XrayGPT [237]), skin images (e.g., SkinGPT [238]), and
general medical images (e.g., Med-flamingo [240], LLaMa-
Med [242], MedDr [244], HuatuoGPT-Vision [245]), can
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 13

B. Counterfactual Explanation A. Desired characteristics of explainability

Counterfactual explanations describe a causal situation by It is important for S-XAI models to possess certain desirable
imagining a hypothetical reality that contradicts the observed qualities when providing explanations. In this regard, Robnik
facts: If X had not occurred, Y would not have occurred [23]. et al. [260] enumerated a set of desirable characteristics
Counterfactual explanations present a contrastive example: for for high-quality explanations generated by XAI methods. In
a given original image, it’s counterfactual image can alter the terms of explanations and explainability methods, Table VI
model’s prediction to a predefined output through the minimal presents expected traits based on our literature review. These
perturbation to observations on the original image, as illus- characteristics can be used to evaluate and compare different
trated in the diagram of Fig. 8(b). Traditional counterfactual S-XAI approaches.
explanations are generated through a post-hoc paradigm [205], For medical applications, the characteristics of high-quality
[250]. Specifically, a classification model is first trained as explanations should align closely with the real-world neces-
a black-box model, and then a generative model such as sities of clinical practice. Van et al. [19] and Adadi et al.
Generative Adversarial Network (GAN) [251] is applied to [261] summarized the characteristics of XAI methods for
produce the counterfactual counterpart of the input image. medical image analysis and healthcare, respectively. Jin et al.
However, dissociating the model’s prediction from its expla- [258] proposed clinical XAI guidelines that consist of five
nation can lead to poor-quality explanations [13]. In particular, criteria for optimizing clinical XAI. These guidelines suggest
counterfactual examples produced in the post-hoc framework selecting an explanation form based on understandability and
are susceptible to issues related to the classifier’s robustness clinical relevance. For the chosen format, the specific XAI
and complexity (e.g., overfitting and excessive generalization), technique should be optimized for truthfulness, informative
resulting in explanations that are inadequate for effective in- plausibility, and computational efficiency. Usability is another
terpretability [252]. To address this, more recent work has ex- factor that enhances a model’s credibility [262]. Individuals
plored self-explainable variants of the counterfactual approach. are more likely to trust a model that provides insights into
An alternative is to incorporate an explanation generation how it accomplished its task. In this regard, an interactive and
module directly into the predictor model, such that the model dynamic explanation is preferred over a static one.
can provide explanations for its own predictions. In general,
the predictor and explanation generator are trained jointly, B. Evaluation Methods
hence the presence of the explanation generator is influenc-
Doshi-Velez and Kim [263] propose three distinct categories
ing the training of the predictor. For example, CounterNet
for evaluating XAI methods. 1) Application-grounded evalu-
[253] integrates the training of the predictive model with the
ations engage experts specific to a field, such as doctors for
generation of counterfactual explanations into a single, end-
diagnostic purposes. 2) Human-grounded evaluations involve
to-end process, allowing for joint optimization. Compared to
non-experts assessing the overall quality of explanations. 3)
post-hoc approaches, it is able to produce counterfactuals with
Functionality-grounded evaluations use proxy tasks instead
higher validity. Following a similar pipeline, Guyomard et al.
of human input to evaluate explanation quality, which are
[254] propose VCNet, in which the counterfactual generator
desirable for interpretability due to constraints related to time
is based on conditional variational autoencoder (cVAE) whose
and cost. In the medical field, it is crucial to involve end-users,
latent space can be controlled and tweaked to generate more
such as junior and senior doctors, in the evaluation process,
realistic counterfactuals.
ideally in contexts that utilize real tasks and data [2].
In the medical domain, Wilms et al. [255] propose an 1) Human-centered evaluation: It is essential to conduct
invertible, self-explainable generative model based on effi- human-centered evaluations in collaboration with medical
cient normalizing flow for brain age regression and brain experts to assess whether end users are satisfied with the
sex classification on 3D neuroimaging data. The invertible explanations provided by S-XAI models. In human-centered
model can generate predictions during the forward process and evaluations, the quality of the explanations can be assessed
produce explanations including voxel-level attribution maps using qualitative metrics and quantitative metrics.
and counterfactual images to clarify its decision-making in Qualitative metrics: Qualitative metrics include evaluating
the reverse process. the usefulness, satisfaction, confidence, and trust in provided
Discussion: S-XAI models that generate counterfactual explanations through interviews or questionnaires [264]. For
explanations alongside predictions show greater promise than instance, System Causability Scale [265] measures the quality
post-hoc methods. However, their application in medical image of interpretability methods applicable in the medical field.
analysis remains largely underexplored. Gale et al. [215] assess experts’ acceptance of explanations
by scoring each type on a 10-level Likert scale. Their findings
VI. E VALUATION indicate that doctors prefer human-style text explanations over
saliency maps and favor a combination of both saliency maps
As discussed in the previous sections, various efforts are and generated text rather than using either one alone.
being made to investigate S-XAI methods in medical image Quantitative metrics: Quantitative metrics focus on measur-
analysis. However, assessing explainability presents significant ing task performance of human-machine collaboration with
challenges. In this section, we will outline the desired charac- factors such as accuracy, response time, likelihood of devia-
teristics and evaluation methods for explainability. tion, ability to detect errors, and even physiological responses
14 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

TABLE VI
D ESIRABLE QUALITIES OF EXPLANATION METHODS , INDIVIDUAL EXPLANATIONS , AND HUMAN - FRIENDLY EXPLANATIONS .

Type Qualities Description Ref.

Faithfulness, Fidelity, Truthfulness Explanations should truthfully reflect the AI model decision process. [256]–[258]
Consistency, Invariance, Robustness For a fixed model, explanation of similar data points (with similar [7], [258]
prediction outputs) should be similar.
Understandability, Comprehensibility Explanations should be easily understandable by clinical users without [256], [258]
Explanations requiring technical knowledge.
Clinical Relevance Explanation should be relevant to physicians’ clinical decision-making [258]
pattern, and can support their clinical reasoning process.
Plausibility, Factuality, Persuasiveness Users’ judgment on explanation plausibility (i.e., how convincing the [7], [256], [258]
explanations are to humans) may inform users about AI decision
quality, including potential flaws or biases.

Explainability Computational Complexity The computational complexity of explanation algorithms. [258], [259]
Methods Generalizability, Portability To increase the utility because of the diversity of model architectures. [259]

[264]. For example, Sayres et al. [266] investigate the impact fixation is an emerging data modality that can provide key
of a deep learning model on doctors’ performance in predict- diagnostic features by tracking the gaze patterns and visual
ing diabetic retinopathy (DR) severity. Ten ophthalmologists attention of clinicians, which is also utilized as the ground
with varying levels of experience read images under three truth of attention maps [271].
conditions: unassisted, predicted grades only, and predicted Concept-based explanation: To evaluate concept-based ex-
grades with heatmaps. The results indicate that AI assistance planations, researchers mainly focus on metrics such as Con-
improves diagnostic accuracy, subjective confidence, and time cept Error [113], [163], T-CAV score [117], Completeness
spent. However, in most cases, the combination of grades Score [164], and Concept Relevance [4], [120]. Additionally,
and heatmaps is only as effective as using grades alone, and other evaluation methods exist. For example, Zarlenga et al.
actually decreased accuracy for patients without DR. [159] propose Concept Alignment Score (CAS) and Mutual
Overall, human-centered evaluations offer the significant ad- Information to evaluate concept-based explainability. Wang et
vantage of providing direct and compelling evidence of the al. [195] adopt Concept Purity to assess the model’s capability
effectiveness of explanations [263]. However, these evaluations to discover concepts that only cover a single shape.
tend to be costly and time-consuming due to the need to recruit Example-based explanation: In the evaluation of example-
expert participants and obtain necessary approvals, as well as based explanations, Nguyen and Martinez [272] establish two
the additional time required for conducting the experiments. quantitative metrics: 1) non-representativeness, which evalu-
Most importantly, these evaluations are inherently subjective. ates how well the examples represent the explanations, thereby
2) Functionality-grounded evaluation: This category of eval- measuring the fidelity of the explanation; and 2) diversity,
uation, which do not involve human-subject investigations, can which gauges the degree of integration within the explanation.
be employed to assess the fidelity of explanations. The accu- Additionally, Huang et al. [200] developed two metrics: 1)
racy of S-XAI methods in generating genuine explanations a consistency score to determine whether the prototype con-
is referred to as the fidelity of an explainer. In this section, sistently highlights the same semantically meaningful areas
we will present a variety of functionality-grounded evaluation across different images, and 2) a stability score to assess
methods for different types of explanations. whether it reliably identifies the same area after the image
Attention-based explanation: In the absence of references, is exposed to noise.
attention-based explanations can be assessed through a causal Textual explanation: The common assessment of textual
framework. For example, Petsiuk et al. [267] introduce two explanations involves using metrics such as BLEU [273],
causal metrics, i.e., deletion and insertion. Following this, ROUGE-L [274], and CIDEr [275] to compare generated
Hooker et al. [268] propose RemOve And Retrain (ROAR), natural language descriptions against ground truth reference
a method that evaluates how the accuracy of a retrained sentences provided by experts. Patricio et al. [17] conduct a
model decreases when essential features in specific regions benchmark study of interpretable medical imaging approaches,
are removed. With the manually annotated ground truth data, specifically evaluating the quality of textual explanations for
such as object bounding boxes or semantic masks, the accuracy chest X-ray images.
of attention-based explanations can be evaluated by comparing Counterfactual explanation: Singla et al. [276] employ three
with these references. Yan et al. [83] and Hou et al. [269] cal- metrics to evaluate counterfactual explanations for chest X-ray
culate the Jaccard index value and the AUC score to measure classification: 1) Frechet Inception Distance (FID) to assess
the effectiveness of attention maps, respectively. Additionally, visual quality, 2) Counterfactual Validity (CV) to determine if
Barnett et al. [270] introduce the Activation Precision metric, the counterfactual aligns with classifier’s predictions, and 3)
which quantifies the proportion of relevant information from Foreign Object Preservation (FOP), which examines whether
the relevant region used to classify the mass margin based patient-specific information is retained. Additionally, they use
on radiologist annotations. Furthermore, human expert eye clinical metrics, including the cardiothoracic ratio and a score
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 15

for detecting normal costophrenic recess, to illustrate the enhance the input explainability through explainable prompts
clinical utility of the explanations. [284] and knowledge-enhanced prompts [285], ultimately im-
proving model performance.
VII. C HALLENGES AND F UTURE D IRECTIONS 2) Foundation models advance S-XAI: Foundation models
Despite the rapid advancements in S-XAI for medical image learn generally useful representations from the clinical knowl-
analysis, several significant challenges remain unresolved. In edge embedded in medical corpora [286]. By harnessing the
this section, we will analyze the existing challenges and sophisticated capabilities of foundation models, S-XAI meth-
discuss potential future directions to enhance the effectiveness ods can produce user-friendly explanations [287] and support
and reliability of S-XAI in the medical domain. more flexible generative concept-based learning [288], [289].
Moreover, foundation models can facilitate the evaluation of
A. S-XAI Benchmark Construction S-XAI methods that emulate human cognitive processes [290].
Establishing benchmarks for S-XAI in medical image anal-
ysis. is essential. These benchmarks will standardize evalua- C. S-XAI with Human-in-the-Loop
tions, enable fair comparisons between different methods, and
ultimately enhance the reliability of medical AI applications. Integrating Human-in-the-Loop (HITL) processes is crucial
1) Dataset construction: One of the main challenges in for effectively implementing S-XAI in the medical field. This
collecting medical data is the limited availability of doctors to approach not only enhances the overall performance of AI
annotate large datasets. This challenge is even more significant systems but also fosters trust among medical experts.
in S-XAI, where additional fine-grained annotations, such as 1) Enhancing prediction accuracy through human interven-
concepts and textual descriptions, are necessary. As a result, tion: A HITL framework allows for the identification and
medical datasets that meet interpretability standards often have removal of potential confounding factors, such as artifacts
a limited volume of data, reducing the generalizability and or biases in datasets, during the training phase. For instance,
applicability of S-XAI methods in real-world contexts. clinicians can adjust the outputs of predicted concepts, leading
2) Evaluation metrics: Automated evaluation of explana- to a more accurate concept bottleneck model [129]. This
tions generated by S-XAI methods poses another significant collaborative approach can significantly enhance the model’s
challenge. In the medical field, human-centered evaluations of- accuracy by incorporating expert insights.
ten rely on the expertise of clinicians. However, the variability 2) Improving explainability through human feedback: To en-
in expert opinions can lead to biased and subjective assess- sure continuous improvement, a versioning or feedback evalu-
ments [277]. Meanwhile, existing functionality-grounded eval- ation system should be established, enabling the final system to
uations still depend on manual annotations. Thus, developing build trust during hospital evaluations. Achieving this requires
objective metrics to evaluate the quality of model explanations fostering collaboration between S-XAI researchers and clinical
is likely to become an important research focus. practitioners, ensuring that feedback is systematically gathered
To tackle these challenges, future directions include leverag- and used to refine the models.
ing semi-automated annotation tools to assist clinicians in the However, one challenge in integrating HITL processes is
annotation process, thereby easing their workload. Addition- the variability in clinician expertise and availability, which
ally, developing objective metrics and standardized protocols can affect the consistency and quality of human feedback.
to assess the quality of model explanations will be a critical Ensuring that human knowledge is effectively integrated into
research trend in S-XAI. the AI training process without introducing additional biases
or errors is a complex task.
B. S-XAI in the Era of Foundation Models
Foundation models, including large language models
D. Trade-off between Performance and Interpretability
(LLMs) and vision-language models (VLMs), have trans-
formed the AI landscape, finding applications across diverse It is widely believed that as model complexity increases
fields such as natural language processing, computer vision, to enhance performance, the model’s interpretability tends to
and multimodal understanding. Notably, medical LLMs [278]– decline [291], [292]. Conversely, more interpretable models
[280] and medical VLMs [237], [240], [281] are designed may sacrifice some predictive accuracy. However, it is im-
to encode rich domain-specific knowledge. The intersection portant to note that some researches contend that there is no
of medical foundation models and S-XAI presents significant scientific evidence for a general trade-off between accuracy
opportunities for the future of medical AI systems [262]. and interpretability [293]. In fact, recent advancements in
1) S-XAI benefits foundation models: Foundation models are concept-based models [132], [139], [142] have demonstrated
typically large models with an extremely huge number of performance on par with black-box models in medical image
parameters, trained on vast datasets. The complexity of these applications. This achievement depends on the researcher’s
models make it challenging to explore their decision-making ability to identify patterns in an interpretable manner while
processes, which may result in potential biases and a lack of maintaining the flexibility to accurately fit the data [13]. Future
transparency. Apart from leveraging post-hoc XAI techniques S-XAI methods are expected to aim for an optimization of
(e.g., attribution maps [282], [283]) to interpret the decision- both performance and interpretability, potentially providing
making processes of foundation models, S-XAI methods can theoretical foundations for this balance.
16 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

E. Other Explainability of S-XAI 1) Radiology: Radiological images generally include

1) Multi-modal explainability: Embracing multi-modal ex- modalities such as X-ray, MRI, CT, mammography, and
plainability is a promising direction for S-XAI in the medical ultrasound. Among these, X-ray is one of the most commonly
field. Medical data often exists in various forms, including used modalities in S-XAI for medical image analysis.
images, text, and omics. By integrating these modalities, multi- For instance, the SLAKE [38] dataset collects knowledge
modal S-XAI approaches can offer more comprehensive and triplets from the open source knowledge graph to assist
intuitive explanations that align with clinicians processes. the model in achieving a better understanding of X-ray
Additionally, S-XAI methods can identify correlations during images. Medical-CXR-VQA [63] focuses on the five types
data fusion [294], offering significant potential for discovering of questions (i.e., abnormality, presence, view, location
new biomarkers. For example, exploring correlations between and type) and aligns them with the key information of the
radiological and pathological images could help discover non- X-ray image, resulting more reliable answers. VQA-RAD
invasive biomarkers as alternatives for tumor diagnosis. [243] is manually constructed based on clinicians asking
2) Causal explainability: Another direction for S-XAI in- naturally occurring questions about radiology images and
volves causality, which defines the cause-and-effect relation- providing reference answers, resulting in a dataset rich
ship and can be mathematically modeled [295]. Traditional in quality expertise and knowledge to capture the details
deep learning methods in medical imaging often confuse cor- of radiology images. OAI [124] provides knee X-rays for
relation with causation, leading to potentially harmful errors. knee osteoarthritis grading and offers clinical concepts (e.g.,
For instance, DeGrave et al. [296] use XAI techniques to audit joint space narrowing, bone spurs, calcification), making
COVID-19 diagnosis methods from chest X-rays find that it suitable for concept-based learning. Moreover, datasets
these methods are primarily identifying spurious correlations. such as IU X-ray [41], MIMIC-CXR [42], OpenI [220], and
To address dataset biases, Castro et al. [297] emphasize Chest ImaGenome [299] provide a large number of chest
the role of causal reasoning in detecting biases. Luo et al. X-rays along with corresponding free-text reports, which
[298] develop debiased models based on biased training data can facilitate the generation of textual explanations. Finally,
generated from causal assumptions for diagnosing chest X- MIMIC-CXR-VQA [248], CheXbench [249], SLAKE [38],
rays. Incorporating such analyses into the explainability of and MIMIC-Diff-VQA [62] are constructed for the VQA
medical images analysis could be highly beneficial. task, providing support for interactive explanations of S-XAI
models.
Regarding MRI, SUN09 [103] and AC17 [104] provide car-
VIII. C ONCLUSION diac segmentation masks, while BraTS 2020 [107] offers brain
This survey reviews recent advancements in self-explainable masks. However, most MRI datasets have a limited number of
artificial intelligence (S-XAI) for medical image analysis. samples, which may affect the generalization ability of models.
Contrary to previous surveys that primarily focus on post-hoc For CT images, datasets like CT-150 [100], NIH-TCIA CT-
XAI techniques, this paper emphasizes inherently interpretable 82 [101], and LiTS [236] can be used for segmentation tasks,
S-XAI models, which are gaining traction in research. This while LIDC-IDRI [146] provides lung cancer annotations with
survey introduce S-XAI from three key perspectives, i.e., input grade of eight attributes, benefiting concept-based learning.
explainability, model explainability, and output explainability. Additionally, disease diagnoses on mammography datasets
Additionally, this survey explore the desired characteristics of [50], [179], [300] and ultrasound datasets [158], [190] also
explainability and various evaluation methods for assessing serve as applications for S-XAI methods.
explanation quality. While significant progress has been made, 2) Dermatology: In the scope of dermatology, datasets
it also highlights key challenges that need to be tackled with fine-grained concept annotations are commonly used
and provides insights for future research on trustworthy AI in concept-based learning for S-XAI. Derm7pt [127] is a
systems in clinical practice. Overall, this survey serves as dermoscopic image dataset containing 1,011 images with
a valuable reference for the XAI community, particularly clinical concepts for melanoma skin lesions according to the
within the medical imaging field, and lays the groundwork 7-point checklist criteria [301]. PH2 dataset [128] includes 200
for future advancements that will improve the transparency dermoscopic images of melanocytic lesions with segmentation
and trustworthiness of AI tools in healthcare. masks and several clinical concepts. SkinCon [131] is a
skin disease dataset containing 3,230 images with 48 clinical
A PPENDIX concepts densely annotated by dermatologists for fine-grained
model debugging and analysis, where the images are selected
A. Public datasets used in S-XAI from the Fitzpatrick 17k [133] and DDI [134] skin image
We provide an overview of more than 70 datasets currently datasets. Other datasets such as ISIC [54], [85], [86] and
available for S-XAI in the medical image domain. Table VII HAM10000 [106] are also broadly used datasets but without
presents the key characteristics of these datasets, including explicit fine-grained concept annotations. Dermoscopic image
modality, scale, and task. In this section, we will introduce the datasets significantly facilitate the development of S-XAI,
relevant datasets categorized by image modalities, highlighting however, annotating concept labels requires the efforts of
their contributions to the development of S-XAI. For more human experts and is labor-intensive. Hence there are currently
detailed information about these datasets, we direct readers to only a few datasets with a limited number of samples that have
the relevant publications and sources. fine-grained concept labels. This poses a significant challenge
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 17

for concept-based S-XAI, especially in supervised concept

learning.
3) Pathology: With regard to pathological image datasets,
PathVQA [39] is the first dataset focused on pathology VQA,
featuring over 32K open-ended questions derived from 4,998
pathology images. PEIR Gross [233] contains 7,442 image-
caption pairs across 21 different sub-categories, with each
caption consisting of a single sentence. The Cancer Genome
Atlas (TCGA) [230] provides multimodal data for over 20,000
tumor and normal samples, offers multimodal data for more
than 20K tumor and normal samples, encompassing clinical
data, DNA, and various imaging types (diagnostic images,
tissue images, and radiological images). Additionally, datasets
such as BACH [81], Biopsy4Grading [88], and NCT [157]
are available for classifying breast cancer, non-alcoholic fatty
liver disease, and colorectal cancer, respectively. WBCAtt
[143] provides 113K microscopic images of white blood
cells, annotated with 11 morphological attributes categorized
into four main groups: overall cell, nucleus, cytoplasm, and
granules. Since pathological images are the “gold standard”
for cancer diagnosis, the development of S-XAI models in
pathology is highly significant.
4) Retinal images: Regarding retinal disease classification,
EyePACS [185] is a large-scale dataset for grading diabetic
retinopathy, containing more than 88K images. ACRIMA [82]
offers 705 images for glaucoma assessment. In addition to
classification labels, FGADR [150], DDR [151], and IDRID
[155] also provide fine-grained masks for segmenting various
types of lesions.
5) Others: Other medical datasets utilized in S-XAI include
the Hyperkvasir endoscopy dataset [55] and the Infectious Ker-
atitis slit lamp microscopy dataset [148]. Additionally, recent
datasets like PMC-OA [241] and PMC-VQA [246] collect data
from open-source medical literature corpus. These datasets
encompass a wealth of multimodal data, which significantly
facilitate the development of medical foundation models.
18 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

TABLE VII: Public datasets used in the reviewed S-XAI methods.

Dataset Modality Scale Task

SLAKE [38] X-ray 642 images, 14,028 QA pairs VQA, SEG, DET
IU X-ray [41] X-ray 8,121 images, 3,996 texts MRG
MIMIC-Diff-VQA [62] X-ray 700K QA pairs, 164K images VQA
OAI [124] X-ray 26,626,000 images CLS
CheXbench [249] X-ray 6.1M QA pairs VQA
ChestXray [40] X-ray 108,948 images CLS, MRG, DET, LOC
CheXpert [44] X-ray 224K images CLS
MIMIC-CXR [42] X-ray 377K images, 227K texts CLS, MRG
BCDR-F03 [79] X-ray 736 images CLS
RSNA [89] X-ray 30,000 images CLS, DET
ZhangLabData [90] OCT, X-ray 108,312 OCT images, 5,232 X-ray images CLS
SIIM-FISABIO-RSNA [91] X-ray 10,178 images CLS, DET
Public Radiography [92] X-ray 3,487 images CLS
COVQU [93] X-ray 18,479 images CLS, SEG
COVID-19-NY-SBU [94] X-ray, CT, MRI 1,384 cases CLS
MIDRC-RICORD-1C [95] X-ray 361 cases CLS
NIH [97] X-ray 108,948 images CLS
VinBigData [98] X-ray 18,000 images CLS, DET
Montgomery [154] X-ray 138 images CLS
COVID-19 [174] X-ray 761 images CLS
OpenI [220] X-ray 7,470 images, 3,955 texts CLS, MRG
Chest ImaGenome [216] X-ray 242K images, 217K texts CLS, MRG, DET
MIMIC-CXR-VQA [248] X-ray 377K images with QA pairs VQA
ADNI-1 [99] MRI 818 subjects CLS, REG
ADNI-2 [112] MRI 599 subjects CLS, REG
SUN09 [103] MRI 395 slices SEG
AC17 [104] MRI 200 volumes SEG
BraTS 2020 [107] MRI 494 cases SEG, CLS
Calgary Campinas MRI [110] MRI 359 subjects SEG
OASIS [176] MRI 416 cases CLS
BraTS 2014 [183] MRI 65 scans SEG
IXI [189] MRI 600 cases REG
RICODR [96] CT, X-ray 240 CT scans, 1,000 X-ray images CLS
CT-150 [100] CT 150 scans SEG
Pancreas-CT [101] CT 82 scans SEG
CHAOS [108] CT, MRI 40 CT scans, 120 MRI scans SEG
LIDC-IDRI [146] CT 1,018 scans CLS, SEG
LiTS [236] CT 201 scans SEG
VQA-RAD [243] CT, MRI, X-ray 3,515 QA pairs, 315 images VQA
DDSM [50] Mammogram 2,620 cases CLS, MRG, SEG
CBIS-DDSM [179] Mammogram 1,644 cases CLS, SEG
CMMD [181] Mammogram 1,775 cases CLS
BUSI [158] Ultrasound 780 images CLS
FGLS [190] Ultrasound 4,290 volumes REG
HAM10000 [106] Dermatology 10,015 images CLS
ISIC 2016 [84] Dermatology 1,279 images CLS, SEG
ISIC 2017 [85] Dermatology 2,750 images CLS, SEG
ISIC 2018 [86] Dermatology 15,121 images CLS, SEG
ISIC 2019 [54] Dermatology 33,569 images CLS
Derm7pt [127] Dermatology 1,011 images CLS
PH2 [128] Dermatology 200 images CLS, SEG
SkinCon [131] Dermatology 3,230 images CLS
Fitzpatrick 17k [133] Dermatology 16,577 images CLS
DDI [134] Dermatology 656 images CLS
DermNetNZ [137] Dermatology 25K images CLS
SD-260 [138] Dermatology 6,584 images CLS
Dermnet [239] Dermatology 18,856 images CLS, VQA
PathVQA [39] Pathology 32K QA pairs, 4,998 images VQA
BACH [81] Histopathology 500 images CLS
Biopsy4Grading [88] Histopathology 351 images CLS
WBCAtt [143] Microscopy 113,278 images CLS
NCT [157] Histopathology 100K images CLS
TCGA [230] Pathology 20K studies CLS, MRG
PEIR Gross [233] Pathology 7,442 image-text pairs MRG
ACRIMA [82] Retinal images 705 images CLS
FGADR [150] Retinal images 2,842 images CLS, SEG
DDR [151] Retinal images 13,673 images CLS, SEG, DET
IDRID [155] Retinal images 516 images CLS, SEG, LOC
Continued on next page
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 19

– continued from previous page

Dataset Modality Scale Task
EyePACS [185] Retinal images 88,702 images CLS
Messidor [187] Retinal images 1,200 images CLS
Hyperkvasir [55] Endoscopy 110,079 images, 374 videos SEG, DET
Infectious Keratitis [148] Slit lamp microscopy 115,408 images CLS
PMC-OA [241] Multiple 1.65M image-text pairs VQA
PMC-VQA [246] Multiple 227K QA pairs, 149K images VQA
20 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

R EFERENCES [26] H. Xiang et al., “Development and validation of an interpretable

model integrating multimodal information for improving ovarian cancer
[1] X. Jia, L. Ren, and J. Cai, “Clinical implementation of ai technologies diagnosis,” Nature Communications, vol. 15, no. 1, p. 2681, 2024.
will require interpretable ai models,” Medical physics, no. 1, pp. 1–4, [27] N. Lassau et al., “Integrating deep learning ct-scan model, biological
2020. and clinical variables to predict severity of covid-19 patients,” Nature
communications, vol. 12, no. 1, pp. 1–11, 2021.
[2] Z. Salahuddin et al., “Transparency of deep neural networks for medical
[28] S. Ji et al., “A survey on knowledge graphs: Representation, acquisition,
image analysis: A review of interpretability methods,” Computers in
and applications,” IEEE transactions on neural networks and learning
biology and medicine, vol. 140, p. 105111, 2022.
systems, vol. 33, no. 2, pp. 494–514, 2021.
[3] L. Luo et al., “Rethinking annotation granularity for overcoming
[29] S. Pan et al., “Unifying large language models and knowledge graphs:
shortcuts in deep learning–based radiograph diagnosis: A multicenter
A roadmap,” IEEE Transactions on Knowledge and Data Engineering,
study,” Radiology: Artificial Intelligence, vol. 4, no. 5, p. e210299,
2024.
2022.
[30] C. Peng et al., “Knowledge graphs: Opportunities and challenges,”
[4] D. Alvarez Melis and T. Jaakkola, “Towards robust interpretability Artificial Intelligence Review, vol. 56, no. 11, pp. 13 071–13 102, 2023.
with self-explaining neural networks,” Advances in neural information [31] X. Xie et al., “A survey on incorporating domain knowledge into deep
processing systems, vol. 31, 2018. learning for medical image analysis,” Medical Image Analysis, vol. 69,
[5] S. Bach et al., “On pixel-wise explanations for non-linear classifier p. 101985, 2021.
decisions by layer-wise relevance propagation,” PloS one, vol. 10, no. 7, [32] S. Liu and H. Chen, “Knowledge injected multimodal irregular ehrs
p. e0130140, 2015. model for medical prediction,” in International Workshop on Trustwor-
[6] B. Zhou et al., “Learning deep features for discriminative localization,” thy Artificial Intelligence for Healthcare. Springer, 2024, pp. 25–39.
in Proceedings of the IEEE conference on computer vision and pattern [33] S. Liu et al., “Shape: A sample-adaptive hierarchical prediction network
recognition, 2016, pp. 2921–2929. for medication recommendation,” IEEE Journal of Biomedical and
[7] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?” Health Informatics, 2023.
explaining the predictions of any classifier,” in Proceedings of the 22nd [34] U. Naseem et al., “K-pathvqa: Knowledge-aware multimodal repre-
ACM SIGKDD international conference on knowledge discovery and sentation for pathology visual question answering,” IEEE Journal of
data mining, 2016, pp. 1135–1144. Biomedical and Health Informatics, 2023.
[8] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting [35] Y. Zhang et al., “When radiology report generation meets knowledge
model predictions,” Advances in neural information processing systems, graph,” in Proceedings of the AAAI conference on artificial intelligence,
vol. 30, 2017. vol. 34, no. 07, 2020, pp. 12 910–12 917.
[9] B. Kim et al., “Interpretability beyond feature attribution: Quantitative [36] F. Liu et al., “Exploring and distilling posterior and prior knowledge
testing with concept activation vectors (tcav),” in International confer- for radiology report generation,” in Proceedings of the IEEE/CVF
ence on machine learning. PMLR, 2018, pp. 2668–2677. Conference on Computer Vision and Pattern Recognition, 2021, pp.
[10] J. Crabbé and M. van der Schaar, “Concept activation regions: A 13 753–13 762.
generalized framework for concept-based explanations,” Advances in [37] Z. Huang, X. Zhang, and S. Zhang, “Kiut: Knowledge-injected u-
Neural Information Processing Systems, vol. 35, pp. 2590–2607, 2022. transformer for radiology report generation,” in Proceedings of the
[11] B. Mittelstadt, C. Russell, and S. Wachter, “Explaining explanations in IEEE/CVF Conference on Computer Vision and Pattern Recognition,
ai,” in Proceedings of the conference on fairness, accountability, and 2023, pp. 19 809–19 818.
transparency, 2019, pp. 279–288. [38] B. Liu et al., “Slake: A semantically-labeled knowledge-enhanced
[12] J. Zhang et al., “Overlooked trustworthiness of saliency maps,” in dataset for medical visual question answering,” in 2021 IEEE 18th
International Conference on Medical Image Computing and Computer- International Symposium on Biomedical Imaging (ISBI). IEEE, 2021,
Assisted Intervention. Springer, 2022, pp. 451–461. pp. 1650–1654.
[13] C. Rudin, “Stop explaining black box machine learning models for high [39] X. He et al., “Pathvqa: 30000+ questions for medical visual question
stakes decisions and use interpretable models instead,” Nature machine answering,” arXiv preprint arXiv:2003.10286, 2020.
intelligence, vol. 1, no. 5, pp. 206–215, 2019. [40] X. Wang et al., “Chestx-ray8: Hospital-scale chest x-ray database and
[14] J. R. Quinlan, “Induction of decision trees,” Machine learning, vol. 1, benchmarks on weakly-supervised classification and localization of
pp. 81–106, 1986. common thorax diseases,” in Proceedings of the IEEE conference on
[15] T. J. Hastie, “Generalized additive models,” in Statistical models in S. computer vision and pattern recognition, 2017, pp. 2097–2106.
Routledge, 2017, pp. 249–307. [41] D. Demner-Fushman et al., “Preparing a collection of radiology exami-
[16] C. Grosan and A. Abraham, Intelligent systems. Springer, vol. 17. nations for distribution and retrieval,” Journal of the American Medical
Informatics Association, vol. 23, no. 2, pp. 304–310, 2016.
[17] C. Patrı́cio, J. C. Neves, and L. F. Teixeira, “Explainable deep learning
[42] A. E. Johnson et al., “Mimic-cxr, a de-identified publicly available
methods in medical image classification: A survey,” ACM Computing
database of chest radiographs with free-text reports,” Scientific data,
Surveys, vol. 56, no. 4, pp. 1–41, 2023.
vol. 6, no. 1, p. 317, 2019.
[18] G. Yang, Q. Ye, and J. Xia, “Unbox the black-box for the medical
[43] B. Chen et al., “Label co-occurrence learning with graph convolutional
explainable ai via multi-modal and multi-centre data fusion: A mini-
networks for multi-label chest x-ray image classification,” IEEE journal
review, two showcases and beyond,” Information Fusion, vol. 77, pp.
of biomedical and health informatics, vol. 24, no. 8, pp. 2292–2302,
29–52, 2022.
2020.
[19] B. H. Van der Velden et al., “Explainable artificial intelligence (xai) in [44] J. Irvin et al., “Chexpert: A large chest radiograph dataset with
deep learning-based medical image analysis,” Medical Image Analysis, uncertainty labels and expert comparison,” in Proceedings of the AAAI
vol. 79, p. 102470, 2022. conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 590–597.
[20] R.-K. Sheu and M. S. Pardeshi, “A survey on medical explainable ai [45] W. Zheng et al., “Pay attention to doctor–patient dialogues: Multi-
(xai): recent progress, explainability approach, human interaction and modal knowledge graph attention image-text embedding for covid-19
scoring system,” Sensors, vol. 22, no. 20, p. 8068, 2022. diagnosis,” Information Fusion, vol. 75, pp. 168–185, 2021.
[21] E. Tjoa and C. Guan, “A survey on explainable artificial intelligence [46] D. Hou, Z. Zhao, and S. Hu, “Multi-label learning with visual-semantic
(xai): Toward medical xai,” IEEE transactions on neural networks and embedded knowledge graph for diagnosis of radiology imaging,” IEEE
learning systems, vol. 32, no. 11, pp. 4793–4813, 2020. Access, vol. 9, pp. 15 720–15 730, 2021.
[22] A. Singh, S. Sengupta, and V. Lakshminarayanan, “Explainable deep [47] Y. Zhou et al., “Contrast-attentive thoracic disease recognition with
learning models in medical image analysis,” Journal of imaging, vol. 6, dual-weighting graph reasoning,” IEEE Transactions on Medical Imag-
no. 6, p. 52, 2020. ing, vol. 40, no. 4, pp. 1196–1206, 2021.
[23] M. Christoph, Interpretable machine learning: A guide for making [48] C. Wu et al., “Medklip: Medical knowledge enhanced language-image
black box models explainable. Leanpub, 2020. pre-training for x-ray diagnosis,” in Proceedings of the IEEE/CVF
[24] S. Kapse et al., “Si-mil: Taming deep mil for self-interpretability in International Conference on Computer Vision, 2023, pp. 21 372–
gigapixel histopathology,” in Proceedings of the IEEE/CVF Conference 21 383.
on Computer Vision and Pattern Recognition, 2024, pp. 11 226–11 237. [49] Y. Liu et al., “Act like a radiologist: towards reliable multi-view
[25] U. Sajid et al., “Breast cancer classification using deep learned features correspondence reasoning for mammogram mass detection,” IEEE
boosted with handcrafted features,” Biomedical Signal Processing and Transactions on Pattern Analysis and Machine Intelligence, vol. 44,
Control, vol. 86, p. 105353, 2023. no. 10, pp. 5947–5961, 2021.
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 21

[50] M. Heath et al., “Current status of the digital database for screening [74] R. Gu et al., “Ca-net: Comprehensive attention convolutional neural
mammography,” in Digital Mammography: Nijmegen, 1998. Springer, networks for explainable medical image segmentation,” IEEE transac-
1998, pp. 457–460. tions on medical imaging, vol. 40, no. 2, pp. 699–711, 2020.
[51] G. Zhao, “Cross chest graph for disease diagnosis with structural [75] C. Barata, M. E. Celebi, and J. S. Marques, “Explainable skin lesion
relational reasoning,” in Proceedings of the 29th ACM International diagnosis using taxonomies,” Pattern Recognition, vol. 110, p. 107413,
Conference on Multimedia, 2021, pp. 612–620. 2021.
[52] B. Qi et al., “Gren: graph-regularized embedding network for weakly- [76] G. Lozupone et al., “Axial: Attention-based explainability for inter-
supervised disease localization in x-ray images,” IEEE Journal of pretable alzheimer’s localized diagnosis using 2d cnns on 3d mri brain
Biomedical and Health Informatics, vol. 26, no. 10, pp. 5142–5153, scans,” arXiv preprint arXiv:2407.02418, 2024.
2022. [77] J. Huang et al., “Swin deformable attention u-net transformer (sdaut)
[53] M. Elbatel, R. Martı́, and X. Li, “Fopro-kd: fourier prompted effective for explainable fast mri,” in International Conference on Medical Image
knowledge distillation for long-tailed medical image recognition,” IEEE Computing and Computer-Assisted Intervention. Springer, 2022, pp.
Transactions on Medical Imaging, 2023. 538–548.
[54] M. Combalia et al., “Bcn20000: Dermoscopic lesions in the wild,” [78] H. Wang et al., “Breast mass classification via deeply integrating the
arXiv preprint arXiv:1908.02288, 2019. contextual information from multi-view data,” Pattern Recognition,
[55] H. Borgli et al., “Hyperkvasir, a comprehensive multi-class image and vol. 80, pp. 42–52, 2018.
video dataset for gastrointestinal endoscopy,” Scientific data, vol. 7, [79] J. Arevalo et al., “Representation learning for mammography mass
no. 1, p. 283, 2020. lesion classification with convolutional neural networks,” Computer
[56] C. Y. Li et al., “Knowledge-driven encode, retrieve, paraphrase for Methods and Programs in Biomedicine, vol. 127, pp. 248–257, 2016.
medical image report generation,” in Proceedings of the AAAI confer- [Online]. Available: https://www.sciencedirect.com/science/article/pii/
ence on artificial intelligence, vol. 33, no. 01, 2019, pp. 6666–6673. S0169260715300110
[57] F. Liu et al., “Auto-encoding knowledge graph for unsupervised med- [80] H. Yang et al., “Guided soft attention network for classification of
ical report generation,” Advances in Neural Information Processing breast cancer histopathology images,” IEEE transactions on medical
Systems, vol. 34, pp. 16 266–16 279, 2021. imaging, vol. 39, no. 5, pp. 1306–1315, 2019.
[58] M. Li et al., “Dynamic graph enhanced contrastive learning for chest [81] G. Aresta et al., “Bach: Grand challenge on breast cancer histology
x-ray report generation,” in Proceedings of the IEEE/CVF Conference images,” Medical image analysis, vol. 56, pp. 122–139, 2019.
on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343. [82] A. Diaz-Pinto et al., “Cnns for automatic glaucoma assessment us-
[59] K. Kale et al., “Kgvl-bart: Knowledge graph augmented visual lan- ing fundus images: an extensive validation,” Biomedical engineering
guage bart for radiology report generation,” in Proceedings of the online, vol. 18, pp. 1–19, 2019.
17th Conference of the European Chapter of the Association for [83] Y. Yan, J. Kawahara, and G. Hamarneh, “Melanoma recognition via
Computational Linguistics, 2023, pp. 3401–3411. visual attention,” in Information Processing in Medical Imaging: 26th
[60] H. Guo et al., “Medical visual question answering via targeted choice International Conference, IPMI 2019, Hong Kong, China, June 2–7,
contrast and multimodal entity matching,” in International Conference 2019, Proceedings 26. Springer, 2019, pp. 793–804.
on Neural Information Processing. Springer, 2022, pp. 343–354. [84] D. Gutman et al., “Skin lesion analysis toward melanoma detection: A
challenge at the international symposium on biomedical imaging (isbi)
[61] J. Huang et al., “Medical knowledge-based network for patient-oriented
2016, hosted by the international skin imaging collaboration (isic),”
visual question answering,” Information Processing & Management,
arXiv preprint arXiv:1605.01397, 2016.
vol. 60, no. 2, p. 103241, 2023.
[85] N. C. Codella et al., “Skin lesion analysis toward melanoma detection:
[62] X. Hu et al., “Expert knowledge-aware image difference graph rep-
A challenge at the 2017 international symposium on biomedical
resentation learning for difference-aware medical visual question an-
imaging (isbi), hosted by the international skin imaging collaboration
swering,” in Proceedings of the 29th ACM SIGKDD Conference on
(isic),” in 2018 IEEE 15th international symposium on biomedical
Knowledge Discovery and Data Mining, 2023, pp. 4156–4165.
imaging (ISBI 2018). IEEE, 2018, pp. 168–172.
[63] X. Hu et al., “Interpretable medical image visual question answering [86] N. Codella et al., “Skin lesion analysis toward melanoma detection
via multi-modal relationship graph learning,” Medical Image Analysis, 2018: A challenge hosted by the international skin imaging collabora-
vol. 97, p. 103279, 2024. tion (isic),” arXiv preprint arXiv:1902.03368, 2019.
[64] J. Li et al., “Align before fuse: Vision and language representation [87] C. Yin et al., “Focusing on clinically interpretable features: selective
learning with momentum distillation,” Advances in neural information attention regularization for liver biopsy image classification,” in Med-
processing systems, vol. 34, pp. 9694–9705, 2021. ical Image Computing and Computer Assisted Intervention–MICCAI
[65] S. Liu et al., “A hybrid method of recurrent neural network and graph 2021: 24th International Conference, Strasbourg, France, September
neural network for next-period prescription prediction,” International 27–October 1, 2021, Proceedings, Part V 24. Springer, 2021, pp.
Journal of Machine Learning and Cybernetics, vol. 11, pp. 2849–2856, 153–162.
2020. [88] F. Heinemann, G. Birk, and B. Stierstorfer, “Deep learning enables
[66] S. Liu et al., “Multimodal data matters: language model pre-training pathologist-like scoring of nash models,” Scientific reports, vol. 9,
over structured and unstructured electronic health records,” IEEE no. 1, p. 18454, 2019.
Journal of Biomedical and Health Informatics, vol. 27, no. 1, pp. 504– [89] G. Shih et al., “Augmenting the national institutes of health chest
514, 2022. radiograph dataset with expert annotations of possible pneumonia,”
[67] A. Radford et al., “Learning transferable visual models from natural Radiology: Artificial Intelligence, vol. 1, no. 1, p. e180041, 2019.
language supervision,” in International conference on machine learn- [90] D. S. Kermany et al., “Identifying medical diagnoses and treatable
ing. PMLR, 2021, pp. 8748–8763. diseases by image-based deep learning,” cell, vol. 172, no. 5, pp. 1122–
[68] Z. Niu, G. Zhong, and H. Yu, “A review on the attention mechanism 1131, 2018.
of deep learning,” Neurocomputing, vol. 452, pp. 48–62, 2021. [91] P. Lakhani et al., “The 2021 siim-fisabio-rsna machine learning covid-
[69] M. Bhattacharya, S. Jain, and P. Prasanna, “Radiotransformer: a 19 challenge: Annotation and standard exam classification of covid-19
cascaded global-focal transformer for visual attention–guided disease chest radiographs,” Journal of Digital Imaging, vol. 36, no. 1, pp. 365–
classification,” in European Conference on Computer Vision. Springer, 372, 2023.
2022, pp. 679–698. [92] M. E. Chowdhury et al., “Can ai help in screening viral and covid-19
[70] S. Jetley et al., “Learn to pay attention,” arXiv preprint pneumonia?” Ieee Access, vol. 8, pp. 132 665–132 676, 2020.
arXiv:1804.02391, 2018. [93] T. Rahman et al., “Exploring the effect of image enhancement tech-
[71] H. Fukui et al., “Attention branch network: Learning of attention niques on covid-19 detection using chest x-ray images,” Computers in
mechanism for visual explanation,” in Proceedings of the IEEE/CVF biology and medicine, vol. 132, p. 104319, 2021.
conference on computer vision and pattern recognition, 2019, pp. [94] J. Saltz et al., “Stony brook university covid-19 positive cases,” the
10 705–10 714. cancer imaging archive, vol. 4, 2021.
[72] L. Li et al., “Scouter: Slot attention-based classifier for explainable [95] E. Tsai et al., “Data from medical imaging data resource center (midrc)-
image recognition,” in Proceedings of the IEEE/CVF international rsna international covid radiology database (ricord) release 1c-chest x-
conference on computer vision, 2021, pp. 1046–1055. ray, covid+(midrc-ricord-1c),” The Cancer Imaging Archive, vol. 10,
[73] J. Schlemper et al., “Attention gated networks: Learning to leverage 2021.
salient regions in medical images,” Medical image analysis, vol. 53, [96] E. B. Tsai et al., “The rsna international covid-19 open radiology
pp. 197–207, 2019. database (ricord),” Radiology, vol. 299, no. 1, pp. E204–E213, 2021.
22 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

[97] X. Wang et al., “Hospital-scale chest x-ray database and benchmarks [122] A. Sun et al., “Explain any concept: Segment anything meets concept-
on weakly-supervised classification and localization of common thorax based explanation,” Advances in Neural Information Processing Sys-
diseases,” in IEEE CVPR, vol. 7. sn, 2017, p. 46. tems, vol. 36, 2024.
[98] H. Q. Nguyen et al., “Vindr-cxr: An open dataset of chest x-rays with [123] A. Kirillov et al., “Segment anything,” in Proceedings of the IEEE/CVF
radiologist’s annotations,” Scientific Data, vol. 9, no. 1, p. 429, 2022. International Conference on Computer Vision, 2023, pp. 4015–4026.
[99] B. T. Wyman et al., “Standardization of analysis sets for reporting [124] M. Nevitt, D. Felson, and G. Lester, “The osteoarthritis initiative,”
results from adni mri data,” Alzheimer’s & Dementia, vol. 9, no. 3, pp. Protocol for the cohort study, vol. 1, p. 2, 2006.
332–337, 2013. [125] K. Chauhan et al., “Interactive concept bottleneck models,” in Proceed-
[100] H. R. Roth et al., “Hierarchical 3d fully convolutional networks for ings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 5,
multi-organ segmentation,” arXiv preprint arXiv:1704.06382, 2017. 2023, pp. 5948–5955.
[101] H. R. Roth et al., “Data from pancreas-ct. the cancer imaging archive,” [126] C. Patrı́cio, J. C. Neves, and L. F. Teixeira, “Coherent concept-
IEEE Transactions on Image Processing, vol. 10, p. K9, 2016. based explanations in medical image and its application to skin lesion
[102] J. Sun et al., “Saunet: Shape attentive u-net for interpretable medical diagnosis,” in Proceedings of the IEEE/CVF Conference on Computer
image segmentation,” in Medical Image Computing and Computer Vision and Pattern Recognition, 2023, pp. 3799–3808.
Assisted Intervention–MICCAI 2020: 23rd International Conference, [127] J. Kawahara et al., “Seven-point checklist and skin lesion classification
Lima, Peru, October 4–8, 2020, Proceedings, Part IV 23. Springer, using multitask multimodal neural nets,” IEEE journal of biomedical
2020, pp. 797–806. and health informatics, vol. 23, no. 2, pp. 538–546, 2018.
[103] P. Radau et al., “Evaluation framework for algorithms segmenting short
[128] T. Mendonça et al., “Ph2: A public database for the analysis of
axis cardiac mri.” The MIDAS Journal, 2009.
dermoscopic images,” Dermoscopy image analysis, vol. 2, 2015.
[104] O. Bernard et al., “Deep learning techniques for automatic mri cardiac
multi-structures segmentation and diagnosis: is the problem solved?” [129] S. Yan et al., “Towards trustable skin cancer diagnosis via rewriting
IEEE transactions on medical imaging, vol. 37, no. 11, pp. 2514–2525, model’s decision,” in Proceedings of the IEEE/CVF Conference on
2018. Computer Vision and Pattern Recognition, 2023, pp. 11 568–11 577.
[105] M. Karri, C. S. R. Annavarapu, and U. R. Acharya, “Explainable [130] Y. Bie, L. Luo, and H. Chen, “Mica: Towards explainable skin lesion
multi-module semantic guided attention based network for medical diagnosis via multi-level image-concept alignment,” in Proceedings of
image segmentation,” Computers in Biology and Medicine, vol. 151, the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024,
p. 106231, 2022. pp. 837–845.
[106] P. Tschandl, C. Rosendahl, and H. Kittler, “The ham10000 dataset, [131] R. Daneshjou et al., “Skincon: A skin disease dataset densely annotated
a large collection of multi-source dermatoscopic images of common by domain experts for fine-grained debugging and analysis,” Advances
pigmented skin lesions. scientific data. 2018; 5: 180161,” Search in, in Neural Information Processing Systems, vol. 35, pp. 18 157–18 167,
vol. 2, 2018. 2022.
[107] C. for Biomedical Image Computing and Analytics, “Multimodal [132] C. Kim et al., “Transparent medical image ai via an image–text
brain tumor segmentation challenge 2020: Data,” MICCAI 2020 foundation model grounded in medical literature,” Nature Medicine,
BraTs, 2020. [Online]. Available: https://www.med.upenn.edu/cbica/ pp. 1–12, 2024.
brats2020/data.html [133] M. Groh et al., “Evaluating deep neural networks trained on clinical
[108] A. E. Kavur et al., “Chaos challenge - combined (ct-mr) healthy images in dermatology with the fitzpatrick 17k dataset,” in Proceed-
abdominal organ segmentation,” Medical Image Analysis, vol. 69, ings of the IEEE/CVF Conference on Computer Vision and Pattern
p. 101950, 2021. [Online]. Available: https://www.sciencedirect.com/ Recognition, 2021, pp. 1820–1828.
science/article/pii/S1361841520303145 [134] R. Daneshjou et al., “Disparities in dermatology ai performance on a
[109] H. Li et al., “Pmjaf-net: Pyramidal multi-scale joint attention and diverse, curated clinical image set,” Science advances, vol. 8, no. 31,
adaptive fusion network for explainable skin lesion segmentation,” p. eabq6147, 2022.
Computers in Biology and Medicine, p. 107454, 2023. [135] A. Lucieri et al., “Exaid: A multimodal explanation framework for
[110] R. Souza et al., “An open, multi-vendor, multi-field-strength brain computer-aided diagnosis of skin lesions,” Computer Methods and
mr dataset and analysis of publicly available skull stripping methods Programs in Biomedicine, vol. 215, p. 106620, 2022.
agreement,” NeuroImage, vol. 170, pp. 482–494, 2018. [136] R. Jalaboi et al., “Dermx: An end-to-end framework for explainable
[111] C. Lian et al., “End-to-end dementia status prediction from brain automated dermatological diagnosis,” Medical Image Analysis, vol. 83,
mri using multi-task weakly-supervised attention network,” in Medical p. 102647, 2023.
Image Computing and Computer Assisted Intervention–MICCAI 2019: [137] N. Z. D. Society, “Dermatology images.” [Online]. Available:
22nd International Conference, Shenzhen, China, October 13–17, 2019, https://dermnetnz.org/
Proceedings, Part IV 22. Springer, 2019, pp. 158–167. [138] X. Sun et al., “A benchmark for automatic visual classification of
[112] C. R. Jack Jr et al., “Update on the magnetic resonance imaging clinical skin disease images,” in Computer Vision–ECCV 2016: 14th
core of the alzheimer’s disease neuroimaging initiative,” Alzheimer’s European Conference, Amsterdam, The Netherlands, October 11-14,
& Dementia, vol. 6, no. 3, pp. 212–220, 2010. 2016, Proceedings, Part VI 14. Springer, 2016, pp. 206–222.
[113] P. W. Koh et al., “Concept bottleneck models,” in International
[139] J. Hou, J. Xu, and H. Chen, “Concept-attention whitening for inter-
conference on machine learning. PMLR, 2020, pp. 5338–5348.
pretable skin lesion diagnosis,” arXiv preprint arXiv:2404.05997, 2024.
[114] M. Yuksekgonul, M. Wang, and J. Zou, “Post-hoc concept bottleneck
models,” arXiv preprint arXiv:2205.15480, 2022. [140] I. Kim et al., “Concept bottleneck with visual concept filtering for
[115] R. Jain et al., “Extending logic explained networks to text classifica- explainable medical image classification,” in International Conference
tion,” arXiv preprint arXiv:2211.09732, 2022. on Medical Image Computing and Computer-Assisted Intervention.
[116] A. Tan, F. Zhou, and H. Chen, “Explain via any concept: Concept Springer, 2023, pp. 225–233.
bottleneck model with open vocabulary concepts,” arXiv preprint [141] C. Patrı́cio, L. F. Teixeira, and J. C. Neves, “Towards concept-based
arXiv:2408.02265, 2024. interpretability of skin lesion diagnosis using vision-language models,”
[117] B. Kim et al., “Interpretability beyond feature attribution: Quantitative in 2024 IEEE International Symposium on Biomedical Imaging (ISBI).
testing with concept activation vectors (tcav),” in International confer- IEEE, 2024, pp. 1–5.
ence on machine learning. PMLR, 2018, pp. 2668–2677. [142] W. Pang et al., “Integrating clinical knowledge into concept bottleneck
[118] J. R. Clough et al., “Global and local interpretability for cardiac models,” in International Conference on Medical Image Computing
mri classification,” in International Conference on Medical Image and Computer-Assisted Intervention (MICCAI), 2024.
Computing and Computer-Assisted Intervention. Springer, 2019, pp. [143] S. Tsutsui, W. Pang, and B. Wen, “Wbcatt: a white blood cell dataset
656–664. annotated with detailed morphological attributes,” Advances in Neural
[119] M. Graziani et al., “Concept attribution: Explaining cnn decisions to Information Processing Systems, vol. 36, 2024.
physicians,” Computers in biology and medicine, vol. 123, p. 103865, [144] R. Marcinkevičs et al., “Interpretable and intervenable ultrasonography-
2020. based machine learning models for pediatric appendicitis,” Medical
[120] R. Achtibat et al., “From attribution maps to human-understandable Image Analysis, vol. 91, p. 103042, 2024.
explanations through concept relevance propagation,” Nature Machine [145] G. Zhao et al., “Diagnose like a radiologist: Hybrid neuro-probabilistic
Intelligence, vol. 5, no. 9, pp. 1006–1019, 2023. reasoning for attribute-based medical image diagnosis,” IEEE Trans-
[121] E. Poeta et al., “Concept-based explainable artificial intelligence: A actions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11,
survey,” arXiv preprint arXiv:2312.12936, 2023. pp. 7400–7416, 2021.
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 23

[146] S. G. Armato III et al., “The lung image database consortium (lidc) [172] I. Biederman, “Recognition-by-components: a theory of human image
and image database resource initiative (idri): a completed reference understanding.” Psychological review, vol. 94, no. 2, p. 115, 1987.
database of lung nodules on ct scans,” Medical physics, vol. 38, no. 2, [173] G. Singh and K.-C. Yow, “An interpretable deep learning model for
pp. 915–931, 2011. covid-19 detection with chest x-ray images,” Ieee Access, vol. 9, pp.
[147] Z. Fang et al., “Concept-based explanation for fine-grained images and 85 198–85 208, 2021.
its application in infectious keratitis classification,” in Proceedings of [174] J. P. Cohen, P. Morrison, and L. Dao, “Covid-19 image data collection,”
the 28th ACM international conference on Multimedia, 2020, pp. 700– arXiv preprint arXiv:2003.11597, 2020.
708. [175] S. Mohammadjafari et al., “Using protopnet for interpretable
[148] Y. Xu et al., “Deep sequential feature learning in clinical image alzheimer’s disease classification.” in Canadian Conference on AI,
classification of infectious keratitis,” Engineering, vol. 7, no. 7, pp. 2021.
1002–1010, 2021. [176] D. S. Marcus et al., “Open access series of imaging studies (oasis):
[149] C. Wen et al., “Concept-based lesion aware transformer for inter- cross-sectional mri data in young, middle aged, nondemented, and
pretable retinal disease diagnosis,” IEEE Transactions on Medical demented older adults,” Journal of cognitive neuroscience, vol. 19,
Imaging, 2024. no. 9, pp. 1498–1507, 2007.
[150] Y. Zhou et al., “A benchmark for studying diabetic retinopathy: seg- [177] A. J. Barnett et al., “A case-based interpretable deep learning model
mentation, grading, and transferability,” IEEE Transactions on Medical for classification of mass lesions in digital mammography,” Nature
Imaging, vol. 40, no. 3, pp. 818–828, 2020. Machine Intelligence, vol. 3, no. 12, pp. 1061–1070, 2021.
[151] T. Li et al., “Diagnostic assessment of deep learning algorithms for [178] G. Carloni et al., “On the applicability of prototypical part learning
diabetic retinopathy screening,” Information Sciences, vol. 501, pp. in medical images: breast masses classification using protopnet,” in
511–522, 2019. International Conference on Pattern Recognition. Springer, 2022, pp.
[152] M. Kong et al., “Attribute-aware interpretation learning for thyroid 539–557.
ultrasound diagnosis,” Artificial Intelligence in Medicine, vol. 131, p. [179] R. S. Lee et al., “A curated mammography data set for use in computer-
102344, 2022. aided detection and diagnosis research,” Scientific data, vol. 4, no. 1,
[153] J. Liu et al., “A chatgpt aided explainable framework for zero-shot pp. 1–9, 2017.
medical image diagnosis,” arXiv preprint arXiv:2307.01981, 2023. [180] C. Wang et al., “Knowledge distillation to ensemble global and
[154] S. Jaeger et al., “Two public chest x-ray datasets for computer-aided interpretable prototype-based mammogram classification models,” in
screening of pulmonary diseases,” Quantitative imaging in medicine International Conference on Medical Image Computing and Computer-
and surgery, vol. 4, no. 6, p. 475, 2014. Assisted Intervention. Springer, 2022, pp. 14–24.
[155] P. Porwal et al., “Indian diabetic retinopathy image dataset (idrid): [181] C. Cui et al., “The chinese mammography database (cmmd): An online
a database for diabetic retinopathy screening research,” Data, vol. 3, mammography database with biopsy confirmed types for machine
no. 3, p. 25, 2018. diagnosis of breast,” The Cancer Imaging Archive, vol. 1, 2021.
[156] Y. Gao et al., “Aligning human knowledge with visual concepts [182] Y. Wei, R. Tam, and X. Tang, “Mprotonet: A case-based interpretable
towards explainable medical image classification,” arXiv preprint model for brain tumor classification with 3d multi-parametric magnetic
arXiv:2406.05596, 2024. resonance imaging,” in Medical Imaging with Deep Learning. PMLR,
[157] J. N. Kather, N. Halama, and A. Marx, “100,000 histological images 2024, pp. 1798–1812.
of human colorectal cancer and healthy tissue,” Zenodo10, vol. 5281,
[183] B. H. Menze et al., “The multimodal brain tumor image segmentation
no. 9, 2018.
benchmark (brats),” IEEE transactions on medical imaging, vol. 34,
[158] W. Al-Dhabyani et al., “Deep learning approaches for data augmenta-
no. 10, pp. 1993–2024, 2014.
tion and classification of breast masses using ultrasound images,” Int.
[184] L. S. Hesse and A. I. Namburete, “Insightr-net: interpretable neural net-
J. Adv. Comput. Sci. Appl, vol. 10, no. 5, pp. 1–11, 2019.
work for regression using similarity-based comparisons to prototypical
[159] M. Espinosa Zarlenga et al., “Concept embedding models: Beyond
examples,” in International Conference on Medical Image Computing
the accuracy-explainability trade-off,” Advances in Neural Information
and Computer-Assisted Intervention. Springer, 2022, pp. 502–511.
Processing Systems, vol. 35, pp. 21 400–21 413, 2022.
[160] Z. Chen, Y. Bei, and C. Rudin, “Concept whitening for interpretable [185] C. H. Foundation, “Eyepacs,” 2015. [Online]. Available: https:
image recognition,” Nature Machine Intelligence, vol. 2, no. 12, pp. //www.kaggle.com/c/diabetic-retinopathy-detection/data
772–782, 2020. [186] I. B. d. A. Santos and A. C. de Carvalho, “Protoal: Interpretable deep
[161] S. Lapuschkin et al., “Unmasking clever hans predictors and assessing active learning with prototypes for medical imaging,” arXiv preprint
what machines really learn,” Nature communications, vol. 10, no. 1, p. arXiv:2404.04736, 2024.
1096, 2019. [187] E. Decencière et al., “Feedback on a publicly distributed image
[162] A. Ghorbani et al., “Towards automatic concept-based explanations,” database: The messidor database. image anal & stereology 33: 231–
Advances in neural information processing systems, vol. 32, 2019. 234,” 2014.
[163] A. Sarkar et al., “A framework for learning ante-hoc explainable [188] L. S. Hesse, N. K. Dinsdale, and A. I. L. Namburete, “Prototype
models via concepts,” in Proceedings of the IEEE/CVF Conference on learning for explainable brain age prediction,” in Proceedings of the
Computer Vision and Pattern Recognition, 2022, pp. 10 286–10 295. IEEE/CVF Winter Conference on Applications of Computer Vision
[164] C.-K. Yeh et al., “On completeness-aware concept-based explanations (WACV), January 2024, pp. 7903–7913.
in deep neural networks,” Advances in neural information processing [189] https://brain-development.org/ixi-dataset/.
systems, vol. 33, pp. 20 554–20 565, 2020. [190] A. T. Papageorghiou et al., “International standards for fetal growth
[165] Y. Yang et al., “Language in a bottle: Language model guided concept based on serial ultrasound measurements: the fetal growth longitudinal
bottlenecks for interpretable image classification,” in Proceedings of the study of the intergrowth-21st project,” The Lancet, vol. 384, no. 9946,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 869–879, 2014.
2023, pp. 19 187–19 197. [191] D. Rymarczyk et al., “Protopshare: Prototypical parts sharing for simi-
[166] T. Brown et al., “Language models are few-shot learners,” Advances in larity discovery in interpretable image classification,” in Proceedings of
neural information processing systems, vol. 33, pp. 1877–1901, 2020. the 27th ACM SIGKDD Conference on Knowledge Discovery & Data
[167] A. Radford et al., “Learning transferable visual models from natural Mining, 2021, pp. 1420–1430.
language supervision,” in International conference on machine learn- [192] D. Rymarczyk et al., “Interpretable image classification with differen-
ing. PMLR, 2021, pp. 8748–8763. tiable prototypes assignment,” in European Conference on Computer
[168] T. Oikarinen et al., “Label-free concept bottleneck models,” arXiv Vision. Springer, 2022, pp. 351–368.
preprint arXiv:2304.06129, 2023. [193] J. Donnelly, A. J. Barnett, and C. Chen, “Deformable protopnet:
[169] Y. Bie et al., “Xcoop: Explainable prompt learning for computer-aided An interpretable image classifier using deformable prototypes,” in
diagnosis via concept-guided context optimization,” arXiv preprint Proceedings of the IEEE/CVF conference on computer vision and
arXiv:2403.09410, 2024. pattern recognition, 2022, pp. 10 265–10 275.
[170] E. Kim et al., “Xprotonet: diagnosis in chest radiography with global [194] J. Wang et al., “Interpretable image recognition by constructing trans-
and local explanations,” in Proceedings of the IEEE/CVF conference parent embedding space,” in Proceedings of the IEEE/CVF interna-
on computer vision and pattern recognition, 2021, pp. 15 719–15 728. tional conference on computer vision, 2021, pp. 895–904.
[171] C. Chen et al., “This looks like that: deep learning for interpretable im- [195] B. Wang et al., “Learning bottleneck concepts in image classification,”
age recognition,” Advances in neural information processing systems, in Proceedings of the ieee/cvf conference on computer vision and
vol. 32, 2019. pattern recognition, 2023, pp. 10 962–10 971.
24 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

[196] P. Hase et al., “Interpretable image recognition with hierarchical proto- [216] J. T. Wu et al., “Chest imagenome dataset for clinical reasoning,”
types,” in Proceedings of the AAAI Conference on Human Computation in Thirty-fifth Conference on Neural Information Processing Systems
and Crowdsourcing, vol. 7, 2019, pp. 32–40. Datasets and Benchmarks Track (Round 2).
[197] Y. Ukai et al., “This looks like it rather than that: Protoknn for [217] Q. Li et al., “Anatomical structure-guided medical vision-language pre-
similarity-based classifiers,” in The Eleventh International Conference training,” arXiv preprint arXiv:2403.09294, 2024.
on Learning Representations, 2022. [218] T. Tanida et al., “Interactive and explainable region-guided radiology
[198] A. Bontempelli et al., “Concept-level debugging of part-prototype report generation,” in Proceedings of the IEEE/CVF Conference on
networks,” in The Eleventh International Conference on Learning Computer Vision and Pattern Recognition, 2023, pp. 7433–7442.
Representations, 2023. [219] L. Wang et al., “An inclusive task-aware framework for radiology report
[199] O. Li et al., “Deep learning for case-based reasoning through proto- generation,” in International Conference on Medical Image Computing
types: A neural network that explains its predictions,” in Proceedings and Computer-Assisted Intervention. Springer, 2022, pp. 568–577.
of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018. [220] D. Demner-Fushman et al., “Design and development of a multimodal
[200] Q. Huang et al., “Evaluation and improvement of interpretability biomedical information retrieval system,” Journal of Computing Sci-
for self-explainable part-prototype networks,” in Proceedings of the ence and Engineering, vol. 6, no. 2, pp. 168–177, 2012.
IEEE/CVF International Conference on Computer Vision, 2023, pp. [221] S. Singh et al., “From chest x-rays to radiology reports: a multimodal
2011–2020. machine learning approach,” in 2019 Digital Image Computing: Tech-
[201] M. Nauta et al., “Pip-net: Patch-based intuitive prototypes for in- niques and Applications (DICTA). IEEE, 2019, pp. 1–8.
terpretable image classification,” in Proceedings of the IEEE/CVF [222] G. Spinks and M.-F. Moens, “Justifying diagnosis decisions by deep
Conference on Computer Vision and Pattern Recognition, 2023, pp. neural networks,” Journal of biomedical informatics, vol. 96, p.
2744–2753. 103248, 2019.
[202] C. Ma et al., “This looks like those: Illuminating prototypical con- [223] Y. Kim et al., “Adversarially regularized autoencoders for generating
cepts using multiple visualizations,” Advances in Neural Information discrete structures,” arXiv preprint arXiv:1706.04223, vol. 2, p. 12,
Processing Systems, vol. 36, 2024. 2017.
[203] M. Nauta, R. Van Bree, and C. Seifert, “Neural prototype trees for [224] G. Liu et al., “Clinically accurate chest x-ray report generation,” in
interpretable fine-grained image recognition,” in Proceedings of the Machine Learning for Healthcare Conference. PMLR, 2019, pp. 249–
IEEE/CVF conference on computer vision and pattern recognition, 269.
2021, pp. 14 933–14 943. [225] Z. Chen et al., “Generating radiology reports via memory-driven
[204] A. Tan, Z. Fengtao, and H. Chen, “Post-hoc part-prototype networks,” transformer,” in Proceedings of the 2020 Conference on Empirical
in Forty-first International Conference on Machine Learning. Methods in Natural Language Processing (EMNLP), 2020, pp. 1439–
[205] J. Kim, M. Kim, and Y. M. Ro, “Interpretation of lesional detection 1449.
via counterfactual generation,” in 2021 IEEE International Conference [226] Z. Wang et al., “Metransformer: Radiology report generation by trans-
on Image Processing (ICIP). IEEE, 2021, pp. 96–100. former with multiple learnable expert tokens,” in Proceedings of the
[206] P. Pino et al., “Clinically correct report generation from chest x- IEEE/CVF Conference on Computer Vision and Pattern Recognition,
rays using templates,” in Machine Learning in Medical Imaging: 2023, pp. 11 558–11 567.
12th International Workshop, MLMI 2021, Held in Conjunction with [227] J. Yuan et al., “Automatic radiology report generation based on multi-
MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings view image fusion and medical concept enrichment,” in Medical Image
12. Springer, 2021, pp. 654–663. Computing and Computer Assisted Intervention–MICCAI 2019: 22nd
[207] C. E. Lipscomb, “Medical subject headings (mesh),” Bulletin of the International Conference, Shenzhen, China, October 13–17, 2019,
Medical Library Association, vol. 88, no. 3, p. 265, 2000. Proceedings, Part VI 22. Springer, 2019, pp. 721–729.
[208] H.-C. Shin et al., “Learning to read chest x-rays: Recurrent neural [228] H. Lee, S. Kim, and Y. Ro, “Generation of multimodal justification
cascade model for automated image annotation,” in Proceedings of the using visual word constraint model for explainable computer-aided
IEEE conference on computer vision and pattern recognition, 2016, diagnosis,” in Interpretability of Machine Intelligence in Medical Image
pp. 2497–2506. Computing and Multimodal Learning for Clinical Decision Support.
[209] A. Gasimova, “Automated enriched medical concept generation for Springer, 2019.
chest x-ray images,” in Interpretability of Machine Intelligence in [229] Z. Zhang et al., “Pathologist-level interpretable whole-slide cancer
Medical Image Computing and Multimodal Learning for Clinical diagnosis with deep learning,” Nature Machine Intelligence, vol. 1,
Decision Support: Second International Workshop, iMIMIC 2019, and no. 5, pp. 236–245, 2019.
9th International Workshop, ML-CDS 2019, Held in Conjunction with [230] N. C. Institute, “The cancer genome atlas program,” 2006. [Online].
MICCAI 2019, Shenzhen, China, October 17, 2019, Proceedings 9. Available: https://www.cancer.gov/tcga
Springer, 2019, pp. 83–92. [231] X. Wang et al., “Tienet: Text-image embedding network for com-
[210] I. Rodin et al., “Multitask and multimodal neural network model for mon thorax disease classification and reporting in chest x-rays,” in
interpretable analysis of x-ray images,” in 2019 IEEE International Proceedings of the IEEE conference on computer vision and pattern
Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2019, recognition, 2018, pp. 9049–9058.
pp. 1601–1604. [232] B. Jing, P. Xie, and E. Xing, “On the automatic generation of medical
[211] Z. Zhang et al., “Mdnet: A semantically and visually interpretable med- imaging reports,” in Proceedings of the 56th Annual Meeting of the
ical image diagnosis network,” in Proceedings of the IEEE conference Association for Computational Linguistics (Volume 1: Long Papers),
on computer vision and pattern recognition, 2017, pp. 6428–6436. 2018, pp. 2577–2586.
[212] Z. Zhang et al., “Tandemnet: Distilling knowledge from medical [233] K. N. Jones et al., “Peir digital library: Online resources and authoring
images using diagnostic reports as optional semantic references,” in system,” in Proceedings of the AMIA Symposium. American Medical
Medical Image Computing and Computer Assisted Intervention- MIC- Informatics Association, 2001, p. 1075.
CAI 2017: 20th International Conference, Quebec City, QC, Canada, [234] X. Zeng et al., “Generating diagnostic report for medical image by
September 11-13, 2017, Proceedings, Part III 20. Springer, 2017, pp. high-middle-level visual information incorporation on double deep
320–328. learning models,” Computer methods and programs in biomedicine,
[213] K. Ma et al., “A pathology image diagnosis network with visual vol. 197, p. 105700, 2020.
interpretability and structured diagnostic report,” in Neural Information [235] J. Tian et al., “A diagnostic report generator from ct volumes on liver
Processing: 25th International Conference, ICONIP 2018, Siem Reap, tumor with semi-supervised attention mechanism,” in Medical Image
Cambodia, December 13–16, 2018, Proceedings, Part VI 25. Springer, Computing and Computer Assisted Intervention–MICCAI 2018: 21st
2018, pp. 282–293. International Conference, Granada, Spain, September 16-20, 2018,
[214] X. Wang et al., “A computational framework towards medical image Proceedings, Part II 11. Springer, 2018, pp. 702–710.
explanation,” in Artificial Intelligence in Medicine: Knowledge Rep- [236] P. Bilic et al., “The liver tumor segmentation benchmark (lits),” Medical
resentation and Transparent and Explainable Systems: AIME 2019 Image Analysis, vol. 84, p. 102680, 2023.
International Workshops, KR4HC/ProHealth and TEAAM, Poznan, [237] O. Thawkar et al., “Xraygpt: Chest radiographs summarization using
Poland, June 26–29, 2019, Revised Selected Papers. Springer, 2019, medical vision-language models,” arXiv preprint arXiv:2306.07971,
pp. 120–131. 2023.
[215] W. Gale et al., “Producing radiologist-quality reports for interpretable [238] J. Zhou et al., “Pre-trained multimodal large language model enhances
deep learning,” in 2019 IEEE 16th international symposium on biomed- dermatological diagnosis using skingpt-4,” Nature Communications,
ical imaging (ISBI 2019). IEEE, 2019, pp. 1275–1279. vol. 15, no. 1, p. 5649, 2024.
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS AND JOURNALS (FEBRUARY 2017) 25

[239] kaggle, “Dermnet.” [Online]. Available: https://www.kaggle.com/ and machine explanations,” KI-Künstliche Intelligenz, vol. 34, no. 2,
datasets/shubhamgoel27/dermnet pp. 193–198, 2020.
[240] M. Moor et al., “Med-flamingo: a multimodal medical few-shot [266] R. Sayres et al., “Using a deep learning algorithm and integrated
learner,” in Machine Learning for Health (ML4H). PMLR, 2023, gradients explanation to assist grading for diabetic retinopathy,” Oph-
pp. 353–367. thalmology, vol. 126, no. 4, pp. 552–564, 2019.
[241] W. Lin et al., “Pmc-clip: Contrastive language-image pre-training using [267] V. Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input
biomedical documents,” in International Conference on Medical Image sampling for explanation of black-box models,” arXiv preprint
Computing and Computer-Assisted Intervention. Springer, 2023, pp. arXiv:1806.07421, 2018.
525–536. [268] S. Hooker et al., “A benchmark for interpretability methods in deep
[242] C. Li et al., “Llava-med: Training a large language-and-vision assistant neural networks,” Advances in neural information processing systems,
for biomedicine in one day,” Advances in Neural Information Process- vol. 32, 2019.
ing Systems, vol. 36, 2024. [269] J. Hou et al., “Diabetic retinopathy grading with weakly-supervised
[243] J. J. Lau et al., “A dataset of clinically generated visual questions and lesion priors,” in ICASSP 2023-2023 IEEE International Conference
answers about radiology images,” Scientific data, vol. 5, no. 1, pp. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023,
1–10, 2018. pp. 1–5.
[244] S. He et al., “Meddr: Diagnosis-guided bootstrapping for large-scale [270] A. J. Barnett et al., “Interpretable mammographic image classifica-
medical vision-language learning,” arXiv preprint arXiv:2404.15127, tion using case-based reasoning and deep learning,” arXiv preprint
2024. arXiv:2107.05605, 2021.
[245] J. Chen et al., “Huatuogpt-vision, towards injecting medical vi- [271] S. M. Muddamsetty, M. N. Jahromi, and T. B. Moeslund, “Expert level
sual knowledge into multimodal llms at scale,” arXiv preprint evaluations for explainable ai (xai) methods in the medical domain,”
arXiv:2406.19280, 2024. in International Conference on Pattern Recognition. Springer, 2021,
[246] X. Zhang et al., “Pmc-vqa: Visual instruction tuning for medical visual pp. 35–46.
question answering,” arXiv preprint arXiv:2305.10415, 2023. [272] A.-p. Nguyen and M. R. Martı́nez, “On quantitative aspects of model
[247] S. Kang et al., “Wolf: Large language model framework for cxr interpretability,” arXiv preprint arXiv:2007.07584, 2020.
understanding,” arXiv preprint arXiv:2403.15456, 2024. [273] K. Papineni et al., “Bleu: a method for automatic evaluation of
[248] S. Bae et al., “Ehrxqa: A multi-modal question answering dataset for machine translation,” in Proceedings of the 40th annual meeting of
electronic health records with chest x-ray images,” Advances in Neural the Association for Computational Linguistics, 2002, pp. 311–318.
Information Processing Systems, vol. 36, 2024. [274] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,”
[249] Z. Chen et al., “Chexagent: Towards a foundation model for chest x-ray in Text summarization branches out, 2004, pp. 74–81.
interpretation,” arXiv preprint arXiv:2401.12208, 2024. [275] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-
[250] K. Schutte et al., “Using stylegan for visual interpretability of deep based image description evaluation,” in Proceedings of the IEEE
learning models on medical images,” arXiv preprint arXiv:2101.07563, conference on computer vision and pattern recognition, 2015, pp.
2021. 4566–4575.
[276] S. Singla et al., “Explaining the black-box smoothly—a counterfactual
[251] I. Goodfellow et al., “Generative adversarial nets,” Advances in neural
approach,” Medical Image Analysis, vol. 84, p. 102721, 2023.
information processing systems, vol. 27, 2014.
[277] S. Tonekaboni et al., “What clinicians want: contextualizing explain-
[252] T. Laugel et al., “Issues with post-hoc counterfactual explanations: a
able machine learning for clinical end use,” in Machine learning for
discussion,” arXiv preprint arXiv:1906.04774, 2019.
healthcare conference. PMLR, 2019, pp. 359–380.
[253] H. Guo, T. H. Nguyen, and A. Yadav, “Counternet: End-to-end training
[278] K. Singhal et al., “Towards expert-level medical question answering
of prediction aware counterfactual explanations,” in Proceedings of the
with large language models,” arXiv preprint arXiv:2305.09617, 2023.
29th ACM SIGKDD Conference on Knowledge Discovery and Data
[279] C. Wu et al., “Pmc-llama: toward building open-source language
Mining, 2023, pp. 577–589.
models for medicine,” Journal of the American Medical Informatics
[254] V. Guyomard et al., “Vcnet: A self-explaining model for realistic Association, p. ocae045, 2024.
counterfactual generation,” in Joint European Conference on Machine [280] Y. Gu et al., “Domain-specific language model pretraining for biomed-
Learning and Knowledge Discovery in Databases. Springer, 2022, ical natural language processing,” ACM Transactions on Computing for
pp. 437–453. Healthcare (HEALTH), vol. 3, no. 1, pp. 1–23, 2021.
[255] M. Wilms et al., “Towards self-explainable classifiers and regressors [281] M. Moor et al., “Foundation models for generalist medical artificial
in neuroimaging with normalizing flows,” in International Workshop intelligence,” Nature, vol. 616, no. 7956, pp. 259–265, 2023.
on Machine Learning in Clinical Neuroimaging, 2021, pp. 23–33. [282] X. Ye and G. Durrett, “Can explanations be useful for calibrating
[256] U. Johansson, R. König, and L. Niklasson, “The truth is in there- black box models?” in Proceedings of the 60th Annual Meeting of the
rule extraction from opaque models using genetic programming.” in Association for Computational Linguistics (Volume 1: Long Papers),
FLAIRS, 2004, pp. 658–663. 2022, pp. 6199–6212.
[257] H. Lakkaraju et al., “Faithful and customizable explanations of black [283] X. Wu et al., “From language modeling to instruction following:
box models,” in Proceedings of the 2019 AAAI/ACM Conference on Understanding the behavior shift in llms after instruction tuning,” in
AI, Ethics, and Society, 2019, pp. 131–138. Proceedings of the 2024 Conference of the North American Chapter
[258] W. Jin et al., “Guidelines and evaluation of clinical explainable ai in of the Association for Computational Linguistics: Human Language
medical image analysis,” Medical Image Analysis, vol. 84, p. 102684, Technologies (Volume 1: Long Papers), 2024, pp. 2341–2369.
2023. [284] J. Wei et al., “Chain-of-thought prompting elicits reasoning in large
[259] E. Lughofer et al., “Explaining classifier decisions linguistically for language models,” Advances in neural information processing systems,
stimulating and improving operators labeling behavior,” Information vol. 35, pp. 24 824–24 837, 2022.
Sciences, vol. 420, pp. 16–36, 2017. [285] Y. Shi et al., “Mededit: Model editing for medical question answering
[260] M. Robnik-Šikonja and M. Bohanec, “Perturbation-based explanations with external knowledge bases,” arXiv preprint arXiv:2309.16035,
of prediction models,” Human and Machine Learning: Visible, Explain- 2023.
able, Trustworthy and Transparent, pp. 159–175, 2018. [286] K. Singhal et al., “Large language models encode clinical knowledge,”
[261] A. Adadi and M. Berrada, “Explainable ai for healthcare: from black Nature, vol. 620, no. 7972, pp. 172–180, 2023.
box to interpretable models,” in Embedded systems and artificial [287] C. Zhao et al., “Automated natural language explanation of deep visual
intelligence: proceedings of ESAI 2019, Fez, Morocco. Springer, 2020, neurons with large models,” arXiv preprint arXiv:2310.10708, 2023.
pp. 327–337. [288] Y. Yang et al., “Language in a bottle: Language model guided concept
[262] X. Wu et al., “Usable xai: 10 strategies towards exploiting explainabil- bottlenecks for interpretable image classification,” in Proceedings of the
ity in the llm era,” arXiv preprint arXiv:2403.08946, 2024. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[263] F. Doshi-Velez and B. Kim, “Towards a rigorous science of inter- 2023, pp. 19 187–19 197.
pretable machine learning,” arXiv preprint arXiv:1702.08608, 2017. [289] C. Singh et al., “Augmenting interpretable models with large language
[264] J. Zhou et al., “Evaluating the quality of machine learning explanations: models during training,” Nature Communications, vol. 14, no. 1, p.
A survey on methods and metrics,” Electronics, vol. 10, no. 5, p. 593, 7913, 2023.
2021. [290] S. Bills et al., “Language models can explain neurons in language
[265] A. Holzinger, A. Carrington, and H. Müller, “Measuring the quality models,” URL https://openaipublic. blob. core. windows. net/neuron-
of explanations: the system causability scale (scs) comparing human explainer/paper/index. html.(Date accessed: 14.05. 2023), vol. 2, 2023.
26 GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2017

[291] D. Gunning and D. Aha, “Darpa’s explainable artificial intelligence

(xai) program,” AI magazine, vol. 40, no. 2, pp. 44–58, 2019.
[292] D. Minh et al., “Explainable artificial intelligence: a comprehensive
review,” Artificial Intelligence Review, pp. 1–66, 2022.
[293] C. Rudin et al., “Interpretable machine learning: Fundamental prin-
ciples and 10 grand challenges,” Statistic Surveys, vol. 16, pp. 1–85,
2022.
[294] Y. Xu and H. Chen, “Multimodal optimal transport-based co-attention
transformer with global structure consistency for survival prediction,”
in Proceedings of the IEEE/CVF International Conference on Com-
puter Vision, 2023, pp. 21 241–21 251.
[295] J. Pearl, Causality. Cambridge university press, 2009.
[296] A. J. DeGrave, J. D. Janizek, and S.-I. Lee, “Ai for radiographic
covid-19 detection selects shortcuts over signal,” Nature Machine
Intelligence, vol. 3, no. 7, pp. 610–619, 2021.
[297] D. C. Castro, I. Walker, and B. Glocker, “Causality matters in medical
imaging,” Nature Communications, vol. 11, no. 1, p. 3673, 2020.
[298] L. Luo et al., “Pseudo bias-balanced learning for debiased chest x-ray
classification,” in International conference on medical image computing
and computer-assisted intervention. Springer, 2022, pp. 621–631.
[299] J. Wu et al., “Chest imagenome dataset,” Physio Net, 2021.
[300] H. Cai et al., “Breast microcalcification diagnosis using deep convolu-
tional neural network from digital mammograms,” Computational and
mathematical methods in medicine, vol. 2019, no. 1, p. 2717454, 2019.
[301] G. Argenziano et al., “Epiluminescence microscopy for the diagnosis
of doubtful melanocytic skin lesions: comparison of the abcd rule of
dermatoscopy and a new 7-point checklist based on pattern analysis,”
Archives of dermatology, vol. 134, no. 12, pp. 1563–1570, 1998.