0% found this document useful (0 votes)
16 views

A_domain_knowledge-based_interpretable_deep_learni

This article presents a novel interpretable deep learning system, MUP-Net, designed to enhance breast cancer risk prediction using multimodal ultrasound images. Developed on a dataset of 4,320 images, MUP-Net achieves prediction accuracy comparable to experienced radiologists, with an area under the curve of 0.902. The system's explainable features improve junior radiologists' clinical outcomes and bolster confidence in senior radiologists, suggesting its potential for integration into breast cancer screening workflows.

Uploaded by

okuwobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

A_domain_knowledge-based_interpretable_deep_learni

This article presents a novel interpretable deep learning system, MUP-Net, designed to enhance breast cancer risk prediction using multimodal ultrasound images. Developed on a dataset of 4,320 images, MUP-Net achieves prediction accuracy comparable to experienced radiologists, with an area under the curve of 0.902. The system's explainable features improve junior radiologists' clinical outcomes and bolster confidence in senior radiologists, suggesting its potential for integration into breast cancer screening workflows.

Uploaded by

okuwobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

communications medicine Article

https://doi.org/10.1038/s43856-024-00518-7

A domain knowledge-based interpretable


deep learning system for improving
clinical breast ultrasound diagnosis
Check for updates
1,12 2,12 3,12 4 5 6
Lin Yan , Zhiying Liang , Hao Zhang , Gaosong Zhang , Weiwei Zheng , Chunguang Han ,
Dongsheng Yu2, Hanqi Zhang4, Xinxin Xie7, Chang Liu6,8, Wenxin Zhang4, Hui Zheng4, Jing Pei 6,8 ,
Dinggang Shen 2,9,10,11 & Xuejun Qian 2,9,10

Abstract Plain Language Summary


1234567890():,;
1234567890():,;

Background Though deep learning has consistently demonstrated advantages in the Breast cancer is one of the most common
automatic interpretation of breast ultrasound images, its black-box nature hinders potential cancers, and finding it early can greatly
interactions with radiologists, posing obstacles for clinical deployment. improve patients’ chances of survival and
Methods We proposed a domain knowledge-based interpretable deep learning system for recovery. We create a tool based on artificial
improving breast cancer risk prediction via paired multimodal ultrasound images. The deep intelligence (AI)—whereby computer software
learning system was developed on 4320 multimodal breast ultrasound images of 1440 learns to perform tasks that normally require
biopsy-confirmed lesions from 1348 prospectively enrolled patients across two hospitals human thinking—called MUP-Net. MUP-Net
between August 2019 and December 2022. The lesions were allocated to 70% training can analyze medical images to predict a
cohort, 10% validation cohort, and 20% test cohort based on case recruitment date. patient’s risk of having breast cancer. To
Results Here, we show that the interpretable deep learning system can predict breast make this AI tool usable in clinical practice, we
cancer risk as accurately as experienced radiologists, with an area under the receiver enabled doctors to see the reasoning behind
operating characteristic curve of 0.902 (95% confidence interval = 0.882 – 0.921), sensitivity the AI’s predictions by visualizing the key
of 75.2%, and specificity of 91.8% on the test cohort. With the aid of the deep learning image features it analyzed. We showed that
system, particularly its inherent explainable features, junior radiologists tend to achieve our AI tool not only makes doctors more
better clinical outcomes, while senior radiologists experience increased confidence levels. confident in their diagnosis but also helps
Multimodal ultrasound images augmented with domain knowledge-based reasoning cues them make better decisions, especially for
enable an effective human-machine collaboration at a high level of prediction performance. less experienced doctors. With further test-
Conclusions Such a clinically applicable deep learning system may be incorporated into ing, our AI tool may help clinicians to diagnose
future breast cancer screening and support assisted or second-read workflows. breast cancer more accurately and quickly,
potentially improving patient outcomes.

Breast cancer is a leading cause of cancer mortality among women and an improve patient outcomes, prompting widespread clinical recommendation
ongoing threat to global health. It is estimated that in 2022, 287,850 new for screening mammography to reduce its morbidity2. However, mam-
cases of breast cancer in females would be diagnosed in the United States, mography exhibits low sensitivity in dense breast tissue, and is not uni-
continuing with a slow but steady increase of approximately 0.5% in inci- versally accessible across all countries3. Ultrasound (US), a low-cost, non-
dence rates per year1. Early detection of breast cancer can potentially invasive, no-ionizing-radiation, and widely-available imaging modality,

1
School of Mathematics, Xi’an University of Finance and Economics, Xi’an, China. 2School of Biomedical Engineering, ShanghaiTech University, Shanghai, China.
3
Department of Neurosurgery, Beijing Friendship Hospital, Capital Medical University, Beijing, China. 4Department of Ultrasound, The First Affiliated Hospital of
Anhui Medical University, Hefei, China. 5Department of Ultrasound, Xuancheng People’s Hospital, Xuancheng, China. 6Department of General Surgery, The First
Affiliated Hospital of Anhui Medical University, Hefei, China. 7Department of Ultrasound, Peking University Third Hospital, Beijing, China. 8Department of Breast
Surgery, The First Affiliated Hospital of Anhui Medical University, Hefei, China. 9State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech
University, Shanghai, China. 10Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China. 11Shanghai Clinical Research and Trial Center, Shanghai, China.
12
These authors contributed equally: Lin Yan, Zhiying Liang, Hao Zhang. e-mail: [email protected]; [email protected]; qianxj@-
shanghaitech.edu.cn

Communications Medicine | (2024)4:90 1


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

serves as a supplementary modality to mammography in screening one of six breast radiologists (i.e, breast radiologists do all ultrasound scans,
settings4,5 and as the primary imaging modality for characterizing breast instead of breast ultrasound technologists), each with more than 10 years of
masses (i.e., solid or cystic)6,7. While the American College of Radiology has experience in breast US, using the Aixplorer US scanner (SuperSonic
established the Breast Imaging Reporting and Data System (BI-RADS) Imagine) with an SL15-4 MHz or SL10-2 MHz linear array transducer
guideline to standardize breast imaging terminology and report manage- following the standard protocol. For each breast, paired US images at the
ment, the intra- and inter-variations among observers in breast US image largest long-axis cross-section plane of the lesion were saved. As a result,
interpretations still exist8,9. In addition, high false positive rates and false 4,320 paired US images from 1,440 lesions (464 positives for cancer) from
negative rates in breast US examinations have limited its applicability to a 1,348 patients were collected. The manual review of the pathology note
broader range of screening and diagnostic population. served as the ground-truth labels. Table 1 shows patient demographics and
Artificial intelligence (AI) has been leveraged for many years to alle- breast lesion characteristics of our dataset. The dataset was split into an 8:2
viate these challenges to some extent10,11. With the rapid development of ratio based on case recruitment date: development cohort (70% for training
deep learning12–15, the convolutional neural network has gradually become a and 10% for validation), and test cohort (20% for testing).
promising approach in the field of breast US image analysis, including The breast US images were pre-processed26 by a custom annotation
region detection16,17, lesion segmentation18,19, tumor classification20 as well as tool to remove irrelevant information, such as text and instrument settings.
multi-stage tasks21,22. Early works primarily focused on the US B-mode In clinical practice, radiologists are required to manually place a sampling
image without recognizing the importance of jointly utilizing comprehen- box to select the vascularity (via US colour Doppler image), elasticity (via US
sive semantic information from other types of US images. A significantly elastography image) measurements, as well as corresponding box region for
better breast cancer risk prediction has been recently demonstrated via the US B-mode image during the US image acquisition. Guided by these box
either the combination of US B-mode and colour Doppler23,24, or the fusion regions, experienced radiologists adjusted the segmentation masks to ensure
of US B-mode and elastography25. More advance, we demonstrated a the similar lesion-to-mask ratios in each imaging mode, followed by crop-
clinically applicable deep learning system26 could prospectively assess clin- ping operations.
ical relevant multimodal US images (i.e., B-mode, colour Doppler, and Owing to the chronological data partition and patient population
elastography) with non-inferior sensitivity and specificity to experienced distribution in the real world, the training cohort has an imbalanced data
radiologists. distribution with 335 malignant and 817 benign lesions. To mitigate this
Despite the radiologist-level classification of breast cancer, the issue, we implemented data augmentation techniques, including horizontal
remaining pivotal issue that needs to be addressed before its clinical flipping, random rotation, and Gaussian blurring to increase the size of
deployment is the black-box nature of deep learning27,28. Current deep malignant samples. We additionally augmented data on-the-fly during the
learning systems provide clinicians with the malignant probability results, training by contrast adjustment, horizontal flipping, and random rotation.
which the clinicians can either simply trust in plausibility or override. In For the reader study, we randomly and equally selected 120 out of 288
other words, the potential inferential logic of the deep learning, particularly lesions, resulting in 60 benign and 60 malignant cases.
as a clinical decision-supporting system, is not fully understood for radi-
ologists. Post hoc-based interpretable strategies such as saliency maps29, Interpretable deep learning model
deconvolution30, and activation maximization31 have been extensively The design of our interpretable AI system and the details of MUP-Net are
explored to visualize the inner working mechanism of deep learning. depicted in Fig. 1 and Supplementary Fig. 2, respectively. The MUP-Net is
However, these approaches still cannot explain how a model exploits trained using multimodal US images and biopsy-confirmed pathology
these cues. labels. Three independent backbone networks, namely ResNet-18 (pre-
In this study, we propose a novel interpretable AI system, namely the trained on ImageNet), are used to distill semantic features from different US
Multimodal Ultrasound Prototype Network (MUP-Net), using domain modalities, which are consistent with our previous work26. Each patch of the
knowledge-based prototypes to predict the malignancy risk probability. We generated feature map is compared against learned prototypes to identify
demonstrate that MUP-Net has comparable breast cancer prediction per- the most similar matches, followed by a quantitative presentation of the
formance to the state-of-the-art black-box deep learning models, reaching similarity scores. These scores are then fed into the final fully connected
the level of experienced radiologists in our prospective reader study. The layer with softmax output to predict a malignancy risk probability. To
most important contribution of our AI system is the emphasis of human- present explainable features to readers, these similarity scores are converted
machine collaboration through its inherent interpretability from pathology- into contribution scores by combining the associated weights from the last
confirmed images, instead of unseen images with post-hoc explainability. In fully connected layer.
our AI-assisted reader study, we demonstrate the potential of our MUP-Net Clinical domain knowledge is used for supervising MUP-Net in
in aiding clinicians, such as increasing the confidence levels of radiologists learning prototypes, which are representative benign and malignant cases
and diminishing the discrepancy. for each modality selected from the training data during a MUP-Net opti-
mization process. To diminish the bias introduced by automatic prototype
Methods selection, we implemented US domain knowledge to constrain the proto-
Ethical approval type selection to a subset BI-RADS group. Specifically, we first excluded BI-
This study was approved by the Institutional Review Board of the First RADS 4b from the candidate prototypes because it represents a moderate
Affiliated Hospital of Anhui Medical University and Xuancheng People’s suspicion of malignancy with borderline probability according to the latest
Hospital of China. Participants were informed of all aspects of the study, BI-RADS Atlas and physician observations in clinical practice. As a result,
even if it involved only minimal risk. The consents were informed and the biopsy-confirmed benign and malignant prototypes for B-mode and
written in advance. The de-identification procedure was performed before colour Doppler modalities were selected from BI-RADS 3,4a versus 4c,5
transferring to our study. cases, respectively. Next, the World Federation for Ultrasound in Medicine
& Biology guidelines32 established a stricter operation for US elastography
Ultrasound dataset (i.e., lightly touching the skin and trying not to apply pressure), which
We prospectively collected US images, including B-mode, colour Doppler, implies that the distorted elastography images with atypical appearance are
and elastography images, from women with breast lesions in either The First inevitable in clinical practice. Therefore, biopsy-confirmed benign and
Affiliated Hospital of Anhui Medical University or Xuancheng People’s malignant prototypes for US elastography were exclusively selected from BI-
Hospital of China from August 2019 to December 2022. Detailed collection RADS 3 versus 5.
procedures, inclusion and exclusion criteria for patient recruitment are The implementation of the deep learning model is described as follows:
depicted in Supplementary Fig. 1. The US examinations were performed by

Communications Medicine | (2024)4:90 2


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

Table 1 | Patient demographics and breast lesion single pixel prototype could be sufficient to represent a significant feature of
characteristics the original image when mapped back to the original pixel space, i.e.,
H m ¼ W m ¼ 1. Therefore, with H m <H; W m <W and a shared channel
Development cohort Test cohort Clinical
test set
number C, our model can evaluate the similarity between each prototype pm j
i , i.e., pixelsðz i Þ, in a non-
to/with all pixels of the distilled feature map z m m
Number of patients 1087 261 120
overlapping sliding window manner. In particular, MUP-Net uses Eucli-
Age (mean) 44.6 (18–85) 44.7 (20–80) 44.3 (20–78) dean distances and top-k average pooling to calculate the similarity score for
Number of lesions 1152 288 120 each prototype:
BI-RADS categorya !
3 or lower 80 (6.9%) 24 (8.3%) 9 (7.5%) ij þ 1
dm
sm ¼ avgðtopk ðlog ÞÞ; ð2Þ
ij þ ε
dm
ij
4a 648 (56.3%) 132 (45.9%) 48 (40.0%)
4b 236 (20.5%) 56 (19.5%) 27 (22.5%)

ij ¼ jpj  pixelsðz i Þj2 . A small number ε is used to avoid dividing


where d m m m 2
4c 109 (9.5%) 39 (13.5%) 19 (15.8%)
M;N
5 79 (6.8%) 37 (12.8%) 17 (14.2%) by zero error. At last, the similarity score set S ¼ fsij m gm¼1;j¼1 is flattened
and fed into the last fully-connected layer hðÞ to produce output logits:
Lesion size (mm)
<10 351 (30.5%) 65 (22.6%) 27 (22.5%)  
li ¼ h S;wh ; ð3Þ
10–20 504 (43.7%) 124 (43.1%) 48 (40.0%)
> 20 297 (25.8%) 99 (34.3%) 45 (37.5%) where wh 2 RMN × K denotes a trainable weight matrix. Indeed, the elements
Location in wh indicate the connection weight between a similarity score and its
contribution to the output logit of a specific class. These weights are not
Upper outer quadrant 549 (47.7%) 139 (48.3%) 53 (44.1%)
randomly initialized. Specifically, given a class k and a modality m, we set
Lower outer quadrant 232 (20.1%) 64 (22.2%) 29 (24.2%) ðmNþj;kÞ
wh ¼ 1 for the j-th prototype pm j belonging to Pm;k , and
Upper inner quadrant 220 (19.1%) 58 (20.1%) 29 (24.2%)
ðmNþj;kÞ
Lower inner quadrant 94 (8.2%) 18 (6.3%) 6 (5.0%) wh ¼ 0:5 otherwise. By following such an initialization approach,
Central 57 (4.9%) 9 (3.1%) 3 (2.5%) we can guide MUP-Net to assign a fixed number of prototypes for each
ðmNþj;kÞ
Pathology notes predicted class. When wh is positive and the input image xm
i belongs
Invasive carcinoma 199 (17.3%) 77 (26.7%) 31 (25.8%) to class k, the fully-connected layer h will let the class k prototypes make a
Carcinoma in situ 38 (3.3%) 16 (5.6%) 9 (7.5%)
positive contribution to the class k output logit. On the contrary, a negative
ðmNþj;kÞ
Other malignantb 98 (8.5%) 36 (12.5%) 20 (16.7%) wh will let the non-class k prototypes make an inverse contribution
Fibroadenoma 349 (30.3%) 65 (22.6%) 25 (20.8%) to the logit of yi .
Other benignc 468 (40.6%) 94 (32.6%) 35 (29.2%)
Three types of loss functions are combined as the learning target of
a
The BI-RADS category is determined by breast US images only. Pathology results are available for MUP-Net. The standard cross-entropy is prevalent
the patients classified as BI-RADS 3 or lower following breast US due to either classification as BI-  tosupervise classifica-
RADS 4a or higher following mammography or magnetic resonance imaging or requests from
tion tasks. It is denoted by Lce ¼ CrossEntropy li ; yi in this work. The
patients themselves. bIncludes specific malignant results. cIncludes adenosis, hyperplasia, mastitis, clustering loss and separation loss are applied to guide the learning process
benign phyllodes tumors, and papillomas. of prototypical representation in latent space. For the m-th modality,
clustering loss is used to encourage some pixels of z m i being close to the
prototypes contained in Pm;yi which could be defined as:
Let D ¼ ½X; Y  denotes the dataset to train the deep learning model. X
 2
D;M
is a multimodal US image set fxi m gi¼1;m¼1 where i is a data index, ranging  
M X
X n z  pm
j 
from 1 to D. m is a modal index, ranging from 1 to M. The modality count M Lclu ¼ min ð4Þ
equals to 3 in this work. Y is a binary label set fyi gDi¼1 indicating the type of m¼1 i¼1 pj 2P ;z2patchðz i Þ
m m;yi m n
lesions. As illustrated in Supplementary Fig. 2, MUP-Net first applies three
independent CNN f m ðÞ on input images xm i from different modalities to On the contrary, separation loss encourages each patch of z m
i to stay
extract features: away from prototypes belonging to Pm;Knfyi g . It is defined as:
   2
i ¼ f
zm m m m
xi ;w ð1Þ  
M X
X n  z  pm
j 
Lsep ¼  min ð5Þ
Let W ¼ fwm gM m¼1 denotes the trainable parameter sets from all f .
m
m¼1 i¼1 pj 2P
m m;Knfyi g
;z2patchðz m
i Þ
n
H×W ×C
The resulting feature maps z i 2 R
m
serve as the input of the proto-
typical part. Prototypical part aims to learn some meaningful representations In addition, L1-norm is applied on wh to generate regularization loss
of each class in the latent space. Here, MUP-Net learns a set Pm containing a Lreg . In summary, the overall learning target of MUP-Net is the sum of the
pre-determined number of prototypes (i.e., N) for the m-th modality, i.e., above loss functions:
Pm ¼ fpm j gj¼1 where pj is the j-th prototype. To be specific, prototypes pj
N m m

are the trainable variables of shape H m × W m × C. Thus, MUP-Net generates L ¼ Lce þ αLsep þ βLclu þ γLreg ð6Þ
a set P containing M  N prototypes in total, i.e., P ¼ fPm gM m¼1 ¼
fpm
j gm¼1;j¼1 . For each class k 2 K, it P
M;N
is represented by N m k prototypes in where α, β; and γ were empirically set to 0.8, −0.08, and 1e-4, respectively.
the m-th modality, which means N ¼ Kk¼1 N m k . We let P
m;k
 Pm denote The optimizing procedure of MUP-Net follows the similar steps
the subset of prototypes allocated to class k in modality m. in the previous works33,34. The prototypes in P were randomly initi-
Since the down sampling operations are performed in backbone f m , the alized using the uniform distribution before training. In the first
spatial dimension of a convolution output z m i is small. A relatively large stage, three independent feature extractors f m ðÞ and the prototype
receptive field is presented in each pixel of z m i . Thus, we speculate that a set P were optimized by network back-propagation. In the second

Communications Medicine | (2024)4:90 3


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

Cancer risk prediction model development

MUP-Net

Patient enrollment
& lesion selection

Patients = 1087
Age = 44.6 (18-85)
Lesions = 1152

Prototype candidates
Benign Malignant
4c
3
4a 3 4a 4c
4c 5
4a
4a 4a 4c 5
3
US B-mode 4c 4a
4a 4a 4b
4a 5
4b 4c
5 3 4a
3
3 4b
Domain-knowledge
Benign Malignant
filtering
4b
4a
3
4c 5 4a 4c
3 3
5
4b 4b
US colour Doppler 4a 4c 4c
3 5 3 4a
4a 5
4a
3 4c 3 4c
4a 4c
Domain-knowledge
Benign Malignant
filtering
4b
3
4a
4c 5 5
3 3
4a 4a
US elastography 4a 5 3 5
3 4c
4a
4a 3
3 5 5
3 4b

Reader study (Solo and +AI)

Contribution scores
Benign Malignant Benign 34%
Overall prediction
US B-mode 19 % 30%
Malignant 66%
US colour Doppler 12 % 21%
US elastography 7% 11%

Questionnaire
BI-RADS Decision mode B/M Decision mode For each case
Q1. Whether the overall prediction provided by AI is helpful in making a better decision?
BI-RADS 4a+
BI-RADS 4b+ Binary Q2. Whether the similarity comparison between prototypes and test cases is appropriate?
BI-RADS 4c+ Q3. Whether the contribution scores reveal the importance of each US modality, and meet
the expectation in BI-RADS Atlas?
Benign Malignant
Benign Malignant Overall evaluation
Q4: Are prototypes and contribution scores useful even if the overall prediction contradicts
2 3 4a 4b 4c 5 BI-RADS Atlas Subjective preference with your own thought?

Fig. 1 | Overall study design of the interpretable AI system. Prospectively collected provided to radiologists as clinical decision-supporting parameters. Two decision
multimodal ultrasound (US) images, including B-mode, colour Doppler, and elas- modes (i.e., BI-RADS rating and B/M (Benign/Malignant) preference) and a ques-
tography were used to develop our MUP-Net model, by utilizing domain knowledge tionnaire were proposed to assess the advantageous use of explainable features by
to supervise the selection of prototype candidates. The overall malignancy risk radiologists in making clinical decisions.
probability and six individual contribution scores generated by the AI system were

stage, each Pm;k was projected onto (or replaced by) the nearest latent the similarity scores S. These three stages were repeated throughout
training patch from the training set belonging to class k. The pro- the whole training process.
jection not only affects the classification accuracy, but also allows the
visualization of the prototypes as training image patches. In the third Model development and evaluation
stage, only the matrix wh of the last fully-connected layer hðÞ was Transfer learning was first applied to the feature extraction part via a
trained to adjust the connection weight between an output logit and ResNet-18 network pre-trained on ImageNet. Adam optimizer was chosen

Communications Medicine | (2024)4:90 4


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

to optimize MUP-Net. We initially set the number of training epochs to 80 considered to have a statistically significant difference. All statistical analyses
and the learning rate to 0.001. The learning rate decreased by 90% every 5 were performed using Python (version 3.9) and the statsmodels library
epochs. The batch size was 40. ReLU was chosen as the activation function. (version 0.14.0).
The weight decay of factor 0.001 was applied to the parameters of batch
normalization layers. Our model was trained on NVIDIA GTX A100 GPU Reporting summary
using Python and PyTorch toolbox. During training, data augmentation Further information on research design is available in the Nature Portfolio
techniques were applied to speed up convergence and avoid overfitting. In Reporting Summary linked to this article.
particular, the input images were flipped horizontally with a 50% prob-
ability. Brightness and contrast were randomly varied by a factor between Results
[0.9, 1.1]. A random portion of an image between 0.9 and 1 was cropped and Our interpretable AI system was developed from a prospectively enrolled
resized to its original size. All images were randomly rotated by a degree breast US dataset consisting of 4320 US images (paired B-mode, colour
between −10° and 10°. Doppler, and elastography) of 1,440 lesions (464 biopsy-confirmed cancer
Some off-the-shelf deep learning models35–37 (i.e., VGG, ResNet, and positive) from 1348 patients. All lesions were allocated to 70% training
DenseNet) were trained to compare the performance between black-box cohort, 10% validation cohort, and 20% test cohort based on case recruit-
models and our interpretable MUP-Net. Their overall architectures were ment date. Our proposed AI system not only generates an overall malig-
similar, with independent backbone networks used to extract features from nancy probability for automated breast cancer risk prediction but also
different US modalities, and fully connected layers with softmax output used creates inherent explainable features such as representative prototypes and
to predict malignancy risk probability. To ensure a fair comparison, the contribution scores, all of which enable a human-understandable interac-
same pre-trained weights from ImageNet, identical data augmentation tion as a clinical decision-supporting system.
techniques, and an initial learning rate of 0.001 were applied.
Selections of prototypes and performance of the MUP-Net model
Reader study and AI-assisted reader study The domain knowledge of breast US is incorporated into MUP-Net to
Nine radiologists (R1-R9) participated in our reader study. We formed two refine the selection criteria of prototype candidates. To be specific,
subgroups to investigate the benefits of the AI interpretability in aiding regarding the US B-mode and colour Doppler images, benign and
human experts: a senior radiologist group of readers (R1, R5, and R7 with an malignant prototypes were picked out from biopsy-confirmed cases with
average of 14 years of clinical experience) and a junior radiologist group of BI-RADS rating of 3 & 4a versus 4c & 5, respectively. The borderline BI-
readers (R2, R8, and R9 with a maximum of 2 years of clinical experience). RADS 4b cases were excluded due to its uncertainty. In terms of US
A two-phase study was conducted to compare the performance elastography, a stricter preference was applied to the selection process of
between the AI and clinicians. In phase I (reader study), readers were prototype candidates (i.e., BI-RADS 3 for benign, and BI-RADS 5 for
blinded to the original radiologist’s interpretation (i.e., the radiologist who malignant) for the purpose of avoiding potential imaging artifacts caused
collected and stored the US images) and MUP-Net prediction. Readers by human operators32. Under this setting, MUP-Net was trained to learn
independently reviewed test samples and determined BI-RADS ratings more discriminative cases as benign and malignant prototypes, thus
along with a Benign/Malignant (B/M) preference. In phase II (AI-assisted generating reasonable explainable features matched to domain knowl-
reader study), additional explainable features (i.e., matched prototypes and edge used by radiologists in clinical settings.
contribution scores) and malignancy risk probability generated by the deep We evaluated the effectiveness of the MUP-Net in three ways. First, we
learning model were provided to the readers. Each reader has the oppor- assessed the model’s performance with various prototype numbers, ranging
tunity to alter their original BI-RADS ratings or B/M preference from Phase from 6 to 14. As a consequence, our model achieved the best performance on
I. In addition to these decision-making tasks, readers were requested to the validation cohort when 10 prototypes were learned for each modality
complete a questionnaire expressing their subjective attitudes towards AI (Supplementary Table 1). Second, we compared MUP-Net with three
assistance. prevalent black-box models (i.e., VGG-16, ResNet-18 and ResNet-152,
The questionnaire includes three individual questions (Q1, Q2, and DenseNet-121 and DenseNet-201) as shown in Supplementary Fig. 3.
Q3) and an overall evaluation (Q4) for each test case as shown in Fig. 1. Q1 is Although a few metrics of ResNet and DenseNet families were slightly
related to the overall breast cancer risk prediction provided by the final higher, there was no significant difference between MUP-Net and black box
output of the deep learning system. The likelihood of malignancy provides models (P < 0.05) on the validation cohort. Third, our MUP-Net achieved
an obvious AI decision without requiring further analysis by radiologists. Q2 the AUC of 0.902 (95% CI = 0.882–0.921), sensitivity of 75.2%, specificity of
and Q3 are attributed to the explainability of our prototypes. Specifically, Q2 91.8%, and F1 score of 0.812 on the test cohort, which was comparable to
checks whether there is a correlation between biopsy-confirmed prototypes that of validation cohort (Supplementary Tables 1-3).
and test samples for each modality. Such a similarity comparison is in line In terms of the trained MUP-Net (10 prototypes per modality), we
with the daily workflow of radiologists (i.e., interpretation based on additionally evaluated the effectiveness of three US modalities on the vali-
experience from previously seen US images). Q3 is an extension of Q2 by dation cohort by randomly altering one of the US inputs by certain ratios.
requesting if the contribution scores reveal the importance of each modality. The changes of performance in Supplementary Table 4 and Supplementary
For instance, as described in BI-RADS Atlas, US B-mode is the dominant Fig. 4 indicate that the B-mode is essential to improve the sensitivity while
modality, while US colour Doppler and US elastography partially provide elastography mainly improves the specificity of MUP-Net. In other words,
assorted feature information. In other words, we expect the contribution B-mode is helpful to identify malignant lesions while the elastography
score of the prototypes to be consistent with this prior. Q4 is an overall inhibits the occurrence of false positives. The role of colour Doppler is
assessment to conclude the effectiveness of the AI in increasing clinician intermediate between B-mode and elastography.
confidence levels or making a better clinical decision, or both.
Reader study
Statistical analysis To further investigate the performance of our AI system, we conducted a
The area under the curve (AUC) of the receiver operating characteristic reader study with nine radiologists who had varying years of experience (1 to
curve (ROC) value was used to express the performance of the model. The 18 years, average of 6 years). In particular, radiologists with more than 10
95% confidence intervals (CI) were calculated based on a non-parametric years of experience are formed to a senior group, while junior group includes
procedure with 1,000 bootstraps. Delong’s test was used to compare the radiologists with no more than 2 years of experience. For each test case in the
performance among different AI models. McNamar’s test was used to reader study, the readers were asked to independently provide a routine BI-
compare readers’ decisions with and without AI support. P < 0.05 was RADS rating and a forced B/M preference using the paired multimodal

Communications Medicine | (2024)4:90 5


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

(a) (b)
True positive rate (sensitivity)

True positive rate (sensitivity)


MUP-Net: AUC=0.862 (95% CI = 0.840-0.884) MUP-Net: AUC=0.862 (95% CI = 0.840-0.884)
Individual reader (Solo) Individual reader (Solo)
Individual reader (+AI) Individual reader (+AI)
Average senior radiologists (Solo) Average senior radiologists (Solo)
Average senior radiologists (+AI) Average senior radiologists (+AI)
Average junior radiologists (Solo) Average junior radiologists (Solo)
Average junior radiologists (+AI) Average junior radiologists (+AI)

False positive rate (1-specificity) (d) False positive rate (1-specificity)


(c)

True positive rate (sensitivity)


True positive rate (sensitivity)

MUP-Net: AUC=0.862 (95% CI = 0.840-0.884) MUP-Net: AUC=0.862 (95% CI = 0.840-0.884)


Individual reader (Solo) Individual reader (Solo)
Individual reader (+AI) Individual reader (+AI)
Average senior radiologists (Solo) Average senior radiologists (Solo)
Average senior radiologists (+AI) Average senior radiologists (+AI)
Average junior radiologists (Solo) Average junior radiologists (Solo)
Average junior radiologists (+AI) Average junior radiologists (+AI)

False positive rate (1-specificity) False positive rate (1-specificity)

Fig. 2 | Performance comparison between MUP-Net and readers in predicting RADS 3,4a versus 4b + , and BI-RADS 3,4a,4b versus 4c + , respectively. Readers
breast cancer risk on the clinical test set. The performance of our AI system was were labelled as senior radiologists if with more than 10 years of clinical experience,
compared with each of the nine readers and the average performance of the readers while marked as junior radiologists if with no more than 2 years of clinical
at two modes. a B/M (Benign/Malignant) preference mode, b–d BI-RADS rating experience.
mode. Three BI-RADS ratings were determined by BI-RADS 3 versus 4a + , BI-

breast US images. To convert the BI-RADS rating to readers’ sensitivity and AI-assisted reader study
specificity, we generated three BI-RADS scores: BI-RADS 4a + , 4b + , and An AI-assisted reader study was conducted to evaluate the advantage of the
4c + , which were implemented to match the conditions of BI-RADS 3 domain knowledge-based interpretable AI system in guiding clinical
versus 4a + , BI-RADS 3,4a versus 4b + , and BI-RADS 3,4a,4b versus 4c + , decision-making. To achieve this, in addition to the same paired multimodal
respectively. US images and the corresponding malignancy risk probability, the matched
We compared the performance of MUP-Net with that of the nine representative prototypes as well as AI-generated individual contribution
radiologists in two ways. First, we compared the deep learning model with scores from each prototype candidate were reviewed by the readers. Figure 2
the sensitivity and specificity of individual readers in four modes (i.e., one B/ showed that 26 out of 36 persons (9 persons in each decision mode) were
M score and three BI-RADS scores). As shown in Fig. 2, most of the readers above the ROC curves of the AI, which was an obvious improvement over
were below the ROC curve of the model, indicating a non-inferior perfor- previous reader studies without any AI assistance. Such a phenomenon was
mance of our proposed AI system. Second, we compared the performance of particularly apparent between junior radiologists and senior radiologists.
MUP-Net with average performance of junior and senior radiologists. The Specifically, senior radiologists obtained an improvement of 7.6% sensitivity
senior radiologists showed their strengths over AI only in the BI-RADS 4a+ in B/M score and 6.9% specificity in BI-RADS 4a+ score. By contrast, junior
and 4b+ scores. By contrast, AI had a superior performance over the average radiologists gained more benefits, including an 11.3% improvement of
performance of junior radiologists in all four modes, which was in line with sensitivity in BI-RADS 4b+ score, and a 6.3% improvement of specificity in
our expectation. The greatest advantage of AI over average performance of BI-RADS 4a+ score.
junior radiologists was in the BI-RADS 4b+ mode. A higher operating point Table 2 summarizes the changes in BI-RADS rating and biopsy rate
selection resulted in a lower false positive rate on the premise of sacrificing made by the readers in completing the AI-assisted reader study. For benign
sensitivity as indicated in Fig. 2 (b-d). lesions, most readers (7 out of 9) preferred to avoid at least 3.4% to a

Communications Medicine | (2024)4:90 6


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

Table 2 | Summary of changes in clinical outcomes made by readers R1-R9 in completing the AI-assisted reader study

Benign lesions (N = 60) Malignant lesions (N = 60)


Adjustments Biopsy rate Adjustments Biopsy rate
Maintain Upgrade a
Downgrade b
Solo +AI Maintain Upgrade a
Downgrade b
Solo +AI
R1** (13 yrs) 54 4 2 53.3% 56.7% 49 6 5 100% 100%
R2* (2 yrs) 32 8 20 70.0% 53.3% 36 17 7 98.3% 98.3%
R3 (5 yrs) 53 4 3 41.7% 36.7% 31 27 2 90.0% 93.3%
R4 (3 yrs) 35 8 17 88.3% 83.3% 24 26 10 100% 100%
R5** (18 yrs) 52 3 5 36.7% 33.3% 55 4 1 96.7% 95.0%
R6 (3 yrs) 47 1 12 56.7% 43.3% 47 11 2 95.0% 95.0%
R7** (10 yrs) 56 2 2 71.7% 71.7% 41 17 2 100% 100%
R8* (1 yrs) 42 7 11 36.7% 31.7% 25 29 6 71.7% 78.3%
*
R9 (1 yrs) 54 5 1 40.0% 43.3% 37 20 3 73.3% 86.7%
a
Shows the number of lesions that the radiologists altered for either an increased BI-RADS or a preference change from “benign to malignancy” under the same BI-RADS. bShows the number of lesions that
the radiologists altered due to either a decreased BI-RADS or a preference change from “malignancy to benign” under the same BI-RADS. Considering BI-RADS 4a+ as a test positive for malignancy, the
readers suggest ‘biopsy’ without (Solo) and with (+AI) computer assistance. **and* indicate the reader belongs to a senior radiologist or a junior radiologist, respectively.

Table 3 | Summary of the effectiveness of interpretable AI in aiding readers R1-R9

Changes of accuracy in different decision modes Questionnaire of AI assistance


Modea Solo + AI Increment Q1 Q2 Q3 Q4
R1** (13 yrs) BI-RADS 73.3%, 80.0%, 72.5% 71.7%, 83.3%, 71.0% + 0.1% 98.3% 88.3% 80.0% Y
B/M 84.2% 84.2% + 0.0%
R2* (2 yrs) BI-RADS 64.1%, 77.5%, 73.3% 71.7%, 82.5%, 78.3% + 5.9% 70.8% 93.3% 62.5% Y
B/M 70.8% 77.5% + 6.7%
R3 (5 yrs) BI-RADS 74.1%, 71.6%, 55.0% 78.3%, 85.8%, 63.3% + 8.9% 77.5% 96.7% 74.2% Y
B/M 75.0% 80.8% + 5.8%
R4 (3 yrs) BI-RADS 55.8%, 77.5%, 70.0% 58.3%, 82.5%, 80.0% + 5.8% 92.5% 85.0% 70.0% Y
B/M 79.1% 82.5% + 3.4%
R5** (18 yrs) BI-RADS 80.0%, 81.6%, 64.1% 80.8%, 85.0%, 65.0% + 1.7% 10.8% 87.5% 80.8% Y
B/M 85.0% 85.0% + 0.0%
R6 (3 yrs) BI-RADS 70.0%, 79.1%, 74.1% 75.8%, 84.2%, 80.0% + 5.6% 71.7% 93.3% 63.3% Y
B/M 70.0% 75.8% + 5.8%
R7** (10 yrs) BI-RADS 64.1%, 73.3%, 64.1% 64.2%, 80.0%, 70.0% + 4.2% 81.6% 85.8% 75.0% Y
B/M 73.3% 80.8% + 7.5%
R8* (1 yrs) BI-RADS 67.5%, 57.5%, 52.5% 73.3%, 69.2%, 54.2% + 6.4% 79.2% 81.7% 59.2% Y
B/M 63.3% 75.8% + 12.5%
R9* (1 yrs) BI-RADS 66.6%, 64.1%, 56.6% 71.6%, 69.2%, 57.5% + 3.7% 84.2% 76.7% 81.7% Y
B/M 62.5% 65.8% + 3.3%
a
BI-RADS mode includes 3 versus 4a + , 3,4a versus 4b+ and 3,4a,4b versus 4c+ while B/M mode only contains benign versus malignancy. Considering 4a + , 4b + , 4c+ in BI-RADS mode and ‘M’ in B/M
mode as test positive for malignancy, the performance of radiologists in the AI-assisted ready study is calculated without (Solo) and with (+AI) computer assistance, respectively. The increment here
represents the average accuracy improvement in both BI-RADS and B/M modes. The detailed descriptions of Q1-Q4 are listed in Fig.1. ** and * indicate the reader belonging to a senior radiologist or a junior
radiologist, respectively.

maximum of 16.7% unnecessary biopsies along with a better BI-RADS 5.3% and 7.5% versus 2.0% and 2.5% in BI-RADS rating and B/M pre-
rating (5 out of 9 downgraded more BI-RADS categories). In terms of ference, respectively).
malignant lesions, all readers (9 out of 9) decided to upgrade more BI-RADS We demonstrated the advantageous use of domain knowledge-based
categories as well as most readers (8 out of 9) suggested more biopsies. These explainable features rather than malignancy risk probability, from two
results demonstrated great potential of our AI to assist radiologists in aspects. First, a questionnaire was surveyed to summarize readers’ subjective
making better clinical assessment, especially for junior radiologists. attitudes towards AI assistance. Second, we performed an additional test by
Table 3 summarizes the effectiveness of the interpretable AI in aiding presenting only the AI-predicted probability of malignancy with no
radiologists in the matter of accuracy increment. It was observed that none explainable features. Supplementary Table 5 compares two types of AI
of the readers suffered from a descending accuracy in BI-RADS ratings and assistance on the clinical test set. The better clinical assessment made by the
B/M preference. As a clinical decision-supporting system, our AI system was readers indicates the advantage of providing explainable features over just
more beneficial for junior radiologists rather than senior radiologists (i.e., providing the malignancy probability predicted by the AI.

Communications Medicine | (2024)4:90 7


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

Fig. 3 | Three test cases to illustrate the adjustment of nine readers with the (see Supplementary Fig. 6) with the highest contribution scores. a and c are
assistance of AI. For each case, six representative prototypes and corresponding pathology-confirmed malignant lesions, while b is a pathology-confirmed benign
normalized contribution scores were presented to readers. For either benign or lesion. R1-R9 represents the nine readers, respectively.
malignant set, representative prototypes were selected from prototype candidates

The importance of explainable features in clinic score of US elastography was relatively weaker than the other two modalities
Compared to previous black-box deep learning models (e.g., heatmaps listed under the same category of prototypes. These observations implied the
in Supplementary Fig. 5), our proposed MUP-Net learned representative potential clinical applicability of our AI system, following the same radiomic
cases (i.e., prototypes listed in Supplementary Fig. 6) from each modality to analysis routine.
perform a similarity comparison during inference, resulting in a better way
of human-machine interaction. We have listed three examples in Fig. 3 to Discussion
demonstrate how domain knowledge-based explainable features could Accurate determination of breast cancer risk from multimodal US images
benefit radiologists in clinical practice. could considerably improve patients’ outcomes and avoid unnecessary
Figure 3 (a) is a malignant lesion with a high predicted malignancy risk biopsies26. To accelerate a broader adoption of deep learning technology by
probability (i.e., the contribution scores of the malignant prototypes were human experts in clinical practice, we propose a domain knowledge-based
much higher than those of the benign prototypes). Except for R8 and R9, all interpretable AI system that not only provides an overall cancer risk pre-
other radiologists either increased their BI-RADS ratings or altered their B/ diction, but also offers an efficient human-machine interaction. We
M preference from benign to malignant, indicating the ability of the AI to demonstrate that our AI system has the potential to assist clinicians, espe-
optimize radiologists’ preferences. When a lower overall cancer risk prob- cially for junior radiologists, in making better and confident decisions.
ability was presented to a benign lesion as shown in Fig. 3 (b), most radi- Deep learning frameworks have previously demonstrated its super-
ologists tended to improve their original decisions. Figure 3 (c) is a iority over hand-crafted features in medical imaging fields as a clinical
malignant lesion with an ambiguous risk score − 55%, which was difficult to decision-supporting system38. However, the black-box nature of deep
convince radiologist in clinical practice. However, explainable features (i.e., learning has hindered the establishment of the trust from human experts,
the highest contribution score 33% belonging to a malignant prototype from even though its performance has reached or surpassed human experts39,40. In
US B-mode) still helped 3 out of 9 readers make positive adjustment towards other words, clinicians could either simply trust the output of AI for its
a higher risk of malignancy. plausibility or overrule it, making the clinical value of AI controversial. To
The detailed adjustments made in the AI-assisted reader study was address the lack of interpretability in deep learning, post-hoc explainability
listed in Supplementary Tables 6-14. It should be noted that the inaccurate approaches have recently been adopted by outputting intermediate results
malignancy risk probability was inevitable in the AI system. To diminish the such as saliency maps41,42 and heatmaps29. However, these methods cannot
impact of erroneous prediction, domain knowledge-based prototypes and clearly explain the inner working mechanism of the model while utilizing
similarity scores have the potential to provide a second-level validation. In these cues.
particular, if the AI had an opposite preference to the ground truth result, the Herein, we propose a domain knowledge-based interpretable AI sys-
majority of the readers would give a negative answer to Q1 and Q3 of our tem on multimodal US images namely MUP-Net, which exploits its
questionnaire. In other words, radiologists preferred to agree with the inherent explainability by making comparison among representative cases
learned prototypes, instead of recognizing the overall prediction of the AI. (i.e., prototypes) during inference, and incorporates with clinical domain
Another interesting finding of our study was that the distribution of con- knowledge of breast US images for prototype selection. The focus of our AI
tribution scores conformed with the clinical experience in utilizing multi- system is to explore an understandable AI decision-making strategy for
modal US images. To be specific, US B-mode took a dominant role in image radiologists as a novel approach of human-machine collaboration. The
interpretation, while the US colour Doppler was an important supplement. proposed MUP-Net balances the interpretability of conventional machine
As a newly proposed modality in BI-RADS guidelines, the contribution learning with the superior performance of deep learning.

Communications Medicine | (2024)4:90 8


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

The performance of our AI system has been demonstrated in three Code availability
aspects. First, the interpretable MUP-Net achieved comparable perfor- The codes used in this study are available at https://github.com/Qian-
mance with prevalent black-box models. For instance, no significant IMMULab, and are provided in Zenodo with the identifier: https://doi.org/
difference was observed between MUP-Net and those off-the-shelf 10.5281/zenodo.11046790. The following packages were used: python 3.9.0,
methods, such as VGG, ResNet, and DenseNet families on the validation PyTorch 1.13.0, Torchvision 0.14.0, Numpy 1.23.4, Scikit-learn 1.1.3,
cohort. Second, in the reader study, the majority of the readers were Statsmodels 0.14.0, and Matplotlib 3.6.1. Code for preprocessing the data
below the ROC curve of AI system in terms of sensitivity and specificity. from raw ultrasound images is available for research purpose upon a request
Our MUP-Net showed a superior performance over junior radiologists, to the lead corresponding author (Xuejun Qian).
while maintaining a competitive performance with senior radiologists.
Third, such an interpretable AI system could help avoid unnecessary Received: 9 August 2023; Accepted: 3 May 2024;
biopsies for BI-RADS 4 rating patients, which is the major challenge in
clinical practice43,44. It is important to note that BI-RADS 4 lesions
occupied over 85% of our dataset, making it more challenging and References
representative to clinical need. 1. Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer statistics,
In addition to the overall improvement in performance, the AI- 2022. CA: a cancer J. Clin. 72, 7–33 (2022).
assisted reader study revealed the enhancement of clinical applicability 2. Tabár, L. et al. Swedish two-county trial: impact of mammographic
brought by the explainable features. On one hand, if the AI system screening on breast cancer mortality during 3 decades. Radiology
outputted a correct prediction the same as the radiologist’s assessment, it 260, 658–663 (2011).
would increase the confidence level of the radiologist. One important 3. Kolb, T. M., Lichy, J. & Newhouse, J. H. Comparison of the
evidence is that all readers gave positive feedbacks in Q4, indicating their performance of screening mammography, physical examination, and
agreements with the reasoning process of AI. On the other hand, when breast US and evaluation of factors that influence them: an analysis of
MUP-Net outputted an inaccurate malignant probability, which did exist 27,825 patient evaluations. Radiology 225, 165–175 (2002).
in clinical practice, our questionnaire demonstrated that the inherent 4. Crystal, P., Strano, S. D., Shcharynski, S. & Koretz, M. J. Using
explainable features have potential to alleviate such a discrepancy. Spe- sonography to screen women with mammographically dense breasts.
cifically, a positive answer to Q2 as well as negative answers to Q1 and Q3 Am. J. Roentgenol. 181, 177–182 (2003).
revealed the potential of the explainable features to help readers adhere to 5. Berg, W. A. et al. Combined screening with ultrasound and
their original decisions. Another finding of our study is that the advance mammography vs mammography alone in women at elevated risk of
of AI-assisted improvement in sensitivity and specificity occurred on breast cancer. Jama 299, 2151–2163 (2008).
junior radiologists rather than senior radiologists, suggesting the 6. Berg, W. A. et al. Ultrasound as the primary screening test for breast
potential educational applicability of this interpretable AI system in cancer: analysis from ACRIN 6666. J. Natl Cancer Inst. 108,
training or assisting junior clinicians. djv367 (2016).
Moreover, the domain knowledge is applied to supervise the learning 7. Chung, M. et al. US as the primary imaging modality in the evaluation
process of AI system, which ensures the quality of output explainable fea- of palpable breast masses in breastfeeding women, including those of
tures. The key point is to supervise the MUP-Net to select discriminative advanced maternal age. Radiology 297, 316–324 (2020).
benign and malignant cases as prototypes for generating credible explain- 8. Abdullah, N., Mesurolle, B., El-Khoury, M. & Kao, E. Breast imaging
able features presented to readers. The domain knowledge for supervision is reporting and data system lexicon for US: interobserver agreement for
based on the standard BI-RADS Atlas and physician observations in clinical assessment of breast masses. Radiology 252, 665–672
practice, which restricts the range of prototype candidates to a subset of BI- (2009).
RADS. It is interesting to note that the learned contribution scores followed 9. Lazarus, E., Mainiero, M. B., Schepps, B., Koelliker, S. L. & Livingston,
the radiomic analysis routine in clinical practice with a relatively weaker L. S. BI-RADS lexicon for US and mammography: interobserver
weight of US elastography than the other two modalities under the same variability and positive predictive value. Radiology 239,
category of prototypes. 385–391 (2006).
There are a few limitations in our study. Our dataset is exclusively 10. Yassin, N. I., Omran, S., El Houby, E. M. & Allam, H. Machine learning
acquired using Aixplorer US scanners and does not include the varia- techniques for breast cancer computer aided diagnosis using
bility generated from various scanner manufacturers. Therefore, the different image modalities: A systematic review. Comput. Methods
proposed AI system may not achieve the same high performance in the Prog.Biomed 156, 25–45 (2018).
external cohorts. In addition, more data are needed for further optimi- 11. Hosny, A., Parmar, C., Quackenbush, J., Schwartz, L. H. & Aerts, H. J.
zation and testing across diverse clinical settings to demonstrate the Artificial intelligence in radiology. Nat. Rev. Cancer 18,
usefulness and generalizability of our system. Moreover, our MUP-Net is 500–510 (2018).
an image-only deep learning system. To further improve our system, we 12. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521,
should include patients’ medical histories and demographics as metadata 436–444 (2015).
input. This information should be relevant and beneficial for compre- 13. Gulshan, V. et al. Development and validation of a deep learning
hensive cancer risk prediction. algorithm for detection of diabetic retinopathy in retinal fundus
photographs. Jama 316, 2402–2410 (2016).
Data availability 14. Esteva, A. et al. Dermatologist-level classification of skin cancer with
The main data supporting the results of this study are available within the deep neural networks. nature 542, 115–118 (2017).
paper and its Supplementary Information. Source data underlying Fig. 2, 15. Shen, D., Wu, G. & Suk, H.-I. Deep learning in medical image analysis.
Tables 2–3 can be found in Supplementary Data 1 and 2. Because of the Annu. Rev. Biomed. Eng. 19, 221–248 (2017).
patient privacy, raw ultrasound datasets from The First Affiliated Hospital 16. Guo, R., Lu, G., Qin, B. & Fei, B. Ultrasound imaging technologies for
of Anhui Medical University and Xuancheng People’s Hospital of China breast cancer detection and management: a review. Ultrasound Med.
cannot be available for public release. However, data in the reader study can Biol. 44, 37–70 (2018).
be available for academic study from the lead corresponding author (Xuejun 17. Yap, M. H. et al. Automated breast ultrasound lesions detection using
Qian) on reasonable request, subject to permission from the institutional convolutional neural networks. IEEE J. Biomed. health Inform. 22,
review boards of the hospitals. 1218–1226 (2017).

Communications Medicine | (2024)4:90 9


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

18. Vakanski, A., Xian, M. & Freer, P. E. Attention-enriched deep learning vision and pattern recognition (CVPR) https://doi.org/10.1109/CVPR.
model for breast tumor segmentation in ultrasound images. 2017.243 (IEEE, 2017).
Ultrasound Med. Biol. 46, 2819–2833 (2020). 38. Swanson, K., Wu, E., Zhang, A., Alizadeh, A. A. & Zou, J. From
19. Ning, Z., Zhong, S., Feng, Q., Chen, W. & Zhang, Y. SMU-net: patterns to patients: Advances in clinical machine learning for
saliency-guided morphology-aware U-net for breast lesion cancer diagnosis, prognosis, and treatment. Cell. 186,
segmentation in ultrasound image. IEEE Trans. Med. imaging 41, 1772–1791 (2023).
476–490 (2021). 39. De Fauw, J. et al. Clinically applicable deep learning for diagnosis and
20. Han, S. et al. A deep learning framework for supporting the referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).
classification of breast lesions in ultrasound images. Phys. Med. Biol. 40. McKinney, S. M. et al. International evaluation of an AI system for
62, 7714 (2017). breast cancer screening. Nature 577, 89–94 (2020).
21. Shin, S. Y., Lee, S., Yun, I. D., Kim, S. M. & Lee, K. M. Joint weakly and 41. Rudin, C. Stop explaining black box machine learning models for high
semi-supervised deep learning for localization and classification of stakes decisions and use interpretable models instead. Nat. Mach.
masses in breast ultrasound images. IEEE Trans. Med. imaging 38, Intell. 1, 206–215 (2019).
762–774 (2018). 42. Arun, N. et al. Assessing the trustworthiness of saliency maps for
22. Qi, X. et al. Automated diagnosis of breast ultrasonography images localizing abnormalities in medical imaging. Radiology: Artif. Intell. 3,
using deep neural networks. Med. image Anal. 52, 185–198 e200267 (2021).
(2019). 43. Menezes, G. L. et al. Downgrading of breast masses suspicious for
23. Qian, X. et al. A combined ultrasonic B-mode and color Doppler cancer by using optoacoustic breast imaging. Radiology 288,
system for the classification of breast masses using neural network. 355–365 (2018).
Eur. Radiol. 30, 3023–3033 (2020). 44. Destrempes, F. et al. Added value of quantitative ultrasound and
24. Shen, Y. et al. Artificial intelligence system reduces false-positive machine learning in BI-RADS 4–5 assessment of solid breast lesions.
findings in the interpretation of breast ultrasound exams. Nat. Ultrasound Med. Biol. 46, 436–444 (2020).
Commun. 12, 5645 (2021).
25. Zhang, Q. et al. Dual-mode artificially-intelligent diagnosis of Acknowledgements
breast tumours in shear-wave elastography and B-mode We thank all the participants and the institutions for supporting this study.
ultrasound using deep polynomial networks. Med. Eng. Phys. 64, This work would not have been possible without the participation of the
1–6 (2019). radiologists who reviewed cases for this study, thus, we are grateful to W.
26. Qian, X. et al. Prospective assessment of breast cancer risk from Yao, X, Shuai, W. Zhang, Y. Li, N. Chen, Y. Guo, J. Zhang, F. Yao and R.
multimodal multiview ultrasound images via clinically applicable deep Xiang. This study was supported by the National Natural Science
learning. Nat. Biomed. Eng. 5, 522–532 (2021). Foundation of China (no. 82371993 to X.Q.), the Anhui Provincial Health
27. Castelvecchi, D. Can we open the black box of AI? Nat. N. 538, Research Project (no. AHWJ2023A20096 to J.P.), and HPC Computing
20 (2016). Platform of ShanghaiTech University.
28. Soffer, S. et al. Convolutional neural networks for radiologic images: a
radiologist’s guide. Radiology 290, 590–606 (2019). Author contributions
29. Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep X.Q. conceived and designed the project. X.Q., D.S. and J.P. supervised the
networks via gradient-based localization. In IEEE International study. L.Y., Z.L. and D.Y. developed the deep learning framework and
Conference on Computer Vision (ICCV) https://doi.org/10.1109/ software tools necessary for the experiments. Hao Zhang, G. Zhang, W.
ICCV.2017.74 (IEEE, 2017). Zheng, C.H., Hanqi Zhang, X.X., C.L., W. Zhang and H. Zheng collected the
30. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T. & Lipson, H. raw ultrasound images, patients’ pathology results in clinic, and defined the
Understanding neural networks through deep visualization Preprint at clinical labels. L.Y., Z.L., C.H. and D.Y. executed the research and performed
https://doi.org/10.48550/arXiv.1506.06579 (2015). statistical analysis. X.Q., L.Y. and Z.L. wrote the manuscript. All authors
31. Erhan, D., Bengio, Y., Courville, A. & Vincent, P. Visualizing higher- contributed to the review and editing of the manuscript.
layer features of a deep network. Univ. Montr. 1341, 1 (2009).
32. Barr, R. G. et al. WFUMB guidelines and recommendations for clinical Competing interests
use of ultrasound elastography: Part 2: breast. Ultrasound Med. Biol. The authors declare no competing interests.
41, 1148–1160 (2015).
33. Barnett, A. J. et al. A case-based interpretable deep learning model for Additional information
classification of mass lesions in digital mammography. Nat. Mach. Supplementary information The online version contains
Intell. 3, 1061–1070 (2021). supplementary material available at
34. Chen, C. et al. This looks like that: deep learning for interpretable https://doi.org/10.1038/s43856-024-00518-7.
image recognition. Advances in neural information processing
systems. Preprint at https://doi.org/10.48550/arxiv.1806. Correspondence and requests for materials should be addressed to
10574 (2019). Jing Pei, Dinggang Shen or Xuejun Qian.
35. Simonyan, K. & Zisserman, A. Very deep convolutional networks for
large-scale image recognition. In International Conference on Peer review information Communications Medicine thanks Christine Lee
Learning Representations. Preprint at https://doi.org/10.48550/arXiv. and the other, anonymous, reviewer(s) for their contribution to the peer
1409.1556 (2015). review of this work.
36. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image
recognition. In IEEE Conference on Computer Vision and Pattern Reprints and permissions information is available at
Recognition (CVPR) https://doi.org/10.1109/CVPR.2016.90 http://www.nature.com/reprints
(IEEE, 2016).
37. Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. Densely Publisher’s note Springer Nature remains neutral with regard to
connected convolutional networks. In IEEE conference on computer jurisdictional claims in published maps and institutional affiliations.

Communications Medicine | (2024)4:90 10


Content courtesy of Springer Nature, terms of use apply. Rights reserved
https://doi.org/10.1038/s43856-024-00518-7 Article

Open Access This article is licensed under a Creative Commons


Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are
included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.

© The Author(s) 2024

Communications Medicine | (2024)4:90 11


Content courtesy of Springer Nature, terms of use apply. Rights reserved
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

You might also like