Breast Cancer Diagnosis in Two-View Mammography Using End-To-End Trained Efficientnet-Based Convolutional Network
Breast Cancer Diagnosis in Two-View Mammography Using End-To-End Trained Efficientnet-Based Convolutional Network
Daniel G.P. Petrini1 , Carlos Shimizu2 , Rosimeire A. Roela3 , Gabriel V. Valente3 , Maria A.K.
Folgueira3 , Hae Yong Kim1
Abstract
Some recent studies have described deep convolutional neural networks to diagnose breast
cancer in mammograms with similar or even superior performance to that of human experts. Shen
et al. (2019) present one of the best techniques that consists of two transfer learnings. The
first uses a model trained on natural images to create a “patch classifier” that categorizes small
subimages. The second uses the patch classifier to scan the whole mammogram and create the
“single-view whole-image classifier”. We propose to make a third transfer learning to obtain a
“two-view classifier” to use the two mammographic views: bilateral craniocaudal and mediolateral
oblique. We use modern EfficientNet as the basis of our model. We “end-to-end” train the entire
system using CBIS-DDSM dataset. To ensure statistical robustness, we test our system twice using:
(a) 5-fold cross validation; and (b) the original training/test division of the dataset. Our technique
reached an AUC of 0.934 using 5-fold cross validation (sensitivity and specificity are 85.13% at the
equal error rate of ROC). Using the original dataset division, our technique achieved an AUC of
0.8483, the largest AUC reported for this problem, as far as we know
1 Introduction
Major medical and governmental health agencies endorse mammography screening programs, because
it reduces breast cancer-specific mortality, and nowadays, more and more women adhere to this rec-
ommendation. As a consequence, the number of mammograms that should be analyzed are increasing
day after day. Mammograms must be interpreted by experienced radiologists to achieve a low error
rate. To help radiologists, CAD (Computer-Aided Detection and Diagnosis) systems have been and
are being developed.
Recently, there has been a revolution in artificial intelligence (AI) and computer vision with the
introduction of the deep convolutional neural network (CNN) [12, 10, 11]. Some recent works have
proposed to use CNN to diagnose cancer in mammograms. However, we should consider that there are
important differences between classifying natural images and mammograms. In natural images, the
being that defines the image category occupies a large area. This does not happen on mammograms,
where the cancer tissue may occupy only a tiny area. Consequently, directly training a CNN or making
a conventional transfer learning to classify mammograms usually does not work well.
Shen et al. [26] present a nice idea to overcome this challenge, that consists on performing two
transfer learnings. The first uses a model trained on the ImageNet [23] natural images to initialize the
“patch classifier” that classifies small mammogram patches into five categories: background, benign
calcification, malignant calcification, benign mass, and malignant mass. The second uses the patch
classifier to initialize the “single-view whole-image classifier” that is end-to-end trained using whole
mammograms with cancer status. In other words, they first build the patch classifier because it is easier
1 EscolaPolitécnica, Universidade de São Paulo, São Paulo, Brazil
2 Instituto
do Câncer do Estado de São Paulo, São Paulo, Brazil
3 Faculdade de Medicina, Universidade de São Paulo, São Paulo, Brazil
1
than building a whole image classifier. Subsequently, the patch classifier scans the entire mammogram,
generating attribute maps that describe the likelihood of having different types of lesions in each region
of the mammogram. The whole image classifier uses these maps to make the final classification and is
end-to-end trained. In this paper, we propose some improvements to Shen et al.’s method to increase
its performance:
(1) The original technique used ResNet [8] and VGG [33] as the base models. We replaced these
rather outdated models with the modern EfficientNet [29].
(2) Standard mammography consists of two views for each breast: bilateral craniocaudal (CC) and
mediolateral oblique (MLO). The original algorithm processes only one view at a time and, to take the
two views into account, it simply averages the scores of the two views processed independently. Our
technique performs a third transfer learning, in addition to the original two, to take into account the
two views. We use the single-view classifier to initialize the “two-view classifier” and then the entire
system (patch, single-view and two-view classifiers) is end-to-end trained, using two-view mammograms
with cancer status.
With the above improvements, together with test-time augmentation (TTA) and ensemble of four
models with the same architecture, we achieved an AUC (Area Under ROC Curve) of 0.9344±0.0341
in 5-fold cross-validation using CBIS-DDSM dataset (sensitivity and specificity are 85.13% at the
equal error rate point of the ROC). It is important to note that these performance measures cannot
be directly compared with those of radiologists, as all CBIS-DDSM images have at least one lesion,
whereas most mammograms examined by radiologists are normal. It is known that a substantially
smaller AUC is obtained using the original CBIS-DDSM training/test division [31]. In this condition,
we obtained an AUC of 0.8483±0.0253 (with TTA). As far as we know, this is the largest AUC reported
for this problem.
A good property of the methods proposed by Shen et al. and ours is that they both allow you to
use datasets with partial ROI annotations. In this case, we can use the subset with the ROI location
to train the patch classifier and the subset without this information (with just the status of whole
images) to train the whole-image classifiers.
A previous version of this work was presented at the 2021 AACR annual meeting and published as an
abstract [20, 21] and a preprint version is available at https://arxiv.org/abs/2110.01606. The inference
code of the two-view classifier and the model are available at https://github.com/dpetrini/two-views-
classifier.
2 Related works
2.1 Early CAD systems
A CAD system should receive a mammogram and report the likelihood that the patient has cancer.
Early CAD systems used to split it into two sub-problems [1]: CADe (Computer-Aided Detection)
that detects ROIs suspected of being cancerous; and CADx (Computer-Aided Diagnosis) that classifies
ROIs as cancer or non-cancer. In 1998, FDA approved the use of CADe for mammography. After
that, CADe systems spread in the United States and costs over $400 million a year. However, Lehman
et al.[14], after analyzing 495,000 mammograms, demonstrated that the screening performance was
not improved with CADe. The specificity of early CADe systems was very low, generating around
1 false positive mark per view. Thus, radiologists assisted by these systems improved sensitivity in
cancer diagnosis (from 4% to 15%) but worsened in specificity (from 5% to 35%) [3]. Once ROIs are
localized, CADx classified them as benign or malignant. Usually, handcrafted features were extracted
from the ROIs [4] and then classical machine learning techniques classified them. The AUCs of old
CADx systems refer only to the classification of ROIs and cannot be directly compared with the AUCs
of modern techniques that classify whole mammograms.
2
performance than human specialists.
Kooi et al. [9] compared classification of mammography ROIs using state-of-the-art classic method,
CNN-based method and radiologists. They concluded that CNN has a performance comparable to
radiologists and superior to the classic method.
Rodriguez et al. [22] compared a CNN-based commercial system (Transpara 1.4.0) with 101 radi-
ologists, using 9 datasets from different institutions in the US and Europe. The AUC of the AI system
was 0.840 while the mean AUC of the radiologists was 0.814. Therefore, the AI was better than the
average of radiologists but its performance was inferior to that of the best radiologist.
Schaffter et al. [24] describe the “DM DREAM Challenge” fostered to develop AI algorithms for
interpreting mammograms, held between September 2016 and November 2017. The top-performing
single algorithm achieved an AUC of 0.858 (in the US dataset) and 0.903 (in the Swedish dataset). No
single or ensemble algorithm outperformed radiologists.
McKinney et al. [17] present an AI system that surpasses human experts in breast cancer prediction.
This system consists of an ensemble of three deep learning models that were tested on private UK and
US private datasets and achieved AUCs of 0.889 and 0.8107, respectively.
Wu et al. [32] designed a four-view deep learning system and trained it with over 200,000 exams,
of which 5,832 with biopsies and 985 biopsies with confirmed malignancies. They achieved an AUC of
0.895 in predicting cancer using 4 views, which is higher than the radiologists’ average AUC of 0.778.
Although both Wu et al.’s work and ours use multi-view to classify cancer, there are fundamental
differences that we explain in Section “Multi-view technique by Wu et al.”
Recent works that use the CBIS-DDSM dataset as our work and whose results can be compared
with ours will be mentioned in Section “recent works that use CBIS-DDSM”.
3
b) There is no large public standard mammogram database where different CAD systems can be
compared. Each system uses its own dataset, making the comparison difficult.
c) CBIS-DDSM is the largest public dataset (although still small for deep learning), but it is composed
of low-quality SFM. It is not possible to make a fair comparison between systems that use FFDM
and SFM mammograms.
It is difficult to compare techniques even when all systems use the same dataset (e.g. CBIS-DDSM).
Many works in the literature randomly divide the dataset into training and test sets [26, 31]. This
procedure can generate biased results, as there is the possibility of randomly choosing a test set that
is easy (or difficult) to classify, especially when the dataset is not large enough, like CBIS-DDSM.
This phenomenon can be seen clearly in our own results. When the CBIS-DDSM dataset is randomly
divided into 5 subsets and our two-view classifier is trained using 4 subsets and tested on the remaining
set, the 5 obtained AUCs vary from 0.90 to 0.99 (4 models with TTA, see the Table named “AUCs
of our two-view classifiers in CV test”). Thus, if we were lucky in the random division, our two-vew
classifier would reach astonishing 0.99 AUC and, if we were unlucky, it would reach only 0.90. Neither
of the two values reflects the true performance of our system. Consequently, the results obtained using
random training/test division are unreliable.
Especially, using the official training/test division of CBIS-DDSM, a remarkably small AUC is
obtained. Shen et al.[26] obtained an AUC of 0.87 using random training/test division but, using the
official division, their system achieves only an estimated AUC of ~0.75 (single runs) [25]. Similarly,
Wei et al. [31] obtained an AUC of 0.9182 using a random division but only 0.7964 using the official
division (single runs). According to Wei et al., this happens because the testing data is another holdout
set acquired in a different time.
Bootstrapping the test set consists of randomly choosing different subsets of the test set many times,
evaluating the system performance on those subsets, and averaging them to obtain the dispersion of
the CAD performance measure. This procedure only tests the obtained fixed model many times. It
does not measure the dispersion that would be observed by making different training/test division.
Thus, the performance dispersions measured using bootstrapping and fixed division is also unreliable.
4
Training Test
Category Benign Malignant Benign Malignant Total
1,347 1,111 381 264 3,103
Mammograms
2,458 (79.21%) 645 (20.79%) (100%)
Mammograms 1,180 968 324 222 2,694
with 2-views 2,148 (79.73%) 546 (20.27%) (100%)
Mammograms
174 136 61 38 409
with 1-view
5
Figure 1: Left: we randomly chose 10 background (yellow) patches anywhere but in the lesion; we
delimited a (white) region centered at the lesion and sampled 10 patches with random horizontal and
vertical displacements within this region. Right: the lesion segmentation mask provided by CBIS-
DDSM.
224x224 pixels around the center with a random displacement of ±10% of the height/width (inside
the white rectangle in Figure 1. In sequence, we sampled 10 background patches from anywhere in
the image except the ROIs. We further divided the patches containing the lesions into 4 subcategories
according to their labels in CBIS-DDSM: benign calcification, malignant calcification, benign mass
and malignant mass. So a patch can be of 5 types, with the background summing up 50% and the
remaining categories making up respectively 9.5%, 17.5%, 11.1% and 11.9%. We did not use any
technique to compensate for this imbalance.
There are 8 models of EfficientNet, numbered from B0 to B7 [29]. EfficientNet-B0 is the smallest
model and was designed automatically by the Neural Architecture Search. Then, this base model was
scaled up in width, depth and resolution of the input image to obtain the remaining seven models.
We took EfficientNets pre-trained on ImageNet [23] images and performed transfer learning to classify
mammogram patches into 5 categories. As mammograms have only one channel, the same grayscale
feeds EfficientNet’s red, green and blue inputs. When an EfficientNet without the top layers is fed with
a 224x224 patch, it yields different numbers of maps with 7x7 attributes. For example, EfficientNet-
B0, B4 and B7 generate respectively 1280, 1792 and 2560 maps with 7x7 attributes. These maps are
average-pooled and pass through a fully-connected layer with five outputs to make the classification
into 5 categories.
6
Figure 2: Learning rate used to train the patch classifier in the “OD test”.
Table 2: Accuracies of patch classifiers and AUCs of single-view classifiers using different base models.
7
Figure 3: Diagrams of the single-view classifier for the “CV test” (top) and “OD test” (bottom).
The output of the last MBConv block is followed by global average pooling and a dense layer with two
output categories.
8
Fold 1 2 3 4 5 mean std
ResNet50, 1 model without 0.7541 0.9089 0.8553 0.9057 0.8320 0.8512 0.0567
TTDA
EfficientNet-B4, 1 model 0.8371 0.8455 0.8865 0.9220 0.8874 0.8757 0.0310
without TTDA
EfficientNet-B4, 1 model 0.8507 0.8570 0.8908 0.9255 0.8965 0.8841 0.0274
with TTDA
EfficientNet-B4, ensemble of 0.8419 0.8566 0.8945 0.9263 0.8984 0.8835 0.0304
4 models without TTDA
EfficientNet-B4, ensemble of 0.8653 0.8634 0.8942 0.9257 0.9048 0.8907 0.0238
4 models with TTDA
Table 4: Comparison of different single-view classifiers (single runs) using the original CBIS-DDSM
division.
9
Figure 4: Diagram of the two-view classifier for the “CV test”.
attributes. The final classification is obtained by average pooling these maps followed by a dense layer.
10
Fold 1 2 3 4 5 mean std
ResNet50, 1 model 0.8421 0.9630 0.9522 0.9730 0.8972 0.9255 0.0492
without TTDA
EfficientNet-B4, 1 model 0.8891 0.8880 0.9486 0.9882 0.9350 0.9298 0.0379
without TTDA
EfficientNet-B4, 1 model 0.8840 0.8926 0.9514 0.9895 0.9402 0.9315 0.0390
with TTDA
EfficientNet-B4, 4 models 0.8904 0.8933 0.9445 0.9893 0.9324 0.9300 0.0365
without TTDA
EfficientNet-B4, 4 models 0.9004 0.8963 0.9462 0.9896 0.9397 0.9344 0.0341
with TTDA
Figure 6: ROCs of our two-view classifiers in “CV test” (with TTA and ensemble of 4 models).
11
Method AUC Observation
Our two-view 0.8418±0.0258 Single run
Our two-view 0.8483±0.0253 1 model with TTA
Shen et al.[26, 25] ~0.75 Estimated (single run)
Shu et al.[27] 0.838 Unclear if the original division is used.
Wei et al.[31] 0.7964 Single run
Wei et al.[31] 0.8187 1 model with TTA
Wei et al.[31] 0.8313 4 models with TTA
Almeida et al.[18] 0.6824 Resized to 224x224, single run.
Figure 7: ROC curve of our two-view classifier in “OD test” (with TTA).
12
Concatenate maps New top layers AUC
after average pools Only dense layers 0.9225±0.0405
before average pools 2 MBConv blocks 0.9298±0.0379
Table 7: Comparison between concatenating the attribute maps after or before average poolings. All
tests used EfficientNet-B4 as the base model (single runs).
our classifier processes each view with EfficientNet-B0 and concatenates the attribute maps before
doing average poolings. We tested the two ideas (concatenating the attribute maps after or before the
average poolings), always using EfficientNet-B4 as the base model, and the results show that better
results are obtained when concatenating the maps before the average poolings (Table 7). This is not
surprising, as information about the spatial locations of lesions are lost by average poolings.
5 Conclusions
In this paper, we have presented a new high performance breast cancer CAD (Computer-Aided De-
tection and Diagnosis) system. We have proposed a deep convolutional network that simultaneously
takes into account the two views of the same side of the mammography that is end-to-end trained
making three transfer learnings:
1. First, we use the weights of EfficientNet trained on natural images to train the patch classifier.
Acknowledgements
This work was partially supported by CNPq (National Council for Scientific and Technological Devel-
opment) process number 305377/2018-3.
Author contributions
All authors contributed with ideas that ended up resulting in the proposed algorithms. D.G.P.P.
developed the algorithms and conducted the experiments. H.Y.K. supervised the experiments and
wrote the main manuscript text. C.S., R.A.R, G.V.V. and M.A.A.K.F. provided medical information
about the problem. All authors performed the bibliographic review and reviewed the manuscript.
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to D.G.P.P.
13
References
[1] Turgay Ayer et al. “Computer-aided diagnostic models in breast cancer screening”. In: Imaging
in medicine 2.3 (2010), p. 313.
[2] K. et al. Bowyer. “The digital database for screening mammography”. In: In: Third international
workshop on digital mammography (1996), p. 27.
[3] C Dromain et al. “Computed-aided diagnosis (CAD) in the detection of breast cancer”. In:
European journal of radiology 82.3 (2013), pp. 417–423.
[4] Matthias Elter and Alexander Horsch. “CADx of mammographic masses and clustered micro-
calcifications: a review”. In: Medical physics 36.6Part1 (2009), pp. 2052–2068.
[5] U Fischer et al. “Comparative study in patients with microcalcifications: full-field digital mam-
mography vs screen-film mammography”. In: European radiology 12.11 (2002), pp. 2679–2683.
[6] Akhilesh Gotmare et al. “A closer look at deep learning heuristics: Learning rate restarts, warmup
and distillation”. In: arXiv preprint arXiv:1810.13243 (2018).
[7] James A Hanley and Barbara J McNeil. “The meaning and use of the area under a receiver
operating characteristic (ROC) curve.” In: Radiology 143.1 (1982), pp. 29–36.
[8] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016, pp. 770–778.
[9] Thijs Kooi et al. “Large scale deep learning for computer aided detection of mammographic
lesions”. In: Medical image analysis 35 (2017), pp. 303–312.
[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Advances in neural information processing systems 25 (2012),
pp. 1097–1105.
[11] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553 (2015),
pp. 436–444.
[12] Yann LeCun et al. “Backpropagation applied to handwritten zip code recognition”. In: Neural
computation 1.4 (1989), pp. 541–551.
[13] Rebecca Sawyer Lee et al. “A curated mammography data set for use in computer-aided detection
and diagnosis research”. In: Scientific data 4.1 (2017), pp. 1–9.
[14] Constance D Lehman et al. “Diagnostic accuracy of digital screening mammography with and
without computer-aided detection”. In: JAMA internal medicine 175.11 (2015), pp. 1828–1837.
[15] Thomas M Lehmann et al. “IRMA–Content-based image retrieval in medical applications”. In:
MEDINFO 2004. IOS Press. 2004, pp. 842–846.
[16] John M Lewin et al. “Comparison of full-field digital mammography with screen-film mammog-
raphy for cancer detection: results of 4,945 paired examinations”. In: Radiology 218.3 (2001),
pp. 873–880.
[17] Scott Mayer McKinney et al. “International evaluation of an AI system for breast cancer screen-
ing”. In: Nature 577.7788 (2020), pp. 89–94.
[18] Rhaylander Mendes de Miranda Almeida et al. “Machine Learning Algorithms for Breast Cancer
Detection in Mammography Images: A Comparative Study”. In: (2021).
[19] Inês C Moreira et al. “Inbreast: toward a full-field digital mammographic database”. In: Academic
radiology 19.2 (2012), pp. 236–248.
[20] Daniel G Petrini et al. End-to-end training of convolutional network for breast cancer detection
in two-view mammography. 2021.
[21] Daniel G Petrini et al. High-accuracy breast cancer detection in mammography using EfficientNet
and end-to-end training. 2021.
14
[22] Alejandro Rodriguez-Ruiz et al. “Stand-alone artificial intelligence for breast cancer detection
in mammography: comparison with 101 radiologists”. In: JNCI: Journal of the National Cancer
Institute 111.9 (2019), pp. 916–922.
[23] Olga Russakovsky et al. “Imagenet large scale visual recognition challenge”. In: International
journal of computer vision 115.3 (2015), pp. 211–252.
[24] Thomas Schaffter et al. “Evaluation of combined artificial intelligence and radiologist assessment
to interpret screening mammograms”. In: JAMA network open 3.3 (2020), e200265–e200265.
[25] Shen. https : / / github . com / lishen / end2end - all - conv / issues / 5. Accessed: 2021-07-28.
2021.
[26] Li Shen et al. “Deep learning to improve breast cancer detection on screening mammography”.
In: Scientific reports 9.1 (2019), pp. 1–12.
[27] Xin Shu et al. “Deep neural networks with region-based pooling structures for mammographic
image classification”. In: IEEE transactions on medical imaging 39.6 (2020), pp. 2246–2255.
[28] P SUCKLING J. “The mammographic image analysis society digital mammogram database”. In:
Digital Mammo (1994), pp. 375–386.
[29] Mingxing Tan and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural
networks”. In: International Conference on Machine Learning. PMLR. 2019, pp. 6105–6114.
[30] Mingxing Tan et al. “Mnasnet: Platform-aware neural architecture search for mobile”. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 2820–
2828.
[31] Tao Wei et al. “Beyond Fine-tuning: Classifying High Resolution Mammograms using Function-
Preserving Transformations”. In: arXiv preprint arXiv:2101.07945 (2021).
[32] Nan Wu et al. “Deep neural networks improve radiologists’ performance in breast cancer screen-
ing”. In: IEEE transactions on medical imaging 39.4 (2019), pp. 1184–1194.
[33] Xiangyu Zhang et al. “Accelerating very deep convolutional networks for classification and detec-
tion”. In: IEEE transactions on pattern analysis and machine intelligence 38.10 (2015), pp. 1943–
1955.
15