0% found this document useful (0 votes)

75 views

Breast Cancer Diagnosis in Two-View Mammography Using End-To-End Trained Efficientnet-Based Convolutional Network

This document describes a study that used an EfficientNet-based convolutional neural network to diagnose breast cancer in mammograms using two views (bilateral craniocaudal and mediolateral oblique). The network was trained end-to-end on the CBIS-DDSM dataset. Using 5-fold cross validation, the technique achieved an AUC of 0.934, with a sensitivity and specificity of 85.13% at the equal error rate point. Using the original dataset division, it achieved an AUC of 0.8483, which is reported to be the largest AUC achieved for this problem so far. The study is an improvement on previous work that used two transfer learnings to classify single mammogram views independently

Uploaded by

Huy Duong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views

Breast Cancer Diagnosis in Two-View Mammography Using End-To-End Trained Efficientnet-Based Convolutional Network

Uploaded by

Huy Duong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Breast Cancer Diagnosis in Two-View Mammography Using

End-to-End Trained EfficientNet-Based Convolutional

Network
arXiv:2110.01606v2 [eess.IV] 6 Oct 2021

Daniel G.P. Petrini1 , Carlos Shimizu2 , Rosimeire A. Roela3 , Gabriel V. Valente3 , Maria A.K.
Folgueira3 , Hae Yong Kim1

Abstract
Some recent studies have described deep convolutional neural networks to diagnose breast
cancer in mammograms with similar or even superior performance to that of human experts. Shen
et al. (2019) present one of the best techniques that consists of two transfer learnings. The
first uses a model trained on natural images to create a “patch classifier” that categorizes small
subimages. The second uses the patch classifier to scan the whole mammogram and create the
“single-view whole-image classifier”. We propose to make a third transfer learning to obtain a
“two-view classifier” to use the two mammographic views: bilateral craniocaudal and mediolateral
oblique. We use modern EfficientNet as the basis of our model. We “end-to-end” train the entire
system using CBIS-DDSM dataset. To ensure statistical robustness, we test our system twice using:
(a) 5-fold cross validation; and (b) the original training/test division of the dataset. Our technique
reached an AUC of 0.934 using 5-fold cross validation (sensitivity and specificity are 85.13% at the
equal error rate of ROC). Using the original dataset division, our technique achieved an AUC of
0.8483, the largest AUC reported for this problem, as far as we know

1 Introduction
Major medical and governmental health agencies endorse mammography screening programs, because
it reduces breast cancer-specific mortality, and nowadays, more and more women adhere to this rec-
ommendation. As a consequence, the number of mammograms that should be analyzed are increasing
day after day. Mammograms must be interpreted by experienced radiologists to achieve a low error
rate. To help radiologists, CAD (Computer-Aided Detection and Diagnosis) systems have been and
are being developed.
Recently, there has been a revolution in artificial intelligence (AI) and computer vision with the
introduction of the deep convolutional neural network (CNN) [12, 10, 11]. Some recent works have
proposed to use CNN to diagnose cancer in mammograms. However, we should consider that there are
important differences between classifying natural images and mammograms. In natural images, the
being that defines the image category occupies a large area. This does not happen on mammograms,
where the cancer tissue may occupy only a tiny area. Consequently, directly training a CNN or making
a conventional transfer learning to classify mammograms usually does not work well.
Shen et al. [26] present a nice idea to overcome this challenge, that consists on performing two
transfer learnings. The first uses a model trained on the ImageNet [23] natural images to initialize the
“patch classifier” that classifies small mammogram patches into five categories: background, benign
calcification, malignant calcification, benign mass, and malignant mass. The second uses the patch
classifier to initialize the “single-view whole-image classifier” that is end-to-end trained using whole
mammograms with cancer status. In other words, they first build the patch classifier because it is easier
1 EscolaPolitécnica, Universidade de São Paulo, São Paulo, Brazil
2 Instituto
do Câncer do Estado de São Paulo, São Paulo, Brazil
3 Faculdade de Medicina, Universidade de São Paulo, São Paulo, Brazil

1
than building a whole image classifier. Subsequently, the patch classifier scans the entire mammogram,
generating attribute maps that describe the likelihood of having different types of lesions in each region
of the mammogram. The whole image classifier uses these maps to make the final classification and is
end-to-end trained. In this paper, we propose some improvements to Shen et al.’s method to increase
its performance:
(1) The original technique used ResNet [8] and VGG [33] as the base models. We replaced these
rather outdated models with the modern EfficientNet [29].
(2) Standard mammography consists of two views for each breast: bilateral craniocaudal (CC) and
mediolateral oblique (MLO). The original algorithm processes only one view at a time and, to take the
two views into account, it simply averages the scores of the two views processed independently. Our
technique performs a third transfer learning, in addition to the original two, to take into account the
two views. We use the single-view classifier to initialize the “two-view classifier” and then the entire
system (patch, single-view and two-view classifiers) is end-to-end trained, using two-view mammograms
with cancer status.
With the above improvements, together with test-time augmentation (TTA) and ensemble of four
models with the same architecture, we achieved an AUC (Area Under ROC Curve) of 0.9344±0.0341
in 5-fold cross-validation using CBIS-DDSM dataset (sensitivity and specificity are 85.13% at the
equal error rate point of the ROC). It is important to note that these performance measures cannot
be directly compared with those of radiologists, as all CBIS-DDSM images have at least one lesion,
whereas most mammograms examined by radiologists are normal. It is known that a substantially
smaller AUC is obtained using the original CBIS-DDSM training/test division [31]. In this condition,
we obtained an AUC of 0.8483±0.0253 (with TTA). As far as we know, this is the largest AUC reported
for this problem.
A good property of the methods proposed by Shen et al. and ours is that they both allow you to
use datasets with partial ROI annotations. In this case, we can use the subset with the ROI location
to train the patch classifier and the subset without this information (with just the status of whole
images) to train the whole-image classifiers.
A previous version of this work was presented at the 2021 AACR annual meeting and published as an
abstract [20, 21] and a preprint version is available at https://arxiv.org/abs/2110.01606. The inference
code of the two-view classifier and the model are available at https://github.com/dpetrini/two-views-
classifier.

2 Related works
2.1 Early CAD systems
A CAD system should receive a mammogram and report the likelihood that the patient has cancer.
Early CAD systems used to split it into two sub-problems [1]: CADe (Computer-Aided Detection)
that detects ROIs suspected of being cancerous; and CADx (Computer-Aided Diagnosis) that classifies
ROIs as cancer or non-cancer. In 1998, FDA approved the use of CADe for mammography. After
that, CADe systems spread in the United States and costs over $400 million a year. However, Lehman
et al.[14], after analyzing 495,000 mammograms, demonstrated that the screening performance was
not improved with CADe. The specificity of early CADe systems was very low, generating around
1 false positive mark per view. Thus, radiologists assisted by these systems improved sensitivity in
cancer diagnosis (from 4% to 15%) but worsened in specificity (from 5% to 35%) [3]. Once ROIs are
localized, CADx classified them as benign or malignant. Usually, handcrafted features were extracted
from the ROIs [4] and then classical machine learning techniques classified them. The AUCs of old
CADx systems refer only to the classification of ROIs and cannot be directly compared with the AUCs
of modern techniques that classify whole mammograms.

2.2 CNN-based breast cancer diagnosis

Recently, the deep convolutional neural network (CNN) has been applied with remarkable success in
different areas. Some recent CNN-based breast cancer diagnostic systems show similar or even better

2
performance than human specialists.
Kooi et al. [9] compared classification of mammography ROIs using state-of-the-art classic method,
CNN-based method and radiologists. They concluded that CNN has a performance comparable to
radiologists and superior to the classic method.
Rodriguez et al. [22] compared a CNN-based commercial system (Transpara 1.4.0) with 101 radi-
ologists, using 9 datasets from different institutions in the US and Europe. The AUC of the AI system
was 0.840 while the mean AUC of the radiologists was 0.814. Therefore, the AI was better than the
average of radiologists but its performance was inferior to that of the best radiologist.
Schaffter et al. [24] describe the “DM DREAM Challenge” fostered to develop AI algorithms for
interpreting mammograms, held between September 2016 and November 2017. The top-performing
single algorithm achieved an AUC of 0.858 (in the US dataset) and 0.903 (in the Swedish dataset). No
single or ensemble algorithm outperformed radiologists.
McKinney et al. [17] present an AI system that surpasses human experts in breast cancer prediction.
This system consists of an ensemble of three deep learning models that were tested on private UK and
US private datasets and achieved AUCs of 0.889 and 0.8107, respectively.
Wu et al. [32] designed a four-view deep learning system and trained it with over 200,000 exams,
of which 5,832 with biopsies and 985 biopsies with confirmed malignancies. They achieved an AUC of
0.895 in predicting cancer using 4 views, which is higher than the radiologists’ average AUC of 0.778.
Although both Wu et al.’s work and ours use multi-view to classify cancer, there are fundamental
differences that we explain in Section “Multi-view technique by Wu et al.”
Recent works that use the CBIS-DDSM dataset as our work and whose results can be compared
with ours will be mentioned in Section “recent works that use CBIS-DDSM”.

3 Materials: datasets and performance measures

3.1 Public mammogram datasets
There are two main types of mammography: SFM (screen-film mammography) and FFDM (Full-
Field Digital Mammography). Some works claim that SFM and FFDM have similar performances
for detecting cancer [16], while others state that FFDM performs better than SFM, especially for
detecting microcalcification [5]. In order to digitally process SFM, it must first be digitized by a
scanner, degrading its quality; meanwhile, FFDM is inherently digital. A simple visual comparison of
digitized SFM with FFDM shows that the latter is considerably superior in quality than the former.
Thus, it seems evident that training and testing an AI system on digital mammograms will yield better
results than using scanned analog mammograms.
To fairly compare different breast cancer CAD systems, large public FFDM mammogram datasets
must be available. Unfortunately, all main public mammogram datasets are composed of low-quality,
scanned SFM: the Digital Database for Screening Mammography (DDSM) [2], the Mammographic
Imaging Analysis Society database (MIAS) [28] and the Image Retrieval in Medical Applications
project (IRMA) [15]. The DDSM is the largest, with 2,620 exams and contains normal, benign and
malignant cases with verified pathological information. Currently, there is no large public dataset of
FFDM mammograms. The INbreast public dataset consists of FFDM mammograms, but it contains
only 115 cases with 410 images [19], too small to be used in deep learning. CBIS-DDSM [13] is an
updated and curated version of DDSM, organized to make it easier to use. It consists of 10,239 images,
but only 3,103 are full mammography images - the rest are cropped ROIs and binary masks. We use
this dataset to train and test our system.

3.2 Comparing CAD performances

It is difficult to compare the performances of mammogram CAD systems described in different papers,
even though (almost) all use AUC as the performance metric:
a) Early CADx classifies small ROIs while modern systems classify whole mammograms. So actually
they are solving different problems.

3
b) There is no large public standard mammogram database where different CAD systems can be
compared. Each system uses its own dataset, making the comparison difficult.
c) CBIS-DDSM is the largest public dataset (although still small for deep learning), but it is composed
of low-quality SFM. It is not possible to make a fair comparison between systems that use FFDM
and SFM mammograms.
It is difficult to compare techniques even when all systems use the same dataset (e.g. CBIS-DDSM).
Many works in the literature randomly divide the dataset into training and test sets [26, 31]. This
procedure can generate biased results, as there is the possibility of randomly choosing a test set that
is easy (or difficult) to classify, especially when the dataset is not large enough, like CBIS-DDSM.
This phenomenon can be seen clearly in our own results. When the CBIS-DDSM dataset is randomly
divided into 5 subsets and our two-view classifier is trained using 4 subsets and tested on the remaining
set, the 5 obtained AUCs vary from 0.90 to 0.99 (4 models with TTA, see the Table named “AUCs
of our two-view classifiers in CV test”). Thus, if we were lucky in the random division, our two-vew
classifier would reach astonishing 0.99 AUC and, if we were unlucky, it would reach only 0.90. Neither
of the two values reflects the true performance of our system. Consequently, the results obtained using
random training/test division are unreliable.
Especially, using the official training/test division of CBIS-DDSM, a remarkably small AUC is
obtained. Shen et al.[26] obtained an AUC of 0.87 using random training/test division but, using the
official division, their system achieves only an estimated AUC of ~0.75 (single runs) [25]. Similarly,
Wei et al. [31] obtained an AUC of 0.9182 using a random division but only 0.7964 using the official
division (single runs). According to Wei et al., this happens because the testing data is another holdout
set acquired in a different time.
Bootstrapping the test set consists of randomly choosing different subsets of the test set many times,
evaluating the system performance on those subsets, and averaging them to obtain the dispersion of
the CAD performance measure. This procedure only tests the obtained fixed model many times. It
does not measure the dispersion that would be observed by making different training/test division.
Thus, the performance dispersions measured using bootstrapping and fixed division is also unreliable.

3.3 Two testing methods

In order to get unbiased results, we did not randomly split CBIS-DDSM into fixed training/test sets.
Instead, we repeated the experiments using two different methodologies. The techniques used in both
tests are similar, but we introduced some minor improvements in the second test.
1. Cross validation (CV) test: At first, we did the “CV test”, in which we randomly divided the
dataset into 5 subsets, trained and tested our system five times using one of the subsets as the
test set and the remaining four as the training set (5-fold cross validation). Then, we computed
the mean and standard deviation of the five results. Of the 3,103 original mammograms, we
discarded those with only one view as we are proposing a two-view system. We also discarded
those classified as "benign without callback" because what characterizes a lesion as benign is
precisely the fact that cancer does not develop over the years. Thus, we used 2,260 images,
representing the two views of 1,130 breasts. Each cross-validation fold comprised 452 test images
and 1,808 training images, of which we used 20% (361 images) as the validation set. We took
precaution to avoid any “information leak” from the test set to the training process.
2. Original division (OD) test: Then we did the “OD test”, where we used the original training/test
split of the CBIS-DDSM dataset. We used all the 3,103 images to train the patch and single-view
classifiers, but we used only 2,694 images with two views to train and test the two-view classifier
(Table 1). This time, we did not discard “benign without callback” cases and considered them as
benign. We used 10% of the training set as the validation set. We calculated the standard error
of the obtained AUCs using the formula proposed by Hanley and McNeil [7].

4
Training Test
Category Benign Malignant Benign Malignant Total
1,347 1,111 381 264 3,103
Mammograms
2,458 (79.21%) 645 (20.79%) (100%)
Mammograms 1,180 968 324 222 2,694
with 2-views 2,148 (79.73%) 546 (20.27%) (100%)
Mammograms
174 136 61 38 409
with 1-view

Table 1: Number of mammograms in CBIS-DDSM dataset in the original training/test division. We

consider as malignant exams with both benign and malignant findings, so some numbers in this table
may be slightly different from those of Almeida et al.[18].

3.4 Recent works that use CBIS-DDSM

Our work is based on Shen et al.’s [26]. Making a random division of CBIS-DDSM dataset, they
obtained AUCs of 0.87 (1 model without TTA), 0.88 (1 model with TTA), and 0.91 (4 models with
TTA) but, as we argued above, these results are unreliable. Using the original dataset division, an
estimated AUC of ~0.75 (single run) is obtained [25]. Besides Shen et al., there are more recent works
that use CBIS-DDSM to train and test their convolutional models.
Shu et al. [27] proposed two new pooling techniques and used them instead of the traditional
average-pooling or max-pooling layers. The largest AUC obtained by their method was 0.838. It is
unclear whether the authors used the original training/test split because they say they used 85/15%
of the images for training/testing, while the original dataset is divided into 80/20% (Table 1).
Wei et al.[31] proposed to use neural net morphing instead of the traditional transfer learning. They
reported an AUC of 0.796, 0.822 and 0.831 respectively single-model without TTA, single-model with
TTA and four models ensemble with TTA, using the original training/test division. They reported
an AUC of 0.9427 using fixed random training/test division but, as we argued above, this result is
unreliable.
Almeida et al. [18] compared the performance of classic XGBoost and convolutional VGG16 on
CBIS-DDSM images resized to 224x224 pixels using the original dataset division and obtained AUCs
of 0.6829 and 0.6824 respectively, concluding that the two techniques have similar prediction accuracy
when used in low-resolution images.

4 Methods: the proposed algorithm

As preprocessing, we resized all mammograms to 1152x896 pixels due to insufficient GPU memory.
We subtracted the mean of all training images from training, validation and test sets. In all trainings
in this work, we used data augmentation with parameters: rotation ±25º, zoom ±20%, shear ±12%,
intensity shift ±20% and horizontal/vertical flips. We used border reflection to fill out the area outside
of the image domain.

4.1 Patch classifier

We describe our system primarily for the “OD test”. The development for the “CV test” is similar. We
make observations whenever there are important differences between the two methodologies.

4.1.1 The structure of patch classifier

We created a “patch classifier” similar to the one described by Shen et al.[26], but based on the modern
EfficientNet [29] instead of rather outdated VGG [33] or ResNet [8]. From 3,103 images, we selected
3,568 ROIs (some images have more than one ROI). From each ROI, we selected 20 patches sized
224x224: 10 around the ROI and another 10 in the background (Figure 1). To select patches around
the ROI, we calculated its center of mass from the corresponding mask and selected an area with

5
Figure 1: Left: we randomly chose 10 background (yellow) patches anywhere but in the lesion; we
delimited a (white) region centered at the lesion and sampled 10 patches with random horizontal and
vertical displacements within this region. Right: the lesion segmentation mask provided by CBIS-
DDSM.

224x224 pixels around the center with a random displacement of ±10% of the height/width (inside
the white rectangle in Figure 1. In sequence, we sampled 10 background patches from anywhere in
the image except the ROIs. We further divided the patches containing the lesions into 4 subcategories
according to their labels in CBIS-DDSM: benign calcification, malignant calcification, benign mass
and malignant mass. So a patch can be of 5 types, with the background summing up 50% and the
remaining categories making up respectively 9.5%, 17.5%, 11.1% and 11.9%. We did not use any
technique to compensate for this imbalance.
There are 8 models of EfficientNet, numbered from B0 to B7 [29]. EfficientNet-B0 is the smallest
model and was designed automatically by the Neural Architecture Search. Then, this base model was
scaled up in width, depth and resolution of the input image to obtain the remaining seven models.
We took EfficientNets pre-trained on ImageNet [23] images and performed transfer learning to classify
mammogram patches into 5 categories. As mammograms have only one channel, the same grayscale
feeds EfficientNet’s red, green and blue inputs. When an EfficientNet without the top layers is fed with
a 224x224 patch, it yields different numbers of maps with 7x7 attributes. For example, EfficientNet-
B0, B4 and B7 generate respectively 1280, 1792 and 2560 maps with 7x7 attributes. These maps are
average-pooled and pass through a fully-connected layer with five outputs to make the classification
into 5 categories.

4.1.2 Training patch classifier

In the “CV test”, we simply used the Adam optimizer with fixed learning rate of 10−4 for 20 epochs
and batch size of 40 to adapt the ImageNet-trained EfficientNet to classify patches. In the “OD test”,
we used the Adam optimizer with learning rate determined by the “warm-up and cyclic cosine” [6] with
30 epochs, period of 3 (the cyclic repetition in number of epochs), delta of 2x10−4 (the amplitude of
learning rate changing), and warm-up delay of 4 epochs (the linear rise until the initial learning rate
of 10−4 ). Figure 2 shows the profile of the used learning rate.

6
Figure 2: Learning rate used to train the patch classifier in the “OD test”.

4.1.3 Results of patch classifier

Table 2 shows the accuracies of the patch classifiers using different EfficientNet models. In the “CV
test”, the patch classifier based on EfficientNet-B4 presents the lowest accuracy (0.764) but it presents
the largest AUC (0.876) when it is converted into a single-image classifier. Consequently, we use
EfficientNet-B4 as the basis in this test. In the “OD test”, surprisingly the opposite happens: the
patch classifier based on EfficientNet-B0 presents the lowest accuracy (0.7554) but its corresponding
single-image classifier presents the largest AUC (0.8033). Consequently, we use EfficientNet-B0 as the
base model in this test. As we anticipated, the accuracies and AUCs of the “OD tests” are considerably
smaller than those of the “CV tests”.

Table 2: Accuracies of patch classifiers and AUCs of single-view classifiers using different base models.

7
Figure 3: Diagrams of the single-view classifier for the “CV test” (top) and “OD test” (bottom).

4.2 Single-view classifier

4.2.1 The architecture of single-view classifier
The “single-view whole-image classifier” is created from the patch classifier by first removing the fully
connected layer with 5 outputs. If this model is fed with a mammogram with 1152×896 pixels (instead
of a 224×224 patch), it will yield 1792 (“CV test”) or 1280 (“OD test”) maps with 36×28 attributes that
represent the likelihoods of presence of different types of lesions in each region (Figure 3). We added
additional layers on top of this model to extract high-level features and classify full mammograms into
malignant or non-malignant. We tested many different combinations of EfficientNet base blocks (i.e.,
MBConv blocks [29, 30]) using:
(a) One, two or three MBConv blocks;
(b) MBConv blocks with strides 1 or 2.
After testing the combinations of these two hyperparameters, we concluded that the best model is
obtained using:
(a) One MBConv block with strides 1 (in the the “CV test”);
(b) Two MBConv blocks with strides 2 (In the “OD test”).

The output of the last MBConv block is followed by global average pooling and a dense layer with two
output categories.

4.2.2 The training and the results of “CV test”

To train single-view classifier, we fed the network with sample mammograms with the cancer status.
Backpropagation adjusts the network parameters to better classify samples.
In the “CV test”, we used fixed learning rate of 10−5 , batch size of 3 (to fit in GPU memory) and
50 epochs. The obtained results are summarized in table 3. As we have already explained, the results
obtained by different works by making random division are unreliable and cannot be compared with
our cross-validation results. We also tested a ResNet-based network, getting an average AUC of 0.8512
substantially less than 0.8757 obtained with EfficientNet-based network (single runs), attesting that
EfficientNet actually helps improve the system performance.

8
Fold 1 2 3 4 5 mean std
ResNet50, 1 model without 0.7541 0.9089 0.8553 0.9057 0.8320 0.8512 0.0567
TTDA
EfficientNet-B4, 1 model 0.8371 0.8455 0.8865 0.9220 0.8874 0.8757 0.0310
without TTDA
EfficientNet-B4, 1 model 0.8507 0.8570 0.8908 0.9255 0.8965 0.8841 0.0274
with TTDA
EfficientNet-B4, ensemble of 0.8419 0.8566 0.8945 0.9263 0.8984 0.8835 0.0304
4 models without TTDA
EfficientNet-B4, ensemble of 0.8653 0.8634 0.8942 0.9257 0.9048 0.8907 0.0238
4 models with TTDA

Table 3: Results of 5-fold “CV tests” of our single-view classifier.

Method Base network AUC

Shen et al. [26, 25] VGG/ResNet ~0.75
Wei et al.[31] Custom 0.7964
Ours EfficientNet-B0 0.8033±0.0096
Ours EfficientNet-B1 0.7922±0.0098
Ours EfficientNet-B2 0.7926±0.0098
Ours EfficientNet-B3 0.7952±0.0098
Ours EfficientNet-B4 0.7940±0.0098

Table 4: Comparison of different single-view classifiers (single runs) using the original CBIS-DDSM
division.

4.2.3 The training and the results of “OD test”

In the “OD test”, we used the Adam optimizer with learning rate determined by the “warm-up and cyclic
cosine” with 50 epochs, warm-up in 4 epochs, period of 5 epochs, delta of 2x10−5 , initial learning rate
of 10−5 and batch size of 4 (to fit in GPU memory). The obtained results (single runs) are summarized
in Table 4. The performance of our best single-view classifier is better than Shen et al.’s [26] and similar
to Wei et al.’s [31].

4.3 Two-view classifier

4.3.1 The architecture of two-view classifier
In standard mammography, each breast is radiographed twice in CC and MLO views and thus an
abnormality appears in both views. We propose a convolutional network that simultaneously takes
into account the two views of the same side of the mammography, making a third transfer learning.
We use the weights of the single-view classifier to obtain the two-view classifier and end-to-end train
the whole system. Also here we evaluated different combinations of number of blocks and strides to
choose the best network architectures.
In the "CV test", we take a pair of single-view classifiers and discard the upper layers (the MBConv
blocks onwards). This operation results in a network that takes the two views of a mammography
exam (CC and MLO, 1152x896 pixels each) and generates a pair of 1792 maps with 36x28 attributes
(Figure 4). Then we concatenate these maps, obtaining 3584 maps with 36x28 attributes that are
processed by two new MBConv blocks with strides 1. The output of the last MBConv block passes
through average pooling followed by a dense layer to make the final classification.
The network architecture of the “OD test” is similar. Discarding the top layers, we get a network
that takes the two views and generates a pair of 1280 maps with 36x28 attributes (Figure 5). We
concatenate these maps, obtaining 2560 maps with 36x28 attributes. These maps are processed by
two new MBConv blocks with strides 2 that reduces dimensionality, producing 2560 maps with 9x7

9
Figure 4: Diagram of the two-view classifier for the “CV test”.

Figure 5: Diagram of the two-view classifier for the “OD test”.

attributes. The final classification is obtained by average pooling these maps followed by a dense layer.

4.3.2 Training two-view classifier

To train the two-view classifier, we fed the network with two-view mammography samples with the
cancer status. Backpropagation adjusts the network parameters to better classify samples.
In the "CV test", we use Adam optimizer with batch size of 2 and learning rates:
• 10−3 during 3 epochs, training only the new fully connected layer.
• 10−4 during 4 epochs, training all the new layers (MBConv blocks onwards) with the bottom
layers (single-view classifiers) frozen.
• 10−5 during 8 epochs with all layers unfrozen.
In the "OD test", we use Adam optimizer with learning rate calculated by “warm-up and cyclic cosine”,
100 epochs, warm-up in 4 epochs, period of 5 epochs, delta 2x10−5 , and initial learning rate 10−5 .
Here, all layers are unfrozen and we use batch size 6 because EfficientNet-B0 is smaller than B4.

4.3.3 Results of “CV test”

Table 5 summarizes the results obtained in the “CV tests”. We obtained an AUC of 0.9298±0.0379 in
single run and 0.9344±0.0341 using TTA and ensemble of four models. Figure 6 depicts the obtained
ROC curves. Using two views, AUC has substantially increased from 0.8757±0.0310 (single view,
single run) to 0.9298±0.0379 (two views, single run). Therefore, we can conclude that taking into
account CC and MLO images simultaneously actually improves cancer detection compared to using
single view. We also tested ResNet50-based two-view classifier to confirm that replacing ResNet by
EfficientNet increases the system performance (from 0.9255 to 0.9298).

10
Fold 1 2 3 4 5 mean std
ResNet50, 1 model 0.8421 0.9630 0.9522 0.9730 0.8972 0.9255 0.0492
without TTDA
EfficientNet-B4, 1 model 0.8891 0.8880 0.9486 0.9882 0.9350 0.9298 0.0379
without TTDA
EfficientNet-B4, 1 model 0.8840 0.8926 0.9514 0.9895 0.9402 0.9315 0.0390
with TTDA
EfficientNet-B4, 4 models 0.8904 0.8933 0.9445 0.9893 0.9324 0.9300 0.0365
without TTDA
EfficientNet-B4, 4 models 0.9004 0.8963 0.9462 0.9896 0.9397 0.9344 0.0341
with TTDA

Table 5: AUCs of our “two-view classifiers“ in “CV test”.

Figure 6: ROCs of our two-view classifiers in “CV test” (with TTA and ensemble of 4 models).

11
Method AUC Observation
Our two-view 0.8418±0.0258 Single run
Our two-view 0.8483±0.0253 1 model with TTA
Shen et al.[26, 25] ~0.75 Estimated (single run)
Shu et al.[27] 0.838 Unclear if the original division is used.
Wei et al.[31] 0.7964 Single run
Wei et al.[31] 0.8187 1 model with TTA
Wei et al.[31] 0.8313 4 models with TTA
Almeida et al.[18] 0.6824 Resized to 224x224, single run.

Table 6: AUCs obtained using the original CBIS-DDSM division.

Figure 7: ROC curve of our two-view classifier in “OD test” (with TTA).

4.3.4 Results of “OD test”

Table 6 summarizes the obtained AUCs using our two-view classifiers with the official CBIS-DDSM
division. The AUC in single run has increased substantially from 0.8033±0.0096 (one-view, single run)
to 0.8418±0.0258 (two-view, single run), confirming that the use of the two views indeed improves the
system. With TTA, we achieved our best mark of 0.8483±0.0253. This is the largest reported AUC
using CBIS-DDSM original division, as far as we know. We tried using an ensemble of independently
trained 4 models (with the same architecture) but the AUC did not increase. We hypothesize that
AUC would increase if we use ensemble of models with different architectures. Figure 7 depicts the
obtained ROC.

4.3.5 Multi-view technique by Wu et al.

Wu et al. [32] also use multi-view to improve their breast cancer CAD performance. However, there is
some important differences between their four-view classifier and ours. First, they do not make transfer
learning from patch classifier to whole-image classifier in end-to-end fashion, idea proposed by Shen
et al. [26] and essential to obtain high performance. Second, Wu et al. independently process each
view with ResNet-22 and concatenate the maps obtained after the four average poolings. Meanwhile,

12
Concatenate maps New top layers AUC
after average pools Only dense layers 0.9225±0.0405
before average pools 2 MBConv blocks 0.9298±0.0379

Table 7: Comparison between concatenating the attribute maps after or before average poolings. All
tests used EfficientNet-B4 as the base model (single runs).

our classifier processes each view with EfficientNet-B0 and concatenates the attribute maps before
doing average poolings. We tested the two ideas (concatenating the attribute maps after or before the
average poolings), always using EfficientNet-B4 as the base model, and the results show that better
results are obtained when concatenating the maps before the average poolings (Table 7). This is not
surprising, as information about the spatial locations of lesions are lost by average poolings.

5 Conclusions
In this paper, we have presented a new high performance breast cancer CAD (Computer-Aided De-
tection and Diagnosis) system. We have proposed a deep convolutional network that simultaneously
takes into account the two views of the same side of the mammography that is end-to-end trained
making three transfer learnings:
1. First, we use the weights of EfficientNet trained on natural images to train the patch classifier.

2. Second, we use the patch classifier to train the single-view classifier.

3. Third, we use the single-view classifier to train the two-view classifier.
Using 5-fold cross validation, our system has achieved an AUC of 0.9344±0.0341 in classifying CBIS-
DDSM mammograms with two views (sensitivity and specificity are 85.13% at the equal error rate
point of the ROC). Using the original CBIS-DDSM division into training/test sets, our technique
achieved an AUC of 0.8483±0.0253, the largest ever achieved, as far as we know. In both tests, AUCs
increased significantly from single view classifiers to two view ones: from 0.8907 to 0.9344 in “CV test”
and from 0.8033 to 0.8483 in “OD test”. We also confirmed that replacing VGG and ResNet by modern
EfficientNet as the base model increases the performance. We have argued that, to test a CAD system
using small datasets like CBIS-DDSM, the dataset cannot be randomly divided into training/test sets,
as this procedure can yield biased results.

Acknowledgements
This work was partially supported by CNPq (National Council for Scientific and Technological Devel-
opment) process number 305377/2018-3.

Author contributions
All authors contributed with ideas that ended up resulting in the proposed algorithms. D.G.P.P.
developed the algorithms and conducted the experiments. H.Y.K. supervised the experiments and
wrote the main manuscript text. C.S., R.A.R, G.V.V. and M.A.A.K.F. provided medical information
about the problem. All authors performed the bibliographic review and reviewed the manuscript.

Competing interests
The authors declare no competing interests.

Additional information
Correspondence and requests for materials should be addressed to D.G.P.P.

13
References
[1] Turgay Ayer et al. “Computer-aided diagnostic models in breast cancer screening”. In: Imaging
in medicine 2.3 (2010), p. 313.
[2] K. et al. Bowyer. “The digital database for screening mammography”. In: In: Third international
workshop on digital mammography (1996), p. 27.
[3] C Dromain et al. “Computed-aided diagnosis (CAD) in the detection of breast cancer”. In:
European journal of radiology 82.3 (2013), pp. 417–423.
[4] Matthias Elter and Alexander Horsch. “CADx of mammographic masses and clustered micro-
calcifications: a review”. In: Medical physics 36.6Part1 (2009), pp. 2052–2068.
[5] U Fischer et al. “Comparative study in patients with microcalcifications: full-field digital mam-
mography vs screen-film mammography”. In: European radiology 12.11 (2002), pp. 2679–2683.
[6] Akhilesh Gotmare et al. “A closer look at deep learning heuristics: Learning rate restarts, warmup
and distillation”. In: arXiv preprint arXiv:1810.13243 (2018).
[7] James A Hanley and Barbara J McNeil. “The meaning and use of the area under a receiver
operating characteristic (ROC) curve.” In: Radiology 143.1 (1982), pp. 29–36.
[8] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016, pp. 770–778.
[9] Thijs Kooi et al. “Large scale deep learning for computer aided detection of mammographic
lesions”. In: Medical image analysis 35 (2017), pp. 303–312.
[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Advances in neural information processing systems 25 (2012),
pp. 1097–1105.
[11] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553 (2015),
pp. 436–444.
[12] Yann LeCun et al. “Backpropagation applied to handwritten zip code recognition”. In: Neural
computation 1.4 (1989), pp. 541–551.
[13] Rebecca Sawyer Lee et al. “A curated mammography data set for use in computer-aided detection
and diagnosis research”. In: Scientific data 4.1 (2017), pp. 1–9.
[14] Constance D Lehman et al. “Diagnostic accuracy of digital screening mammography with and
without computer-aided detection”. In: JAMA internal medicine 175.11 (2015), pp. 1828–1837.
[15] Thomas M Lehmann et al. “IRMA–Content-based image retrieval in medical applications”. In:
MEDINFO 2004. IOS Press. 2004, pp. 842–846.
[16] John M Lewin et al. “Comparison of full-field digital mammography with screen-film mammog-
raphy for cancer detection: results of 4,945 paired examinations”. In: Radiology 218.3 (2001),
pp. 873–880.
[17] Scott Mayer McKinney et al. “International evaluation of an AI system for breast cancer screen-
ing”. In: Nature 577.7788 (2020), pp. 89–94.
[18] Rhaylander Mendes de Miranda Almeida et al. “Machine Learning Algorithms for Breast Cancer
Detection in Mammography Images: A Comparative Study”. In: (2021).
[19] Inês C Moreira et al. “Inbreast: toward a full-field digital mammographic database”. In: Academic
radiology 19.2 (2012), pp. 236–248.
[20] Daniel G Petrini et al. End-to-end training of convolutional network for breast cancer detection
in two-view mammography. 2021.
[21] Daniel G Petrini et al. High-accuracy breast cancer detection in mammography using EfficientNet
and end-to-end training. 2021.

14
[22] Alejandro Rodriguez-Ruiz et al. “Stand-alone artificial intelligence for breast cancer detection
in mammography: comparison with 101 radiologists”. In: JNCI: Journal of the National Cancer
Institute 111.9 (2019), pp. 916–922.
[23] Olga Russakovsky et al. “Imagenet large scale visual recognition challenge”. In: International
journal of computer vision 115.3 (2015), pp. 211–252.
[24] Thomas Schaffter et al. “Evaluation of combined artificial intelligence and radiologist assessment
to interpret screening mammograms”. In: JAMA network open 3.3 (2020), e200265–e200265.
[25] Shen. https : / / github . com / lishen / end2end - all - conv / issues / 5. Accessed: 2021-07-28.
2021.
[26] Li Shen et al. “Deep learning to improve breast cancer detection on screening mammography”.
In: Scientific reports 9.1 (2019), pp. 1–12.
[27] Xin Shu et al. “Deep neural networks with region-based pooling structures for mammographic
image classification”. In: IEEE transactions on medical imaging 39.6 (2020), pp. 2246–2255.
[28] P SUCKLING J. “The mammographic image analysis society digital mammogram database”. In:
Digital Mammo (1994), pp. 375–386.
[29] Mingxing Tan and Quoc Le. “Efficientnet: Rethinking model scaling for convolutional neural
networks”. In: International Conference on Machine Learning. PMLR. 2019, pp. 6105–6114.
[30] Mingxing Tan et al. “Mnasnet: Platform-aware neural architecture search for mobile”. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 2820–
2828.
[31] Tao Wei et al. “Beyond Fine-tuning: Classifying High Resolution Mammograms using Function-
Preserving Transformations”. In: arXiv preprint arXiv:2101.07945 (2021).
[32] Nan Wu et al. “Deep neural networks improve radiologists’ performance in breast cancer screen-
ing”. In: IEEE transactions on medical imaging 39.4 (2019), pp. 1184–1194.
[33] Xiangyu Zhang et al. “Accelerating very deep convolutional networks for classification and detec-
tion”. In: IEEE transactions on pattern analysis and machine intelligence 38.10 (2015), pp. 1943–
1955.