An End To End Breast Tumour Classification Model Using 2021 Computerized Med
An End To End Breast Tumour Classification Model Using 2021 Computerized Med
A R T I C L E I N F O A B S T R A C T
Keywords: Researchers working on computational analysis of Whole Slide Images (WSIs) in histopathology have primarily
Whole slide images resorted to patch-based modelling due to large resolution of each WSI. The large resolution makes WSIs infea
Microscopy images sible to be fed directly into the machine learning models due to computational constraints. However, due to
Histopathology
patch-based analysis, most of the current methods fail to exploit the underlying spatial relationship among the
Deep learning
patches. In our work, we have tried to integrate this relationship along with feature-based correlation among the
Convolutional neural networks
ICIAR extracted patches from the particular tumorous region. The tumour regions extracted from WSI have arbitrary
BACH dimensions having the range 20,570 to 195 pixels across width and 17,290 to 226 pixels across height. For the
Computational pathology given task of classification, we have used BiLSTMs to model both forward and backward contextual relationship.
Classification Also, using RNN based model, the limitation of sequence size is eliminated which allows the modelling of var
BiLSTMs iable size images within a deep learning model. We have also incorporated the effect of spatial continuity by
RNN exploring different scanning techniques used to sample patches. To establish the efficiency of our approach, we
LSTM
trained and tested our model on two datasets, microscopy images and WSI tumour regions. Both datasets were
Breast tumours classification
published by ICIAR BACH Challenge 2018. Finally, we compared our results with top 5 teams who participated
in the BACH challenge and achieved the top accuracy of 90% for microscopy image dataset. For WSI tumour
region dataset, we compared the classification results with state of the art deep learning networks such as ResNet,
DenseNet, and InceptionV3 using maximum voting technique. We achieved the highest performance accuracy of
84%. We found out that BiLSTMs with CNN features have performed much better in modelling patches into an
end-to-end Image classification network. Additionally, the variable dimensions of WSI tumour regions were used
for classification without the need for resizing. This suggests that our method is independent of tumour image
size and can process large dimensional images without losing the resolution details.
1. Introduction research based on oncogenic pathways and tumor cell metabolism, and
based on chemotherapeutic observations, it has been realized by pa
Image based computational pathology has developed into an ever- thologists that the disease is quite unpredictable Leong and Zhuang
evolving field for computer vision researchers. New methods are being (2011).
introduced frequently in this field for natural everyday scenes, face Hence, there is a pressing need for the development of computer
recognition, video analysis and other forms of biometrics. Despite that, vision algorithms that are particularly advanced for diagnostic and
the rate of development of medical image CAD algorithms for enhancing prognostic evaluation of digitized biopsy images. Until then, there is a
their diagnostic performance could not mirror the rate of development progressive adaptation of currently available state of the art methods for
of new natural scenes analysis algorithms. It may have been due to the cancer detection, segmentation, and classification. The importance of
highly heterogeneous nature of cancer cells, which increases the precise prognosis in this field requires the differentiation of digitized
complexity of the task at hand. In context with breast cancer, extensive samples into two, three, or more classes. In our work, we have four
* Corresponding author.
E-mail address: [email protected] (S. Tripathi).
https://doi.org/10.1016/j.compmedimag.2020.101838
Received 18 August 2020; Received in revised form 31 October 2020; Accepted 29 November 2020
Available online 4 December 2020
0895-6111/© 2020 Elsevier Ltd. All rights reserved.
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
classes, Normal, Benign, In situ carcinoma, and Invasive Carcinoma. The between patches is better than randomly extracting patches from the
process of classification of breast samples by the pathologist help in a image.
more accurate understanding of the disease and consequently help in the
directed treatment of patients. The manual process is; however, quite a 2. Related work
time consuming and requires an expert’s knowledge due to the under
lying complexity of the nature of images. The efforts to automate such The application of RNN based architectures such as LSTM Hochreiter
non-trivial problem requires expert intervention to verify the diagnosis and Schmidhuber (1997) and BiLSTMs Schuster and Paliwal (1997),
made by the CAD process. Besides that, the feasibility of implementation Graves et al. (2005) on series data classification such as texts and time
of such algorithms poses preliminary challenges. For instance, high series has been a very common methodology. Researchers have recently
resolution of gigapixel Whole Slide Images could not be processed by started combining CNNs and LSTMs for image captioning Johnson et al.
any current state of the art algorithms due to their large size. A large (2016), Karpathy and Fei-Fei (2015), Vinyals et al. (2015)Vinyals et al.,
amount of information present in one patient slide makes the task more 2015 or multi-label image classification Zhang et al. (2018), Wang et al.
challenging concerning space and efficiency. Therefore, for practical (2016), Wei et al. (2014), Guo et al. (2018) as well. The idea of using
problem solving, we need to either build new systems that could address RNN based models for image classification stemmed from the fact that
such challenges or find a workaround of our problems that could be objects in an image are often, though not always, related to each other in
feasibly addressed by available systems. One such workaround is some way. Images, although, are not sequential data but carry some
dividing WSI into patches of the size that could be easily fed into the latent semantic dependencies which can be modeled as a sequence of
algorithm. However, this leads to loss of overall structure of the tumor occurrences of certain objects present in the image that overall define
and various other sub-structures present in the slide. The spatial conti the global image description. These deep LSTMs based models have,
nuity of the patches also becomes hard to incorporate within a deep end- however, are not sufficiently explored on high-resolution medical data.
to-end model. The task becomes more non-trivial in case of a 4-class With high dimensional images in case of WSIs, the tumor regions when
problem rather than a 2-class problem where the structures need to be divided into patches can act as a sequential data that have some
segregated between two widely spaced classes. As the number of classes contextual dependency with each other. Modeling this contextual in
or segregation increases, the space between classes reduces. formation among patches is a crucial step to perform slide level
Considering all these issues, we chose our model such that the gig classification.
apixel size of WSI could be harnessed in a way without losing the There have been studies in Whole Slide Image level analysis that
structure of the overall suspected region. The spatial relationship be have drawn contextual and spatial relationship among patches using
tween the patches of the same region could be modeled end-to-end their novel methods. For instance, authors in Huang and C. Chung
without the need of building a separate algorithm to infuse the (2018) proposed a deep spatial fusion network to predict image-wise
context of the previous patch in the sequence of the patches which label from patch-wise probability maps. They evaluated their network
together makes an entire tumor region. performance on two datasets BIC B (2015) and BACH IB (2018) and used
One such state of the art model which is well known for preserving heavy augmentation due to the small volume of images. Their network
the contextual relationship, is Recurrent Neural network, commonly was not end-to-end and required heavy data pre-processing steps to
known as RNN. RNNs have been used by the computer vision commu enhance the performance. They used microscopy images to test their
nity to process sequences such as texts and videos. We acknowledged its model, which have dense class properties. Whereas, in the case of Whole
efficiency and formulated our problem around the strength of the RNNs, Slide Image annotations, the tumour class like Invasive carcinoma could
which is processing sequences of patches from the same region and be spread across the gigapixel image and the parts of the annotation may
eventually classify the input sequence as one of the four classes. We look like normal. Therefore, with WSIs, the parts of the annotation when
classified image regions as a whole using BiLSTMs. BiLSTMs are a known broken into patches, may not give the reliable label. Hence, such
version of RNNs for modeling textual and video sequences. They have methods should be tested on such datasets as well for better clinical
been widely used for activity recognition in videos and have proved significance. The method in Shaban et al. (2019) exploits the spatial
their niche in modeling future contextual information due to their bi- context between patches extracted from high resolution histopatholog
directional architecture. Our method could serve its purpose in clin ical images for grading of colorectal cancer histology images. The au
ical diagnosis by assisting pathologists for labeling suspected regions thors propose a two staged framework consisting of two stacked CNNs.
automatically. Our main contribution is summarised in following points: The first CNN called as LR-CNN learns the representations of the patches
and aggregates the learned features from each patch in the same spatial
1. According to our knowledge, this is the first study that includes the dimension as the original image (M × N). So, in other words LR-CNN
use of contextual information among the patches from the same re converts a high resolution image into high dimensional feature map.
gion using BiLSTMs for classification of tumors. The next stage consists of context aware blocks called as RA-CNN that
2. Our method is robust to the size of the tumour regions as it can take takes feature representation cube as input to learn the spatial relation
both very huge dimensions WSI and microscopy regions. In this ship between patches to make a context-aware prediction. The authors
study, the range of tumour regions vary between 17,290 to 236 explored different network architectures for context-aware learning.
pixels across height and 20,570 to 195 pixels across the width. The strategy solves a huge challenge of missing contextual information
3. The study did not alter the size of the tumors for deep modeling and in patch-based classifiers. The robustness of the method also lies in the
classify variable size tumor regions by processing them as a sequence fact that the use of pre-trained architectures to extract features reduces
of features. the time and effort to train large models. However, the authors did not
4. This work proposed end-to-end network for patch to image classifi test their method on WSIs which pose a challenge of multi-resolution
cation unlike previous literatures that use stage wise networks to first feature learning and very large size. In case of WSIs the feature cube
classify patches and then aggregate classification results of patches could be as large as high resolution images and then its processing in a
into image labels Hou et al. (2016), Nazeri et al. (2018), Mahbod deep learning network could become infeasible.
et al. (2018), Wang et al. (2019), Roy et al. (2019), Huang and C. To address the problem of multi-resolution analysis, majority of the
Chung (2018), Shaban et al. (2019), Araújo et al. (2017). previous works of literature have used patch-level analysis which re
5. This is a shallow network that do not require heavy training to train quires breaking up of structures and hence global level features are lost.
hundreds of layers as in ResNet and GoogleNet. But, due to multi-resolution data, it is in fact left as an only choice to
6. We also experimented with patch scanning methods to verify that a process such images. All the methods using WSI datasets discussed
particular scanning technique that deploy maximal connectivity above have done the same for developing their models. Studies like Hou
2
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
et al. (2016), Nazeri et al. (2018), Mahbod et al. (2018),Wang et al. the model. Similarly, Ren et al. (2018) and Bychkov et al. (2018) have
(2019), Roy et al. (2019) have performed patch-based modelling of also used the combination of CNN-LSTM for disease outcome prediction.
histopathology slides or microscopy images to perform image-wise The authors Ren et al. (2018) have used the genomic data (Pathway
classification using methods like probability fusion and majority Scores PS) with disease recurrence extracted from gene expression sig
voting. The authors in Wang et al. (2019) developed a two-stage pro natures exhibited in prostate tumors with a Gleason 7 score to identify
cessing pipeline for classifying WSIs of gastric cancer. The first stage- prognostic marker. They calculated the PS scores and combined them
discriminative instance selection selected the most informative patches with deep learning model for the purpose of combining the prognostic
on the basis of probability maps generated by a localization network. markers with image biomarkers. The deep learning model used is
The second stage performed the image level prediction. The authors CNN-LSTM end-to-end model that take WSI patches as input sequence.
proposed a novel recalibrated multi-instance deep learning network CNN finds the features which LSTM processes to output the final hazard
(RMDL) with the purpose of aggregating both local and global features ratios of recurrence of the disease. Thy compared their model perfor
of each instance via a modified local-global feature fusion module. mance with different image features (LBP, HOG, SURF, neurons) with
RMDL framework presented an effective way to aggregate patches for pathway scores. The results shows higher hazard rations with
final image level prediction by exploiting the interrelationship of the CNN-LSTM + PS in comparison to other clinically relevant prognostic
patch features and overcame the drawbacks of direct patch aggregation. features used in the comparison. The model show a novel idea of
The method is however limited in its approach as it is confined to same combining genetic markers with image biomarkers using LSTM in their
scale context and do not address the spatial relationship between the model in order to preserve the spatial and contextual relationship among
instances. patches. However, the model is not sufficiently validated with different
The authors in Spanhol et al. (2016) studied the applicability of deep datasets along with their choice of CNN model and choice of training
learning architectures in identifying the breast cancer malignant tu parameters. The paper Bychkov et al. (2018) predicts the five-year dis
mours from benign tumours. The different sets of experiments were ease specific survival of patients diagnosed with colorectal cancer
designed to train the CNN with different strategies that allow both high directly from digitized images of haematoxylin and eosin (H&E) stained
and low resolution images as input. diagnostic tissue samples. The authors used a CNN-LSTM based model
In Bayramoglu et al. (2016) two CNN architectures have been used to that takes TMA spots as input sequence into the model. The VGG16
identify breast cancer tumour and the magnification of the image. Single architecture was used to extract patch features. The model claims the
Task CNN classifies the benign and malignant tumour. Whereas, novelty of providing direct outcome prediction instead of doing inter
multi-task CNN has two output branches which takes multi-resolution mediate analysis like classifying tissue samples. The proposed model by
image patches as input and produces two classification – between ma the authors used different scanning techniques to extract patches but
lignant and benign and between four classes of magnification. claimed to have found no effect on the final prediction results. This claim
Similarly, Araujo et al. Araújo et al. (2017) first proposed a is not properly validated in the study and is contradictory to what we
patch-wise classification and then combined the patch probabilities to found in our experimental analysis. The authors compared their model
perform image-wise classification. They used their custom CNN model with traditional machine learning classifiers such as naive-bayes, lo
to perform patch-wise classification and achieved 66.7% accuracy. Then gistic regression, SVM. The lack of comparison with contemporary deep
the majority voting scheme was used among the classified patches to learning classifiers weakens the validation of the proposed method. All
predict the overall image label. This method was also not end-to-end and these methods using CNN-LSTM as their base model have shown the
required extensive CNN training and experiments to decide optimal applicability of RNN based models in disease prognosis. Keeping the
hyper-parameters for their proposed model. They also did not consider advantages in mind, we used the BiLSTMs, the Bidirectional LSTM to
spatial context among the patches to build a relationship between same classify tumour regions in our work. The experimental observations on
image patches which may have proved crucial performance enhancer. our dataset (Section 4.4) showed the advantage of using BiLSTMs over
All these methods although solve the challenge of multi-resolution LSTMs in our model.
analysis by patch-level aggregation of classification results, suffer from
lack of spatial context and continuity relationship among patches. 3. Methodology
Moreover, due to the inherent limitation of state-of-the-art deep learning
models which takes only a fixed size input, the previous works of liter 3.1. Overview
ature had to sometimes perform heavy resizing to conform to the size of
network input. Therefore, CNN + RNN based model could be the perfect In medical images, patch level classification is often useful for
replacement of such models since they could provide both spatial and detecting cancer in microscopy and WSI images. However, if the pre
contextual modelling, strategic region extraction method without the diction needs to be made for a whole tumor or gland, the network model
limitation of resizing along with the end-to-end compact model to pro needs to be trained such that the whole tumor region could be classified
cess high-resolution Whole Slide and microscopy images. without losing its structure, resolution, and spatial correlation. For
Few of the recent pieces of literatures have used such type of building such model, we have first extracted annotated tumour regions
CNN + RNN models for the analysis of histopathological data. For from WSIs and performed rotation transformation on regions for rota
instance, the paper Qaiser and Rajpoot (2019) explores the application tion invariance. After pre-processing of WSI dataset, we divided both
of deep reinforcement learning in predicting the diagnostically relevant microscopy and WSI tumour samples into patches. The patches were
regions and their HER2 scores in breast immunohistochemical (IHC) acquired by following different scanning techniques. We further devel
Whole Slide Images. For the given task, the authors proposed context oped a BiLSTM network model that takes the patches acquired from
module and a CNN-LSTM end-to-end model. The model intelligently large tumour regions in the form of sequences. Since, the patches were
views the WSI as the environment and the CNN-LSTM acts as a decision extracted in a continuous pattern, therefore, we were able to construct a
maker or the agent. Their model successfully mimics the histopatho sequential data fit for BiLSTM network. We extracted features from each
logical expert analysis that first looks coarsely at ROIs at low resolution patch in a sequence using GoogleNet (pre-trained on ImageNet). Accu
and then predict the scores of diagnostically relevant regions. Their mulated features per region formed one sequence. The sequences were
model also incorporates multi-resolution analysis by combining features then passed through BiLSTM layers for classification into labels. At the
of the same region at multiple resolution for better predictive perfor test time, the test regions follow the same feature extraction and
mance. The main advantage given by their model is that one need not sequence formation procedure. The trained BiLSTM model then tests the
look at all the regions of a WSI to predict the outcome and instead could sequence and give out the predicted label. In brief, the method follows
focus on small number of regions without sacrificing the performance of the 5 steps: (1)extracting whole regions (Benign, Invasive, and In situ),
3
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
Fig. 1. Microscopy BACH data samples. First row to fourth: Benign Tumours, Invasive Carcinoma, In situ Carcinoma, and Normal.
(2) extracting patches from each tumor region, (3) extracting features The angle of rotation was then calculated following the steps below:
from each set, per patch, (4) forming a sequence out of each set, and (5)
sequence processing and classification. 1. Determine the major axis centroid of the region.
2. Calculate the major axis angle (M) from the X-axis.
3.2. Preprocessing 3. Calculate the angle of rotation R = 90 − M
4. Rotate the region along the major axis centroid by the angle R.
3.2.1. Region extraction 5. Repeat steps 1–4 for both region and region mask
The histopathological breast cancer slide dataset used in our work 6. Calculate the bounding box coordinates of the rotated mask.
contains ten annotated WSIs labeled into four major classes, Normal, 7. Modify the obtained bounding box dimensions to the nearest mul
Benign, In situ carcinoma, Invasive carcinoma. The annotation of each tiple of 256.
WSI is recorded in XML files. Each XML file is divided into regions as 8. Crop rotated region around the modified bounding box coordinates.
annotated by pathologists in the corresponding WSI. The regions are
then marked by drawing a rough boundary around the suspected region. 3.2.2. Scanning methods for patch extraction
The boundary is marked using slide annotation tools such as ASAP Some of the extracted regions had large pixel dimensions due to their
(Automated Slide Annotation Platform). Each pixel coordinate anno high resolution, which required breaking regions into patches to enable
tated by the pathologist is recorded in the XML file under a current re the processing of the regions. The arbitrary dimensions of the sampled
gion being annotated. The XML file also contains the region label, area of regions was also an issue for the deep network training since such net
the region in pixels, region id, zoom ration, length of the region in mi works require equal size images as input. Therefore, for the feasibility of
crons, and area of the region in microns. Each annotated coordinate is the experiment, the regions were divided into patches of dimension
represented in X, Y, and Z axes values. From the available information, 256 × 256. The particular patch size was chosen keeping in mind
we calculated the maximum and minimum boundary coordinates to find following points:
out the location, height, and width of the labeled region (Fig. 2).
Since the tumour regions can be found in varying orientations • The smaller patch size in the power of 2 is 128 × 128. This patch size
depending upon the angle of acquisition of the particular WSI or mi contains less details than a 256 × 256 patch.
croscopy image, the model should be robust to such changes. Therefore, • The larger patch size 512 × 512 and more (in the powers of 2)
to make the process more robust and rotation invariant, the obtained although would contain more details and context, but will impose
regions were rotated by following a unified method. To determined the computational constraints like expensive computation resources and
angle of rotation for a particular region, the region mask was used to time. This scenario would not be feasible in hospital implementation
analyse the orientation of the region with respect to the vertical axis. and integration of the CAD methods.
4
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
Fig. 2. WSI BACH data samples extracted from gigapixel slides. The variable size of each tumour region pose limitation in traditional deep learning framework. But,
our model mitigate this limitation by allowing variable sequence size. First row to third: Benign Tumours, Invasive Carcinoma, In situ Carcinoma. These regions can
be seen having different dimensions but represent a single resolution level (level 0) from the WSI pyramid shown in Figure 5gr2
• The pre-trained deep learning models like GoogLeNet, ResNet, this scanning method as Scan_1. The process is illustrated in Fig. 3a.
DenseNet, InceptionV3 take fixed size input ranging from 200 to 300 The second scanning technique was thought as an attempt to arrange
pixels across their width and height. Hence, taking smaller or larger patches in sequence to bring as much continuity as possible. For any
patch sizes would demand heavy resizing resulting in loss of infor RNN method, where the sequence of data is the key to linking the
mation and details. Therefore, 256 × 256 patch size seemed appro context of the past and future with the present, we needed to derive
priate for the proposed method. Many recent literatures like Wang sequential information from our tumor regions after they are sampled
et al., Chennaswamy et al. in Aresta et al. (2019) have resized their into patches. Our method is an effort to test the efficiency of RNN in case
patches to 256 × 256 and then resized them to 224 × 224 in order to of image sequences. It scans patches starting from left to the right across
process them with deep learning architectures like ResNet and the width in one iteration, and then the second iteration starts from the
DenseNet. next row of non-overlapping pixels. It starts from right towards left,
covering the width of the image. The process is repeated for subsequent
To study and analyze the effect of different scanning techniques for rows until the entire region is exhausted. We named this scanning
sampling patches from regions, we tested three different scanning technique as Scan_2, shown in Fig. 3b.
methods. Fig. 3 shows the pictorial representation of these techniques. The third scanning method was deployed to bring more correspon
The first technique deploys most commonly used scanning method dence between the neighboring patches. The patches were scanned as
that moves the sliding window of desired patch dimensions from left to represented in Fig. 3c. The set of four neighboring patches are scanned
right across the width until the maximum width. The process is repeated first, then the next adjacent batch, and onwards. When the row of non-
across the height of the region. The window is non-overlapping, and at overlapping pixels changes, the batches were scanned from right to left.
the extreme ends, if the expected height and/or width of the patch is The process was repeated until the region was covered across both di
greater than the remainder, we used symmetric padding to level the mensions. This technique is further referred to in the article as Scan_3.
patch dimensions. For the convenience of the language, we addressed The patches from each region were separated in the form of sets or
5
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
Fig. 3. The figure illustrates the different scanning methods that are used to extract patches from labeled WSI regions. The numbered blue blocks represents the
patches in the WSI or Microscopy dataset. The dotted purple arrow shows the direction of scan and the dotted yellow arrow shows the transition from one pass of scan
to another.
6
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
7
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
8
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
Table 2 Table 3
Distribution of the labels for the microscopy and WSI datasets. Distribution of the data for the microscopy and WSI datasets into training,
Dataset Benign In situ Invasive Normal
validation and testing sets (for parameter selection only (refer Section 4.4)).
Dataset Benign Invasive In situ Normal
Microscopy 100 100 100 100
WSI 57 109 60 – Microscopy Train 66 75 76 63
Validation 18 11 12 20
– denotes no annotated normal regions.
Test 16 14 12 17
WSI Train 34 74 45 –
Validation 11 13 9 –
Test 12 15 6 –
9
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
Table 4
0.9933 ± 0.0014
Accuracy (%) obtained against different learning rates, rates, optimizing func
tions and scanning techniques with respect to Whole Slide Images (3-classes).
Scanning Learning Optimizer Dropout rate
Method Rate
Sp.
0.4 0.5 0.6 0.7
–
4
Scan_1 10− SGDM 51.52 57.58 54.55 39.39
RMSprop 72.73* 60.61* 66.67 63.64
0.9789 ± 0.0045
ADAM 69.70 69.70 66.67 63.64
3
10− SGDM 72.73* 72.73 66.67* 66.67*
RMSprop 60.61 66.67* 63.64* 57.58*
Normal
ADAM 51.52 63.64 63.64 63.64
Se.
4
Scan_2 10− SGDM 60.61 57.58 57.58 57.58
–
RMSprop 75.76 87.88* 81.82* 84.85
ADAM 78.79 75.76 78.79 84.85
3
10− SGDM 81.82 84.85 81.82* 81.82
0.9416 ± 0.0039
0.8505 ± 0.0376
RMSprop 78.79 75.76* 81.82 69.70
ADAM 66.67 72.73 72.73 66.67
4
Scan_3 10− SGDM 72.73 75.76 66.67 69.70
RMSprop 78.79* 72.73 84.85* 78.79
Sp.
ADAM 81.82* 69.70* 75.76 72.73*
3
10− SGDM 78.79 75.76* 78.79* 72.73*
RMSprop 75.76 78.79 72.73 75.76*
ADAM 69.70 69.70 66.67 75.76
0.8644 ± 0.0081
0.7508 ± 0.0481
* validation patience 5.
In situ
Table 5
Se.
Accuracy (%) obtained against different learning rates, drop-out rates, opti
mizing function and scanning techniques with respect to Microscopy Images (4-
classes).
0.9653 ± 0.0061
0.5885 ± 0.0421
Scanning Learning Optimizer Drop-out rate
method rate
0.4 0.5 0.6 0.7
Sp.
4
Scan_1 10− SGDM 59.32* 59.32 55.93 62.71
RMSprop 74.58 72.88 74.58 69.49
ADAM 76.27 72.88 79.66* 76.27
3
10− SGDM 69.49 69.49 69.49 67.80
0.9303 ± 0.0103
0.7277 ± 0.0115
RMSprop 67.80 69.49* 64.41* 62.71
ADAM 59.32* 69.49* 59.32 71.19*
Invasive
4
Scan_2 10− SGDM 55.93 71.19* 72.88* 71.19*
RMSprop 79.66 83.05* 77.97* 81.36
Se.
4
Scan_3 10− SGDM 18.69* 22.03* 22.03* 23.73*
RMSprop 1.69* 1.69* 1.69* 1.69*
ADAM 0 3.39* 0 3.39*
3
10− SGDM 3.39* 5.08* 3.39* 5.08*
Sp.
* validation patience 5.
0.8516 ± 0.0106
0.4917 ± 0.0410
front of some of the accuracy values represents that the epochs were run
keeping training option of validation patience at 5. The setting ensures
Benign
that the training stops if the validation loss is larger than or equal to the
Se.
previously recorded smallest loss for at most 5 times during the training
or if the maximum number of epochs are exhausted, whichever is the
earlier. In this setting, the number of epochs may or may not reach the
0.8875 ± 0.0056
0.5455 ± 0.0629
maximum limit set at the start of the training. So, we performed all the
72 experiments with and without validation patience 5 for a maximum
of 30 epochs. We have shown only the largest of the two accuracy values
Acc
obtained from the two settings. The ’*’ indicates that the larger accuracy
value is obtained with validation patience 5. So, in total, we conducted
4 × 3 ×3 × 2 ×2 = 144 experiments for each dataset to select the
optimal hyper-parameters. The experimental results are indicated in
Microscopy
Table 4 for 3-class classification of WSI tumour regions and Table 5 for 4-
Dataset
Table 6
10
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
were made from the Table 4. Such as, across all the scanning methods,
learning rate 10− 4 performed better than learning rate 10− 3. However,
0.9500 ± 0.0022
0.9542 ± 0.0103
0.9575 ± 0.0093
0.9476 ± 0.0094
for second scanning method (Scan_2), both the learning rates performed
closely with accuracy values falling in the range 80–88%. Scanning
method Scan_3 followed closely in terms of frequency of accuracy values
0.88
0.93
0.91
0.91
0.95
0.93
Sp.
more than 80%. When we kept the optimization function, learning rate
and scanning method constant, the trend of accuracy across different
0.7200 ± 0.0235
0.8605 ± 0.0113
0.9416 ± 0.0080
0.9125 ± 0.0145
drop-out rates signify the importance of tuning drop-out values during
training custom models. With respect to optimization function and
Normal
irrespective of the drop-out rates, the all-over analysis of the Table 4
0.96
0.96
0.96
0.96
0.90
suggests that SGDM did not perform well in first two scanning methods
1.0
Se.
(Scan_1 and Scan_2) whereas, the gain in SGDM performance was
observed in Scan_3. In case of ADAM, this optimization function could
0.9500 ± 0.0039
0.9706 ± 0.0047
0.9440 ± 0.0055
0.9599 ± 0.0035
not enhance model’s performance across all hyper-parameters except in
Scan_2 with learning rate 10− 4. The optimization function RMSprop
performed consistently better across scanning methods Scan_2 and
0.97
0.99
0.91
0.92
0.93
1.0
Sp.
Scan_3 irrespective of the learning rates and drop-out rates. The highest
performance as can be seen in the Table 4 for WSIs was given by Scan_2,
0.9600 ± 0.0052
0.8307 ± 0.0090
0.9170 ± 0.0099
0.8557 ± 0.0122
RMSprop, 0.5 drop-out rate and 10− 4 learning rate. The cell is high
lighted in magenta.
The analysis of Table 5 also gives some interesting insights about the
In situ
0.84
0.88
0.84
0.92
0.92
behaviour of model when the hyper-parameters change. These values
1.0
Se.
were obtained after the 4-class classification of microscopy dataset. The
parameters are most sensitive to scanning methods in this dataset as we
0.9800 ± 0.0023
0.9666 ± 0.0048
0.9836 ± 0.0029
0.9735 ± 0.0045
could observe from the Table 5 that when the patches extracted from
Scan_3 were trained using the same hyper-parameters, absolute drop in
the accuracy was recorded. The results also indicate of the fact that
Comparative Performance metrics with standard errors for patch-to-image classification model for Microscopy Dataset (4-classes).
0.99
0.96
0.95
0.97
0.89
1.0
Sp.
scanning techniques can over-power the outcome of the model espe
cially in the case of sequence modelling of images to labels. In Micro
0.9800 ± 0.0042
0.9243 ± 0.0121
0.8408 ± 0.0209
0.8628 ± 0.0116
scopy dataset as well, the learning rate 10− 4 performed better than 10− 3
and the scanning method Scan_2 gave better outcome in comparison to
other two methods. We observed the difference in optimization function
Invasive
(ADAM) and drop-out rate (0.6) when compared with best performing
0.88
0.92
0.96
0.92
0.8
1.0
Se.
0.9341 ± 0.0052
0.9697 ± 0.0039
0.9463 ± 0.0049
our parameter set to perform cross-validation. We deduced that learning
rate 10− 4 and scan technique Scan_2 with validation patience gave us the
better results in both the datasets.
0.96
0.96
0.97
0.99
The direct analysis of comparative methods in literature Ren et al. 0.99
1.0
Sp.
(2018), Qaiser and Rajpoot (2019), Bychkov et al. (2018) with our
0.9400 ± 0.0052
0.8674 ± 0.0239
0.8624 ± 0.0227
0.8499 ± 0.0163
proposed method could not be achieved since these methods have
different objectives like calculating HER2 scores, five year disease spe
cific survival prediction, hazard ratios. Also, they have different data
Benign
0.70
0.8
0.4
Se.
datasets. Whereas, we do not have such type of data and hence the ob
jectives are different. However, all these methods used CNN + LSTM as
*The standard error data for comparative literature is not available.
0.9000 ± 0.0053
0.8675 ± 0.0083
0.8900 ± 0.0107
0.8700 ± 0.0066
0.87
0.86
0.83
0.81
0.90
obtained results, we observed that the performance with LSTM layer has
Kone et al., 2018 (team 19)* Aresta et al. (2019)
11
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
Table 8
Comparative performance metrics with standard errors for patch-to-image classification model for WSI Dataset (3-classes).
Benign Invasive In situ
Ours (proposed) 0.8402 ± 0.0032 0.7090±0.0309 0.9132 ± 0.0157 0.9142 ± 0.0136 0.9190 ± 0.0096 0.8333 ± 0.0264 0.9240 ± 0.0117
ResNet50 He et al. (2016) 0.8127 ± 0.0093 0.9233 ± 0.0202 0.8341 ± 0.0148 0.8285 ± 0.0113 0.9492 ± 0.0126 0.7167 ± 0.0271 0.9556 ± 0.0065
DenseNet201 Huang et al. (2017) 0.8127 ± 0.0054 0.8142 ± 0.0202 0.9091 ± 0.0071 0.8520 ± 0.0098 0.9160 ± 0.0135 0.7833 ± 0.0500 0.9056 ± 0.0079
InceptionV3 Szegedy et al. (2015) 0.8221 ± 0.0087 0.8356 ± 0.0140 0.8740 ± 0.0142 0.8451 ± 0.0077 0.9183 ± 0.0087 0.7667 ± 0.0245 0.9369 ± 0.0696
et al. (2017). We used the constant learning rate of 10− 4 and SGDM as prediction. Their method was also not end-to-end and required two
optimizer for fine-tuning each of these networks. The last fully con datasets to fine-tune the model performance. The difference between
nected layer of each of these models were removed and replaced with accuracy between our model and theirs was also 3%. The class-wise
our new fully connected layer having four outputs in the case of Mi comparison (Table 7) suggests that our model is much sensitive then
croscopy dataset and three outputs for the WSI dataset. Each model was the top 2 performing teams. Team 1 Brancati et al. also used the ensemble
trained for 30 epochs. After the model training, we performed majority of three ResNet models having 34, 50, and 101 layers, respectively. They
voting scheme to predict the final label for the image. This process was used down-sampled microscopy images to extract patches of two sizes
done for both Microscopy and WSI dataset. The benchmark models are 308 × 308 and 615 × 615. These patches were taken from the center of
not end-to-end due to the required post-processing of patch-based the down-sampled images. They used highest class probability from
classifier outputs for image label prediction. three models as the class of the image. Our model performed better than
We compared the results from benchmark models and the results by theirs by overall accuracy of 90% against 86%. The next team in the list
top 5 teams in BACH grand challenge (IB, 2018) published in (Aresta was Team 157 Wang et al.. The authors in this work trained VGG16
et al., 2019) with our proposed method on the Microscopy dataset in Simonyan and Zisserman (2014) using sample pairing data augmenta
Table 7. Similarly, for WSI dataset, we compared our model’s perfor tion technique by Inoue (2018) in which samples from different classes
mance in Table 8. We performed 10 fold cross-validation on our pro are augmented and then merged. The merged images are then trained
posed model. We have evaluated the performance of our model in terms using the chosen model. In the next step, the trained classifier from the
of overall accuracy of the model, class-wise sensitivity, and specificity. mixed images is again trained using the initial non mixed dataset. The
Sensitivity and Specificity are commonly used for measuring medical authors have resized their images to 256 × 256 and then extracted
applications. Sensitivity refers to how much our model is sensitive in patches of size 224 × 224 at random locations. They achieved the ac
detecting positive class or the percentage of actual positives that are curacy of 83%. The difference between their and our approach is same as
correctly identified. Whereas, Specificity is the measure of actual neg with other competitive models. Team 19 Kone et al. achieved the accu
atives that are correctly identified. Both Sensitivity and Specificity of the racy of 81%, 9 percent less than our proposed model. They proposed
model should be as high as possible to be able to correctly detect all binary tree like structure of 3 ResNeXt50 Xie et al. (2017) models in
positive samples and all negative samples. which the top CNN in the hierarchy classifies images into carcinoma (In
situ, Invasive) and non-carcinoma normal and benign. The next two
True Positive
Sensitivity = (3) children of the root CNN then classifies the images into respective two
True Positive + False Negative
sub-classes benign or normal and, In situ or Invasive. They also used the
True Negative two-stage process that used the learned weights of first stage to train the
Specificity = (4) subsequent stages. All these methods in the challenge (IB, 2018) who
True Negative + False Positive
have reported their models performance used current state of the art
deep learning models. The common thread between these models was
4.5.1. Performance on microscopy dataset
that all used pre-trained models due to limited amount of data. How
The accuracy of our model is 3% more than the top performing team
ever, they all used very heavy resizing of images which compromise with
216 Chennaswamy et al.. The authors also used pre-trained CNNs instead
the quality of the high resolution intrinsic details present in cancer data.
of building their own custom model. They used ensemble of ResNet-101
Moreover, their methods used two-three stages of training and the final
He et al. (2016) and two DenseNet-161 Huang et al. (2017) networks. In
outputs were aggregated to declare imagewise prediction. Our model on
comparison to our model which is end-to-end, they first trained
the other hand as mentioned avoid the disadvantages posed by the
ResNet-101 and a DenseNet-161 using images normalized with
compared models. The same disadvantages are posed by the authors in
breast-histology data and then another DenseNet-161 with images
Roy et al. (2019) as well. They extracted different size patches (64 × 64,
fine-tuned with ImageNet normalization. During the testing, the ma
128 × 128, 512 × 512) to train their model separately but found opti
jority voting scheme was used to declare the class of the input image
mum performance with 512 × 512. They then used heavy data
from among the three classes predicted by the three models. Other
augmentation to increase the amount of data. The augmented dataset is
notable difference between our model and theirs is that they used
then trained using their custom CNN architecture. After the patches
bilinear interpolation to resize their image dimensions from
were trained, they used majority voting scheme to declare the predicted
2048 × 1536 to 224 × 224 whereas, we did not use resized images since
class of the input image. Although, they have achieved equal accuracy as
that would have decreased the quality of extracted features. We main
our proposed model but suffered from the drawback of stage-wise
tained the resolution and instead broke the image into patches to
model, data augmentation, and having to train their model from
decrease the size of the input image. At the training time, for feature
scratch which demands time and space.
extraction step through GoogleNet, the patches were resized from
256 × 256 to 224 × 224. Second team on the leaderboard Kwok et al.
4.5.2. Performance on WSI dataset
team 248 trained their model using images from both microscopy
Microscopy dataset has balanced sets of four classes having equal
dataset and extracted patches from WSI dataset. Their 2-stage process
image dimensions. The labelled mask of each class covers the entire
first trained the ResNet-v2 Szegedy et al. (2017) pre-trained on Image
image area and hence the features detected belong to one class only.
Net on patches acquired from microscopy dataset and then again
These properties have helped to capture patches that completely belong
pre-trained their network with the patches acquired from WSI dataset.
to the labelled image class. However, with WSI dataset, due to arbitrary
The prediction of each patch was then aggregated to image-wise
12
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
shape and size of the regions, the automatic extraction script could only as hand-held devices like mobile handsets. The complexity of the
extract the tumour from the surrounding bounding box area. Hence, the method is discussed in Section 6. Another advantage is that the various
patches sampled from such WSI regions also contained a lot of non- limitation of high-resolution images could be exploited in the favour of
tumour or non-class images. Moreover, the final acquired image re the methodology. The large dimensions could be easily turned into se
gions were imbalanced (Table 3). Therefore, these reasons might have quences using the appropriate scanning process. Due to BiLSTM layer,
caused the performance decline in the accuracy with WSI dataset in the model encapsulates the context mining capability which helped form
comparison to microscopy images. We trained for only three classes the spatial and contextual relationship between patches sampled from a
since the normal patches were randomly extracted and therefore, did not single image. The results suggested that this context modelling was
belong to one particular area in the WSI. The continuity of the patches is crucial in patch-based models that process patches instead of complete
the important factor for our model. For experimental purposes when we structures at a time. In other words, modelling direct dependencies be
trained our model with non-continuous normal patches, our model tween patches, past or future, is crucial for performance of the model.
suffered from performance decline which proved that the continuous The idea of processing patches as a sequence using RNN based
patches draw spatial and contextual relationship through BiLSTMs. BiLSTM model could be further extended by using four RNNs. Each RNN
Otherwise, in the absence of non-continuity, the model may suffer from would take patches going in up, down, left, and right directions,
high variance. For benchmarking purposes and due to the lack of other respectively Bengio et al. (2017). According to the Visin et al. (2015),
comparative models, we compared our model with ResNet50, Incep Kalchbrenner et al. (2015), compared to CNNs, RNNs when applied to
tionV3, and DenseNet121. From the Table 8, we could observe an images allow for long-range lateral interactions between features in the
improvement in the performance metrics when we used context based same feature map.
model. The main difference between our model and these state of the art
models is that we did not train any deep architecture and our model is 6. Complexity
end-to-end.
The model is end-to-end deep learning model whose architecture is
5. Discussions briefly expressed in 1 . Till layer number 143 − Flatten Layer, there are
no Floating Point Operations (FLOPs) being performed. GoogLeNet
Computer Aided Diagnosis (CAD) by analysing samples of Ultra network is present to extract pre-trained features which are then passed
sound, CT, and MRI images has been vastly suggested by medical image on to subsequent layers for further processing. Similarly, Sequence
researchers for quite sometime. They trained machine learning models Folding, Unfolding, Average Pooling, and Flatten layer also accumulate
with various morphological, graph, and intensity based methods from zero FLOPs. Therefore, the time complexity is calculated from BiLSTM
very small set of data samples which were sometimes in the range of only layer onwards. The formula for calculating number of learnable pa
30–100 images. The generalizing capability of such models has thus rameters in a BiLSTM layer is derived as follows,
been questionable. However, after the introduction of deep learning Let I be the input size of the sequence, K be the number of output
models and availability of large amount of data. CAD techniques have dimensions and H be the number of hidden units. For BiLSTM if H are the
experienced a huge success in performance precision and accuracy. number of initialized hidden units then M = 2 × H are the total number
When such deep models were tested for histopathological images, the of hidden units for both forward and backward passes of the BiLSTM
low inter-class variability, especially between Normal and Benign clas network. After concatenation of the forward and backward outputs, the
ses, affected the overall performance. Hence, some new methods total output dimensions become K = M/2. Then the complexity of a
engaging these deep models in form of cascaded or ensemble architec BiLSTM layer is:
tures were proposed. Also, the most biopsy samples digitized at high
resolutions contain very detailed information of cell structures and ℴ(W)
various other microstructures. The amount of information in one biopsy
where W are the total number of learnable parameters in the network
sample could collectively form a gigapixel image. Such high-resolution
calculated as:
images are then required to be broken into smaller patches for further
processing. Patch-based processing with complex ensemble methods W = 4 × M((I + 1) + K)
followed by aggregation of patches into image labels in case of classi
fication and segmented objects in case of segmentation makes it a W = 4 × (M(I + 1) + MK)
lengthy process. The whole pipeline is divided into stages and lacks
contextual relationship between patches. To overcome this drawback, Here in the above formula, the first term 4 × M(I + 1) are the total
we thought to streamline the process into an end-to-end network. The number of input weights and the second term 4 × MK are the number of
patches were visualized as a sequence of images as in a video and an recurrent weights.
effort was made to scan the patches so as to maintain as much continuity In the terms of Big Oh notation, the time complexity is;
as possible. RNN based BiLSTM models are known to serve the purpose ℴ(M(I + 1) + MK)
for predicting input sequence labels. Since, with BiLSTMs, we could
capture both past and future contexts which enabled the model to The multiplication by factor 4 represents four weight matrices of
aggregate the whole tumour features despite providing non-overlapping BiLSTM layer (Input gate, Forget gate, Cell candidate, Output gate). The
tumour parts in the form of patches as input sequence. input size variable I is added with a bias value 1.
Due to sequence classification, the next step of predicting image label For the BiLSTM layer in our network, the number of parameters are:
from patch labels was not required. The graphical structure of BiLSTMs W = 4 × 4000 × ((1024 + 1) + 2000)
helped to build a context-based high-resolution tumour classification
model that also gave us the benefit of end-to-end network structure. We W = 16000 × (1025 + 2000)
also analysed that with our proposed models, there is no need of training
deep models. We used a pre-trained ImageNet model for feature W = 48400000
extraction and only one BiLSTM layer to train a shallow network. The
average time to train the model was 17 minutes for 30 epochs. Once the where 4000 are the total number of hidden units for both forward and
model’s hyper-parameters are tuned for the particular dataset, the backward passes of the BiLSTM layer, 1024 is the size of the input
training would take only few minutes. The shallow structure of the sequence, and 2000 is the total number of outputs.
model also make it feasible for deployment in lighter applications such Next, for the fully connected layer, the parameters are
13
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
Table 9 References
FLOPs for popular deep learning architectures.
Araújo, T., Aresta, G., Castro, E., Rouco, J., Aguiar, P., Eloy, C., Polónia, A., Campilho, A.,
AlexNet 727 MFLOPs
2017. Classification of breast cancer histology images using convolutional neural
VGG16 16 GFLOPs networks. PLoS One 12 (6), e0177544.
VGG19 20 GFLOPs Aresta, G., Araújo, T., Kwok, S., Chennamsetty, S.S., Safwan, M., Alex, V., Marami, B.,
GoogLeNeT 2 GFLOPs Prastawa, M., Chan, M., Donovan, M., et al., 2019. BACH: grand challenge on breast
cancer histology images. Med. Image Anal.
ResNet50 4 GFLOPs
B., 2015. 4th International Symposium in Applied Bioimaging. http://www.bioimagi
DenseNet121 3 GFLOPs
ng2015.ineb.up.pt//.
InceptionV3 6 GFLOPs Bayramoglu, N., Kannala, J., Heikkilä, J., 2016. Deep learning for magnification
independent breast cancer histopathology image classification. 2016 23rd
International Conference on Pattern Recognition (ICPR) 2440–2445.
F = 3 × 4000 Bengio, Y., Goodfellow, I., Courville, A., 2017. Deep Learning, vol. 1. Citeseer.
Bychkov, D., Linder, N., Turkki, R., Nordling, S., Kovanen, P.E., Verrill, C.,
Walliander, M., Lundin, M., Haglund, C., Lundin, J., 2018. Deep learning based
Hence, the total number of FLOPs are tissue analysis predicts outcome in colorectal cancer. Sci. Rep. 8 (1), 1–11.
Graves, A., Fernández, S., Schmidhuber, J., 2005. Bidirectional lstm networks for
W + F = 48400000 + 12000 = 48412000
improved phoneme classification and recognition. International Conference on
Artificial Neural Networks 799–804.
W + F = 48.4 × 106 = 48MFLOPs Guo, Y., Liu, Y., Bakker, E.M., Guo, Y., Lew, M.S., 2018. CNN-RNN: a large-scale
hierarchical image classification framework. Multimedia Tools Appl. 77 (8),
We have used NVIDIA TitanX GPU (12GB) for training our models. It 10251–10271.
performs 11 × 1012 or 11 Tera FLOPs per second which is a sufficient He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
computational efficiency required for training. 770–778.
To put it in perspective, we mention the number of FLOPs for few Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9 (8),
popular deep learning networks in Table 9. 1735–1780.
Hou, L., Samaras, D., Kurc, T.M., Gao, Y., Davis, J.E., Saltz, J.H., 2016. Patch-based
In terms of the Big Oh notation, the time complexity of the model for convolutional neural network for whole slide tissue image classification. Proceedings
t number of input samples and n number of epochs is represented as; of the IEEE Conference on Computer Vision and Pattern Recognition 2424–2433.
Huang, Y., Chung, A.C.-S., 2018. Improving high resolution histology image
ℴ(n × t × (W + F)) classification with deep spatial fusion network. Computational Pathology and
Ophthalmic Medical Image Analysis 19–26.
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected
7. Conclusion convolutional networks. Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition 4700–4708.
We proposed an end-to-end RNN based model that takes patches as I.B., 2018. Grand Challenge on Breast Cancer Histology. https://iciar2018-challenge.
grand-challenge.org/Home/.
input and outputs image labels. The patches are modelled as sequences Inoue, H., 2018. Data Augmentation by Pairing Samples for Images Classification. arXiv
by using one-layer BiLSTM model. The sequence in an image is captured preprint arXiv:1801.02929.
using the strategic scanning method which was experimentally chosen. Johnson, J., Karpathy, A., Fei-Fei, L., Densecap:, 2016. Fully convolutional localization
networks for dense captioning. Proceedings of the IEEE Conference on Computer
We used BACH challenge dataset to test our method and reported our Vision and Pattern Recognition 4565–4574.
results on two different datasets introduced in the challenge. The clas Kalchbrenner, N., Danihelka, I., Graves, A., 2015. Grid Long Short-Term Memory. arXiv
sifier performance was compared with recently reported metrics by top preprint arXiv:1507.01526.
Karpathy, A., Fei-Fei, L., 2015. Deep visual-semantic alignments for generating image
5 teams in BACH challenge for microscopy dataset. We achieved highest
descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern
performance of 90% with simpler architecture and less time and space Recognition 3128–3137.
complexity. Leong, A.S.-Y., Zhuang, Z., 2011. The changing role of pathology in breast cancer
diagnosis and treatment. Pathobiology 78 (2), 99–114.
Mahbod, A., Ellinger, I., Ecker, R., Smedby, Ö., Wang, C., 2018. Breast cancer
Authors’ contribution histological image classification using fine-tuned deep network fusion. International
Conference Image Analysis and Recognition 754–762.
Suvidha Tripathi: Conceptualization, Methodology, Software, Nazeri, K., Aminpour, A., Ebrahimi, M., 2018. Two-stage convolutional neural network
for breast cancer histology image classification. International Conference Image
Writing – Original draft preparation, Visualization, Investigation, Vali Analysis and Recognition 717–726.
dation. Satish Kumar Singh: Conceptualization, Methodology, Supervi Qaiser, T., Rajpoot, N.M., 2019. Learning where to see: a novel attention model for
sion, Resources, Writing – Reviewing and Editing, Visualization, automated immunohistochemical scoring. IEEE Trans. Med. Imaging 38 (11),
2620–2631.
Validation. Hwee Kuan Lee: Methodology, Formal Analysis, Supervi Ren, J., Karagoz, K., Gatza, M., Foran, D.J., Qi, X., 2018. Differentiation among prostate
sion, Writing – Reviewing and Editing, Investigation, Visualization, cancer patients with Gleason score of 7 using histopathology whole-slide image and
Validation. genomic data. Medical Imaging 2018: Imaging Informatics for Healthcare, Research,
and Applications, vol. 10579 1057904.
Roy, K., Banik, D., Bhattacharjee, D., Nasipuri, M., 2019. Patch-based system for
Declaration of Competing Interest classification of breast histology images using deep learning. Comput. Med. Imaging
Graph. 71, 90–103.
Schuster, M., Paliwal, K.K., 1997. Bidirectional recurrent neural networks. IEEE Trans.
The authors report no declarations of interest.
Signal Process. 45 (11), 2673–2681.
Shaban, M., Awan, R., Fraz, M.M., Azam, A., Snead, D., Rajpoot, N.M., 2019. Context-
Acknowledgements Aware Convolutional Neural Network for Grading of Colorectal Cancer Histology
Images. arXiv preprint arXiv:1907.09478.
Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale
This research was carried out in Indian Institute of Information Image Recognition. arXiv preprint arXiv:1409.1556.
Technology, Allahabad and supported, in part, by the Ministry of Human Spanhol, F.A., Oliveira, L.S., Petitjean, C., Heutte, L., 2016. Breast cancer
Resource and Development, Government of India and the Biomedical histopathological image classification using convolutional neural networks. 2016
International Joint Conference on Neural Networks (IJCNN) 2560–2567.
Research Council of the Agency for Science, Technology, and Research, Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2015. Rethinking the
Singapore. We are also grateful to the NVIDIA Corporation for sup Inception Architecture for Computer Vision. arXiv preprint arXiv:1512.00567.
porting our research in this area by granting us TitanX (PASCAL) GPU. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A., 2017. Inception-V4, inception-ResNet
and the impact of residual connections on learning. Thirty-First AAAI Conference on
Artificial Intelligence.
Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2015. Show and tell: a neural image
caption generator. Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition 3156–3164.
14
S. Tripathi et al. Computerized Medical Imaging and Graphics 87 (2021) 101838
Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., Bengio, Y., 2015. Renet: A Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., Yan, S., 2014. CNN: single-label to
Recurrent Neural Network Based Alternative to Convolutional Networks. arXiv multi-label. arXiv preprint arXiv:1406.5726.
preprint arXiv:1505.00393. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K., 2017. Aggregated residual transformations
Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W., 2016. CNN-RNN: a unified for deep neural networks. Proceedings of the IEEE Conference on Computer Vision
framework for multi-label image classification. Proceedings of the IEEE Conference and Pattern Recognition 1492–1500.
on Computer Vision and Pattern Recognition 2285–2294. Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., 2018. Multilabel image classification with
Wang, S., Zhu, Y., Yu, L., Chen, H., Lin, H., Wan, X., Fan, X., Heng, P.-A., 2019. RMDL: regional latent semantic dependencies. IEEE Trans. Multimedia 20 (10), 2801–2813.
Recalibrated multi-instance deep learning for whole slide gastric image
classification. Med. Image Anal. 58, 101549.
15