0% found this document useful (0 votes)
18 views17 pages

2304.04567v1

Uploaded by

mmhameedkhan6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views17 pages

2304.04567v1

Uploaded by

mmhameedkhan6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Highlights

ADS_UNet: A Nested UNet for Histopathology Image Segmentation


Yilong Yang,Srinandan Dasmahapatra,Sasan Mahmoodi

• We propose ADS_UNet that integrates cascade training and AdaBoost algorithm.


• We supervise layers of UNet to learn useful features in a manner that is learnable.

• The importance of layers varies across the training and contributes differently.
• The ADS_UNet achieves state-of-the-art performance.
• The ADS_UNet is much more memory and computationally efficient than Transformers.
arXiv:2304.04567v1 [eess.IV] 10 Apr 2023
ADS_UNet: A Nested UNet for Histopathology Image Segmentation
Yilong Yanga,∗ , Srinandan Dasmahapatraa and Sasan Mahmoodia
a School of Electronics and Computer Science, University of Southampton, University Road, Southampton, SO17 1BJ, Hampshire, United Kingdom

ARTICLE INFO ABSTRACT


Keywords: The UNet model consists of fully convolutional network (FCN) layers arranged as contracting encoder
Segmentation and upsampling decoder maps. Nested arrangements of these encoder and decoder maps give rise to
UNet extensions of the UNet model, such as UNete and UNet++. Other refinements include constraining
AdaBoost the outputs of the convolutional layers to discriminate between segment labels when trained end to
Histopathology end, a property called deep supervision. This reduces feature diversity in these nested UNet models
Ensemble despite their large parameter space. Furthermore, for texture segmentation, pixel correlations at
multiple scales contribute to the classification task; hence, explicit deep supervision of shallower
layers is likely to enhance performance. In this paper, we propose ADS UNet, a stage-wise additive
training algorithm that incorporates resource-efficient deep supervision in shallower layers and takes
performance-weighted combinations of the sub-UNets to create the segmentation model. We provide
empirical evidence on three histopathology datasets to support the claim that the proposed ADS
UNet reduces correlations between constituent features and improves performance while being more
resource efficient. We demonstrate that ADS_UNet outperforms state-of-the-art Transformer-based
models by 1.08 and 0.6 points on CRAG and BCSS datasets, and yet requires only 37% of GPU
consumption and 34% of training time as that required by Transformers.

1. Introduction to extract decorrelated features during training, to guaran-


tee prediction diversity? 2) can the outputs of constituent
The fully convolutional neural network (FCN) (Long models, sensitive to different spatial resolutions, be weighted
et al., 2015), trained end-to-end on per-pixel labels, is con-
differently when they are integrated into the final segmen-
sidered a milestone in image segmentation using deep net-
tation? 3) can we provide deep supervision for encoders
works. It was then extended by Ronneberger et al. (2015) to directly rather than by supervising the up-sampled decoders?
include a large number of up-sampled features concatenated To address these questions, we propose the Adaboosted
using skip connections with the encoded convolutional fea- Deeply Supervised UNet (ADS_UNet). The key contribu-
tures. They named the network a UNet after a geometrical tions of our work can be summarized as follows:
laying out of the network topology in a u-shape. Zhou
et al. (2019) modified the UNet architecture by adding more 1) We integrate deep supervision, cascade learning, and
nodes and connections to capture low-level correlation of AdaBoost into the proposed ADS_UNet, a stage-wise
distributed semantic attributes. The resulting architectures, additive training algorithm, in which multiple UNets
known as UNet𝑒 (𝑒 denotes ensemble) and UNet++, used of varying depths are trained sequentially to enhance
class labels to guide the outputs of decoder layers (called the feature diversity of constituent models. Extensive
deep supervision) to learn highly discriminative features. experiments demonstrate that ADS_UNet is effective
Both UNet𝑒 and UNet++ can be classified as ensemble in boosting segmentation performance.
models, in which multiple models are created to obtain better
performance than each constituent model alone (Opitz and 2) In our deep supervision scheme, we down-sample
Maclin, 1999). A property that is present in a good ensemble the mask to have the same size as feature maps of
is the diversity of the predictions made by contributing mod- hidden layers to compute pixel-wise loss, instead of
els. However, end-to-end training of deep networks tends to up-sampling features. This modification retains the
correlate intermediate layers (Ji and Telgarsky, 2020), hence advantages of deep supervision and yet reduces com-
the collaborative learning of constituent UNets adopted by putation cost and GPU memory consumption.
UNet𝑒 and UNet++ induces learned features to be corre- 3) Instead of assigning balanced weights to all super-
lated. Such learning runs counter to the idea of feature vised layers, we introduce a learnable weight for the
diversity pursued by ensemble models. Moreover, simple
loss of each supervised layer to characterize the im-
averaging performed in UNet𝑒 , disregarding the difference
portance of features learned by layers.
in the performance of each member also restricts the final
predictive performance of the ensemble. 4) We conduct a comprehensive ablation study to sys-
Based on the work of UNet𝑒 and UNet++, we pose tematically analyze the performance gain achieved by
several questions: 1) can each constituent model be forced the ADS_UNet.
∗ Corresponding author
[email protected] (Y. Yang); [email protected] (S.
Dasmahapatra); [email protected] (S. Mahmoodi)
ORCID (s): 0000-0002-2595-7883 (Y. Yang)

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 1 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

(a) UNet𝑒 (b) UNet++ (c) CENet (d) Legend

Figure 1: Comparison of UNet𝑒 (a), UNet++ (b), CENet (c), and proposed UNet (c). UNet++ is constructed from UNet𝑒 by
introducing skip-connections (red dashed lines in (b)) between decoder nodes. CENet disregards inner decoder nodes and adopts
deconvolution and concatenation to harvest multi-scale context clues between encoder and decoder nodes.

2. Related Work et al. (2022) proposed the hybrid ladder transformer (HyLT),
in which the authors use bidirectional cross-attention bridges
In this section, we review the works related to UNet and at multiple resolutions for the exchange of local and global
its variants, deep supervision and AdaBoost, which are the features between the CNN- and transformer-encoding paths.
main components of our architecture.
The fusion of local and global features renders HyLT robust
compared to other CNN-, transformer- and hybrid- methods
2.1. UNet family
for image perturbations. Gao et al. (2022) presented Med-
UNet (Ronneberger et al., 2015) consists of a down-
Former, in which an efficient bidirectional multi-head at-
sampling path to capture context, and a symmetric up-
tention (B-MHA) is proposed to eliminate redundant tokens
sampling path to expand feature maps back to the input size.
and reduce the quadratic complexity of conventional self-
The down-sampling part has an FCN-like architecture that
attention to a linear level. Furthermore, the B-MHA liberates
extracts features with 3×3 convolutions. The up-sampling
the constraints of model design and enables MedFormer to
part uses deconvolution to reduce the number of feature
extract global relations on high-resolution token maps to-
maps while increasing their area. Feature maps from the
wards the fine-grained boundary modelling. Ma et al. (2022)
down-sampling part of the network are copied and concate-
proposed a hierarchical context-attention transformer-based
nated to the up-sampling part to ensure precise localization.
architecture (HT-Net), which introduces an axial attention
Building on the success of UNet, several variants have
layer to model pixel dependencies of multi-scale feature
been proposed to further improve segmentation perfor-
maps, followed by a context-attention module that captures
mance. Here we describe the networks UNet𝑒 and UNet++
context information from adjacent encoder layers.
(Zhou et al., 2019), whose simplified architectures are given
in Figure 1. UNet𝑒 is an ensemble architecture, which
combines UNets of varying depths into one unified structure. CRediT authorship contribution statement
Note that deep supervision is required to train UNet𝑒 in
Yilong Yang: Conceptualization, Methodology, Visual-
an end-to-end fashion. In order to allow deeper UNets to
ization, Investigation, Software, Formal analysis, Writing-
offer a supervision signal to the decoders of the shallower
Original Draft. Srinandan Dasmahapatra: Supervision,
UNets in the ensemble and address the potential loss of
Conceptualisation, Formal analysis, Writing-Reviewing &
information, the UNet++ connects the decoder nodes, to
Editing. Sasan Mahmoodi: Supervision, Conceptualisa-
enable dense feature propagation along skip connections and
tion, Formal analysis, Writing-Reviewing & Editing.
thus more flexible feature fusion at the decoder nodes. The
difference between UNet++ and UNet𝑒 is that there are skip- 2.2. Deep supervision
connections between decoder nodes in UNet++ (highlighted A deeply supervised network (DSN) (Lee et al., 2015)
in red in Figure 1(b)). Zhou et al. (2022) proposed the introduced classification outputs to hidden layers as well as
contextual ensemble network (CENet), where the contextual the last layer output as is the convention. This was shown
cues are aggregated via densely up-sampling the features of to increase the discriminative power of learned features in
the encoder layers to the features of the decoder layers. This shallow layers and robustness to hyper-parameter choice.
enables CENet to capture multi-scale context information. Despite the fact that the original DSN was proposed
While UNet++ and CENet yield higher performance than for classification tasks, deep supervision can also be used
UNet, it does so by introducing dense skip connections that for image segmentation. Dou et al. (2016) introduced deep
result in a huge increase of parameters and computational supervision to combat potential optimization difficulties and
cost. concluded that the model acquired a faster convergence rate
Most recently, building upon the success of Vision and greater discriminability. Based on the UNet architecture,
Transformer (Dosovitskiy et al., 2021) on image classifica- Zhu et al. (2017) introduced a supervision layer to each
tion tasks, self-attention modules have also been integrated encoder/decoder block. Their method is very similar to our
into UNet-like architectures for accurate segmentation. Luo proposed supervision scheme; the difference lies in how the

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 2 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

loss between the larger-sized ground-truth and the smaller- CNN to the later one and then train the new CNN for
sized output of hidden layers is calculated. Note that the only one epoch. This achieves better performance than the
dimension of feature maps of the hidden layers are gradually single CNN, but at the cost of increasing the number of
reduced and become much smaller than that of the ground- parameters several fold. Curriculum learning (Bengio et al.,
truth mask, because of the down-sampling operation. In 2009) is related to boosting algorithms, in that the training
Dou et al. (2016) and Zhu et al. (2017), deconvolutional schedule gradually emphasizes the difficult examples. Cui
layers were used to up-sample feature maps back to the et al. (2022) demonstrated that better performance can be
same size as the ground-truth mask. Evidently, the additional achieved by forcing UNet to learn from easy to difficult
deconvolutional layers introduce more parameters and more scenes. However, the difficulty level of training samples
computational overhead. Although it was pointed out in is predefined according to the size of the target to be
Long et al. (2015) that one can learn arbitrary interpolation segmented, rather than calculated by the network itself.
functions, bilinear interpolation was adopted in Xie and
Tu (2015) to up-sample feature maps with no reduction in
performance compared to learned deconvolutions. All of 3. Method
the aforementioned literature solve the dimension mismatch Ensemble learning is often justified by the heuristic that
problem by up-sampling feature maps. However, in our each base learner might perform well on some data and less
deep supervision scheme, we perform average pooling to accurately on others for some learned features, to enable the
down-sample the ground-truth mask to the same size as ensemble to override common weaknesses. To this end, we
feature maps of hidden layers. This reduces the amount of seek enhanced segmentation performance of the model by
computation and is more GPU memory efficient. enabling diverse feature maps to be learned. We propose the
ADS_UNet algorithm, which adopts a layer-wise cascade
2.3. AdaBoost training approach (Fahlman and Lebiere, 1989; Bengio et al.,
AdaBoost (Adaptive Boosting) (Freund and Schapire, 2007; Marquez et al., 2018) but with an added component
1997) is a very successful ensemble classifier, which has that re-weights training samples to train each base learner in
been widely used in binary classification tasks. The idea sequence. We evaluate the role of feature map diversity in
of AdaBoost is based on the assumption that a highly ac- section 5.3.
curate prediction rule can be obtained by combining many
relatively weak and inaccurate rules. This was re-derived 3.1. Computation and Memory Efficient Deep
in Friedman (2001) as a gradient of an exponential loss Supervision
function of a stage-wise additive model. Such an additive As we mentioned in the introduction section, the UNet𝑒
model was extended to the multi-class case by Hastie et al. and UNet++ (Zhou et al., 2019) offer deep supervision to
(2009), who proposed SAMME (Stage-wise Additive Mod- shallower layers by gradually up-sampling feature maps to
eling using a Multi-class Exponential loss function) that the size of the mask, which is computation and GPU memory
naturally extends the original AdaBoost algorithm to the expensive. To reduce the computational burden, we average-
multi-class case without reducing it to multiple two-class pool the mask to have the same size as feature maps. The
problems. The detailed iterative procedure of multi-class advantage of this change is that we no longer need to train
AdaBoost is described in Algorithm 2 of Hastie et al. (2009). deconvolutional weights for intermediate blocks to obtain
Starting from equally weighted training samples, the feature maps with the same dimension as the ground-truth
AdaBoost trains a classifier 𝑓𝑡 , (𝑡 ∈ {1, 2, ..., 𝑇 } is the mask. This is of potential benefit for texture segmentation,
iteration index) iteratively, re-weighting the training samples as relevant textural characteristics occur at multiple length
in the process. A misclassified item 𝑥𝑖 is assigned a higher scales, and is not confined to the location of the mask bound-
weight 𝑤𝑡𝑖 so that the next iteration of the training pays ary. We adopted UNet𝑑 s, whose hidden layers have been
more attention to it. After each classifier 𝑓𝑡 is trained, it is trained with supervision, as base learners of the proposed
assigned a weight based on its error rate 𝜖𝑡 on the training ensemble model. Given the input image 𝑥 and the network,
set. For the integrated output of the classifier ensemble, the we define the probability map generated at block X𝑖,𝑗 as:
more accurate classifier is assigned a larger weight 𝛼𝑡 to have
more impact on the final outcome. A classifier with < 𝐶1 % 𝑦̂𝑖,𝑗 (𝑥) = sof tmax(X𝑖,𝑗 (𝑥)) (1)
accuracy (less than random guessing for 𝐶 target classes) 𝑖,𝑗
The mapping X𝑖,𝑗 (⋅) ∶ 𝑿 → ℝ𝑁 ×𝐶 consists of a sequence
is discarded. 𝑇 classifiers will be trained after T iterations
of convolution, batch normalization, ReLu activation and
of this procedure. The final labels can be obtained by the
pooling operations, to transform the input image to a feature
weighted majority voting of these 𝑇 classifiers.
representation. Then a softmax activation function is used
An adaptive algorithm, Adaboost-CNN, which com-
to map the representation to a probability map. Here 𝐶 is
bines multiple CNN models for multi-classification was
the number of classes, 𝑁 𝑖,𝑗 denotes the number of pixels of
introduced in Taherkhani et al. (2020). In AdaBoost-CNN,
the down-sampled mask, (𝑖, 𝑗) denotes the index of convolu-
all the weak classifiers are convolutional neural networks 𝑖,𝑗
tional blocks. Given mask 𝑦𝑖,𝑗 ∈ ℝ𝑁 ,𝐶 , the loss function
and have the same architecture. Instead of training a new
used in the block X𝑖,𝑗 is the pixel-wise cross-entropy loss,
CNN from scratch, they transfer the parameters of the prior

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 3 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

(a) UNet1 (b) UNet2 (c) UNet3 (d) UNet4

(e) ADS_UNet (f) scSE

(g) Legend

Figure 2: The architecture of the proposed ADS_UNet. Each circular node in the graph represents a convolution block. Specifically,
yellow nodes indicate that parameters are trainable during back-propagation, green nodes indicate that parameters are frozen.
(a-d) UNets of varying depths. All of UNet𝑖 are trained with the same dataset , but using different sample weight, 𝑊 . (e)
Ensemble architecture, ADS_UNet, which combines UNets of varying depths into one unified architecture for inference. (f) The
scSE block. It contains left branch channel squeeze and spatial excitation block (sSE), and right branch spatial squeeze and
channel excitation block (cSE).

which is defined as: 𝑑 to guarantee ∑ 𝜂 𝑑 = 1. However, the feature


normalize 𝜂𝑖,𝑗 𝑖,𝑗
𝑖,𝑗
𝑑 decreases to
learning of a block will be restricted if its 𝜂𝑖,𝑗
1 ∑ ∑ 𝑖,𝑗
𝑁 𝐶
0, during training. In order to guard against this competition
𝑛,𝑐 ), (2)
𝑖,𝑗 𝑖,𝑗 𝑖,𝑗 𝑖,𝑗
 (𝑦 , 𝑦̂ , 𝑁 ) = − 𝑖,𝑗 𝑦 log(𝑦̂𝑖,𝑗
𝑁 𝑛=1 𝑐=1 𝑛,𝑐 exclusion phenomenon and encourage all supervised blocks
1
to contribute to the segmentation, we add a constant 𝑑+1 to
where 𝑦𝑖,𝑗 𝑖,𝑗
𝑛,𝑐 is the ground-truth label of a pixel and 𝑦̂𝑛,𝑐 is the 𝑑 to raise its lower limit:
𝜂𝑖,𝑗
probability of the pixel being classified as class 𝑐. Based on
equation 2, the overall loss function for the deep supervised 𝑑 + 1 𝑑
𝜂𝑖,𝑗 𝜂𝑖,𝑗
UNet𝑑 is then defined as the weighted sum of the cross 𝑑
𝜂̃𝑖,𝑗 = ∑𝑖+𝑗=𝑑
𝑑+1
= +
1
, (4)
entropy loss from each supervised block X𝑖,𝑗 : (𝜂 𝑑 + 1
) 2 2(𝑑 + 1)
𝑖,𝑗≥0 𝑖,𝑗 𝑑+1


𝑖+𝑗=𝑑 ∑
𝑖+𝑗=𝑑 𝑑
𝜂𝑖,𝑗 1 1
𝑑
𝜂𝑖,𝑗 1
𝑑 = 𝑑 𝑖,𝑗 𝑖,𝑗 𝑖,𝑗
𝜂𝑖,𝑗  (𝑦 , 𝑦̂ , 𝑁 𝑖,𝑗 ), 𝑑
𝜂𝑖,𝑗 ≥ 0, 𝑑
𝜂𝑖,𝑗 = 1, Since lim 2
+ 2(𝑑+1)
= 2(𝑑+1)
and lim 2
+ 2(𝑑+1)
=
𝑑 →0
𝜂𝑖,𝑗 𝑑 →1
𝜂𝑖,𝑗
𝑖,𝑗≥0 𝑖,𝑗≥0
𝑑+2 1 𝑑+2
(3) , 𝜂̃ 𝑑
2(𝑑+1) 𝑖,𝑗
is bounded in [ 2(𝑑+1) , 2(𝑑+1) ]. Then equation (3)
is re-written as follows to train each constitute model UNet𝑑 :
where 𝜂𝑖,𝑗
𝑑 is a weighting factor assigned to the convolutional

block X𝑖,𝑗 to characterize the relative importance of blocks.


𝑑 denotes the depth of the UNet. In contrast to previous
works (Dou et al., 2016; Zhu et al., 2017; Zhou et al., 2019)
𝑑 = 1 , we initialize 𝜂 𝑑 to 1 and
that use equal weights 𝜂𝑖,𝑗 𝑑+1 𝑖,𝑗 𝑑+1
allow the 𝜂𝑖,𝑗
𝑑 to be trainable, and use the softmax function to

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 4 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

Algorithm 1: ADS_UNet. The 𝜂̃𝑖,𝑗 𝑡 term in line 9 is


𝑖+𝑗=𝑑 discussed in the context of equation (4); the UNet𝑡
𝑑 = 𝑑 𝑖,𝑗 𝑖,𝑗 𝑖,𝑗
𝜂̃𝑖,𝑗  (𝑦 , 𝑦̂ , 𝑁 𝑖,𝑗 ), are described in Figure 2.
𝑖,𝑗≥0
(5) Input: Number of class: 𝐶; Training images:
1 𝑑+2 ∑ 𝑑
𝑖+𝑗=𝑑  = {𝑥1 , ..., 𝑥𝑚 }; Training masks:
𝑑
≤ 𝜂̃𝑖,𝑗 ≤ , 𝜂̃ = 1.  = {𝑦1 , ..., 𝑦𝑚 }; Number of iteration: 𝑇 .
2(𝑑 + 1) 2(𝑑 + 1) 𝑖,𝑗≥0 𝑖,𝑗 1
1 𝒘 = {𝑤 |𝑤 =
1 1
𝑘
1
𝑘 𝑚
, 𝑘 = 1, 2, ..., 𝑚} ;
2 Initialising convolutional block 𝑋 0,0 ;
Once the UNet𝑑 is trained, the final probability map gener-
3 for 𝑑 = 1, 2, ..., 𝑇 do
ated by UNet𝑑 is calculated by:
4 for 𝑗 = 0, 1, ..., 𝑑 do

𝑖+𝑗=𝑑 5 Initializing convolutional block 𝑋 𝑑−𝑗,𝑗 ;
𝑑
𝑦̂ (𝑥) = 𝑑 𝑖,𝑗
𝜂̃𝑖,𝑗 𝑦̂ (𝑥), (6) 6 end
𝑖,𝑗≥0 7
𝑑 = 1 , 𝑖, 𝑗 ≥ 0, 𝑖 + 𝑗 = 𝑑 ;
𝜂̃𝑖,𝑗 𝑑+1
with 𝑦̂𝑖,𝑗 (𝑥) and 𝜂𝑖,𝑗
𝑑 defined in equations (1) and (4). 𝑦̂𝑑 (𝑥) 8 Train UNet𝑑 (, , 𝒘𝑑 ) ;

denotes the combined prediction of model UNet𝑑 . We con- 9 𝑦̂𝑑𝑘 = 𝑖+𝑗=𝑑 𝜂̃ 𝑑 𝑦̂𝑖,𝑗 (𝑥𝑘 ) ; // see equation (6)
𝑖,𝑗≥0 𝑖,𝑗
duct ablation studies in section 5.2 to show the benefits of 10 𝑠𝑑𝑘 = mIoU(𝑦̂𝑑𝑘 , 𝑦𝑘 ) ; // see equation (7)

imposing range constraint on 𝜂̃𝑖,𝑗
𝑑 . Moreover, we demonstrate
11 𝜖𝑑 = 𝑚 𝑘=1 𝑤𝑘 (1 − 𝑠𝑘 ) ;
𝑑 𝑑
that generating the final prediction by using the weighted 12 if 𝜖𝑑 < 1 − 1
then
summation of multi-scale outputs yields better segmentation 1
𝐶
1−𝜖𝑑
performance. 13 𝛼𝑑 = 2
ln( 𝜖𝑑
) + ln(𝐶 − 1) ;
14 Updating sample weight 𝑤𝑑𝑘 using
3.2. Stage-wise Additive Training equation (8a) and (8b);
The stage-wise additive training process of the ADS_UNet 15 for 𝑗 = 0, 1, ..., 𝑑 do
is described in Algorithm 1 and visually illustrated in 16 Freeze convolution block 𝑋 𝑗,0 ;
Figure 2. The main components of the iterative training 17 end
procedure are 1) updating sample weights, 2) assigning 18 else
weighting factors to base learners, and 3) freezing trained 19 𝛼𝑑 = 0 ;
encoders while training decoders. We will elaborate on these 20 end
as follows. 21 end
Firstly, given the training images ={𝑥1 , ..., 𝑥𝑚 } of 𝑛 ∑
pixels each, and associated masks ={𝑦1 , ..., 𝑦𝑚 }, we assign Output: ADS_UNet = arg max( 𝑇𝑑=1 𝛼𝑑 𝑦̂𝑑 )
𝐶
a weight 𝑤𝑘 to each sample 𝑥𝑘 . These weights are initialized
to 𝑤1𝑘 = 𝑚1 (line 1 in Algorithm 1). Then, in the first iteration
(𝑑=1), the parameters of the encoder block (X0,0 ) of the where 𝑖 is the index of training images, 𝑐 is the index of class
first base learner UNet1 are initialized (line 2). In the first labels, 𝑑 is the index of iteration and also denotes the depth
iteration of the sequential learning approach, parameters of of the constituent UNet. If the error rate 𝜖1 of the UNet1 is
the bottleneck node X1,0 and decoder nodes X0,1 ) of the less than 1- 𝐶1 (line 12), then UNet1 will be preserved for
UNet1 are initialized randomly (lines 4-6). Line 7 initializes the ensemble, otherwise, it will be disregarded by setting its
the weighting factors 𝜂̃𝑖,𝑗
𝑑 of supervised blocks. The UNet1 is
weighting factor to 0 (lines 18-19). In the case that 𝜖1 <1− 𝐶1 ,
then trained on all training samples with the same weight of the equation shown in line 13 is used to calculate model
1
𝑚
(line 8). After the UNet1 is trained, the training set will be weight 𝛼𝑑 for the ensemble. So far we have obtained the first
used to evaluate it and to determine its error rate 𝜖1 (lines base learner UNet1 , and its weighting factor 𝛼1 .
9-11). In contrast to AdaBoost, we use mean Intersection We then update sample weights based on mIoU scores
over Union (mIoU) error (lines 10) to measure segmentation (line 14) for the training of the next iteration:
performance rather than using mis-classification rate. In 𝑑−1 )
(1−𝑠𝑘
detail, given one-hot mask 𝒚𝑘,𝑐 =[𝑘1 , ⋯ , 𝑘𝑛 ], 𝑘𝑗 ∈{0, 1} for a 𝑤𝑑𝑘 = 𝑤𝑑−1
𝑘 𝑒 , 𝑘 = 1, 2, ..., 𝑚, (8a)
pixel of image 𝑘 belonging to class 𝑐 and the corresponding 𝑤𝑑𝑘
one-hot prediction 𝒚̂𝑘,𝑐
𝑑 =[𝑘̂ 1 , ⋯ , 𝑘̂ 𝑛 ], 𝑘̂ 𝑗 ∈{0, 1} generated by 𝑤𝑑𝑘 ← ∑𝑚 𝑑
, 𝑘 = 1, 2, ..., 𝑚, (8b)
UNet , the mIoU score 𝑠𝑘 is calculated by:
𝑑 𝑑 𝑖=1 𝑤𝑖

Equation (8a) assigns greater weight to images that cannot



𝐶 𝒚𝑘,𝑐 ⋅ 𝒚̂𝑘,𝑐𝑑 be accurately segmented by UNet𝑑−1 , encouraging UNet𝑑 to
1
𝑠𝑑𝑘 = mIoU(𝑦𝑘 , 𝑦̂𝑑𝑘 ) = 𝑑 𝑑 𝑑
, focus more on their segmentation.∑ Equation (8b) normalizes
𝐶 𝑐=1 𝒚𝑘,𝑐 ⋅ 𝒚𝑘,𝑐 + 𝒚̂ 𝑘,𝑐 ⋅ 𝒚̂ 𝑘,𝑐 − 𝒚𝑘,𝑐 ⋅ 𝒚̂𝑘,𝑐 sample weights to guarantee that 𝑚 𝑑
𝑘=1 𝑤𝑘 = 1.
(7) Before the start of the second iteration, it is necessary to
freeze the encoder nodes (X0,0 and X1,0 ) of the UNet1 (lines

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 5 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

15-17). Otherwise, the process of training UNet2 would


update UNet1 ’s encoder parameters as well, reducing the
learned association between the encoder and decoder paths
of UNet1 . Furthermore, subsequent sub-networks UNet d , for
𝑑 ≥ 2 would acquire correlated features. The code block in
lines 4-20 is run for 𝑇 iterations to obtain 𝑇 base learners,
each weighted by 𝛼𝑑 . Note that all parameters of UNet1 (a) Info flow of UNet1 (b) Info flow of UNet2
are trained as a whole but UNet2 reuses encoder weights
Figure 3: Information flow diagram of base learners. 𝑓 𝑖,𝑗
of UNet1 and only its decoder parameters are trained (if 𝑖,𝑗
denotes the output feature maps of block X𝑖,𝑗 , 𝑓𝑠𝑐𝑆𝐸 denotes
𝜖1 ≤ 1− 𝐶1 ). These are shown as yellow nodes in Figure 2(b). 𝑖,𝑗
the re-calibrated version of 𝑓 , a circle with plus sign denotes
The purpose of using the updated sample weights 𝑤2𝑘 to train feature map concatenation operation, connecting lines with an
UNet2 is to force the decoder layers of UNet2 (because the arrow denote the flow of features. Showing in red features that
can be updated during the training of UNet2 (after UNet1 is
connection between X1,0 and X2,0 only involves max pool-
trained), while others are fixed.
ing) to learn features dissimilar to those learned by UNet1 .
This procedure is repeated for each of the base learners
UNet𝑑, with the additional help of feature normalization to
be described next. by the re-scaled vector, with its 𝑘𝑡ℎ channel

3.3. Feature Re-calibration 𝑼̂ 𝑘𝑐𝑆𝐸 = 𝜎(𝒛̂ 𝑘 )𝑼𝑘 , 𝑼𝑘 ∈ ℝ𝐻×𝑊 , (11)


The concurrent spatial and channel Squeeze & Excita-
tion (scSE) block (Roy et al., 2018) is used to re-calibrate In the channel re-calibrated feature maps 𝑼̂ 𝑐𝑆𝐸 , the channels
feature maps learned from encoder blocks of UNet𝑑 , to better that are less important are suppressed and the important ones
adapt to features learned from decoder blocks of deeper are emphasized. Finally, after concurrent spatial and channel
UNet𝑑+𝑎 , 𝑎 ≥ 1 layers. For example, features learned by squeeze and excitation (scSE), a location (𝑖, 𝑗, 𝑐) of the input
the encoder block 𝑋 0,0 and the decoder block 𝑋 1,0 of the feature map 𝑼 is then given higher activation when it gets
UNet1 can cooperate well to perform segmentation since high importance from both, channel re-scaling and spatial
their weights are updated in a coordinated end-to-end back- re-scaling.
propagation process. In UNet2 , however, features produced It is worth mentioning that sample re-weighting and fea-
by 𝑋 1,1 and 𝑋 1,0 (in the same depth) can be very different, ture re-calibration are utilized in the ADS_UNet for different
since the gradient flow is truncated between block X1,0 and purposes and are not in conflict with each other. Taking the
𝑋 2,0 . Therefore, although features produced by 𝑋 0,0 used to pair UNet1 and UNet2 as an example, sample re-weighting
cooperate well with that of 𝑋 1,0 , it is not guaranteed that aims at achieving feature diversity between final outputs
it can adapt well to that of 𝑋 1,1 . Based on this analysis, (𝑓 0,1 and 𝑓 0,2 in Figure 3) of two base learners, so that
the scSE block is used to re-weight features before con- the ensemble of UNet1 and UNet2 can compensate for each
catenating. We evaluate the role of feature re-calibration in other’s incorrect predictions, thus leading to better segmen-
section 4.4. The detailed process of scSE is illustrated in tation. When considering feature re-calibration, UNet1 is
Figure 2(f). trained as a whole with each training sample having the same
Given an input feature map 𝑼 ∈ ℝ𝐻×𝑊 ×𝐶 , The channel sample weight (as described in section 3.2). That means
squeeze operation generates a matrix 𝒒 ∈ ℝ𝐻×𝑊 with feature maps 𝑓 0,0 and 𝑓 1,0 have a high association. In the
matrix elements 𝑞𝑖,𝑗 = 𝑊𝑠𝑞 ⋅ 𝑈𝑖,𝑗,𝑘 , 𝑊𝑠𝑞 ∈ ℝ𝐶 maps the second iteration, however, UNet2 reuses UNet1 ’s encoder
vector at each location (𝑖, 𝑗) into a scalar. This matrix is blocks (X0,0 and X1,0 are fixed now), only newly added
then re-scaled by passing it through a sigmoid function 𝜎(⋅), blocks (X2,0 , X1,1 and X0,2 ) are trained on updated sample
which re-weights the input feature map 𝑼 spatially, weights. This reduces the association of [𝑓 1,0 , 𝑓 2,0 ] and
[𝑓 0,0 , 𝑓 1,1 ], enabling feature de-correlation between fixed
𝑠𝑆𝐸
𝑈̂ 𝑖,𝑗,𝑘 = 𝜎(𝑞𝑖,𝑗 )𝑈𝑖,𝑗,𝑘 , (9) and newly added feature maps. Directly concatenating 𝑓 1,0
with 𝑓 2,0 and f1,1 with 𝑓 0,0 ignores the feature dependence
The global average pooling of the feature map over all pixels issue and results in worse performance (validated in Ta-
produces 𝒛 with components 𝑧𝑘 , ble 4). Therefore, to mitigate this feature mismatching effect,
0,0
we re-calibrate the fixed features (𝑓 0,0 → 𝑓𝑠𝑐𝑆𝐸 , 𝑓 1,0 →
1 ∑𝐻 ∑
𝑊 1,0
𝑓𝑠𝑐𝑆𝐸 ), before concatenation.
𝑧𝑘 = 𝑈 , 𝑘 = 1, 2, … , 𝐶 (10)
𝐻 × 𝑊 𝑖 𝑗 𝑖,𝑗,𝑘
3.4. Difference between ADS_UNet and UNet++
This vector, 𝒛, is transformed to 𝒛̂ = 𝑾𝟏 (ReLU(W2 z)), In section 3.1- 3.3, we introduced the components and
with 𝑾1 ∈ ℝ 𝐶× 𝐶2
, 𝑾2 ∈ ℝ
𝐶
2
×𝐶
being weights of two training scheme of the ADS_UNet. For inference, the final
fully connected layers. The range of the activations of 𝒛̂ are
brought to the interval [0, 1], by passing it through a sigmoid
function 𝜎(⋅). The input feature map 𝑼 is then re-weighted

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 6 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

probability map for an image 𝑥 ∈ ℝ𝐶×𝐻×𝑊 can be gener- of 5 classes (i)tumour, (ii)stroma, (iii)inflammatory infiltra-
ated by weighted average: tion, (iv)necrosis and (v)others. We set aside slides from
7 institutions to create our test set and used the remain-

𝑇
ing images for training. Shift and crop data augmentation,
𝑦(𝑥)
̂ = ADS_UNet(𝑥) = 𝛼𝑑 𝑦̂𝑑 (𝑥) (12) random horizontal and vertical flips were adopted to enrich
𝑑=1
training samples. Finally, 3154 and 1222 pixel tiles of size
Here 𝐶 is the number of classes. 𝑦̂𝑑 (𝑥) is the probability map 512×512 were cropped for training and testing, respectively.
generated by UNet𝑑 , as defined in equation (6) and shown in Weighted categorical cross-entropy loss was used to mitigate
Figure 2(e). class imbalance, with the weight associated with each class
𝑁
The proposed ensemble structure differs from the UNet++ determined by 𝑊𝑐 = 1 − 𝑁𝑐 , where 𝑁 is the number of
in two ways: one differs in the training method, and the pixels in the training dataset and 𝑁𝑐 is the number of pixels
other in the way decisions are made and incorporated into belonging to class 𝑐.
learning. 1) Embedded vs. isolated training. The UNet++ MoNuSeg dataset. The MoNuSeg dataset (Kumar et al.,
is trained in an embedded training fashion where the full 2019) is a multi-organ nucleus segmentation dataset. The
UNet++ model is trained as a whole, with deep supervision training set includes 30 images of size 1000×1000 from
on the last decoder block 𝑋 0,𝑖 of branch 𝑖. In the ADS_UNet, 4 different organs (lung, prostate, kidney, and breast). The
however, each UNet𝑑 is trained by isolating features acquired test set contains 14 images with more than 7000 nucleus
by the deeper encoder and decoder blocks. Moreover, deep boundary annotations. A 400× 400 window slides through
supervision is added to each decoder block of each branch the images with a stride of 200 pixels to separate each image
by down-scaling the label masks, rather than solely on the into 16 tiles for training and testing.
last decoder node of each branch. 2) Average vs. weighted
average voting. In the ensemble mode of the UNet++, the 4.2. Baselines and Implementation
segmentation results from all branches are collected and Since our work is mainly based on UNet, UNet𝑒 , and
then averaged to produce the final prediction. UNet++(𝑥) = UNet++, we re-implement these three models, as well as
∑ CENet, to compare with our proposed methods. We also
arg max( 𝑇1 𝑇𝑑=1 UNet 𝑑 (𝑥)), with UNet𝑑 (𝑥) = 𝑦̂0,𝑑 . 𝑇 is
𝑐∈𝐶 compare the proposed ADS_UNet with two transformer-
the number of branches of the UNet++. However, the based UNet variants, HyLT (Luo et al., 2022) and Med-
ADS_UNet takes performance-weighted combinations of Former (Gao et al., 2022), using the implementation pro-
the component UNets to create the final segmentation map: vided by the authors. For a fair comparison, the configuration

ADS_UNet(𝑥) = arg max( 𝑇𝑑=1 𝛼𝑑 𝑦̂𝑑 ), with 𝑦̂𝑑 =UNet𝑑 (𝑥) of the outermost convolutional blocks (𝑋 𝑖,0 , 𝑖 ∈ {0, 1, 2, 3}
𝑐∈𝐶
is calculated from equation (6). 𝛼𝑑 reflects the importance and 𝑋 𝑖,𝑗 , 𝑖, 𝑗 ≥ 0, 𝑖 + 𝑗 = 4) of all compared methods
of the UNet𝑑 in the ensemble. are exactly the same as in the original UNet (both the num-
ber and size of filters). All inner decoder nodes of UNet𝑒 ,
UNet++ and ADS_UNet are also exactly the same, and all
4. Experiments and Results models have the same hyper-parameters. It is noted that scSE
Three histopathology datasets are used to check the block is not used in UNet, UNet𝑒 , UNet++ and CENet, but
effectiveness of the proposed methods. it is used in the skip-connections of ADS_UNet. The models
are implemented in Pytorch (Paszke et al., 2019) and trained
4.1. Datasets on one NVIDIA RTX 8000 GPU using the Adam optimizer
CRAG dataset. The colorectal adenocarcinoma gland (CRAG) (Kingma and Ba, 2014) with weight decay of 10−7 and
dataset (Awan et al., 2017) contains a total of 213 Hema- learning rate initialized at 0.001 and then changed according
toxylin and Eosin images taken from 38 WSIs scanned with to the 1cycle learning rate policy (Smith and Topin, 2019).
an Omnyx VL120 scanner under 20× objective magnifica- The cross-entropy loss is used to train all compared models,
tion). All images are mostly of size 1512×1516 pixels. The and ADS_UNet is trained with the linear combination of loss
dataset is split into 173 training images and 40 test images. functions using equation (5). On models with a depth of 4,
We resize each image to a resolution of 1024×1024 and then the number of filters at each level are 64, 128, 256, 512,
crop it into four patches with a resolution of 512×512 for all and 1024, on the CRAG and the BCSS dataset. This setting
our experiments. is consistent with the standard UNet (Ronneberger et al.,
BCSS dataset. The Breast Cancer Semantic Segmentation 2015). However, we change the number of filters to 16, 32,
dataset (Amgad et al., 2019) consists of 151 H&E stained 64, 128, 256 for all models, when trained on the MoNuSeg
whole-slide images and ground truth masks correspond- dataset, as our experimental results show that increasing the
ing to 151 histologically confirmed breast cancer cases. A number of filters leads to inferior performance. The colour
representative region of interest (ROI) was selected within normalization method proposed in Vahadane et al. (2016)
each slide by the study coordinator, a medical doctor, and is used to remove stain color variation, before training. We
approved by a senior pathologist. ROIs were selected to be also compare our methods with the state-of-the-art nnU-Net
representative of predominant region classes and textures (Isensee et al., 2021). Note that the nnUNet automatically
within each slide. Tissue types of the BCSS dataset consist

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 7 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

Table 1
The comparison of models in terms of the number of parameters, computational complexity (measured by FLOPs), required
GPU memory, training time (seconds) per epoch, and segmentation performance (measured by mIoU). The FLOPs and GPU
consumption are computed with 512×512 inputs. The GPU memory consumption is measured by nvidia-smi command (batch
size=2). In ADS_UNet, base learners require a different amount of GPU memory, since they vary in depth and the number of
parameters (The total number of trainable parameters of the ADS_UNet is 35.41 million). The mIoU score of the FCN-8 is
computed from the confusion matrix provided in the supplementary material of Amgad et al. (2019).

Net Params(M) FLOPs(G) GPU(GB) Time(s) CRAG BCSS MoNuSeg


FCN-8 (Amgad et al., 2019) – – – – – 60.55 –
UNet (Ronneberger et al., 2015) 31.04 218.9 5.54 771 86.87 59.41 80.12
UNet𝑒 (Zhou et al., 2019) 34.92 445.2 9.80 1071 86.75 58.73 81.08
UNet++ (Zhou et al., 2019) 36.17 514.8 9.31 1303 88.04 59.85 81.29
nnUNet (Isensee et al., 2021) 41.27 65.6 2.92 442 88.45 60.96 80.79
CENet (Zhou et al., 2022) 35.17 471.55 5.99 713 86.85 59.45 81.67
HyLT (Luo et al., 2022) 42.20 329.11 16.06 1500 87.70 60.45 81.69
MedFormer (Gao et al., 2022) 99.54 325.76 15.48 1337 87.92 60.26 81.84
0.41→1.63→ 62.61→114.80→ 4.00→ 4.92→
ADS_UNet 453 89.04 61.05 81.43
6.65→26.72 166.93→219.04 5.40→5.71

Table 2
The difference of the mean rank between ADS_UNet and compared methods. For each tested dataset, the differences in
performance are statistically significant if the difference in the mean rank is greater than the critical distance. Values lower than
the critical distance are highlighted in bold, indicating there are no significant differences between those models and ADS_UNet.
Dataset Critical Distance UNet UNet𝑒 UNet++ nnUNet CENet HyLT MedFormer
CRAG 0.830 1.941 2.109 1.813 0.316 2.025 0.831 0.972
BCSS 0.300 0.474 0.082 0.336 0.426 0.239 0.524 0.801
MoNuSeg 0.700 1.662 2.387 1.978 2.32 1.622 0.453 2.218

decides the depth of the architecture based on its character- the CRAG and BCSS datasets, and achieves competitive per-
ization of the properties of the datasets. In our experiments, formance on the MoNuSeg dataset. UNet++ achieves 1.17,
the nnUNet generated for the BCSS dataset and the CRAG 0.44 and 1.17 higher mIoU scores than UNet by performing
dataset is of a depth of 7, while it is 6 for the MoNuSeg 2.35 times more computation and consuming 1.77 times
dataset. The officially released nnUNet source code is used more GPU memory. In contrast, ADS_UNet performs the
in our experiments. best and yet requires at most 59.51% of the GPU memory
and 42.55% of the floating-point operations required by
4.3. Results UNet++ for training. CENet surpasses ADS_UNet on the
Some image patches and their corresponding segmen- MoNuSeg dataset, but at a cost of requiring 2.15 times more
tation maps are depicted in Figure 4. Table 1 summarizes computation and 1.19 times more GPU memory. nnUNet
the segmentation performance achieved by all compared consumes the least amount of GPU memory and the number
methods. The performance of the baseline method (VGG- of operations, but at the cost of a small decrease in segmen-
16, FCN-8) used in Amgad et al. (2019) is also included for tation accuracy. The design choices (pipeline fingerprint)
comparison. The number of parameters and computational of nnUNet are not fixed across datasets, but are config-
efficiency of various UNet variants is also included in the ured on the fly according to the ‘data fingerprint’ (dataset
table. Statistical analysis of the results (Table 1) is performed properties such as image size, image spacing, number of
with the help of the Autorank package (Herbold, 2020). The classes, etc.). The data-dependent ‘rule-based parameters’
non-parametric Friedman test and the post hoc Nemenyi test (patch size, batch size, network depth, etc.) of the pipeline
(Demšar, 2006) at the significance level 𝛼 = 0.05 are applied is determined by a set of heuristic rules that models pa-
to determine if there are significant differences between rameter inter-dependencies. As shown in Table 1, nnUNet
the predictions generated by models and to find out which outperforms all models on CRAG and BCSS datasets, except
differences are significant. The performance of ADS_UNet for ADS_UNet. But it demonstrates inferior performance
is compared with 7 other models on 3 datasets. Table 2 shows on the MoNuSeg dataset. This can be explained by the
that 17 of these 21 pairwise comparisons are statistically characteristics of datasets and the receptive field size of
significant. models. Firstly, the nnUNet is deeper (the depth of the
Among the different networks evaluated, the ADS_UNet nnUNet is 6 or 7, as mentioned in section 4.2), which means
outperforms all of the other state-of-the-art approaches on that the convolutional kernels of the bottleneck layer (the

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 8 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

Figure 4: Visual comparison of segmentation maps. The mIoU score of each prediction is reported below the prediction.

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 9 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

4.4. Ablation Studies


4.4.1. Down-sampling masks vs. up-sampling feature
maps
We build UNet↑ and UNet↓ as UNet’s counterparts to
demonstrate the advantage of using down-sampled masks
(a) (b) (c) (d)
for deep supervision. In UNet↑ , feature maps of the UNet
Figure 5: (a)-(b) Image-mask patch from the MoNuSeg are bilinearly interpolated to fit the size of the original
dataset (nucleus segmentation). (c)-(d) Image-mask patch mask, while in UNet↓ , the masks are down-sampled to
from the CRAG dataset (gland segmentation). All patches are fit the size of feature maps. As shown in Table 3, UNet↓
of size 512 × 512. with average pooled masks outperforms UNet by 1.46 and
1.12 mIoU on the CRAG and MoNuSeg datasets. This is
achieved with only 0.06% more parameters, 1.26% more
deepest encoder layer) have a larger receptive field, enabling GPU memory consumption and 0.08% more FLOPs. This
the model to extract information from a larger region. This small increase comes from a 1×1 convolution layer ap-
is especially beneficial when the task is to recognize large pended to supervised blocks. We attribute this performance
objects, e.g. tissue types or glands, since a larger receptive gain to back-propagation through deep layers enforcing
field can cover the whole object. In models with a depth of shallow layers to learn discriminative features. UNet↑ yields
4, the size of the receptive field of the bottleneck layer is 0.51 and 0.16 higher mIoU than UNet↓ on CRAG and
limited. This difference in the depth of models may explain MoNuSeg dataset, but performs worse than UNet by 1.01
why nnUNet outperforms shallower models when trained points on BCSS dataset. The 18.80% more computation of
for segmenting tissues and glands. In contrast, the size of the UNet↑ (compared with UNet↓ ) originates from bilinear
the cell nucleus in the MoNuSeg dataset is much smaller interpolation operations when up-sampling feature maps.
than tissue and gland. The receptive field of the bottleneck The GPU memory required in the training process of UNet↑
layer of shallow models is large enough to capture the is 2.50 times that of UNet↓ . The reason is that during
entire nucleus. Further increasing the depth of the network back-propagation the output of all layers is cached during
compresses features leading to information loss rather than forward propagation, and the size of the feature map of the
enhancing the features learnt. The nnUNet improves seg- supervision layer in UNet↑ is 4 to 256 times the size of
mentation performance by enlarging receptive field size, the corresponding one in UNet↓ . Therefore, beyond a small
while ADS_UNet achieves so by ensembling. Image and performance improvement, UNet↓ saves more than 1.50×
mask patches presented in Figure 5 show the size difference GPU consumption thus enabling us to use a larger batch size
of target objects between the nucleus segmentation dataset and save training time.
and the gland segmentation dataset.
Both transformer-based architectures, HyLT and Med-
Former, demonstrate inferior performance on the CRAG 4.4.2. Tracing the origin of the performance gain of
dataset, but achieve competitive performance on the BCSS ADS_UNet.
dataset, and outperform the ADS UNet on the MoNuSeg To gain insight into the reason why ADS_UNet demon-
datasets. However, it is worth noting that the HyLT and strates superior performance on segmentation, we construct
the MedFormer have 1.19 times and 2.81 times parameters eight models and evaluate them on the BCSS dataset, with
than the ADS_UNet does and require 2.81 fold and 2.71 each of them being a combination of deep supervision, SCSE
fold increases in GPU memory than the ADS_UNet does for feature re-calibration blocks and sample re-weighting. The
training. The high demand for GPU memory in the HyLT configuration of models, the performance of each constituent
and MedFormer is not surprising, as the attention blocks UNet𝑑 and their ensemble performance is summarized in
introduce extra intermediate feature maps that should be kept Table 4. To see whether weighted average voting of base
in the GPU memory for back-propagation. learners is better than simple average voting or not, we also
The amount of computation (FLOPs) and GPU memory compare these two ensemble strategies and report results in
requirement are the main constraints on training speed. Table 4.
Among all compared methods, ADS_UNet shows a clear As seen in Table 4, when compared with model_0
advantage in training speed, because the lower GPU memory (none of three components is used), model_1, model_2
requirement of ADS_UNet allows us to use a larger batch and model_3 demonstrate the effectiveness of incorporating
size for faster training. The training speed of nnUNet is close deep supervision, SCSE feature re-calibration and sample
to ADS_UNet, for the same reason. The transformer-based re-weighting into the training of each constituent UNet𝑑 ,
models (MedFormer and HyLT) are the slowest ones since respectively. Moreover, all constitute UNet𝑑 s of model_2
they have the highest GPU memory demand and relatively (with SCSE) surpass the counterparts of model_0, model_1
high computation cost. and model_3. This supports the claim we made in sec-
tion 3.3, namely, features from the encoder block of UNet𝑑−1
should be re-calibrated before concatenating with features
from the decoder block of UNet𝑑 .

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 10 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

Table 3
Comparison between the original UNet (without deep supervision) and UNet↑ /UNet↓ with deep supervision using up-sampled
feature maps/average pooled masks.
Net Params (M) FLOPs (G) GPU (GB) CRAG BCSS MoNuSeg
UNet 31.04 218.9 5.54 86.87 59.41 80.12
UNet↑ 31.17 260.27 14.03 88.84 58.40 81.40
UNet↓ 31.06 219.08 5.61 88.33 59.41 81.24

Table 4
Ablation study. Performance measured by mIoU (highest score highlighted in bold). "SCSE" denotes spatial and channel squeeze
& excitation used in skip-connections. "Re-weight" denote training sample re-weighting. "ens (avg)"/"ens (𝛼)" denote that
segmentation results from all branches are collected and then averaged, or summed by 𝛼 weights.

Model Name Deep supervision SCSE Re-weight UNet1 UNet2 UNet3 UNet4 ens(avg) ens(𝛼)
model_0 % % % 43.41 53.66 57.59 58.40 57.92 58.22
model_1 ✓ % % 43.89 53.14 57.21 58.07 58.79 58.82
model_2 % ✓ % 46.67 55.87 58.70 59.92 59.92 60.20
model_3 % % ✓ 43.58 53.93 57.34 58.43 58.59 58.82
model_4 % ✓ ✓ 47.97 56.47 59.51 59.55 60.51 60.76
model_5 ✓ % ✓ 44.04 53.32 55.87 57.99 58.23 58.25
model_6 ✓ ✓ % 46.93 56.60 58.62 60.26 60.57 60.63
ADS_UNet ✓ ✓ ✓ 46.93 56.91 60.11 60.26 61.04 61.05

Removing deep supervision from the ADS_UNet drops of UNet𝑑 and decoder blocks of UNet𝑑+1 is detrimen-
the mIoU score by 0.29 points (compared with model_4). tal to the final segmentation performance (compared with
Further analysis is provided in section 5.2 to reveal the the ADS_UNet). By introducing feature re-calibration in
reason why introducing explicit deep supervision leads to skip-connections, features learnt in encoder blocks are re-
better performance. weighted to adapt to the ones of decoder blocks, thereby
In model_4, model_5 and model_6, we either remove leading to better performance. The importance of SCSE
deep supervision or the SCSE block or sample re-weighting feature re-calibration is also reflected in comparisons of
from the ADS_UNet, respectively, to show the importance model_0 vs. model_2 (1.98↑), model_1 vs. model_6 (1.81↑),
of each component in the composition of the ADS_UNet. and models_3 vs. model_4 (1.94↑).
As seen in Table 4, removing any one of them would lead to In terms of sample re-weighting, the ensemble (ens(𝛼))
lower segmentation performance. of ADS_UNet surpasses the one of model_6 by 0.42 points.
The experiment conducted on model_5 demonstrates We attribute this to sample weight updating, which allows
that truncating the gradient flow between encoder blocks UNet𝑑 to pay more attention to images which are hard to be
segmented by UNet𝑑−1 . The benefit of sample re-weighting

(a) 𝜂̃𝑖,𝑗 values (MoNuSeg). (b) 𝜂̃𝑖,𝑗 values (CRAG). (c) 𝜂̃𝑖,𝑗 values (BCSS). (d) Training losses (BCSS).

Figure 6: (a)-(c) figures show how 𝜂𝑖,𝑗𝑑


changes when the UNet↓ is trained on the MoNuSeg, CRAG and BCSS datasets. The
changing of 𝜂𝑖,𝑗 varies from dataset to dataset. (e) The training losses of supervision layers (trained on the BCSS dataset).

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 11 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

Table 5 Table 6
The proportion of incorrect labels in different scaled masks. mIoU score of ADS_UNet trained in 3 modes based on the
The (X𝑖,𝑗 ) under the down-scale factor indicates which layers BCSS dataset. Each UNet𝑑 is trained for 70 epochs.
the mask down-sampled by this down-scale factor is used to UNet1 UNet2 UNet3 UNet4 ens(𝛼)
supervise. 𝜼𝑑 47.77 56.34 58.37 59.20 59.64
↓ 2 4 8 16 𝜼̃𝑑 47.14 55.42 58.52 58.95 60.10
Data X1,0 , X1,3 X2,0 , X2,2 X3,0 , X3,1 X4,0 𝜼̃𝑑 (sum) 46.93 56.91 60.11 60.26 61.05
CRAG 1.02 2.78 5.72 9.94
BCSS 1.51 4.32 10.04 19.75
MoNuSeg 5.49 16.21 36.69 60.05 down-sampling, the overall result (as shown in Table 3) is
not adversely affected.

is also reflected in comparisons of model_0 vs. model_3 5.2. Deep Supervision in UNet↓ and ADS_UNet
(0.6↑) and model_2 vs. model_4 (0.56↑). 5.2.1. Different layers contribute differently at
When comparing ensemble strategies, we find both av- different time stamps.
erage voting and weighted voting improve segmentation In UNet𝑒 and UNet++, all losses have the same weight in
performance compared with UNet4 ; but the improvement the back-propagation process, while in UNet↓ and ADS_UNet,
due to weighted voting is higher than from average voting. 𝜂̃𝑖,𝑗 is trainable. The purpose of this design is to check
Moreover, the segmentation performance of the model_6 whether all layers in the summand of the training loss in
with 𝛼 weighting is better than that of average weighting, equation (5) contribute equally. Taking Figure 6(c) as an
although training samples are not re-weighted in its iterative example, the importance of decoder nodes X3,1 and X2,2
training process. This, too, supports the view that integrating is ranked in the top two. This means features learned by
multiple models by weighting each as per its segmenting these 2 layers contribute more than others, with changes in
ability improves the overall performance of the ensemble. their importance throughout the training process. From the
perspective of back-propagation, this means that parameters
of layers which have larger 𝜂𝑖,𝑗 values, will have relatively
5. Analysis
large changes when they are updated using gradient descent.
5.1. Incorrect labeling information can be evaded This fact, therefore, indicates the importance of the features
𝑑
by adjusting 𝜂̃𝑖,𝑗 . derived at that length scale to the separability of texture
It is true that down-sampling the ground-truth mask labels. The segmentation performance achieved by layer
eliminates small objects and leads to incorrect labels for 𝑋 3,1 , 𝑋 2,2 and 𝑋 0,4 (the last layer) are: 59.57%, 59.55% and
pixels located on the class boundaries. We quantify the ratio 59.41%, respectively. This is consistent with 𝜂̃𝑖,𝑗 values (see
of incorrect labels of down-sampled masks and present the Figure 6(c)), 𝜂3,1 > 𝜂2,2 > 𝜂0,4 . A similar trend in the
statistics in Table 5. It can be observed that the proportion changes to 𝜂̃𝑖,𝑗
𝑑 in the iterative training process of ADS_UNet
of incorrect labels rises as the down-scaling factor becomes is also observed in Figure 7(e)- 7(h).
larger. Incorrect labels in the ×16 down-scaled mask in the Figure 6(c) and Figure 7(e)- 7(h) not only show us
CRAG and BCSS datasets account for 9.94% and 19.75% how the parameters of different layers change during the
of the total labels, respectively. This figure soars up to training process, but also indicate that: 1) the importance
60.05% in the MoNuSeg dataset. However, it is noteworthy of parameter varies from layer to layer; 2) the significance
that when these reduced masks are used to supervise the of parameters also vary throughout the training process.
training of layers, there is a trainable weight 𝜂̃𝑖,𝑗 𝑑 (defined
This is the effect of normalization of the weights 𝜂̃𝑖,𝑗𝑑 , which
in equation (4)) that dynamically adjusts the strength of introduces competition between the layers. And also, 3) the
each layer being supervised. Figure 6(a)-6(c) shows how the competition between the layers will continue until equilib-
network adjusts 𝜂̃𝑖,𝑗 during training to assign weightings to rium is reached.
layers and scales that contribute most to the segmentation
task. As seen, at the end of the training, the largest 𝜂̃𝑖,𝑗 5.2.2. Preventing 𝜂𝑖,𝑗 𝑑
from vanishing leads to higher
values of the MoNuSeg, CRAG and BCSS datasets come segmentation performance.
from 𝜂̃0,4 , 𝜂̃1,3 and 𝜂̃3,1 , respectively. That means the UNet↓ In equation (4), we redefine 𝜂𝑖,𝑗
𝑑 as 𝜂̃ 𝑑 to enforce all
𝑖,𝑗
benefits most from the original mask and the mask down- layers to learn features that are directly discriminative for
scaled by a factor of 2, 8, when trained on the MoNuSeg, classifying textures. We then sum the probability maps pro-
CRAG and BCSS dataset. The ×2 and ×8 down-scaled duced by these layers based on their importance factors
masks carry 1.02% and 10.04% incorrect label information, 𝑑 to generate the segmentation map of UNet𝑑 (defined
𝜂̃𝑖,𝑗
respectively. Therefore, even though the down-scaled masks in equation (6)). To verify if this constraint range and the
introduce wrong labelling information, the UNet↓ is able to weighted combination yield better performance or not, we
evade this wrong information to a certain extent and puts run experiments on the BCSS dataset, in which ADS_UNet
attention on the informative mask by adjusting 𝜂̃𝑖,𝑗 . Despite is trained in 3 modes:
the (apparently significant) labelling errors introduced by

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 12 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

(a) 𝑖,𝑗 , 𝑖 + 𝑗 = 1 (b) 𝑖,𝑗 , 𝑖 + 𝑗 = 2 (c) 𝑖,𝑗 , 𝑖 + 𝑗 = 3 (d) 𝑖,𝑗 , 𝑖 + 𝑗 = 4

(e) 𝜂̃𝑖,𝑗
1 , 𝑖+𝑗 =1 (f) 𝜂̃𝑖,𝑗
2 , 𝑖+𝑗 =2 (g) 𝜂̃𝑖,𝑗
3 , 𝑖+𝑗 =3 (h) 𝜂̃𝑖,𝑗
4 , 𝑖+𝑗 =4

(i) 𝑖,𝑗 , 𝑖 + 𝑗 = 1 (j) 𝑖,𝑗 , 𝑖 + 𝑗 = 2 (k) 𝑖,𝑗 , 𝑖 + 𝑗 = 3 (l) 𝑖,𝑗 , 𝑖 + 𝑗 = 4

(m) 𝜂𝑖,𝑗
1 , 𝑖+𝑗 =1 (n) 𝜂𝑖,𝑗
2 , 𝑖+𝑗 =2 (o) 𝜂𝑖,𝑗
3 , 𝑖+𝑗 =3 (p) 𝜂𝑖,𝑗
4 , 𝑖+𝑗 =4

Figure 7: (a)-(d) Cross-entropy losses of supervision layers during the UNet𝑑 training process (Equation (4) is imposed to constraint
𝑑
the range of 𝜂̃𝑖,𝑗 ). 𝑑 is calculated from equation (5). (e)-(h) The corresponding weights of supervision layers. 𝜂̃𝑖𝑗𝑑 reflects the
importance of node X𝑖,𝑗 while computing the overall loss. (i)-(p) The loss and the 𝜂𝑖,𝑗 𝑑
values of supervision layers of UNet𝑑 , in
which 𝜂𝑖,𝑗 is trained without constraints. 𝑑 shown in (i)-(l) is calculated from equation (3). For all plots, the x-axis indexes the
𝑑

training epoch. These plots are based on the BCSS dataset.

1) 𝜼𝑑 : with its element 𝜂𝑖,𝑗𝑑 being trained without range weighted summation of multi-scale prediction (using
constraint. After UNet𝑑 is trained, the output of the equation (6)), which is then used to calculate the 𝛼 𝑑 .
layer which has the largest 𝜂𝑖,𝑗 𝑑 value is selected to
The results of training ADS_UNet in 3 different modes
generate the final segmentation map. i.e., let (𝑖′ , 𝑗 ′ ) = are reported in Table 6, where ADS_UNet with bounded
𝑑 ), the final probability map is obtained
arg max(𝑖,𝑗) (𝜂𝑖,𝑗 𝜼̃𝑑 is seem to slightly surpass the unbounded one. After
combining the probability maps produced by supervision
′ ′
by 𝑦̂𝑑 = 𝑦̂𝑖 ,𝑗 , with 𝑦̂𝑖,𝑗 defined in equation (1).
Then 𝑦̂𝑑 is used to compare with the ground truth to layers based on the layer importance factors 𝜂̃𝑖,𝑗
𝑑 , the mIoU

calculate the 𝛼 𝑑 (the weight of UNet𝑑 ). score on the BCSS dataset is further improved by 0.95
points. To explain the results of Table 6, the loss, 𝜂𝑖,𝑗 𝑑
𝑑 is bounded in [ 1
2) 𝜼̃𝑑 : 𝜂̃𝑖,𝑗 𝑑+2
], according to
2(𝑑+1) 2(𝑑+1) and 𝜂̃𝑖,𝑗 of the ADS_UNet (trained in mode 1 and mode
𝑑
equation (4). The final segmentation map generation 3) are tracked and visualized in Figure 7. As observed
and 𝛼 𝑑 calculation criteria are the same as 1). in Figure 7(i)-7(p), when there is no range constraint on
𝑑 , only one specific layer’s loss dominates the learning
𝜂𝑖,𝑗
3) 𝜼̃𝑑 (sum): training criteria is the same as 2). While the
segmentation map produced by model UNet𝑑 is the process and the loss of other layers is almost negligible (𝜂𝑖,𝑗
𝑑

close to 0), after training for a few epochs. But the loss

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 13 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

(a) UNet (b) diff(UNet↓ , UNet)

Figure 8: (a) Feature similarity of layers for UNet. (b) The difference in feature similarity of layers between UNet↓ and UNet.
In (a), each entry shows the CKA similarity between the two layers. In (b), we calculate the feature similarity matrix for UNet↓ ,
then take the difference between UNet↓ and UNet. These plots are based on BCSS dataset.

(a) ADS_UNet (b) diff(ADS_UNet,UNet𝑒 ) (c) diff(ADS_UNet,UNet++)

Figure 9: (a) Feature similarity of the output layers of ADS_UNet. (b) and (c) We calculate the feature similarity matrix for
UNet𝑒 and UNet++, then take the difference between ADS_UNet and UNet𝑒 , UNet++. These plots are based on BCSS dataset.

increases (3,0 in Figure 7(k) and 4,0 in Figure 7(l)), so similarity of features learned by these blocks will be higher
there is reduced discriminability at the intermediate layers than those of the original UNet. Centered Kernel Alignment
(X3,0 , X4,0 ) still. However, this phenomenon is eliminated (CKA) (Kornblith et al., 2019) has been developed as a tool
after the range constraint is imposed, to suppress the weight for comparing feature representations of neural networks.
of the dominant layer and to enable those of the others Here we use CKA to characterize the similarity of feature
to grow, as shown in Figure 7(a)- 7(h). That means, by representations learned by different convolutional blocks in
retaining the information from previous layers, the range of UNet↓ . As shown in Figure 8(b), the similarity of features
features that are being learned is increased, therefore leading extracted by blocks in UNet↓ is mostly higher than in their
to better performance. Note that 3,0 in Figure 7(c) and counterparts in UNet (although 6 of similarity entries in
4,0 in Figure 7(d) keep decreasing, and differs from that UNet have lower values than that of UNet), which is con-
of Figure 7(k) and Figure 7(l). sistent with our expectation (the 20 positive values add up to
1.89 vs. the 6 negative values add up to -0.47).
5.3. Feature Similarity of Hidden Layers
Since deep supervision provides features of intermediate
blocks with a direct contribution to the overall loss, the

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 14 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

5.4. Feature Diversity of Output Layers References


Ensemble-based learning methods, such as AdaBoost, Amgad, M., Elfandy, H., Hussein, H., Atteya, L.A., Elsebaie, M.A., Abo El-
rely on the independence of features exploited by classi- nasr, L.S., Sakr, R.A., Salem, H.S., Ismail, A.F., Saad, A.M., et al.,
fiers in its ensemble (Freund and Schapire, 1997). If base 2019. Structured crowdsourcing enables convolutional segmentation of
learners produce independent outputs, then the segmentation histology images. Bioinformatics 35, 3461–3467.
accuracy of the ensemble can be enhanced by majority Awan, R., Sirinukunwattana, K., Epstein, D., Jefferyes, S., Qidwai, U.,
Aftab, Z., Mujeeb, I., Snead, D., Rajpoot, N., 2017. Glandular mor-
weighting. Figure 9(a) characterize the feature similarity of phometrics for objective grading of colorectal adenocarcinoma histology
output layers of ADS_UNet. Figure 9(b) and 9(c) shows that images. Scientific reports 7, 1–12.
features learned by the output layers of ADS_UNet are less Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., et al., 2007. Greedy
similar than those in UNet𝑒 (the values add up to -0.29) and layer-wise training of deep networks. Advances in neural information
UNet++ (the values add up to -0.59). Our interpretation is processing systems 19, 153.
Bengio, Y., Louradour, J., Collobert, R., Weston, J., 2009. Curriculum
that this can be attributed to the stage-wise additive learning, learning, in: Proceedings of the 26th annual international conference on
followed by the sample weight updating rule of ADS_UNet, machine learning, pp. 41–48.
and may explain why ADS_UNet outperforms UNet𝑒 and Cui, H., Jiang, L., Yuwen, C., Xia, Y., Zhang, Y., 2022. Deep u-net architec-
UNet++. ture with curriculum learning for myocardial pathology segmentation in
multi-sequence cardiac magnetic resonance images. Knowledge-Based
Systems , 108942.
6. Conclusion Demšar, J., 2006. Statistical comparisons of classifiers over multiple data
sets. The Journal of Machine learning research 7, 1–30.
In this paper, we propose a novel stage-wise additive Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., 2021.
training algorithm, ADS_UNet, that incorporates the Ad- An image is worth 16x16 words: Transformers for image recognition at
aBoost algorithm and greedy layer-wise training strategy scale, in: 9th International Conference on Learning Representations.
into the iterative learning progress of an ensemble model. Dou, Q., Chen, H., Jin, Y., Yu, L., Qin, J., Heng, P.A., 2016. 3d deeply
The proposed method has the following advantages: 1) The supervised network for automatic liver segmentation from ct volumes,
in: International conference on medical image computing and computer-
stage-wise training strategy with re-weighted training sam- assisted intervention, Springer. pp. 149–157.
ples empowers base learners to learn discriminative and Fahlman, S., Lebiere, C., 1989. The cascade-correlation learning architec-
diverse feature representations. These are eventually com- ture. Advances in neural information processing systems 2.
bined in a performance-weighted manner to produce the final Freund, Y., Schapire, R.E., 1997. A decision-theoretic generalization of
prediction, leading to higher accuracy than those achieved by on-line learning and an application to boosting. Journal of computer
and system sciences 55, 119–139.
other UNet-like architectures. 2) In the configuration of base Friedman, J.H., 2001. Greedy function approximation: a gradient boosting
learners, intermediate layers are supervised directly to learn machine. Annals of statistics , 1189–1232.
discriminative features, without the need for learning extra Gao, Y., Zhou, M., Liu, D., Yan, Z., Zhang, S., Metaxas, D.N., 2022. A
up-sampling blocks. This, therefore, diminishes memory data-scalable transformer for medical image segmentation: architecture,
consumption and computational burden. 3) By introducing model efficiency, and benchmark. arXiv preprint arXiv:2203.00131 .
Hastie, T., Rosset, S., Zhu, J., Zou, H., 2009. Multi-class adaboost. Statistics
layer competition, we observe that the importance of feature and its Interface 2, 349–360.
maps produced by layers varies from epoch to epoch at the Herbold, S., 2020. Autorank: A python package for automated ranking of
training stage, and different layers contribute differently in classifiers. Journal of Open Source Software 5, 2173.
a manner that is learnable. 4) ADS_UNet is more computa- Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H., 2021.
tionally efficient (fewer requirements on GPU memory and nnu-net: a self-configuring method for deep learning-based biomedical
image segmentation. Nature methods 18, 203–211.
training time) than UNet𝑒 , UNet++, CENet and transformer- Ji, Z., Telgarsky, M., 2020. Directional convergence and alignment in
based UNet variants, due to its cascade training regimen. deep learning. Advances in Neural Information Processing Systems 33,
However, the ADS_UNet has the following limitation 17176–17186.
that we would like to address in future work: currently, the Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization.
sample re-weighting training criteria restricts the ADS_UNet arXiv preprint arXiv:1412.6980 .
Kornblith, S., Norouzi, M., Lee, H., Hinton, G., 2019. Similarity of
to only update the weights of samples at a relatively coarse neural network representations revisited, in: International Conference on
granularity. In future work, more fine-grained re-weighting Machine Learning, PMLR. pp. 3519–3529.
criteria will be explored to guide successive base learners Kumar, N., Verma, R., Anand, D., Zhou, Y., Onder, O.F., Tsougenis, E.,
to pay more attention to regions/pixels that are difficult Chen, H., Heng, P.A., Li, J., Hu, Z., et al., 2019. A multi-organ nucleus
to distinguish. It would also be promising to integrate the segmentation challenge. IEEE transactions on medical imaging 39,
1380–1391.
AdaBoost and stage-wise training with a Transformer-like Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z., 2015. Deeply-supervised
architecture to further improve segmentation performance. nets, in: Artificial intelligence and statistics, pp. 562–570.
Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks
for semantic segmentation, in: Proceedings of the IEEE conference on
Acknowledgement computer vision and pattern recognition, pp. 3431–3440.
Luo, H., Changdong, Y., Selvan, R., 2022. Hybrid ladder transformers
The authors acknowledge the use of the IRIDIS High-
with efficient parallel-cross attention for medical image segmentation,
Performance Computing Facility, and associated support in: Medical Imaging with Deep Learning.
services at the University of Southampton, in the completion Ma, M., Xia, H., Tan, Y., Li, H., Song, S., 2022. Ht-net: hierarchical context-
of this work. Yilong Yang is supported by China Scholarship attention transformer network for medical ct image segmentation. Ap-
Council under Grant No. 201906310150. plied Intelligence , 1–14.

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 15 of 16


ADS_UNet: A Nested UNet for Histopathology Image Segmentation

Marquez, E.S., Hare, J.S., Niranjan, M., 2018. Deep cascade learning. IEEE if len(np.unique(window))>1:
transactions on neural networks and learning systems 29, 5475–5485. err_count_per_mask+=1
Opitz, D., Maclin, R., 1999. Popular ensemble methods: An empirical return err_count_per_mask
study. Journal of artificial intelligence research 11, 169–198. def get_incorrect_ratio(mask_shape, mask_list):
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., down_factor = [2, 4, 8, 16]
Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al., 2019. Pytorch: An
index = [0, 1, 2, 3]
imperative style, high-performance deep learning library, in: Advances
err_count_per_scale = [0, 0, 0, 0]
in neural information processing systems, pp. 8026–8037.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for mask in mask_list:
for biomedical image segmentation, in: International Conference on for idx,f in zip(index, down_factor):
Medical image computing and computer-assisted intervention, Springer. err_count_per_scale[idx] += avg_pool(mask,
pp. 234–241. mask_shape, f)
Roy, A.G., Navab, N., Wachinger, C., 2018. Concurrent spatial and channel pixel_count = [mask_shape/f for f in
‘squeeze & excitation’in fully convolutional networks, in: International down_factor]**2*len(mask_list)
conference on medical image computing and computer-assisted inter- return np.array(err_count_per_scale)/np.array(pixel_count)
vention, Springer. pp. 421–429.
Smith, L.N., Topin, N., 2019. Super-convergence: Very fast training of
neural networks using large learning rates, in: Artificial intelligence
and machine learning for multi-domain operations applications, Inter-
national Society for Optics and Photonics. p. 1100612.
Taherkhani, A., Cosma, G., McGinnity, T.M., 2020. Adaboost-cnn: An
adaptive boosting algorithm for convolutional neural networks to clas-
sify multi-class imbalanced datasets using transfer learning. Neurocom-
puting 404, 351–366.
Vahadane, A., Peng, T., Sethi, A., Albarqouni, S., Wang, L., Baust, M.,
Steiger, K., Schlitter, A.M., Esposito, I., Navab, N., 2016. Structure-
preserving color normalization and sparse stain separation for histolog-
ical images. IEEE transactions on medical imaging 35, 1962–1971.
Xie, S., Tu, Z., 2015. Holistically-nested edge detection, in: Proceedings of
the IEEE international conference on computer vision, pp. 1395–1403.
Zhou, Q., Wu, X., Zhang, S., Kang, B., Ge, Z., Latecki, L.J., 2022. Contex-
tual ensemble network for semantic segmentation. Pattern Recognition
122, 108290.
Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2019. Unet++:
Redesigning skip connections to exploit multiscale features in image
segmentation. IEEE transactions on medical imaging 39, 1856–1867.
Zhu, Q., Du, B., Turkbey, B., Choyke, P.L., Yan, P., 2017. Deeply-
supervised cnn for prostate segmentation, in: 2017 International Joint
Conference on Neural Networks (Ijcnn), IEEE. pp. 178–184.
Yilong Yang received the master degree in software engineering from
Xiamen University, Xiamen, China, in 2019. He is currently a Ph.D can-
didate with Vision, Learning and Control Research Group, University of
Southampton, United Kingdom. His research interests include computer
vision and geometric deep learning.
Srinandan Dasmahapatra received the Ph.D. degree in physics from the
State University of New York, Stony Brook, NY, USA, in 1992. He is
currently an Associate Professor in the School of Electronics and Computer
Science, University of Southampton, Southampton, U.K. His research
interests include artificial intelligence and pattern recognition.
Sasan Mahmoodi received the Ph.D degree from the University of New-
castle, Newcastle upon Tyne, U.K., in 1998. He is currently an Associate
Professor in the School of Electronics and Computer Science, University
of Southampton, Southampton, U.K. His research interests include medical
image processing, computer vision, and modeling of the biological vision.

A. Python code for counting incorrect labels


in down-scaled masks
import numpy as np
def avg_pool(mask, mask_shape, window_size):
err_count_per_mask = 0
row, col = mask_shape/window_size
for i in range(row):
for j in range(col):
window = img[i:i+window_size,j:j+window_size]

Yang, Dasmahapatra, Mahmoodi: Preprint submitted to Elsevier Page 16 of 16

You might also like