2304.04567v1
2304.04567v1
• The importance of layers varies across the training and contributes differently.
• The ADS_UNet achieves state-of-the-art performance.
• The ADS_UNet is much more memory and computationally efficient than Transformers.
arXiv:2304.04567v1 [eess.IV] 10 Apr 2023
ADS_UNet: A Nested UNet for Histopathology Image Segmentation
Yilong Yanga,∗ , Srinandan Dasmahapatraa and Sasan Mahmoodia
a School of Electronics and Computer Science, University of Southampton, University Road, Southampton, SO17 1BJ, Hampshire, United Kingdom
Figure 1: Comparison of UNet𝑒 (a), UNet++ (b), CENet (c), and proposed UNet (c). UNet++ is constructed from UNet𝑒 by
introducing skip-connections (red dashed lines in (b)) between decoder nodes. CENet disregards inner decoder nodes and adopts
deconvolution and concatenation to harvest multi-scale context clues between encoder and decoder nodes.
2. Related Work et al. (2022) proposed the hybrid ladder transformer (HyLT),
in which the authors use bidirectional cross-attention bridges
In this section, we review the works related to UNet and at multiple resolutions for the exchange of local and global
its variants, deep supervision and AdaBoost, which are the features between the CNN- and transformer-encoding paths.
main components of our architecture.
The fusion of local and global features renders HyLT robust
compared to other CNN-, transformer- and hybrid- methods
2.1. UNet family
for image perturbations. Gao et al. (2022) presented Med-
UNet (Ronneberger et al., 2015) consists of a down-
Former, in which an efficient bidirectional multi-head at-
sampling path to capture context, and a symmetric up-
tention (B-MHA) is proposed to eliminate redundant tokens
sampling path to expand feature maps back to the input size.
and reduce the quadratic complexity of conventional self-
The down-sampling part has an FCN-like architecture that
attention to a linear level. Furthermore, the B-MHA liberates
extracts features with 3×3 convolutions. The up-sampling
the constraints of model design and enables MedFormer to
part uses deconvolution to reduce the number of feature
extract global relations on high-resolution token maps to-
maps while increasing their area. Feature maps from the
wards the fine-grained boundary modelling. Ma et al. (2022)
down-sampling part of the network are copied and concate-
proposed a hierarchical context-attention transformer-based
nated to the up-sampling part to ensure precise localization.
architecture (HT-Net), which introduces an axial attention
Building on the success of UNet, several variants have
layer to model pixel dependencies of multi-scale feature
been proposed to further improve segmentation perfor-
maps, followed by a context-attention module that captures
mance. Here we describe the networks UNet𝑒 and UNet++
context information from adjacent encoder layers.
(Zhou et al., 2019), whose simplified architectures are given
in Figure 1. UNet𝑒 is an ensemble architecture, which
combines UNets of varying depths into one unified structure. CRediT authorship contribution statement
Note that deep supervision is required to train UNet𝑒 in
Yilong Yang: Conceptualization, Methodology, Visual-
an end-to-end fashion. In order to allow deeper UNets to
ization, Investigation, Software, Formal analysis, Writing-
offer a supervision signal to the decoders of the shallower
Original Draft. Srinandan Dasmahapatra: Supervision,
UNets in the ensemble and address the potential loss of
Conceptualisation, Formal analysis, Writing-Reviewing &
information, the UNet++ connects the decoder nodes, to
Editing. Sasan Mahmoodi: Supervision, Conceptualisa-
enable dense feature propagation along skip connections and
tion, Formal analysis, Writing-Reviewing & Editing.
thus more flexible feature fusion at the decoder nodes. The
difference between UNet++ and UNet𝑒 is that there are skip- 2.2. Deep supervision
connections between decoder nodes in UNet++ (highlighted A deeply supervised network (DSN) (Lee et al., 2015)
in red in Figure 1(b)). Zhou et al. (2022) proposed the introduced classification outputs to hidden layers as well as
contextual ensemble network (CENet), where the contextual the last layer output as is the convention. This was shown
cues are aggregated via densely up-sampling the features of to increase the discriminative power of learned features in
the encoder layers to the features of the decoder layers. This shallow layers and robustness to hyper-parameter choice.
enables CENet to capture multi-scale context information. Despite the fact that the original DSN was proposed
While UNet++ and CENet yield higher performance than for classification tasks, deep supervision can also be used
UNet, it does so by introducing dense skip connections that for image segmentation. Dou et al. (2016) introduced deep
result in a huge increase of parameters and computational supervision to combat potential optimization difficulties and
cost. concluded that the model acquired a faster convergence rate
Most recently, building upon the success of Vision and greater discriminability. Based on the UNet architecture,
Transformer (Dosovitskiy et al., 2021) on image classifica- Zhu et al. (2017) introduced a supervision layer to each
tion tasks, self-attention modules have also been integrated encoder/decoder block. Their method is very similar to our
into UNet-like architectures for accurate segmentation. Luo proposed supervision scheme; the difference lies in how the
loss between the larger-sized ground-truth and the smaller- CNN to the later one and then train the new CNN for
sized output of hidden layers is calculated. Note that the only one epoch. This achieves better performance than the
dimension of feature maps of the hidden layers are gradually single CNN, but at the cost of increasing the number of
reduced and become much smaller than that of the ground- parameters several fold. Curriculum learning (Bengio et al.,
truth mask, because of the down-sampling operation. In 2009) is related to boosting algorithms, in that the training
Dou et al. (2016) and Zhu et al. (2017), deconvolutional schedule gradually emphasizes the difficult examples. Cui
layers were used to up-sample feature maps back to the et al. (2022) demonstrated that better performance can be
same size as the ground-truth mask. Evidently, the additional achieved by forcing UNet to learn from easy to difficult
deconvolutional layers introduce more parameters and more scenes. However, the difficulty level of training samples
computational overhead. Although it was pointed out in is predefined according to the size of the target to be
Long et al. (2015) that one can learn arbitrary interpolation segmented, rather than calculated by the network itself.
functions, bilinear interpolation was adopted in Xie and
Tu (2015) to up-sample feature maps with no reduction in
performance compared to learned deconvolutions. All of 3. Method
the aforementioned literature solve the dimension mismatch Ensemble learning is often justified by the heuristic that
problem by up-sampling feature maps. However, in our each base learner might perform well on some data and less
deep supervision scheme, we perform average pooling to accurately on others for some learned features, to enable the
down-sample the ground-truth mask to the same size as ensemble to override common weaknesses. To this end, we
feature maps of hidden layers. This reduces the amount of seek enhanced segmentation performance of the model by
computation and is more GPU memory efficient. enabling diverse feature maps to be learned. We propose the
ADS_UNet algorithm, which adopts a layer-wise cascade
2.3. AdaBoost training approach (Fahlman and Lebiere, 1989; Bengio et al.,
AdaBoost (Adaptive Boosting) (Freund and Schapire, 2007; Marquez et al., 2018) but with an added component
1997) is a very successful ensemble classifier, which has that re-weights training samples to train each base learner in
been widely used in binary classification tasks. The idea sequence. We evaluate the role of feature map diversity in
of AdaBoost is based on the assumption that a highly ac- section 5.3.
curate prediction rule can be obtained by combining many
relatively weak and inaccurate rules. This was re-derived 3.1. Computation and Memory Efficient Deep
in Friedman (2001) as a gradient of an exponential loss Supervision
function of a stage-wise additive model. Such an additive As we mentioned in the introduction section, the UNet𝑒
model was extended to the multi-class case by Hastie et al. and UNet++ (Zhou et al., 2019) offer deep supervision to
(2009), who proposed SAMME (Stage-wise Additive Mod- shallower layers by gradually up-sampling feature maps to
eling using a Multi-class Exponential loss function) that the size of the mask, which is computation and GPU memory
naturally extends the original AdaBoost algorithm to the expensive. To reduce the computational burden, we average-
multi-class case without reducing it to multiple two-class pool the mask to have the same size as feature maps. The
problems. The detailed iterative procedure of multi-class advantage of this change is that we no longer need to train
AdaBoost is described in Algorithm 2 of Hastie et al. (2009). deconvolutional weights for intermediate blocks to obtain
Starting from equally weighted training samples, the feature maps with the same dimension as the ground-truth
AdaBoost trains a classifier 𝑓𝑡 , (𝑡 ∈ {1, 2, ..., 𝑇 } is the mask. This is of potential benefit for texture segmentation,
iteration index) iteratively, re-weighting the training samples as relevant textural characteristics occur at multiple length
in the process. A misclassified item 𝑥𝑖 is assigned a higher scales, and is not confined to the location of the mask bound-
weight 𝑤𝑡𝑖 so that the next iteration of the training pays ary. We adopted UNet𝑑 s, whose hidden layers have been
more attention to it. After each classifier 𝑓𝑡 is trained, it is trained with supervision, as base learners of the proposed
assigned a weight based on its error rate 𝜖𝑡 on the training ensemble model. Given the input image 𝑥 and the network,
set. For the integrated output of the classifier ensemble, the we define the probability map generated at block X𝑖,𝑗 as:
more accurate classifier is assigned a larger weight 𝛼𝑡 to have
more impact on the final outcome. A classifier with < 𝐶1 % 𝑦̂𝑖,𝑗 (𝑥) = sof tmax(X𝑖,𝑗 (𝑥)) (1)
accuracy (less than random guessing for 𝐶 target classes) 𝑖,𝑗
The mapping X𝑖,𝑗 (⋅) ∶ 𝑿 → ℝ𝑁 ×𝐶 consists of a sequence
is discarded. 𝑇 classifiers will be trained after T iterations
of convolution, batch normalization, ReLu activation and
of this procedure. The final labels can be obtained by the
pooling operations, to transform the input image to a feature
weighted majority voting of these 𝑇 classifiers.
representation. Then a softmax activation function is used
An adaptive algorithm, Adaboost-CNN, which com-
to map the representation to a probability map. Here 𝐶 is
bines multiple CNN models for multi-classification was
the number of classes, 𝑁 𝑖,𝑗 denotes the number of pixels of
introduced in Taherkhani et al. (2020). In AdaBoost-CNN,
the down-sampled mask, (𝑖, 𝑗) denotes the index of convolu-
all the weak classifiers are convolutional neural networks 𝑖,𝑗
tional blocks. Given mask 𝑦𝑖,𝑗 ∈ ℝ𝑁 ,𝐶 , the loss function
and have the same architecture. Instead of training a new
used in the block X𝑖,𝑗 is the pixel-wise cross-entropy loss,
CNN from scratch, they transfer the parameters of the prior
(g) Legend
Figure 2: The architecture of the proposed ADS_UNet. Each circular node in the graph represents a convolution block. Specifically,
yellow nodes indicate that parameters are trainable during back-propagation, green nodes indicate that parameters are frozen.
(a-d) UNets of varying depths. All of UNet𝑖 are trained with the same dataset , but using different sample weight, 𝑊 . (e)
Ensemble architecture, ADS_UNet, which combines UNets of varying depths into one unified architecture for inference. (f) The
scSE block. It contains left branch channel squeeze and spatial excitation block (sSE), and right branch spatial squeeze and
channel excitation block (cSE).
∑
𝑖+𝑗=𝑑 ∑
𝑖+𝑗=𝑑 𝑑
𝜂𝑖,𝑗 1 1
𝑑
𝜂𝑖,𝑗 1
𝑑 = 𝑑 𝑖,𝑗 𝑖,𝑗 𝑖,𝑗
𝜂𝑖,𝑗 (𝑦 , 𝑦̂ , 𝑁 𝑖,𝑗 ), 𝑑
𝜂𝑖,𝑗 ≥ 0, 𝑑
𝜂𝑖,𝑗 = 1, Since lim 2
+ 2(𝑑+1)
= 2(𝑑+1)
and lim 2
+ 2(𝑑+1)
=
𝑑 →0
𝜂𝑖,𝑗 𝑑 →1
𝜂𝑖,𝑗
𝑖,𝑗≥0 𝑖,𝑗≥0
𝑑+2 1 𝑑+2
(3) , 𝜂̃ 𝑑
2(𝑑+1) 𝑖,𝑗
is bounded in [ 2(𝑑+1) , 2(𝑑+1) ]. Then equation (3)
is re-written as follows to train each constitute model UNet𝑑 :
where 𝜂𝑖,𝑗
𝑑 is a weighting factor assigned to the convolutional
∑
𝑖+𝑗=𝑑 discussed in the context of equation (4); the UNet𝑡
𝑑 = 𝑑 𝑖,𝑗 𝑖,𝑗 𝑖,𝑗
𝜂̃𝑖,𝑗 (𝑦 , 𝑦̂ , 𝑁 𝑖,𝑗 ), are described in Figure 2.
𝑖,𝑗≥0
(5) Input: Number of class: 𝐶; Training images:
1 𝑑+2 ∑ 𝑑
𝑖+𝑗=𝑑 = {𝑥1 , ..., 𝑥𝑚 }; Training masks:
𝑑
≤ 𝜂̃𝑖,𝑗 ≤ , 𝜂̃ = 1. = {𝑦1 , ..., 𝑦𝑚 }; Number of iteration: 𝑇 .
2(𝑑 + 1) 2(𝑑 + 1) 𝑖,𝑗≥0 𝑖,𝑗 1
1 𝒘 = {𝑤 |𝑤 =
1 1
𝑘
1
𝑘 𝑚
, 𝑘 = 1, 2, ..., 𝑚} ;
2 Initialising convolutional block 𝑋 0,0 ;
Once the UNet𝑑 is trained, the final probability map gener-
3 for 𝑑 = 1, 2, ..., 𝑇 do
ated by UNet𝑑 is calculated by:
4 for 𝑗 = 0, 1, ..., 𝑑 do
∑
𝑖+𝑗=𝑑 5 Initializing convolutional block 𝑋 𝑑−𝑗,𝑗 ;
𝑑
𝑦̂ (𝑥) = 𝑑 𝑖,𝑗
𝜂̃𝑖,𝑗 𝑦̂ (𝑥), (6) 6 end
𝑖,𝑗≥0 7
𝑑 = 1 , 𝑖, 𝑗 ≥ 0, 𝑖 + 𝑗 = 𝑑 ;
𝜂̃𝑖,𝑗 𝑑+1
with 𝑦̂𝑖,𝑗 (𝑥) and 𝜂𝑖,𝑗
𝑑 defined in equations (1) and (4). 𝑦̂𝑑 (𝑥) 8 Train UNet𝑑 (, , 𝒘𝑑 ) ;
∑
denotes the combined prediction of model UNet𝑑 . We con- 9 𝑦̂𝑑𝑘 = 𝑖+𝑗=𝑑 𝜂̃ 𝑑 𝑦̂𝑖,𝑗 (𝑥𝑘 ) ; // see equation (6)
𝑖,𝑗≥0 𝑖,𝑗
duct ablation studies in section 5.2 to show the benefits of 10 𝑠𝑑𝑘 = mIoU(𝑦̂𝑑𝑘 , 𝑦𝑘 ) ; // see equation (7)
∑
imposing range constraint on 𝜂̃𝑖,𝑗
𝑑 . Moreover, we demonstrate
11 𝜖𝑑 = 𝑚 𝑘=1 𝑤𝑘 (1 − 𝑠𝑘 ) ;
𝑑 𝑑
that generating the final prediction by using the weighted 12 if 𝜖𝑑 < 1 − 1
then
summation of multi-scale outputs yields better segmentation 1
𝐶
1−𝜖𝑑
performance. 13 𝛼𝑑 = 2
ln( 𝜖𝑑
) + ln(𝐶 − 1) ;
14 Updating sample weight 𝑤𝑑𝑘 using
3.2. Stage-wise Additive Training equation (8a) and (8b);
The stage-wise additive training process of the ADS_UNet 15 for 𝑗 = 0, 1, ..., 𝑑 do
is described in Algorithm 1 and visually illustrated in 16 Freeze convolution block 𝑋 𝑗,0 ;
Figure 2. The main components of the iterative training 17 end
procedure are 1) updating sample weights, 2) assigning 18 else
weighting factors to base learners, and 3) freezing trained 19 𝛼𝑑 = 0 ;
encoders while training decoders. We will elaborate on these 20 end
as follows. 21 end
Firstly, given the training images ={𝑥1 , ..., 𝑥𝑚 } of 𝑛 ∑
pixels each, and associated masks ={𝑦1 , ..., 𝑦𝑚 }, we assign Output: ADS_UNet = arg max( 𝑇𝑑=1 𝛼𝑑 𝑦̂𝑑 )
𝐶
a weight 𝑤𝑘 to each sample 𝑥𝑘 . These weights are initialized
to 𝑤1𝑘 = 𝑚1 (line 1 in Algorithm 1). Then, in the first iteration
(𝑑=1), the parameters of the encoder block (X0,0 ) of the where 𝑖 is the index of training images, 𝑐 is the index of class
first base learner UNet1 are initialized (line 2). In the first labels, 𝑑 is the index of iteration and also denotes the depth
iteration of the sequential learning approach, parameters of of the constituent UNet. If the error rate 𝜖1 of the UNet1 is
the bottleneck node X1,0 and decoder nodes X0,1 ) of the less than 1- 𝐶1 (line 12), then UNet1 will be preserved for
UNet1 are initialized randomly (lines 4-6). Line 7 initializes the ensemble, otherwise, it will be disregarded by setting its
the weighting factors 𝜂̃𝑖,𝑗
𝑑 of supervised blocks. The UNet1 is
weighting factor to 0 (lines 18-19). In the case that 𝜖1 <1− 𝐶1 ,
then trained on all training samples with the same weight of the equation shown in line 13 is used to calculate model
1
𝑚
(line 8). After the UNet1 is trained, the training set will be weight 𝛼𝑑 for the ensemble. So far we have obtained the first
used to evaluate it and to determine its error rate 𝜖1 (lines base learner UNet1 , and its weighting factor 𝛼1 .
9-11). In contrast to AdaBoost, we use mean Intersection We then update sample weights based on mIoU scores
over Union (mIoU) error (lines 10) to measure segmentation (line 14) for the training of the next iteration:
performance rather than using mis-classification rate. In 𝑑−1 )
(1−𝑠𝑘
detail, given one-hot mask 𝒚𝑘,𝑐 =[𝑘1 , ⋯ , 𝑘𝑛 ], 𝑘𝑗 ∈{0, 1} for a 𝑤𝑑𝑘 = 𝑤𝑑−1
𝑘 𝑒 , 𝑘 = 1, 2, ..., 𝑚, (8a)
pixel of image 𝑘 belonging to class 𝑐 and the corresponding 𝑤𝑑𝑘
one-hot prediction 𝒚̂𝑘,𝑐
𝑑 =[𝑘̂ 1 , ⋯ , 𝑘̂ 𝑛 ], 𝑘̂ 𝑗 ∈{0, 1} generated by 𝑤𝑑𝑘 ← ∑𝑚 𝑑
, 𝑘 = 1, 2, ..., 𝑚, (8b)
UNet , the mIoU score 𝑠𝑘 is calculated by:
𝑑 𝑑 𝑖=1 𝑤𝑖
probability map for an image 𝑥 ∈ ℝ𝐶×𝐻×𝑊 can be gener- of 5 classes (i)tumour, (ii)stroma, (iii)inflammatory infiltra-
ated by weighted average: tion, (iv)necrosis and (v)others. We set aside slides from
7 institutions to create our test set and used the remain-
∑
𝑇
ing images for training. Shift and crop data augmentation,
𝑦(𝑥)
̂ = ADS_UNet(𝑥) = 𝛼𝑑 𝑦̂𝑑 (𝑥) (12) random horizontal and vertical flips were adopted to enrich
𝑑=1
training samples. Finally, 3154 and 1222 pixel tiles of size
Here 𝐶 is the number of classes. 𝑦̂𝑑 (𝑥) is the probability map 512×512 were cropped for training and testing, respectively.
generated by UNet𝑑 , as defined in equation (6) and shown in Weighted categorical cross-entropy loss was used to mitigate
Figure 2(e). class imbalance, with the weight associated with each class
𝑁
The proposed ensemble structure differs from the UNet++ determined by 𝑊𝑐 = 1 − 𝑁𝑐 , where 𝑁 is the number of
in two ways: one differs in the training method, and the pixels in the training dataset and 𝑁𝑐 is the number of pixels
other in the way decisions are made and incorporated into belonging to class 𝑐.
learning. 1) Embedded vs. isolated training. The UNet++ MoNuSeg dataset. The MoNuSeg dataset (Kumar et al.,
is trained in an embedded training fashion where the full 2019) is a multi-organ nucleus segmentation dataset. The
UNet++ model is trained as a whole, with deep supervision training set includes 30 images of size 1000×1000 from
on the last decoder block 𝑋 0,𝑖 of branch 𝑖. In the ADS_UNet, 4 different organs (lung, prostate, kidney, and breast). The
however, each UNet𝑑 is trained by isolating features acquired test set contains 14 images with more than 7000 nucleus
by the deeper encoder and decoder blocks. Moreover, deep boundary annotations. A 400× 400 window slides through
supervision is added to each decoder block of each branch the images with a stride of 200 pixels to separate each image
by down-scaling the label masks, rather than solely on the into 16 tiles for training and testing.
last decoder node of each branch. 2) Average vs. weighted
average voting. In the ensemble mode of the UNet++, the 4.2. Baselines and Implementation
segmentation results from all branches are collected and Since our work is mainly based on UNet, UNet𝑒 , and
then averaged to produce the final prediction. UNet++(𝑥) = UNet++, we re-implement these three models, as well as
∑ CENet, to compare with our proposed methods. We also
arg max( 𝑇1 𝑇𝑑=1 UNet 𝑑 (𝑥)), with UNet𝑑 (𝑥) = 𝑦̂0,𝑑 . 𝑇 is
𝑐∈𝐶 compare the proposed ADS_UNet with two transformer-
the number of branches of the UNet++. However, the based UNet variants, HyLT (Luo et al., 2022) and Med-
ADS_UNet takes performance-weighted combinations of Former (Gao et al., 2022), using the implementation pro-
the component UNets to create the final segmentation map: vided by the authors. For a fair comparison, the configuration
∑
ADS_UNet(𝑥) = arg max( 𝑇𝑑=1 𝛼𝑑 𝑦̂𝑑 ), with 𝑦̂𝑑 =UNet𝑑 (𝑥) of the outermost convolutional blocks (𝑋 𝑖,0 , 𝑖 ∈ {0, 1, 2, 3}
𝑐∈𝐶
is calculated from equation (6). 𝛼𝑑 reflects the importance and 𝑋 𝑖,𝑗 , 𝑖, 𝑗 ≥ 0, 𝑖 + 𝑗 = 4) of all compared methods
of the UNet𝑑 in the ensemble. are exactly the same as in the original UNet (both the num-
ber and size of filters). All inner decoder nodes of UNet𝑒 ,
UNet++ and ADS_UNet are also exactly the same, and all
4. Experiments and Results models have the same hyper-parameters. It is noted that scSE
Three histopathology datasets are used to check the block is not used in UNet, UNet𝑒 , UNet++ and CENet, but
effectiveness of the proposed methods. it is used in the skip-connections of ADS_UNet. The models
are implemented in Pytorch (Paszke et al., 2019) and trained
4.1. Datasets on one NVIDIA RTX 8000 GPU using the Adam optimizer
CRAG dataset. The colorectal adenocarcinoma gland (CRAG) (Kingma and Ba, 2014) with weight decay of 10−7 and
dataset (Awan et al., 2017) contains a total of 213 Hema- learning rate initialized at 0.001 and then changed according
toxylin and Eosin images taken from 38 WSIs scanned with to the 1cycle learning rate policy (Smith and Topin, 2019).
an Omnyx VL120 scanner under 20× objective magnifica- The cross-entropy loss is used to train all compared models,
tion). All images are mostly of size 1512×1516 pixels. The and ADS_UNet is trained with the linear combination of loss
dataset is split into 173 training images and 40 test images. functions using equation (5). On models with a depth of 4,
We resize each image to a resolution of 1024×1024 and then the number of filters at each level are 64, 128, 256, 512,
crop it into four patches with a resolution of 512×512 for all and 1024, on the CRAG and the BCSS dataset. This setting
our experiments. is consistent with the standard UNet (Ronneberger et al.,
BCSS dataset. The Breast Cancer Semantic Segmentation 2015). However, we change the number of filters to 16, 32,
dataset (Amgad et al., 2019) consists of 151 H&E stained 64, 128, 256 for all models, when trained on the MoNuSeg
whole-slide images and ground truth masks correspond- dataset, as our experimental results show that increasing the
ing to 151 histologically confirmed breast cancer cases. A number of filters leads to inferior performance. The colour
representative region of interest (ROI) was selected within normalization method proposed in Vahadane et al. (2016)
each slide by the study coordinator, a medical doctor, and is used to remove stain color variation, before training. We
approved by a senior pathologist. ROIs were selected to be also compare our methods with the state-of-the-art nnU-Net
representative of predominant region classes and textures (Isensee et al., 2021). Note that the nnUNet automatically
within each slide. Tissue types of the BCSS dataset consist
Table 1
The comparison of models in terms of the number of parameters, computational complexity (measured by FLOPs), required
GPU memory, training time (seconds) per epoch, and segmentation performance (measured by mIoU). The FLOPs and GPU
consumption are computed with 512×512 inputs. The GPU memory consumption is measured by nvidia-smi command (batch
size=2). In ADS_UNet, base learners require a different amount of GPU memory, since they vary in depth and the number of
parameters (The total number of trainable parameters of the ADS_UNet is 35.41 million). The mIoU score of the FCN-8 is
computed from the confusion matrix provided in the supplementary material of Amgad et al. (2019).
Table 2
The difference of the mean rank between ADS_UNet and compared methods. For each tested dataset, the differences in
performance are statistically significant if the difference in the mean rank is greater than the critical distance. Values lower than
the critical distance are highlighted in bold, indicating there are no significant differences between those models and ADS_UNet.
Dataset Critical Distance UNet UNet𝑒 UNet++ nnUNet CENet HyLT MedFormer
CRAG 0.830 1.941 2.109 1.813 0.316 2.025 0.831 0.972
BCSS 0.300 0.474 0.082 0.336 0.426 0.239 0.524 0.801
MoNuSeg 0.700 1.662 2.387 1.978 2.32 1.622 0.453 2.218
decides the depth of the architecture based on its character- the CRAG and BCSS datasets, and achieves competitive per-
ization of the properties of the datasets. In our experiments, formance on the MoNuSeg dataset. UNet++ achieves 1.17,
the nnUNet generated for the BCSS dataset and the CRAG 0.44 and 1.17 higher mIoU scores than UNet by performing
dataset is of a depth of 7, while it is 6 for the MoNuSeg 2.35 times more computation and consuming 1.77 times
dataset. The officially released nnUNet source code is used more GPU memory. In contrast, ADS_UNet performs the
in our experiments. best and yet requires at most 59.51% of the GPU memory
and 42.55% of the floating-point operations required by
4.3. Results UNet++ for training. CENet surpasses ADS_UNet on the
Some image patches and their corresponding segmen- MoNuSeg dataset, but at a cost of requiring 2.15 times more
tation maps are depicted in Figure 4. Table 1 summarizes computation and 1.19 times more GPU memory. nnUNet
the segmentation performance achieved by all compared consumes the least amount of GPU memory and the number
methods. The performance of the baseline method (VGG- of operations, but at the cost of a small decrease in segmen-
16, FCN-8) used in Amgad et al. (2019) is also included for tation accuracy. The design choices (pipeline fingerprint)
comparison. The number of parameters and computational of nnUNet are not fixed across datasets, but are config-
efficiency of various UNet variants is also included in the ured on the fly according to the ‘data fingerprint’ (dataset
table. Statistical analysis of the results (Table 1) is performed properties such as image size, image spacing, number of
with the help of the Autorank package (Herbold, 2020). The classes, etc.). The data-dependent ‘rule-based parameters’
non-parametric Friedman test and the post hoc Nemenyi test (patch size, batch size, network depth, etc.) of the pipeline
(Demšar, 2006) at the significance level 𝛼 = 0.05 are applied is determined by a set of heuristic rules that models pa-
to determine if there are significant differences between rameter inter-dependencies. As shown in Table 1, nnUNet
the predictions generated by models and to find out which outperforms all models on CRAG and BCSS datasets, except
differences are significant. The performance of ADS_UNet for ADS_UNet. But it demonstrates inferior performance
is compared with 7 other models on 3 datasets. Table 2 shows on the MoNuSeg dataset. This can be explained by the
that 17 of these 21 pairwise comparisons are statistically characteristics of datasets and the receptive field size of
significant. models. Firstly, the nnUNet is deeper (the depth of the
Among the different networks evaluated, the ADS_UNet nnUNet is 6 or 7, as mentioned in section 4.2), which means
outperforms all of the other state-of-the-art approaches on that the convolutional kernels of the bottleneck layer (the
Figure 4: Visual comparison of segmentation maps. The mIoU score of each prediction is reported below the prediction.
Table 3
Comparison between the original UNet (without deep supervision) and UNet↑ /UNet↓ with deep supervision using up-sampled
feature maps/average pooled masks.
Net Params (M) FLOPs (G) GPU (GB) CRAG BCSS MoNuSeg
UNet 31.04 218.9 5.54 86.87 59.41 80.12
UNet↑ 31.17 260.27 14.03 88.84 58.40 81.40
UNet↓ 31.06 219.08 5.61 88.33 59.41 81.24
Table 4
Ablation study. Performance measured by mIoU (highest score highlighted in bold). "SCSE" denotes spatial and channel squeeze
& excitation used in skip-connections. "Re-weight" denote training sample re-weighting. "ens (avg)"/"ens (𝛼)" denote that
segmentation results from all branches are collected and then averaged, or summed by 𝛼 weights.
Model Name Deep supervision SCSE Re-weight UNet1 UNet2 UNet3 UNet4 ens(avg) ens(𝛼)
model_0 % % % 43.41 53.66 57.59 58.40 57.92 58.22
model_1 ✓ % % 43.89 53.14 57.21 58.07 58.79 58.82
model_2 % ✓ % 46.67 55.87 58.70 59.92 59.92 60.20
model_3 % % ✓ 43.58 53.93 57.34 58.43 58.59 58.82
model_4 % ✓ ✓ 47.97 56.47 59.51 59.55 60.51 60.76
model_5 ✓ % ✓ 44.04 53.32 55.87 57.99 58.23 58.25
model_6 ✓ ✓ % 46.93 56.60 58.62 60.26 60.57 60.63
ADS_UNet ✓ ✓ ✓ 46.93 56.91 60.11 60.26 61.04 61.05
Removing deep supervision from the ADS_UNet drops of UNet𝑑 and decoder blocks of UNet𝑑+1 is detrimen-
the mIoU score by 0.29 points (compared with model_4). tal to the final segmentation performance (compared with
Further analysis is provided in section 5.2 to reveal the the ADS_UNet). By introducing feature re-calibration in
reason why introducing explicit deep supervision leads to skip-connections, features learnt in encoder blocks are re-
better performance. weighted to adapt to the ones of decoder blocks, thereby
In model_4, model_5 and model_6, we either remove leading to better performance. The importance of SCSE
deep supervision or the SCSE block or sample re-weighting feature re-calibration is also reflected in comparisons of
from the ADS_UNet, respectively, to show the importance model_0 vs. model_2 (1.98↑), model_1 vs. model_6 (1.81↑),
of each component in the composition of the ADS_UNet. and models_3 vs. model_4 (1.94↑).
As seen in Table 4, removing any one of them would lead to In terms of sample re-weighting, the ensemble (ens(𝛼))
lower segmentation performance. of ADS_UNet surpasses the one of model_6 by 0.42 points.
The experiment conducted on model_5 demonstrates We attribute this to sample weight updating, which allows
that truncating the gradient flow between encoder blocks UNet𝑑 to pay more attention to images which are hard to be
segmented by UNet𝑑−1 . The benefit of sample re-weighting
(a) 𝜂̃𝑖,𝑗 values (MoNuSeg). (b) 𝜂̃𝑖,𝑗 values (CRAG). (c) 𝜂̃𝑖,𝑗 values (BCSS). (d) Training losses (BCSS).
Table 5 Table 6
The proportion of incorrect labels in different scaled masks. mIoU score of ADS_UNet trained in 3 modes based on the
The (X𝑖,𝑗 ) under the down-scale factor indicates which layers BCSS dataset. Each UNet𝑑 is trained for 70 epochs.
the mask down-sampled by this down-scale factor is used to UNet1 UNet2 UNet3 UNet4 ens(𝛼)
supervise. 𝜼𝑑 47.77 56.34 58.37 59.20 59.64
↓ 2 4 8 16 𝜼̃𝑑 47.14 55.42 58.52 58.95 60.10
Data X1,0 , X1,3 X2,0 , X2,2 X3,0 , X3,1 X4,0 𝜼̃𝑑 (sum) 46.93 56.91 60.11 60.26 61.05
CRAG 1.02 2.78 5.72 9.94
BCSS 1.51 4.32 10.04 19.75
MoNuSeg 5.49 16.21 36.69 60.05 down-sampling, the overall result (as shown in Table 3) is
not adversely affected.
is also reflected in comparisons of model_0 vs. model_3 5.2. Deep Supervision in UNet↓ and ADS_UNet
(0.6↑) and model_2 vs. model_4 (0.56↑). 5.2.1. Different layers contribute differently at
When comparing ensemble strategies, we find both av- different time stamps.
erage voting and weighted voting improve segmentation In UNet𝑒 and UNet++, all losses have the same weight in
performance compared with UNet4 ; but the improvement the back-propagation process, while in UNet↓ and ADS_UNet,
due to weighted voting is higher than from average voting. 𝜂̃𝑖,𝑗 is trainable. The purpose of this design is to check
Moreover, the segmentation performance of the model_6 whether all layers in the summand of the training loss in
with 𝛼 weighting is better than that of average weighting, equation (5) contribute equally. Taking Figure 6(c) as an
although training samples are not re-weighted in its iterative example, the importance of decoder nodes X3,1 and X2,2
training process. This, too, supports the view that integrating is ranked in the top two. This means features learned by
multiple models by weighting each as per its segmenting these 2 layers contribute more than others, with changes in
ability improves the overall performance of the ensemble. their importance throughout the training process. From the
perspective of back-propagation, this means that parameters
of layers which have larger 𝜂𝑖,𝑗 values, will have relatively
5. Analysis
large changes when they are updated using gradient descent.
5.1. Incorrect labeling information can be evaded This fact, therefore, indicates the importance of the features
𝑑
by adjusting 𝜂̃𝑖,𝑗 . derived at that length scale to the separability of texture
It is true that down-sampling the ground-truth mask labels. The segmentation performance achieved by layer
eliminates small objects and leads to incorrect labels for 𝑋 3,1 , 𝑋 2,2 and 𝑋 0,4 (the last layer) are: 59.57%, 59.55% and
pixels located on the class boundaries. We quantify the ratio 59.41%, respectively. This is consistent with 𝜂̃𝑖,𝑗 values (see
of incorrect labels of down-sampled masks and present the Figure 6(c)), 𝜂3,1 > 𝜂2,2 > 𝜂0,4 . A similar trend in the
statistics in Table 5. It can be observed that the proportion changes to 𝜂̃𝑖,𝑗
𝑑 in the iterative training process of ADS_UNet
of incorrect labels rises as the down-scaling factor becomes is also observed in Figure 7(e)- 7(h).
larger. Incorrect labels in the ×16 down-scaled mask in the Figure 6(c) and Figure 7(e)- 7(h) not only show us
CRAG and BCSS datasets account for 9.94% and 19.75% how the parameters of different layers change during the
of the total labels, respectively. This figure soars up to training process, but also indicate that: 1) the importance
60.05% in the MoNuSeg dataset. However, it is noteworthy of parameter varies from layer to layer; 2) the significance
that when these reduced masks are used to supervise the of parameters also vary throughout the training process.
training of layers, there is a trainable weight 𝜂̃𝑖,𝑗 𝑑 (defined
This is the effect of normalization of the weights 𝜂̃𝑖,𝑗𝑑 , which
in equation (4)) that dynamically adjusts the strength of introduces competition between the layers. And also, 3) the
each layer being supervised. Figure 6(a)-6(c) shows how the competition between the layers will continue until equilib-
network adjusts 𝜂̃𝑖,𝑗 during training to assign weightings to rium is reached.
layers and scales that contribute most to the segmentation
task. As seen, at the end of the training, the largest 𝜂̃𝑖,𝑗 5.2.2. Preventing 𝜂𝑖,𝑗 𝑑
from vanishing leads to higher
values of the MoNuSeg, CRAG and BCSS datasets come segmentation performance.
from 𝜂̃0,4 , 𝜂̃1,3 and 𝜂̃3,1 , respectively. That means the UNet↓ In equation (4), we redefine 𝜂𝑖,𝑗
𝑑 as 𝜂̃ 𝑑 to enforce all
𝑖,𝑗
benefits most from the original mask and the mask down- layers to learn features that are directly discriminative for
scaled by a factor of 2, 8, when trained on the MoNuSeg, classifying textures. We then sum the probability maps pro-
CRAG and BCSS dataset. The ×2 and ×8 down-scaled duced by these layers based on their importance factors
masks carry 1.02% and 10.04% incorrect label information, 𝑑 to generate the segmentation map of UNet𝑑 (defined
𝜂̃𝑖,𝑗
respectively. Therefore, even though the down-scaled masks in equation (6)). To verify if this constraint range and the
introduce wrong labelling information, the UNet↓ is able to weighted combination yield better performance or not, we
evade this wrong information to a certain extent and puts run experiments on the BCSS dataset, in which ADS_UNet
attention on the informative mask by adjusting 𝜂̃𝑖,𝑗 . Despite is trained in 3 modes:
the (apparently significant) labelling errors introduced by
(e) 𝜂̃𝑖,𝑗
1 , 𝑖+𝑗 =1 (f) 𝜂̃𝑖,𝑗
2 , 𝑖+𝑗 =2 (g) 𝜂̃𝑖,𝑗
3 , 𝑖+𝑗 =3 (h) 𝜂̃𝑖,𝑗
4 , 𝑖+𝑗 =4
(m) 𝜂𝑖,𝑗
1 , 𝑖+𝑗 =1 (n) 𝜂𝑖,𝑗
2 , 𝑖+𝑗 =2 (o) 𝜂𝑖,𝑗
3 , 𝑖+𝑗 =3 (p) 𝜂𝑖,𝑗
4 , 𝑖+𝑗 =4
Figure 7: (a)-(d) Cross-entropy losses of supervision layers during the UNet𝑑 training process (Equation (4) is imposed to constraint
𝑑
the range of 𝜂̃𝑖,𝑗 ). 𝑑 is calculated from equation (5). (e)-(h) The corresponding weights of supervision layers. 𝜂̃𝑖𝑗𝑑 reflects the
importance of node X𝑖,𝑗 while computing the overall loss. (i)-(p) The loss and the 𝜂𝑖,𝑗 𝑑
values of supervision layers of UNet𝑑 , in
which 𝜂𝑖,𝑗 is trained without constraints. 𝑑 shown in (i)-(l) is calculated from equation (3). For all plots, the x-axis indexes the
𝑑
1) 𝜼𝑑 : with its element 𝜂𝑖,𝑗𝑑 being trained without range weighted summation of multi-scale prediction (using
constraint. After UNet𝑑 is trained, the output of the equation (6)), which is then used to calculate the 𝛼 𝑑 .
layer which has the largest 𝜂𝑖,𝑗 𝑑 value is selected to
The results of training ADS_UNet in 3 different modes
generate the final segmentation map. i.e., let (𝑖′ , 𝑗 ′ ) = are reported in Table 6, where ADS_UNet with bounded
𝑑 ), the final probability map is obtained
arg max(𝑖,𝑗) (𝜂𝑖,𝑗 𝜼̃𝑑 is seem to slightly surpass the unbounded one. After
combining the probability maps produced by supervision
′ ′
by 𝑦̂𝑑 = 𝑦̂𝑖 ,𝑗 , with 𝑦̂𝑖,𝑗 defined in equation (1).
Then 𝑦̂𝑑 is used to compare with the ground truth to layers based on the layer importance factors 𝜂̃𝑖,𝑗
𝑑 , the mIoU
calculate the 𝛼 𝑑 (the weight of UNet𝑑 ). score on the BCSS dataset is further improved by 0.95
points. To explain the results of Table 6, the loss, 𝜂𝑖,𝑗 𝑑
𝑑 is bounded in [ 1
2) 𝜼̃𝑑 : 𝜂̃𝑖,𝑗 𝑑+2
], according to
2(𝑑+1) 2(𝑑+1) and 𝜂̃𝑖,𝑗 of the ADS_UNet (trained in mode 1 and mode
𝑑
equation (4). The final segmentation map generation 3) are tracked and visualized in Figure 7. As observed
and 𝛼 𝑑 calculation criteria are the same as 1). in Figure 7(i)-7(p), when there is no range constraint on
𝑑 , only one specific layer’s loss dominates the learning
𝜂𝑖,𝑗
3) 𝜼̃𝑑 (sum): training criteria is the same as 2). While the
segmentation map produced by model UNet𝑑 is the process and the loss of other layers is almost negligible (𝜂𝑖,𝑗
𝑑
close to 0), after training for a few epochs. But the loss
Figure 8: (a) Feature similarity of layers for UNet. (b) The difference in feature similarity of layers between UNet↓ and UNet.
In (a), each entry shows the CKA similarity between the two layers. In (b), we calculate the feature similarity matrix for UNet↓ ,
then take the difference between UNet↓ and UNet. These plots are based on BCSS dataset.
Figure 9: (a) Feature similarity of the output layers of ADS_UNet. (b) and (c) We calculate the feature similarity matrix for
UNet𝑒 and UNet++, then take the difference between ADS_UNet and UNet𝑒 , UNet++. These plots are based on BCSS dataset.
increases (3,0 in Figure 7(k) and 4,0 in Figure 7(l)), so similarity of features learned by these blocks will be higher
there is reduced discriminability at the intermediate layers than those of the original UNet. Centered Kernel Alignment
(X3,0 , X4,0 ) still. However, this phenomenon is eliminated (CKA) (Kornblith et al., 2019) has been developed as a tool
after the range constraint is imposed, to suppress the weight for comparing feature representations of neural networks.
of the dominant layer and to enable those of the others Here we use CKA to characterize the similarity of feature
to grow, as shown in Figure 7(a)- 7(h). That means, by representations learned by different convolutional blocks in
retaining the information from previous layers, the range of UNet↓ . As shown in Figure 8(b), the similarity of features
features that are being learned is increased, therefore leading extracted by blocks in UNet↓ is mostly higher than in their
to better performance. Note that 3,0 in Figure 7(c) and counterparts in UNet (although 6 of similarity entries in
4,0 in Figure 7(d) keep decreasing, and differs from that UNet have lower values than that of UNet), which is con-
of Figure 7(k) and Figure 7(l). sistent with our expectation (the 20 positive values add up to
1.89 vs. the 6 negative values add up to -0.47).
5.3. Feature Similarity of Hidden Layers
Since deep supervision provides features of intermediate
blocks with a direct contribution to the overall loss, the
Marquez, E.S., Hare, J.S., Niranjan, M., 2018. Deep cascade learning. IEEE if len(np.unique(window))>1:
transactions on neural networks and learning systems 29, 5475–5485. err_count_per_mask+=1
Opitz, D., Maclin, R., 1999. Popular ensemble methods: An empirical return err_count_per_mask
study. Journal of artificial intelligence research 11, 169–198. def get_incorrect_ratio(mask_shape, mask_list):
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., down_factor = [2, 4, 8, 16]
Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al., 2019. Pytorch: An
index = [0, 1, 2, 3]
imperative style, high-performance deep learning library, in: Advances
err_count_per_scale = [0, 0, 0, 0]
in neural information processing systems, pp. 8026–8037.
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for mask in mask_list:
for biomedical image segmentation, in: International Conference on for idx,f in zip(index, down_factor):
Medical image computing and computer-assisted intervention, Springer. err_count_per_scale[idx] += avg_pool(mask,
pp. 234–241. mask_shape, f)
Roy, A.G., Navab, N., Wachinger, C., 2018. Concurrent spatial and channel pixel_count = [mask_shape/f for f in
‘squeeze & excitation’in fully convolutional networks, in: International down_factor]**2*len(mask_list)
conference on medical image computing and computer-assisted inter- return np.array(err_count_per_scale)/np.array(pixel_count)
vention, Springer. pp. 421–429.
Smith, L.N., Topin, N., 2019. Super-convergence: Very fast training of
neural networks using large learning rates, in: Artificial intelligence
and machine learning for multi-domain operations applications, Inter-
national Society for Optics and Photonics. p. 1100612.
Taherkhani, A., Cosma, G., McGinnity, T.M., 2020. Adaboost-cnn: An
adaptive boosting algorithm for convolutional neural networks to clas-
sify multi-class imbalanced datasets using transfer learning. Neurocom-
puting 404, 351–366.
Vahadane, A., Peng, T., Sethi, A., Albarqouni, S., Wang, L., Baust, M.,
Steiger, K., Schlitter, A.M., Esposito, I., Navab, N., 2016. Structure-
preserving color normalization and sparse stain separation for histolog-
ical images. IEEE transactions on medical imaging 35, 1962–1971.
Xie, S., Tu, Z., 2015. Holistically-nested edge detection, in: Proceedings of
the IEEE international conference on computer vision, pp. 1395–1403.
Zhou, Q., Wu, X., Zhang, S., Kang, B., Ge, Z., Latecki, L.J., 2022. Contex-
tual ensemble network for semantic segmentation. Pattern Recognition
122, 108290.
Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J., 2019. Unet++:
Redesigning skip connections to exploit multiscale features in image
segmentation. IEEE transactions on medical imaging 39, 1856–1867.
Zhu, Q., Du, B., Turkbey, B., Choyke, P.L., Yan, P., 2017. Deeply-
supervised cnn for prostate segmentation, in: 2017 International Joint
Conference on Neural Networks (Ijcnn), IEEE. pp. 178–184.
Yilong Yang received the master degree in software engineering from
Xiamen University, Xiamen, China, in 2019. He is currently a Ph.D can-
didate with Vision, Learning and Control Research Group, University of
Southampton, United Kingdom. His research interests include computer
vision and geometric deep learning.
Srinandan Dasmahapatra received the Ph.D. degree in physics from the
State University of New York, Stony Brook, NY, USA, in 1992. He is
currently an Associate Professor in the School of Electronics and Computer
Science, University of Southampton, Southampton, U.K. His research
interests include artificial intelligence and pattern recognition.
Sasan Mahmoodi received the Ph.D degree from the University of New-
castle, Newcastle upon Tyne, U.K., in 1998. He is currently an Associate
Professor in the School of Electronics and Computer Science, University
of Southampton, Southampton, U.K. His research interests include medical
image processing, computer vision, and modeling of the biological vision.