R2 Unet PDF
R2 Unet PDF
Md Zahangir Alom1*, Chris Yakopcic1, Tarek M. Taha1, and Vijayan K. Mahmudul Hasan2, is with Comcast Labs, Washington, DC, USA. (e-mail:
Asari1 are with the University of Dayton, 300 College Park, Dayton, OH, [email protected]).
45469, USA. (e-mail: {alomm1, cyakopcic1, ttaha1, vasari1}@udayton.edu).
Fig. 2. U-Net architecture consisted with convolutional encoding and decoding units that take image as input and produce the segmentation feature maps with
respective pixel classes.
manual segmentation approaches, there is a significant demand as universal learning approaches, where a single model can be
for computer algorithms that can do segmentation quickly and utilized efficiently in different modalities of medical imaging
accurately without human interaction. However, there are some such as MRI, CT, and X-ray.
limitations of medical image segmentation including data According to a recent survey, DL approaches are applied to
scarcity and class imbalance. Most of the time the large number almost all modalities of medical imagining [20, 21].
of labels (often in the thousands) for training is not available for Furthermore, the highest number of papers have been published
several reasons [11]. Labeling the dataset requires an expert in on segmentation tasks in different modalities of medical
this field which is expensive, and it requires a lot of effort and imaging [20, 21]. A DCNN based brain tumor segmentation and
time. Sometimes, different data transformation or augmentation detection method was proposed in [22].
techniques (data whitening, rotation, translation, and scaling) From an architectural point of view, the CNN model for
are applied for increasing the number of labeled samples classification tasks requires an encoding unit and provides class
available [12, 13, and 14]. In addition, patch based approaches probability as an output. In classification tasks, we have
are used for solving class imbalance problems. In this work, we performed convolution operations with activation functions
have evaluated the proposed approaches on both patch-based followed by sub-sampling layers which reduces the
and entire image-based approaches. However, to switch from dimensionality of the feature maps. As the input samples
the patch-based approach to the pixel-based approach that traverse through the layers of the network, the number of
works with the entire image, we must be aware of the class feature maps increases but the dimensionality of the feature
imbalance problem. In the case of semantic segmentation, the maps decreases. This is shown in the first part of the model (in
image backgrounds are assigned a label and the foreground green) in Fig. 2. Since, the number of feature maps increase in
regions are assigned a target class. Therefore, the class the deeper layers, the number of network parameters increases
imbalance problem is resolved without any trouble. Two respectively. Eventually, the Softmax operations are applied at
advanced techniques including cross-entropy loss and dice the end of the network to compute the probability of the target
similarity are introduced for efficient training of classification classes.
and segmentation tasks in [13, 14]. As opposed to classification tasks, the architecture of
Furthermore, in medical image processing, global segmentation tasks requires both convolutional encoding and
localization and context modulation is very often applied for decoding units. The encoding unit is used to encode input
localization tasks. Each pixel is assigned a class label with a images into a larger number of maps with lower dimensionality.
desired boundary that is related to the contour of the target The decoding unit is used to perform up-convolution (de-
lesion in identification tasks. To define these target lesion convolution) operations to produce segmentation maps with the
boundaries, we must emphasize the related pixels. Landmark same dimensionality as the original input image. Therefore, the
detection in medical imaging [15, 16] is one example of this. architecture for segmentation tasks generally requires almost
There were several traditional machine learning and image double the number of network parameters when compared to
processing techniques available for medical image the architecture of the classification tasks. Thus, it is important
segmentation tasks before the DL revolution, including to design efficient DCNN architectures for segmentation tasks
amplitude segmentation based on histogram features [17], the which can ensure better performance with less number of
region based segmentation method [18], and the graph-cut network parameters.
approach [19]. However, semantic segmentation approaches This research demonstrates two modified and improved
that utilize DL have become very popular in recent years in the segmentation models, one using recurrent convolution
field of medical image segmentation, lesion detection, and networks, and another using recurrent residual convolutional
localization [20]. In addition, DL based approaches are known networks. To accomplish our goals, the proposed models are
Fig. 3. RU-Net architecture with convolutional encoding and decoding units using recurrent convolutional layers (RCL) based U-Net architecture. The residual
units are used with RCL for R2U-Net architecture.
evaluated on different modalities of medical imagining as segmentation [2]. One of the image patch-based architectures is
shown in Fig. 1. The contributions of this work can be called Random architecture, which is very computationally
summarized as follows: intensive and contains around 134.5M network parameters.
The main drawback of this approach is that a large number of
1) Two new models RU-Net and R2U-Net are introduced for
pixel overlap and the same convolutions are performed many
medical image segmentation.
2) The experiments are conducted on three different times. The performance of FCN has improved with recurrent
modalities of medical imaging including retina blood vessel neural networks (RNN), which are fine-tuned on very large
segmentation, skin cancer segmentation, and lung datasets [27]. Semantic image segmentation with DeepLab is
segmentation. one of the state-of-the-art performing methods [28]. SegNet
3) Performance evaluation of the proposed models is consists of two parts, one is the encoding network which is a
conducted for the patch-based method for retina blood vessel 13-layer VGG16 network [5], and the corresponding decoding
segmentation tasks and the end-to-end image-based approach network uses pixel-wise classification layers. The main
for skin lesion and lung segmentation tasks. contribution of this paper is the way in which the decoder up-
4) Comparison against recently proposed state-of-the-art samples its lower resolution input feature maps [10]. Later, an
methods that shows superior performance against equivalent improved version of SegNet, which is called Bayesian SegNet
models with same number of network parameters. was proposed in 2015 [29]. Most of these architectures are
The paper is organized as follows: Section II discusses related explored using computer vision applications. However, there
work. The architectures of the proposed RU-Net and R2U-Net are some deep learning models that have been proposed
models are presented in Section III. Section IV, explains the specifically for the medical image segmentation, as they
datasets, experiments, and results. The conclusion and future consider data insufficiency and class imbalance problems.
direction are discussed in Section V. One of the very first and most popular approaches for
semantic medical image segmentation is called “U-Net” [12].
II. RELATED WORK A diagram of the basic U-Net model is shown in Fig. 2.
According to the structure, the network consists of two main
Semantic segmentation is an active research area where
DCNNs are used to classify each pixel in the image parts: the convolutional encoding and decoding units. The basic
convolution operations are performed followed by ReLU
individually, which is fueled by different challenging datasets
activation in both parts of the network. For down sampling in
in the fields of computer vision and medical imaging [23, 24,
and 25]. Before the deep learning revolution, the traditional the encoding unit, 2×2 max-pooling operations are performed.
In the decoding phase, the convolution transpose (representing
machine learning approach mostly relied on hand engineered
up-convolution, or de-convolution) operations are performed to
features that were used for classifying pixels independently. In
up-sample the feature maps. The very first version of U-Net was
the last few years, a lot of models have been proposed that have
used to crop and copy feature maps from the encoding unit to
proved that deeper networks are better for recognition and
segmentation tasks [5]. However, training very deep models is the decoding unit. The U-Net model provides several
advantages for segmentation tasks: first, this model allows for
difficult due to the vanishing gradient problem, which is
the use of global location and context at the same time. Second,
resolved by implementing modern activation functions such as
Rectified Linear Units (ReLU) or Exponential Linear Units it works with very few training samples and provides better
performance for segmentation tasks [12]. Third, an end-to-end
(ELU) [5,6]. Another solution to this problem is proposed by
He et al., a deep residual model that overcomes the problem pipeline process the entire image in the forward pass and
directly produces segmentation maps. This ensures that U-Net
utilizing an identity mapping to facilitate the training process
preserves the full context of the input images, which is a major
[26].
In addition, CNNs based segmentation methods based on advantage when compared to patch-based segmentation
approaches [12, 14].
FCN provide superior performance for natural image
However, U-Net is not only limited to the applications in the are named RU-Net and R2U-Net. These two approaches utilize
domain of medical imaging, nowadays this model is massively the strengths of all three recently developed deep learning
applied for computer vision tasks as well [30, 31]. Meanwhile, models. RCNN and its variants have already shown superior
different variants of U-Net models have been proposed, performance on object recognition tasks using different
including a very simple variant of U-Net for CNN-based benchmarks [42, 43]. The recurrent residual convolutional
segmentation of Medical Imaging data [32]. In this model, two operations can be demonstrated mathematically according to
modifications are made to the original design of U-Net: first, a the improved-residual networks in [43]. The operations of the
combination of multiple segmentation maps and forward Recurrent Convolutional Layers (RCL) are performed with
feature maps are summed (element-wise) from one part of the respect to the discrete time steps that are expressed according
network to the other. The feature maps are taken from different to the RCNN [41]. Let’s consider the 𝑥𝑙 input sample in the 𝑙 𝑡ℎ
layers of encoding and decoding units and finally summation layer of the residual RCNN (RRCNN) block and a pixel located
(element-wise) is performed outside of the encoding and at (𝑖, 𝑗) in an input sample on the kth feature map in the RCL.
𝑙
decoding units. The authors report promising performance Additionally, let’s assume the output of the network 𝑂𝑖𝑗𝑘 (𝑡) is
improvement during training with better convergence at the time step t. The output can be expressed as follows as:
compared to U-Net, but no benefit was observed when using a
𝑓 𝑇 𝑓(𝑖,𝑗)
summation of features during the testing phase [32]. However, 𝑙 (𝑡)
𝑂𝑖𝑗𝑘 = (𝑤𝑘 ) ∗ 𝑥𝑙 (𝑡) + (𝑤𝑘𝑟 )𝑇 ∗ 𝑥𝑙𝑟(𝑖,𝑗) (𝑡 − 1) + 𝑏𝑘 (1)
this concept proved that feature summation impacts the 𝑓(𝑖,𝑗)
Here 𝑥𝑙 (𝑡) and 𝑥𝑙𝑟(𝑖,𝑗) (𝑡 − 1) are the inputs to the
performance of a network. The importance of skipped
connections for biomedical image segmentation tasks have standard convolution layers and for the 𝑙 𝑡ℎ RCL respectively.
𝑓
been empirically evaluated with U-Net and residual networks The 𝑤𝑘 and 𝑤𝑘𝑟 values are the weights of the standard
[33]. A deep contour-aware network called Deep Contour- convolutional layer and the RCL of the kth feature map
Aware Networks (DCAN) was proposed in 2016, which can respectively, and 𝑏𝑘 is the bias. The outputs of RCL are fed to
extract multi-level contextual features using a hierarchical the standard ReLU activation function 𝑓 and are expressed:
architecture for accurate gland segmentation of histology 𝑙 𝑙
ℱ(𝑥𝑙 , 𝑤𝑙 ) = 𝑓(𝑂𝑖𝑗𝑘 (𝑡)) = max(0, 𝑂𝑖𝑗𝑘 (𝑡)) (2)
images and shows very good performance for segmentation
[34]. Furthermore, Nabla-Net: a deep dig-like convolutional ℱ(𝑥𝑙 , 𝑤𝑙 ) represents the outputs from of lth layer of the
architecture was proposed for segmentation in 2017 [35]. RCNN unit. The output of ℱ(𝑥𝑙 , 𝑤𝑙 ) is used for down-sampling
Other deep learning approaches have been proposed based and up-sampling layers in the convolutional encoding and
on U-Net for 3D medical image segmentation tasks as well. The decoding units of the RU-Net model respectively. In the case of
3D-Unet architecture for volumetric segmentation learns from R2U-Net, the final outputs of the RCNN unit are passed through
sparsely annotated volumetric images [13]. A powerful end-to- the residual unit that is shown Fig. 4(d). Let’s consider that the
end 3D medical image segmentation system based on output of the RRCNN-block is 𝑥𝑙+1 and can be calculated as
volumetric images called V-net has been proposed, which follows:
consists of a FCN with residual connections [14]. This paper
also introduces a dice loss layer [14]. Furthermore, a 3D deeply 𝑥𝑙+1 = 𝑥𝑙 + ℱ(𝑥𝑙 , 𝑤𝑙 ) (3)
supervised approach for automated segmentation of volumetric Here, 𝑥𝑙 represents the input samples of the RRCNN-block.
medical images was presented in [36]. High-Res3DNet was The 𝑥𝑙+1 sample is used the input for the immediate succeeding
proposed using residual networks for 3D segmentation tasks in sub-sampling or up-sampling layers in the encoding and
2016 [37]. In 2017, a CNN based brain tumor segmentation decoding convolutional units of R2U-Net. However, the
approach was proposed using a 3D-CNN model with a fully number of feature maps and the dimensions of the feature maps
connected CRF [38]. Pancreas segmentation was proposed in for the residual units are the same as in the RRCNN-block
[39], and Voxresnet was proposed in 2016 where a deep voxel shown in Fig. 4 (d).
wise residual network is used for brain segmentation. This
architecture utilizes residual networks and summation of
feature maps from different layers [40].
Alternatively, we have proposed two models for semantic
segmentation based on the architecture of U-Net in this paper.
The proposed Recurrent Convolutional Neural Networks
(RCNN) model based on U-Net is named RU-Net, which is
shown in Fig. 3. Additionally, we have proposed a residual
RCNN based U-Net model which is called R2U-Net. The
following section provides the architectural details of both Fig. 4. Different variant of convolutional and recurrent convolutional units (a)
models. Forward convolutional units, (b) Recurrent convolutional block (c) Residual
convolutional unit, and (d) Recurrent Residual convolutional units (RRCU).
III. RU-NET AND R2U-NET ARCHITECTURES
The proposed deep learning models are the building blocks
Inspired by the deep residual model [7], RCNN [41], and U-
of the stacked convolutional units shown in Fig. 4(b) and (d).
Net [12], we propose two models for segmentation tasks which
There are four different architectures evaluated in this work. copying unit from the basic U-Net model and use only
First, U-Net with forward convolution layers and feature concatenation operations, resulting a much-sophisticated
concatenation is applied as an alternative to the crop and copy architecture that results in better performance.
method found in the primary version of U-Net [12]. The basic
convolutional unit of this model is shown in Fig. 4(a). Second,
U-Net with forward convolutional layers with residual
connectivity is used, which is often called residual U-net
(ResU-Net) and is shown in Fig. 4(c) [14]. The third
architecture is U-Net with forward recurrent convolutional
layers as shown in Fig. 4(b), which is named RU-Net. Finally,
the last architecture is U-Net with recurrent convolution layers
with residual connectivity as shown in Fig. 4(d), which is
named R2U-Net. The pictorial representation of the unfolded
RCL layers with respect to time-step is shown in Fig 5. Here
t=2 (0 ~ 2), refers to the recurrent convolutional operation that
includes one single convolution layer followed by two sub-
sequential recurrent convolutional layers. In this
implementation, we have applied concatenation to the feature
maps from the encoding unit to the decoding unit for both RU-
Net and R2U-Net models.
Fig. 6. Example images from training dataset: left column from DRIVE dataset,
middle column from STARE dataset and right column from CHASE-DB1
dataset. The first row shows the original images, second row shows fields of
view (FOV), and third row shows the target outputs.
Fig. 11. Experimental outputs of STARE dataset using R2UNet: first row shows
input image after performing normalization, second row show ground truth, and
third row shows the experimental outputs.
C. Results
1) Retina Blood Vessel Segmentation Using the DRIVE
Dataset
The precise segmentation results achieved with the proposed
R2U-Net model are shown in Fig. 8. Figs. 9 and 10 show the
training and validation accuracy when using the DRIVE
dataset. These figures show that the proposed R2U-Net and
RU-Net models provide better performance during both the
training and validation phase when compared to U-Net and
Fig. 12. Training accuracy in STARE dataset for R2U-Net, RU-Net, ResU-Net,
ResU-Net. and U-Net.
Fig. 10. Validation accuracy of the proposed models against ResU-Net and U-
Net.
Fig. 13. Validation accuracy in STARE dataset for R2U-Net, RU-Net, ResU-
2) Retina blood vessel segmentation on the STARE dataset Net, and U-Net.
The experimental outputs of R2U-Net when using the
STARE dataset are shown in Fig. 11. The training and 3) CHASE_DB1
For qualitative analysis, the example outputs of R2U-Net are
shown in Fig. 14. For quantitative analysis, the results are given
in Table I. From the table, it can be concluded that in all cases, validation during training with a batch size of 32 and 150
the proposed RU-Net and R2U-Net models show better epochs.
performance in terms of AUC and accuracy. The ROC for the The training accuracy of the proposed models R2U-Net and
highest AUCs for the R2U-Net model on each of the three retina RU-Net was compared with that of ResU-Net and U-Net for an
blood vessel segmentation datasets is shown in Fig. 15. end-to-end image based segmentation approach. The result is
TABLE I. EXPERIMENTAL RESULTS OF PROPOSED APPROACHES FOR RETINA BLOOD VESSEL SEGMENTATION AND COMPARISON AGAINST OTHER
TRADITIONAL AND DEEP LEARNING-BASED APPROACHES.
Dataset Methods Year F1-score SE SP AC AUC
DRIVE Chen [53] 2014 - o.7252 0.9798 0.9474 0.9648
Azzopardi [54] 2015 - 0.7655 0.9704 0.9442 0.9614
Roychowdhury[55] 2016 - 0.7250 0.9830 0.9520 0.9620
Liskowsk [56] 2016 - 0.7763 0.9768 0.9495 0.9720
Qiaoliang Li [57] 2016 - 0.7569 0.9816 0.9527 0.9738
U-Net 2018 0.8142 0.7537 0.9820 0.9531 0.9755
Residual U-Net 2018 0.8149 0.7726 0.9820 0.9553 0.9779
Recurrent U-Net 2018 0.8155 0.7751 0.9816 0.9556 0.9782
R2U-Net 2018 0.8171 0.7792 0.9813 0.9556 0.9784
STARE Marin et al. [58] 2011 - 0.6940 0.9770 0.9520 0.9820
Fraz [59] 2012 - 0.7548 0.9763 0.9534 0.9768
Roychowdhury[55] 2016 - 0.7720 0.9730 0.9510 0.9690
Liskowsk [56] 2016 - 0.7867 0.9754 0.9566 0.9785
Qiaoliang Li [57] 2016 - 0.7726 0.9844 0.9628 0.9879
U-Net 2018 0.8373 0.8270 0.9842 0.9690 0.9898
Residual U-Net 2018 0.8388 0.8203 0.9856 0.9700 0.9904
Recurrent U-Net 2018 0.8396 0.8108 0.9871 0.9706 0.9909
R2U-Net 2018 0.8475 0.8298 0.9862 0.9712 0.9914
CHASE_DB1 Fraz [59] 2012 - 0.7224 0.9711 0.9469 0.9712
Fraz [60] 2014 - - - 0.9524 0.9760
Azzopardi [54] 2015 - 0.7655 0.9704 0.9442 0.9614
Roychowdhury[55] 2016 - 0.7201 0.9824 0.9530 0.9532
Qiaoliang Li [57] 2016 - 0.7507 0.9793 0.9581 0.9793
U-Net 2018 0.7783 0.8288 0.9701 0.9578 0.9772
Residual U-Net 2018 0.7800 0.7726 0.9820 0.9553 0.9779
Recurrent U-Net 2018 0.7810 0.7459 0.9836 0.9622 0.9803
R2U-Net 2018 0.7928 0.7756 0.9820 0.9634 0.9815
TABLE II. EXPERIMENTAL RESULTS OF PROPOSED APPROACHES FOR SKIN CANCER LESION SEGMENTATION AND COMPARISON AGAINST OTHER
EXISTING APPROACHES. JACCARD SIMILARITY SCORE (JSC).
Methods Year SE SP JSC F1-score AC AUC DC
Conv. classifier VGG-16 [61] 2017 0.533 - - - 0.6130 0.6420 -
Conv. classifier Inception-v3[61] 2017 0.760 - - - 0.6930 0.7390 -
Melanoma detection [62] 2017 - - - - o.9340 - 0.8490
Skin Lesion Analysis [63] 2017 0.8250 0.9750 - - 0.9340 - -
U-Net (t=2) 2018 0.9479 0.9263 0.9314 0.8682 0.9314 0.9371 0.8476
ResU-Net (t=2) 2018 0.9454 0.9338 0.9367 0.8799 0.9367 0.9396 0.8567
RecU-Net (t=2) 2018 0.9334 0.9395 0.9380 0.8841 0.9380 0.9364 0.8592
R2U-Net (t=2) 2018 0.9496 0.9313 0.9372 0.8823 0.9372 0.9405 0.8608
R2U-Net (t=3) 2018 0.9414 0.9425 0.9421 0.8920 0.9424 0.9419 0.8616
TABLE III. EXPERIMENTAL OUTPUTS OF PROPOSED MODELS OF RU-NET AND R2U-NET FOR LUNG SEGMENTATION AND COMPARISON
AGAINST RESU-NET AND U-NET MODELS.
Methods Year SE SP JSC F1-Score AC AUC
U-Net (t=2) 2018 0.9696 0.9872 0.9858 0.9658 0.9828 0.9784
ResU-Net(t=2) 2018 0.9555 0.9945 0.9850 0.9690 0.9849 0.9750
RU-Net (t=2) 2018 0.9734 0.9866 0.9836 0.9638 0.9836 0.9800
R2U-Net (t=2) 2018 0.9826 0.9918 0.9897 0.9780 0.9897 0.9872
R2U-Net (t=3) 2018 0.9832 0.9944 0.9918 0.9823 0.9918 0.9889
with same number of network parameters. of-the-art methods including ResU-Net and U-Net in terms of
AUC and accuracy on all three datasets. The network
architectures with different numbers of network parameters
with respect to the different time-step are shown in Table IV.
The processing times during the testing phase for the STARE,
CHASE_DB, and DRIVE datasets were 6.42, 8.66, and 2.84
seconds per sample respectively. In addition, skin cancer
segmentation and lung segmentation take 0.22 and 1.145
seconds per sample respectively.