0% found this document useful (0 votes)

34 views10 pages

Azad_Deep_Frequency_Re-Calibration_U-Net_for_Medical_Image_Segmentation_ICCVW_2021_paper_2

Polyp-Mixer

Uploaded by

praveer82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views10 pages

Azad_Deep_Frequency_Re-Calibration_U-Net_for_Medical_Image_Segmentation_ICCVW_2021_paper_2

Polyp-Mixer

Uploaded by

praveer82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Deep Frequency Re-calibration U-Net for Medical Image Segmentation

Reza Azad Afshin Bozorgpour

Institute of Imaging and Computer Vision, Sharif University of Technology,
RWTH Aachen University, Germany Iran
[email protected] [email protected]

Maryam Asadi-Aghbolaghi Dorit Merhof

School of Computer Science, Inst. for Research in Institute of Imaging and Computer Vision,
Fundamental Sciences (IPM), Iran RWTH Aachen University, Germany
[email protected] [email protected]

Sergio Escalera
Universitat de Barcelona and
Computer Vision Center, Spain
[email protected]

Abstract

The human visual cortex is biased towards shape compo-

nents while CNNs produce texture biased features. This fact
may explain why the performance of CNN significantly de-
grades with low-labeled input data scenarios. In this paper, Figure 1. Skin lesion examples showing large visual variability.
we propose a frequency re-calibration U-Net (FRCU-Net)
for medical image segmentation. Representing an object in
terms of frequency may reduce the effect of texture bias, re- 1. Introduction
sulting in better generalization for a low data regime. To do
so, we apply the Laplacian pyramid in the bottleneck layer Medical imaging is a key element in computer-aided di-
of the U-shaped structure. The Laplacian pyramid repre- agnosis and smart medicine. Obtaining accurate results
sents the object proposal in different frequency domains, from medical imaging allows to enhance diagnostic effi-
where the high frequencies are responsible for the texture ciency, resulting in a reduction of time, cost, and error of
information and lower frequencies might be related to the human-based processing. Different modern medical imag-
shape. Adaptively re-calibrating these frequency represen- ing approaches like Magnetic Resonance Imaging (MRI),
tations can produce a more discriminative representation or Computed Tomography (CT), are useful for the medical
for describing the object of interest. To this end, we first examination of different parts of the human body. There-
propose to use a channel-wise attention mechanism to cap- fore, automated processing of these kinds of imaging data
ture the relationship between the channels of a set of feature is essential to support diagnosis and treatment of diseases.
maps in one layer of the frequency pyramid. Second, the ex- Medical image segmentation is an important and effec-
tracted features of each level of the pyramid are then com- tive step in numerous medical imaging tasks. To help clin-
bined through a non-linear function based on their impact icians to make an accurate diagnosis, and shorten the time-
on the final segmentation output. The proposed FRCU-Net consuming inspection and evaluation processes, it is re-
is evaluated on five datasets ISIC 2017, ISIC 2018, the PH2 , quired to pre-segment some crucial tissues or abnormal fea-
lung segmentation, and SegPC 2021 challenge datasets and tures in medical images. Image segmentation includes a
compared to existing alternatives, achieving state-of-the-art large number of applications ranging from skin cancer de-
results. tection in RGB images, lung tissue segmentation in CT im-
ages to pathological image analysis.

3274
Medical image segmentation is a challenging task due to while the high-frequency level is responsible for texture-
several complexities. E.g. in case of skin cancer segmen- based features, resulting in a reduction of the effect of the
tation, there are large intra-class variabilities and inter-class noise on the final representation. We employ the Lapla-
similarities because of differences in color, texture, shape, cian pyramid inspired by other successful traditional image
size, contrast, and location of the lesions (Figure 1). Low processing tools like SIFT, where Laplacian pyramid was
contrast and obscuration which can be observed between shown to have a high representative power for describing
the affected areas and normal regions make the recogni- the object in various frequency domains by deploying Gaus-
tion a hard task. To overcome these issues, different ap- sian kernels.
proaches have been proposed for medical image segmenta- To enhance the discriminative power of different chan-
tion. Like other fields of research in computer vision, deep nels of one frequency level, a channel-wise attention mech-
learning-based networks have outperformed traditional ma- anism is exploited to re-calibrate the frequency representa-
chine learning approaches. Convolutional neural networks tions, inspired by the effectiveness of the proposed squeeze
(CNN), inspired by the human visual cortex, have been a and excitation modules [22]. We then propose to employ a
widely used deep network. It can learn complex feature hi- weighted combination function to aggregate the features of
erarchies from the data layer by layer. all levels of the Laplacian pyramid and allow the network
The main drawback of deep networks is their extreme to learn weights of the levels based on their importance on
hunger for annotated training data. However, in medical the final result. This mechanism helps the network to focus
image segmentation, large (and annotated) datasets are hard more on the informative and meaningful features while sup-
to obtain, due to the burden of manual annotation. To deal pressing noisy ones by using global embedding information
with this issue Ronneberger et al. [31] extended the idea of the channels.
of fully convolutional neural network (FCN) [26] to U-Net. We evaluate FRCU-Net on five datasets: ISIC 2017 [13],
Compared to the previous approaches, U-Net was able to ISIC 2018 [12], PH2 [28], Lung segmentation [1], and
produce better performance and also leverage the need of SegPC 2021 [16] challenge datasets. The experimental re-
large amounts of training data. This network includes an sults demonstrate that the proposed network achieves supe-
encoding and a decoding path. The encoder extracts a large rior performance compared to state-of-the-art alternatives.
number of feature maps by reducing the dimensionality. On
the other hand, the decoder produces the segmentation maps 2. Related Work
by applying a hierarchical series of up-convolutional layers. Research on deep learning has grown rapidly, and deep
To improve the performance of U-Net, many extensions learning networks are nowadays prominent strategies for
of this network have been proposed [3, 4, 15, 5, 14, 32, 35]. segmentation in medical imaging. FCN [26], a pixels-to-
These networks aim to enhance the original U-Net by insert- pixels network, is one of the first convolutional networks
ing attention mechanisms, recurrent residual strategies, or introduced for image classification. To keep the original
other non-linear functions in the convolutional layers. What resolution, all fully connected layers are replaced with con-
all these networks have in common is their bias towards ex- volution and deconvolution layers. Ronneberger et al. ex-
tracting features based on texture rather than shape. This tend this idea and proposed U-Net [31] for biomedical im-
fact limits the ability of these convolutional neural networks age segmentation. U-Net is a fairly symmetrical U-shaped
(CNNs) to leverage useful low-frequency information, e.g. encoder-decoder architecture, in which the encoder and de-
shape information [17]. It has been shown that the represen- coder parts are combined with skip connections at different
tation power of CNNs can be improved by employing shape scales to integrate deep and shallow features.
information through adjusting input images [17]. However, Different extensions of U-Net have been proposed for
it is still an open problem to design an efficient approach for image segmentation. To process 3D volumes, 3D U-Net
CNNs that can attenuate high-frequency local components was been proposed [11], in which all 2D operations are re-
and benefit from low-frequency information. placed with their 3D counterparts. By exploiting more skip
To address the above problems, in this paper we pro- connections and convolutions, U-Net++ [35] can solve the
pose our so-called Frequency Re-calibration U-Net (FRCU- problem that edge information and small objects are lost due
Net). We introduce a frequency level attention mechanism to the down-sampling functions. ResUNet [23] improves
to control and aggregate the representation space using a the performance of the original U-Net by employing a bet-
weighted combination of different types of frequency infor- ter CNN backbone with a U-shaped structure that extracts
mation. To take advantage of both texture and shape fea- multi-scale information. Some other extensions of U-Net
tures based on their effect on the performance, we propose have been proposed by inserting additional modules in dif-
to include the Laplacian pyramid in the bottleneck layers ferent parts of the network. Different studies show that in-
of the U-Net. The low-frequency domain from the Lapla- tegrating a self-attention mechanism into U-Net by model-
cian pyramid causes the network to learn shape information ing global interactions of all pixels in feature maps results

3275
in better performance. Schlemper et al. propose attention- Net (Figure 2). We propose a frequency attention mecha-
based U-Net [32] by inserting additive attention gate into nism to re-calibrate the representation space within a U-Net
the skip-connections. Inspiring by the ideas of squeeze and based architecture. The proposed module is capable of re-
excitation approaches [22] and dense connections, Asadi et calibrating the representation space by taking into account
al. proposed MCGU-Net [4] in which the channel-wise at- the informative frequency domains and reconstructing the
tention improves the performance of the original U-Net. representation by the nonlinear attention mechanism. To
Deng et al. proposed PraNet [14] by adding an RBF this end, our proposed method incorporates the frequency
module to the skip connection to capture visual informa- attention module into the latent space to re-arrange and cal-
tive features at multiple scales. Azad et al. [5] enhance ibrate the frequency domain for better representation. In
the performance of the U-Net by inserting non-linearity in the following subsections, we describe each network com-
the skip connections through ConvLSTM for combining the ponent in detail.
features from encoder and decoder parts rather than a sim-
ple concatenation. Martin et al. [27] utilized a stacked ver- 3.1. Encoder
sion of BCDU-Net for myocardial pathology segmentation. The contracting path of the U-shaped architecture (en-
Alom et al. [3] extended U-Net by adding Recurrent Convo- coder) aims at extracting hierarchically semantic features
lutional Neural Network (RCNN) and Recurrent Residual and capture context information. To train the encoder con-
Convolutional Neural Network (R2CNN) in which feature taining a high number of parameters, a large dataset includ-
accumulation with recurrent residual convolutional layers ing a large number of labeled data is necessary. The idea of
ensures better feature representation. transfer learning allows the network to leverage knowledge
Deeplab [10] utilizes the idea of atrous spatial pyramid from pre-trained models and use it to solve a new problem
pooling (ASPP) at several grid scales. Atrous convolu- with fewer data. We utilize Xception as the backbone of the
tion layers with different rates capture multi-scale informa- proposed network, and therefore, we can finetune the net-
tion, resulting in better performance on several segmenta- work by using a set of parameters pre-trained on the PAS-
tion benchmarks [21]. By taking into account the advan- CAL VOC dataset. The network with the Xception back-
tages of both U-shape networks and pyramid spatial pool- bone converges fast and achieve accurate results.
ing, Chen et al. introduce Deeplabv3+ in which Atrous con- Xception structure is a linear stack of depthwise separa-
volution extracts rich semantic information in the encoding ble convolution layers with residual connections. Channel-
path and controls the density of the decoder features. Azad wise 3 × 3 spatial convolution, and 1 × 1 pointwise con-
et al. [6] improve Deeplabv3+ by inserting two attention volution are utilized in our FRCU-Net. We define the en-
modules of channel-wise attention and multi-scale attention coder model E with parameters θ, which takes the input
0 0 0
mechanisms in the Atrous convolution. sample I ∈ RH ×W ×C and generate the encoded feature
It has been shown that CNNs have a strong texture in- map X ∈ RH×W ×C as,
ductive bias which limits their ability to leverage useful
shape information [17]. In other words, convolutional net- X = Eθ (I), (1)
works have a bias towards extracting features based on tex- 0 0 0
where H , W , and C are the dimensions of the input data,
ture rather than shape. In the context of few-shot learn-
H, W , and C are the dimensions of the encoded feature
ing, a set of Difference of Gaussians (DoG) is inserted into
representation, and θ is the set of network parameters.
a deep network to attenuate high-frequency local compo-
nents in the feature space [7]. Lai et al. [24] propose 3.2. Frequency Re-calibration Module
Laplacian Pyramid Super-Resolution Network (LapSRN)
to progressively reconstruct the sub-band residuals of high- A U-shaped architecture includes a sequence of regu-
resolution images for image super-resolution. Their model lar convolutional layers in the bottleneck layer. The con-
takes coarse-resolution feature maps as input, and predicts volutional layers have a strong texture inductive bias. In
the high-frequency residuals. To enhance the performance other words, these models tend to perform the recognition
of a U-shaped architecture and remove the texture bias of task based on the object texture, while recognition in hu-
convolutional layers, we propose to utilize the frequency man vision is highly influenced by shape. By utilizing the
domain of the extracted features to learn shape informa- extracted feature maps from the convolutional layers in the
tion along with texture information, reducing the amount frequency domain, we can take advantage of both shape and
of noise on the feature representation. texture information for training the network. The impor-
tance of texture and shape is different for different applica-
3. Proposed Method tions and data. The low-frequency domain of the extracted
feature maps contain shape information of the input data
Inspired by a recent study on texture bias [17] and while higher frequencies are responsible for texture infor-
squeeze and excitation module [22], we present FRCU- mation. Instead of focusing on only one of the two general

3276
Encoder Decoder 64 64
‫س‬

128

Conv 3*3, Relu

Frequency Recalibration
256 Up-Conv 2*2, BN

Laplacian Pyramid Copy

512
Feature Concatenation

Frequency Attention Conv 1*1, Sigmoid

Re-scale

Figure 2. FRCU-Net with 1) Laplacian pyramid to take convolutional features to frequency domain and 2) frequency attention mechanism
for a non-linearly weighted combination of all levels of the pramid.

types of information (i.e., shape and texture), we propose a 3.2.2 Frequency Attention
frequency re-calibration (FRC) module which consists of a
Different levels of the Laplacian pyramid contain different
Laplacian pyramid and frequency attention (Figure 3). We
kinds of information. For instance, the low-frequency level
compute the frequency levels of the extracted feature maps
features include shape-based features, while the higher level
through the Laplacian pyramid. Moreover, the frequency at-
ones are more related to texture. The importance of these
tention mechanism allows the network to focus on the more
kinds of information differs depending on the data and the
informative frequency level of features. The FRC module is
task at hand. Inspired by the squeeze and excitation network
exploited in the bottleneck layer.
[22], we propose a frequency attention mechanism to non-
linearly aggregate the features of all levels of the frequency
3.2.1 Laplacian Pyramid domain. In other words, the network employs the global
The extracted feature maps from the convolutional layers information of each frequency level of the Laplacian pyra-
are included into the frequency domain through a Lapla- mid. This idea helps the network to selectively empathize
cian pyramid mechanism. To approximate the Laplacian informative frequency levels and suppress less useful ones.
function we use the same strategy as [7], e.g. Difference of First, for each level of the Laplacian pyramid, we nor-
Gaussian (DoG) technique to generate the Laplacian pyra- malize all input channels. To this end, we utilize the
mid. First, we extract a (L + 1) Gaussian representation global context information of the input features to generate
from the encoded feature map X ∈ RH×W ×C using differ- weights for all input channels of each Laplacian pyramid.
ent values as the variance of the Gaussian function to gen- The global average pooling is calculated as,
erate different scales, H X
W
1 X
1
2
− i +j2
2 GAPlf = LPlf (i, j), (4)
Gl (X) = X ∗ √ e 2σ
l , (2) H ×W i j
σl 2π
where the LPlf is the f th channel of the frequency features
where σl is the variance of the lth Gaussian function, i and j
of the lth Laplacian pyramid level, H × W is the size of
represent the spatial location in the encoded feature space,
each channel, and GAPlf is the output of the global aver-
X is the input set of encoded feature maps which consists
age pooling function for the f th channel of the lth Laplacian
of C channels with the size of H × W , and ∗ denotes the
pyramid level. Two fully connected layers (FCL) are then
convolution operator. To encode frequency information at
used to capture the channel-wise dependencies of feature
different scales, we apply a pyramid of DoGs with increas-
maps at each level as
ing variance. The lth level of the pyramid is computed as

wlf = σ W2 δ W1 GAPlf , (5)

Gl − Gl+1 , 1 ≤ l < L
LPl = , (3)
GL , l=L
where W1 and W2 are the parameters of the FCLs, δ and
where LPl is the lth level of the Laplacian pyramid, Gl is the σ are the ReLU and Sigmoid activation functions, respec-
output of the lth Gaussian functions, and L is the number of tively, and wlf is the learnt weight for the f th channel of the
levels of the pyramid. lth layer. The final feature map in each channel is computed

3277
Frequency Re-calibration ′𝑓
𝑓
𝐿𝑃𝑙 Re-calibration 𝐿𝑃𝑙

Laplacian Pyramid
𝑙5 Depth-wise aggregation

𝑙4

GAPlf (LPlf )

wlf (GPlf )
𝑙3

𝑋 𝐻×𝑊×𝐶 𝑙2
𝐿
′𝑓
𝑀 𝑓 = 𝜎 ෍ 𝑤𝑙′ 𝐿𝑃𝑙
𝑙1 𝑙=1

Figure 3. Frequency attention mechanism consists of 1) feature re-calibration to focus more on the informative features and 2) a non-linear
depth-wise aggregation to combine features from different levels of the pyramid .

by multiplication of the learnt weight and the input channel epochs1 . To compare the proposed network with other alter-
feature LP˜ fl = wf . LP f . natives, we consider several performance metrics, including
l l
After re-calibrating all of the feature maps in each layer, accuracy (AC), sensitivity (SE), specificity (SP), F1-Score,
we aggregate the features of all the pyramid levels taking and Jaccard similarity (or Jaccard index) (JS). The baseline
into account their discriminative power. To do that, a weight network has the same structure as FRCU-Net, but without
is learned for each level and a non-linear depthwise aggre- FRC module.
gation is utilized to combine these features as,
4.1. ISIC 2017 Dataset
L
!
The ISIC 2017 dataset [13] is obtained from the 2017
˜ fl
X
Mf = σ wl0 LP , (6) Kaggle competition which consisted of 3 tasks: lesion seg-
l=1 mentation, dermoscopic feature detection, and disease clas-
f sification. The skin lesion segmentation data is considered
˜ l is the f th
where wl0 is the learnt weight for the lth level, LP for evaluation in this paper. This dataset includes 2000 skin
channel of the feature set from the l level, and M f is the
th
lesion (cancer or non-cancer) images as training set with
f th channel of the output feature map. masks for segmentation. We use 1250 samples for training,
150 samples for validation data, and 600 samples as test
3.3. Decoder
set. The original size of each sample is 576 × 767 pixels.
The same decoder as in the regular U-Net is utilized in The same pre-processing as [3] is used to resize images to
our network. The features from the encoder part are con- 256 × 256 pixels.
catenated with the up-sampled features from the previous Figure 4 shows some segmentation results of our pro-
decoder layer. The concatenated features are then passed posed network. In Table 1 the quantitative results of the
to two 3 × 3 convolutional functions. We utilize the cross proposed network on this dataset are compared with some
entropy energy function to train the network. other related approaches. The FRCU-Net achieves better
performance than the baseline network for all the metrics.
4. Experimental Results The results demonstrate that, except for the sensitivity, the
proposed network achieves better results than the other ap-
We evaluate the proposed network on five datasets: ISIC proaches.
2017 [13], ISIC 2018 [12], PH2 [28], Lung segmentation
[1], and SegPC 2021 [16] challenge datasets. For imple- 4.2. ISIC 2018 Dataset
mentation, we use Keras with TensorFlow backend. All
The International Skin Imaging Collaboration (ISIC)
experiments were performed on an NVIDIA GTX 1080
published this dataset [12] as a large-scale dataset of der-
GPU with a batch size of 8 without any data augmentation.
moscopy images in 2018 which includes 2594 images with
We use the Adam optimizer with a learning rate equal to
their corresponding ground truth annotations (containing
10−4 for training and stop the training process of the net-
work when the validation does not change in 10 consecutive 1 Source code is available on https://github.com/rezazad68/FRCU-Net.

3278
Figure 4. Segmentation result of FRCU-Net on ISIC 2017. Figure 5. Segmentation result of FRCU-Net on ISIC 2018.
Table 1. Performance comparison on ISIC 2017 dataset (best re-
sults are bolded).
Methods F1 SE SP AC JS melanocytic lesions, including 80 common nevi, 80 atyp-
U-net [31] 0.8682 0.9479 0.9263 0.9314 0.9314
ical nevi, and 40 melanomas. The manual segmentation of
Melanoma det. [13] - - - 0.9340 -
Lesion Analysis [2] - 0.8250 0.9750 0.9340 - the skin lesions are available as the ground truth. The reso-
R2U-net [3] 0.8920 0.9414 0.9425 0.9424 0.9421 lution of each input image is 768 × 560 pixels. We follow
MCGU-Net [4] 0.8927 0.8502 0.9855 0.9570 0.9570 the experimental setting used in [25], and randomly split the
Baseline 0.9036 0.8745 0.9857 0.9647 0.9647 dataset into two sets of 100 images, and then use one set as
FRCU-Net 0.9269 0.9150 0.9861 0.9727 0.9727
test data, 80% of the other set for the training, and the re-
maining data for validation. For this dataset, we exploit the
learnt weights of ISIC 2017 as the pre-trained model and
cancer or non-cancer lesions). We used 1815 images for
then finetune the network with the training data.
training, 259 for validation and 520 for testing, like other
approaches [3]. We resize the original size of each sample, Some segmentation outputs of the proposed network for
i.e., from 2016 × 3024, to 256 × 256 pixels. PH2 dataset are depicted in Figure 6. The results of the
Figure 5 shows some example outputs of the proposed proposed network are compared with other state-of-the art
network. Table 2 lists the quantitative results of different al- approaches in Table 3. It can be seen that except from the
ternative methods and the proposed network on this dataset. specificity, the proposed approach results in better perfor-
It can be seen that better performance is achieved by the mance than state-of-the-art alternatives. The performance
proposed network w.r.t. state-of-the-art alternatives for F1- of the FRCU-Net is also better than the baseline network.
score, sensitivity, accuracy and Jaccard similarities, and for
all the metrics, our FRCU-Net outperform the baseline.

Table 2. Performance comparison on ISIC 2018 dataset (best re-

sults are bolded.).
Methods F1 SE SP AC PC JS
U-net [31] 0.647 0.708 0.964 0.890 0.779 0.549
Att U-net [30] 0.665 0.717 0.967 0.897 0.787 0.566
R2U-net [3] 0.679 0.792 0.928 0.880 0.741 0.581
Att R2U-Net [3] 0.691 0.726 0.971 0.904 0.822 0.592
BCDU-Net [5] 0.851 0.785 0.982 0.937 0.928 0.937
MCGU-Net [4] 0.895 0.848 0.986 0.955 0.947 0.955
Baseline 0.892 0.871 0.978 0.954 0.914 0.954
FRCU-Net 0.913 0.904 0.979 0.963 0.922 0.963

4.3. PH2 Dataset

The PH2 dataset [28] is a dermoscopic image database
Figure 6. Segmentation result of FRCU-Net on PH2 .
which was introduced for both segmentation and clas-
sification. The dataset contains a total number of 200

3279
Table 3. Performance comparison on PH2 dataset (best results are Table 4. Performance comparison on Lung dataset (best results are
bolded). bolded).
Methods DIC SE SP AC JS Methods F1 SE SP AC
FCN [29] 0.8903 0.9030 0.9402 0.9282 0.8022 U-net [31] 0.9658 0.9696 0.9872 0.9872
U-net [31] 0.8761 0.8163 0.9776 0.9255 0.7795 RU-net [3] 0.9638 0.9734 0.9866 0.9836
SegNet [8] 0.8936 0.8653 0.9661 0.9336 0.8077 R2U-Net [3] 0.9832 0.9944 0.9832 0.9918
FrCN [2] 0.9177 0.9372 0.9565 0.9508 0.8479
MCGU-Net [4] 0.9263 0.8322 0.9714 0.9537 0.9537
MCGU-Net[4] 0.9904 0.9910 0.9982 0.9972
Baseline 0.9278 0.9071 0.9787 0.9568 0.9568 Baseline 0.9851 0.9914 0.9962 0.9954
FRCU-Net 0.9497 0.9730 0.9689 0.9689 0.9689 FRCU-Net 0.9901 0.9904 0.9982 0.9970
Table 5. Performance comparison on SegPC dataset (best results
are bolded).
Methods mIOU
4.4. Lung Segmentation Dataset XLAB Insights [9] 0.9360
DSC-IITISM [9] 0.9356
The Lung Nodule Analysis (LUNA) competition at the
bmdeep [9] 0.9065
Kaggle Data Science Bowl in 2017 introduced a lung seg-
Baseline 0.9215
mentation dataset [1]. This data includes 2D and 3D CT
FRCU-Net 0.9392
images with labels for lung segmentation . For our evalua-
tion, we use 70% of the data as train set and the remaining
30% as test set. The size of each image is 512 × 512 pixels.
4.5. SegPC 2021 Challenge dataset
The lung lesions in CT images have almost the same Haus-
dorff value as other structures that are not of interest, such We evaluate our FRCU-Net on multiple myeloma cell
as bone and air. We use the same strategy as [5] to estimate segmentation grand challenges which are provided by the
the lung region as a region inside the estimated surrounding SegPC 2021 [16, 18, 19]. Images in this dataset were cap-
tissues. tured from bone marrow aspirate slides of patients diag-
Figure 7 shows some outputs of the proposed network. nosed with Multiple Myeloma (MM), a type of white blood
In Table 4, the performance of the FRCU-Net on this dataset cancer. This dataset consists of a training set with 290 sam-
is compared with other state-of-the-art approaches. It can be ples, a validation set with 200, and a test set with 277 sam-
seen that after the MCGU-Net, we have the best F1-Score ples. Since the test data is not publicly available, we split
and accuracy for the FRCU-Net among other approaches the training dataset into a training and validation set and
listed in this table. The MCGU-Net uses bidirectional Con- evaluate the method on the original validation set as our test
vLSTM in the skip connection layers and dense connections set. All the samples have been annotated by a pathologist
in the bottleneck layer. Consequently, compared to FRCU- and instance base segmentation masks are provided for each
Net, MCGU-Net comprises a larger number of parameters object of interest (myeloma plasmacells).
for training, and it therefore needs much longer for conver- Some segmentation outputs of the FRCU-Net are shown
gence. in Figure 8. The mIOU of the proposed network is
compared with the challenge winners in Table 5. The
first-ranked team (XLAB Insights) utilizes a combina-
tion of three instance segmentation networks (SCNet [33],
ResNeSt[34], and Mask-RCNN [20]) with a slight modi-
fication to suit the current task. The second team (DSC-
IITISM) employs the Mask-RCNN model with heavy data
augmentation approaches. Lastly, bmdeep [9] uses an at-
tention deeplabv3+ method [6] with a multi-scale region-
based training strategy. In our pipeline, we also use this [9]
strategy and replace the attention deeplabv3+ network with
our proposed model. Our experimental results demonstrate
that our proposed approach improves the performance com-
pared to all aforementioned approaches.

4.6. Effect of the FRC Module

The main modification of our proposed network com-
Figure 7. Segmentation result of FRCU-Net on Lung segmentation pared to the U-Net is utilizing the frequency domain fea-
dataset. tures in the bottleneck layer. In Figure 9, we compare the
segmentation output of the proposed network with the base-

3280
Figure 8. Segmentation result of FRCU-Net on SegPC dataset.

line model (U-Net). It shows a more precise and fine seg-

mentation output of the proposed network by utilizing the
frequency domain.
The main task of the FRC module in our proposed net-
work is taking convolutional features from some levels of
the frequency domain. Different levels of the Laplacian
pyramid are responsible for different kinds of features. For
instance, high level frequencies include significant shape Figure 9. Visual effect of the FRC module in the FRCU-Net. From
information, while lower frequencies contain information top to bottom, examples from ISIC 2018, ISIC 2017, Lung seg-
about the texture of the input data. It is worth mentioning mentation, SegPC, and PH2 datasets.
that we also evaluated the network without the Laplacian
pyramid, i.e., with the SE block only. The SE block im-
proves the F1-Score of the base architecture with by 1% for
many applications shape-based features can provide highly
ISIC 2017 dataset while the performance of the FRCU-Net
discriminative information. In order to consider both of
(Laplacian pyramid plus SE block) was about 2.3% higher
these kinds of features, we proposed to extend the classical
than our baseline. In other words, both components are
U-Net by inserting the FRC module, which comprises two
clearly responsible for the achieved gain.
parts: Laplacian pyramid and frequency attention. We rep-
The FRC module in our network is employed to com-
resent the extracted feature maps of the convolutional layers
bine these kinds of features so that the network learns to
in the frequency domain to capture both texture-based and
attend more on the kind of feature which is the most dis-
shape-based information. To aggregate the features from all
criminative one based on each particular benchmark. This
levels of the Laplacian pyramid, we proposed a frequency
can be seen in Figure 9. Compared to U-Net, FRCU-Net re-
attention mechanism. The channels of each set of feature
sults in a more accurate output segmentation, providing an
maps are first re-calibrated by employing the global aver-
accurate and smooth segmentation boundary that properly
age pooling information. The features from different levels
defines the shape of the skin lesion. As we can see in the
of the pyramid were then combined by utilizing a non-linear
third row of Figure 9, the skin lesion is not as obvious as
aggregation function. Our evaluation on five public medi-
other examples and there is an overlap between the back-
cal image segmentation datasets demonstrated that the pro-
ground and the lesion. Shape-based features are relevant to
posed FRCU-Net outperforms state-of-the-art alternatives.
segment this example. Overall, one can see that the visual
performance of FRCU-Net is better than the original U-Net.

5. Conclusion 6. Acknowledgment
In this paper, we proposed the FRCU-Net for skin lesion This work has been partially supported by the Spanish
segmentation. It has been shown that the regular convolu- project PID2019-105093GB-I00 (MINECO/FEDER, UE)
tional layers tend to learn texture-based features, while in and by ICREA under the ICREA Academia programme.

3281
References tional skin imaging collaboration (isic). arXiv preprint
arXiv:1902.03368, 2019.
[1] https://www.kaggle.com/kmader/ [13] Noel CF Codella, David Gutman, M Emre Celebi, Brian
finding-lungs-in-ct-data. Helba, Michael A Marchetti, Stephen W Dusza, Aadi
[2] Mohammed A Al-Masni, Mugahed A Al-Antari, Mun-Taek Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kit-
Choi, Seung-Moo Han, and Tae-Seong Kim. Skin lesion tler, et al. Skin lesion analysis toward melanoma detection:
segmentation in dermoscopy images via deep full resolution A challenge at the 2017 international symposium on biomed-
convolutional networks. Computer methods and programs in ical imaging (isbi), hosted by the international skin imaging
biomedicine, 162:221–231, 2018. collaboration (isic). In 2018 IEEE 15th International Sym-
[3] Md Zahangir Alom, Mahmudul Hasan, Chris Yakopcic, posium on Biomedical Imaging (ISBI 2018), pages 168–172.
Tarek M Taha, and Vijayan K Asari. Recurrent residual con- IEEE, 2018.
volutional neural network based on u-net (r2u-net) for med- [14] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, Huazhu
ical image segmentation. arXiv preprint arXiv:1802.06955, Fu, Jianbing Shen, and Ling Shao. Pranet: Parallel reverse
2018. attention network for polyp segmentation. In International
[4] Maryam Asadi-Aghbolaghi, Reza Azad, Mahmood Fathy, Conference on Medical Image Computing and Computer-
and Sergio Escalera. Multi-level context gating of embed- Assisted Intervention, pages 263–273. Springer, 2020.
ded collective knowledge for medical image segmentation. [15] Abdur R Feyjie, Reza Azad, Marco Pedersoli, Claude Kauff-
arXiv preprint arXiv:2003.05056, 2020. man, Ismail Ben Ayed, and Jose Dolz. Semi-supervised few-
[5] Reza Azad, Maryam Asadi-Aghbolaghi, Mahmood Fathy, shot learning for medical image segmentation. arXiv preprint
and Sergio Escalera. Bi-directional convlstm u-net with arXiv:2003.08462, 2020.
densley connected convolutions. In Proceedings of the [16] Shiv Gehlot, Anubha Gupta, and Ritu Gupta. Ednfc-net:
IEEE/CVF International Conference on Computer Vision Convolutional neural network with nested feature concate-
Workshops, pages 0–0, 2019. nation for nuclei-instance segmentation. In ICASSP 2020-
[6] Reza Azad, Maryam Asadi-Aghbolaghi, Mahmood Fathy, 2020 IEEE International Conference on Acoustics, Speech
and Sergio Escalera. Attention deeplabv3+: Multi-level con- and Signal Processing (ICASSP), pages 1389–1393. IEEE,
text attention mechanism for skin lesion segmentation. In 2020.
European Conference on Computer Vision, pages 251–266. [17] Robert Geirhos, Patricia Rubisch, Claudio Michaelis,
Springer, 2020. Matthias Bethge, Felix A Wichmann, and Wieland Brendel.
Imagenet-trained cnns are biased towards texture; increasing
[7] Reza Azad, Abdur R Fayjie, Claude Kauffmann, Ismail
shape bias improves accuracy and robustness. arXiv preprint
Ben Ayed, Marco Pedersoli, and Jose Dolz. On the texture
arXiv:1811.12231, 2018.
bias for few-shot cnn segmentation. In Proceedings of the
[18] Anubha Gupta, Rahul Duggal, Shiv Gehlot, Ritu Gupta, An-
IEEE/CVF Winter Conference on Applications of Computer
vit Mangal, Lalit Kumar, Nisarg Thakkar, and Devprakash
Vision, pages 2674–2683, 2021.
Satpathy. Gcti-sn: Geometry-inspired chemical and tissue
[8] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.
invariant stain normalization of microscopic medical images.
Segnet: A deep convolutional encoder-decoder architecture
Medical Image Analysis, 65:101788, 2020.
for image segmentation. IEEE transactions on pattern anal-
[19] Anubha Gupta, Pramit Mallick, Ojaswa Sharma, Ritu Gupta,
ysis and machine intelligence, 39(12):2481–2495, 2017.
and Rahul Duggal. Pcseg: Color model driven probabilistic
[9] Afshin Bozorgpour, Reza Azad, Eman Showkatian, and Alaa multiphase level set based tool for plasma cell segmentation
Sulaiman. Multi-scale regional attention deeplab3+: Multi- in multiple myeloma. PloS one, 13(12):e0207908, 2018.
ple myeloma plasma cells segmentation in microscopic im- [20] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
ages. arXiv preprint arXiv:2105.06238, 2021. shick. Mask r-cnn. In Proceedings of the IEEE international
[10] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, conference on computer vision, pages 2961–2969, 2017.
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image [21] Mohammad Hesam Hesamian, Wenjing Jia, Xiangjian He,
segmentation with deep convolutional nets, atrous convolu- and Paul J Kennedy. Atrous convolution for binary semantic
tion, and fully connected crfs. IEEE transactions on pattern segmentation of lung nodule. In ICASSP 2019-2019 IEEE
analysis and machine intelligence, 40(4):834–848, 2017. International Conference on Acoustics, Speech and Signal
[11] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Processing (ICASSP), pages 1015–1019. IEEE, 2019.
Thomas Brox, and Olaf Ronneberger. 3d u-net: learning [22] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-
dense volumetric segmentation from sparse annotation. In works. In Proceedings of the IEEE conference on computer
International conference on medical image computing and vision and pattern recognition, pages 7132–7141, 2018.
computer-assisted intervention, pages 424–432. Springer, [23] Debesh Jha, Sharib Ali, Håvard D Johansen, Dag D
2016. Johansen, Jens Rittscher, Michael A Riegler, and Pål
[12] Noel Codella, Veronica Rotemberg, Philipp Tschandl, Halvorsen. Real-time polyp detection, localisation and
M Emre Celebi, Stephen Dusza, David Gutman, Brian segmentation in colonoscopy using deep learning. arXiv
Helba, Aadi Kalloo, Konstantinos Liopyris, Michael preprint arXiv:2011.07631, 2020.
Marchetti, et al. Skin lesion analysis toward melanoma [24] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-
detection 2018: A challenge hosted by the interna- Hsuan Yang. Deep laplacian pyramid networks for fast and

3282
accurate super-resolution. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition, pages
624–632, 2017.
[25] Xinhua Liu, Gaoqiang Hu, Xiaolin Ma, and Hailan Kuang.
An enhanced neural network based on deep metric learn-
ing for skin lesion segmentation. In 2019 Chinese Control
And Decision Conference (CCDC), pages 1633–1638. IEEE,
2019.
[26] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In Pro-
ceedings of the IEEE conference on computer vision and pat-
tern recognition, pages 3431–3440, 2015.
[27] Carlos Martı́n-Isla, Maryam Asadi-Aghbolaghi, Polyxeni
Gkontra, Victor M Campello, Sergio Escalera, and Karim
Lekadir. Stacked bcdu-net with semantic cmr synthesis: Ap-
plication to myocardial pathology segmentation challenge.
In Myocardial Pathology Segmentation Combining Multi-
Sequence CMR Challenge, pages 1–16. Springer, 2020.
[28] Teresa Mendonça, Pedro M Ferreira, Jorge S Marques,
André RS Marcal, and Jorge Rozeira. Ph 2-a dermoscopic
image database for research and benchmarking. In 2013
35th annual international conference of the IEEE engineer-
ing in medicine and biology society (EMBC), pages 5437–
5440. IEEE, 2013.
[29] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.
Learning deconvolution network for semantic segmentation.
In Proceedings of the IEEE international conference on com-
puter vision, pages 1520–1528, 2015.
[30] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee,
Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven
McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Atten-
tion u-net: Learning where to look for the pancreas. arXiv
preprint arXiv:1804.03999, 2018.
[31] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
net: Convolutional networks for biomedical image segmen-
tation. In International Conference on Medical image com-
puting and computer-assisted intervention, pages 234–241.
Springer, 2015.
[32] Jo Schlemper, Ozan Oktay, Michiel Schaap, Mattias Hein-
rich, Bernhard Kainz, Ben Glocker, and Daniel Rueckert.
Attention gated networks: Learning to leverage salient re-
gions in medical images. Medical image analysis, 53:197–
207, 2019.
[33] Thang Vu, Haeyong Kang, and Chang D Yoo. Scnet: Train-
ing inference sample consistency for instance segmentation.
In Proceedings of the AAAI Conference on Artificial Intelli-
gence, volume 35, pages 2701–2709, 2021.
[34] Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu,
Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller,
R Manmatha, et al. Resnest: Split-attention networks. arXiv
preprint arXiv:2004.08955, 2020.
[35] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima
Tajbakhsh, and Jianming Liang. Unet++: A nested u-net
architecture for medical image segmentation. In Deep learn-
ing in medical image analysis and multimodal learning for
clinical decision support, pages 3–11. Springer, 2018.