0% found this document useful (0 votes)

13 views

H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation

Uploaded by

Samir A. Faragِ

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation

Uploaded by

Samir A. Faragِ

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO.

9, SEPTEMBER 2023 2763

H2Former: An Efficient Hierarchical Hybrid

Transformer for Medical Image Segmentation
Along He , Kai Wang , Tao Li , Chengkun Du, Shuang Xia, and Huazhu Fu , Senior Member, IEEE

Abstract— Accurate medical image segmentation is of diagnosis is time-consuming and is easily affected by clinical
great significance for computer aided diagnosis. Although experience. Therefore, automated medical image segmentation
methods based on convolutional neural networks (CNNs) is a good auxiliary tool for medical experts.
have achieved good results, it is weak to model the
long-range dependencies, which is very important for Recently, convolutional neural networks (CNNs) have been
segmentation task to build global context dependencies. the main-stream in computer vision [1], [2], [3], as well as
The Transformers can establish long-range dependencies in the field of medical image analysis [4], [5], [6], [7], [8].
among pixels by self-attention, providing a supplement Most of the segmentation methods are based on UNet [7]
to the local convolution. In addition, multi-scale feature or its variants [8], [9], [10]. However, due to the intrinsic
fusion and feature selection are crucial for medical image
segmentation tasks, which is ignored by Transformers. locality of convolution, CNNs failed to model long-range
However, it is challenging to directly apply self-attention dependencies, leading to sub-optimal results, and it has the
to CNNs due to the quadratic computational complexity following two shortcomings. Firstly, the small convolutional
for high-resolution feature maps. Therefore, to integrate kernels attend to a local region and force the network to
the merits of CNNs, multi-scale channel attention and focus on local feature patterns rather than the global features.
Transformers, we propose an efficient hierarchical hybrid
vision Transformer (H2Former) for medical image segmen- For medical image segmentation, it often requires long-range
tation. With these merits, the model can be data-efficient for global information for reliable segmentation, since the shape
limited medical data regime. The experimental results show and size of lesions vary greatly. Secondly, after training, the
that our approach exceeds previous Transformer, CNNs convolutional kernels are static and fixed, and the well trained
and hybrid methods on three 2D and two 3D medical image parameters cannot adapt to the input image contents. There-
segmentation tasks. Moreover, it keeps computational
efficiency in model parameters, FLOPs and inference time. fore, CNNs lack flexibility for different inputs with different
For example, H2Former outperforms TransUNet by 2.29% in characteristics.
IoU score on KVASIR-SEG dataset with 30.77% parameters Meanwhile, Transformer [11] based on self-attention has
and 59.23% FLOPs. achieved remarkable sucsess in natural language processing.
Index Terms— Hierarchical hybrid transformer, medical It is well known that visual attention plays an important role
image segmentation, multi-scale channel attention, self- in computer vision [12], [13]. Therefore, a natural idea is
attention.
to introduce the self-attention into computer vision. Vision
I. I NTRODUCTION Transformers (ViT) [14] is the first attempt to adopt Trans-
RECISE lesion segmentation from medical images has
P been one of the fundamental yet challenging problems
in medical image analysis, which can benefit disease diagno-
former encoder for image classification, achieving competi-
tive performance with its CNN counterparts. Because of its
great success, Transformers are also popular in other vision
sis and treatment planning. However, relying on experts for tasks, such as object detection [15] and semantic segmenta-
Manuscript received 23 February 2023; accepted 29 March 2023. tion [16], and achieves the state-of-the-art performance. With
Date of publication 5 April 2023; date of current version 31 August the long-range information interaction and dynamic feature
2023. This work was supported in part by the National Natural Science encoding ability, Transformers have been applied for medical
Foundation under Grant 62272248, in part by CAAI-Huawei MindSpore
Open Fund under Grant CAAIXSJLJJ-2021-025A, in part by AISG Tech image segmentation. TransUNet [17] is the first attempt to
Challenge Funding under Grant AISG2-TC-2021-003, and in part by employ Transformer for medical image segmentation, which
A*STAR CDF under Grant C222812010. (Corresponding author: Tao Li.) is a hybrid structure of CNNs and Transformers. Then, a pure
Along He, Kai Wang, and Chengkun Du are with the Tianjin
Key Laboratory of Network and Data Security Technology, Transformer encoder-decoder structure named Swin-UNet [18]
College of Computer Science, Nankai University, Tianjin 300350, and nnFormer [19] for volumetric medical image segmentation
China (e-mail: [email protected]; [email protected]; are proposed, showing strong performance.
[email protected]).
Tao Li is with the College of Computer Science, Nankai University, Although long-range dependencies can be modeled well in
Tianjin 300350, China, and also with the Xingchuang Haihe Laboratory, Transformer, it still has drawbacks and limits the performance.
Tianjin 300459, China (e-mail: [email protected]). Firstly, the spatial information is ignored, as it serializes
Shuang Xia is with the Radiology Department, Tianjin
First Central Hospital, Nankai, Tianjin 300350, China (e-mail: images into 1D tokens, and it is weak in local feature learning,
[email protected]). which is crucial for 2D images. Although this problem can be
Huazhu Fu is with the Institute of High Performance Computing relieved by position encoding, while position encoding needs
(IHPC), Agency for Science, Technology and Research (A*STAR), Sin-
gapore 138632 (e-mail: [email protected]). to adapt to the various input resolution through interpolation,
Digital Object Identifier 10.1109/TMI.2023.3264513 thus affecting the performance. The second drawback of

1558-254X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2764 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023

To answer this question, we redesign the network struc-

ture, and propose an efficient hybrid Transformer block by
incorporating the strong inductive bias of CNNs and multi-
scale channel-wise attention with Transformer’s token-wise
attention, which can aggregate feature extraction capabilities
from different aspects. Finally, it forms a hierarchical hybrid
vision Transformer (H2Former) for medical image segmen-
tation, which is an encoder-decoder structure. From a macro
point of view, the outputs of each stage constitute the multi-
scale features, and features within each stage also constitute
multi-scale features by our MSCA from a micro point of view,
forming the nested multi-scale feature representation. The
experimental results demonstrate that H2Former can deliver
Fig. 1. The Dice score on KVASIR-SEG test dataset compared to other highly competitive performance with reasonable parameters
methods with respect to FLOPs (based on ResNet-34 [1]), and it shows and floating point operations (FLOPs), as shown in Fig. 1.
that ours method achieves the best trade-off between performance and
model complexity. The size of markers represents the FLOPs. The main contribution of our method is that we consider
multi-scale, local features of CNN and long-range dependency
of Transformer within a block, forming a hierarchical hybrid
Transformer is the dependence on large-scale datasets with structure with multi-scale information in the stage-level and
quadratic computational complexity, which can be interpreted network-level, and it is important for segmentation task with
as low inductive bias [20] in modeling local visual cues [21]. different shapes and scales. From a macro point of view, the
Finally, Transformer only learns features in a token-wise output of each stage forms the multi-scale features. From
attention manner with single scale and cannot perceive multi- a micro point of view, each stage also forms multi-scale
scale channel-wise feature dependencies, which is harmful for features by our MSCA, forming the nested multi-scale feature
lesions with various shapes and scales. Therefore, there is representation. Moreover, we also consider multi-scale feature
still room for improvement in hybrid structure of CNNs and selection by channel attention, which is ignored by previous
Transformers. works. Because there is a large gap between medical images
From the above analysis, CNNs can preserve the spatial and natural images, and the labeled training data is limited,
information of features by convolutional kernels with strong it is necessary to elegantly integrate these existing key findings
inductive bias, which is beneficial to 2D images. It can into a unified framework, and we have conducted extensive
reduce the requirement for large scale datasets and thus is experiments on both 2D and 3D medical image segmentation
data-efficient and suitable for medical image segmentation. tasks to validate the effectiveness of our method. In contrast
While Transformer is capable of performing dynamic attention to the existing approaches, our method provides the following
and modeling long-range dependencies with sufficient recep- three main contributions.
tive field, which is helpful in segmenting large scale lesions. • We proposed a hierarchical hybrid model that elegantly
Considering the complementary roles of CNNs and Trans- integrate the local information of the CNNs, multi-scale
former, there are some works [22], [23], [24] to integrate the channel attention features and long-range features of
two feature encoding strategies, while they maintain quadratic Transformer within a unified block, which can integrate
computational complexity for high-resolution feature maps, the merits of them simultaneously, and enhanced the
and they are dependent on large scale dataset like ImageNet. feature representation ability of the model.
Therefore, it is still a problem whether the hybrid structure • A light-weight multi-scale channel attention (MSCA)
trained on limited medical images can outperform the perfor- branch is presented, which serializes the feature maps into
mance of pure CNNs and pure Transformer models. Beyond multi-scale token pyramids, and then they are calibrated
that, most existing hybrid methods only rely on single token with channel-wise attention. MSCA benefits medical
granularity within each attention layer in a token-wise manner, image segmentation task with different shapes and scales
and they are incapable of capturing features at different and is complementary to token-wise self-attention.
scales. Furthermore, they ignore the channel relationship of the • Finally, we demonstrate the superiority of our method
Transformer, which is also important for feature learning with comprehensively in terms of performance, model param-
channel attention. Multi-scale features has been demonstrated eters, FLOPs and inference time, which outperforms the
to be effective in various vision tasks owing to their powerful competing models on three 2D and two 3D medical image
feature representation ability [25], [26], [27]. The hybrid of segmentation tasks, which demonstrates the effectiveness
CNNs and Transformers with multi-scale features within a of our model.
layer has not been fully explored in previous network design
for medical image segmentation with limited labeled data. II. R ELATED W ORK
Naturally, we seek to ask the question: how to beneficially
integrate the merits of CNNs, multi-scale channel attention and A. CNNs for Medical Image Segmentation
Transformers for medical image segmentation while avoiding Due to the powerful feature representation capability, med-
their respective limitations and keep computation efficiency? ical image segmentation is mainly based on CNNs in the last

Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
HE et al.: H2Former: AN EFFICIENT HIERARCHICAL HYBRID TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION 2765

few years. UNet [7] is seminal and widely used for medical limited receptive field and lack the ability of modeling long-
image segmentation and has achieved state-of-the-art results range cues. Secondly, for Transformers, although long-range
on various medical image segmentation tasks [4], [5], [6]. dependencies can be modeled well, they still ignore the local
Then, many variants of UNet are developed, such as Dense- detailed information and structure priors to some extent, which
UNet [28], Attention UNet [9], UNet++ [8], nnUNet [10] is essential for segmentation tasks. And beyond that, pure
and UNet3+ [29]. Inspired by these fundamental architectures, Transformer models ignore channel-wise attention, and the
some methods are tailored for specific tasks, such as optic global self-attention like ViT [11] require large-scale datasets
disc and optic cup segmentation from fundus images [5] and high computational complexity, making it difficult to train
and COVID-19 lung infection segmentation [30], etc. All transformers efficiently for medical segmentation applications.
in all, CNNs have made great progress in medical image Thirdly, for the hybrid Transformer, they only consider token-
segmentation. wise self-attention to build the relationship among tokens.
It does not fully consider the channel-wise interaction and
B. Vision Transformers for Medical Image Segmentation conduct feature learning with single scale. There are three
key differences between the proposed H2Former and previous
Recently, the Transformer based methods have shown
works, as shown in Fig. 2: (1) Multi-scale feature fusion
promising performance for vision tasks [16], [24], [25], [31],
is crucial to feature representation, and we want to fuse
[32], [33], and they have been viewed as an alternative to the
multi-scale feature fusion in Transformer in a simple and effi-
CNN counterparts. ViT [11] is the first work to apply pure
cient way. Although there are previous works like CrossFormer
Transformers for image classification, which takes the input
considering multi-scale feature fusion within a stage, they
image as a sequence of image patches and then adopt Trans-
only concatenate these multi-scale features without feature
former blocks to encode image features. In order to train ViT
selection, resulting in sub-optimal solutions (See Table VI E.6
more effectively, DeiT [34] proposed several efficient training
and E.7) because of the conflict between features with different
strategies that allow ViT to converge well with no exter-
scales. Therefore, how to perform multi-scale feature fusion
nal data. To mitigate the quadratic computational overhead,
and considering feature selection becomes very important.
a window-based Transformer named Swin-transformer [31]
For this reason, we propose MSCA to conduct dynamic
was proposed to performs self-attention through local windows
selection of multi-scale features and model the relationship
for computer vision tasks, reducing computation cost and
between channel dimensions, which complement token-wise
achieving good performance. In order to introduce desirable
self-attention. (2) The labeled training data is limited in
properties of CNNs to the Transformers, some works learn
medical domain, and it means that we should fully explore
from CNN’s design philosophy [23], [35], and achieved a
the features from the backbone. Our hybrid block can model
strong performance with efficiency.
the local and long-range dependency at the same time, and
Due to the promising performance of Transformers, some
multiscale feature with feature selection is also considered
works have been proposed for both 2D and 3D medical image
to perform more effective multi-scale feature fusion. These
segmentation [36], [37], [38], and nnFormer [19] is a strong
merits make the proposed method superior to other hybrid
3D volumetric medical image segmentation method with local
transformer-CNN models using multi-scale feature fusion.
and global volume-based self-attention mechanism to learn
Other hybrid transformer-CNN models using multi-scale fea-
volume representations. TransUNet [17] is the first to employ
ture fusion only consider the inherent multi-scale feature from
a hybrid CNN-Transformer architecture for medical image
multi-stage and fuse them together without feature selection.
segmentation. It leverage the local features from CNNs and
In addition to considering this implicit multi-scale feature,
the global context features from Transformers in a cascade
we also explicitly modeled the multi-scale feature inside each
manner, which is a strong alternative to CNNs. Inspired by
stage, forming a nested multi-scale feature representation,
Swin-transformer [31], Swin-UNet [18] was proposed with a
which is conducive to the network to make full use of
pure Transformer structure. However, it ignore local spatial
the features of various granularity. (3) Previous models only
information, which is extremely essential for segmentation
consider the performance of the model or the efficiency of the
tasks. In order to addresses the dilemma that Transformer
model, and we further consider both of them at the same time,
requires large amounts of data, a hybrid Transformer archi-
and thus our model can achieve excellent performance while
tecture (UTNet) [39] that integrates self-attention into a CNN
maintaining a high inference speed (See Table VII and Fig. 1).
for enhancing medical image segmentation.
The experimental results show that such integration can signif-
icantly improve the feature representation ability of the model.
C. Analysis of Previous Work
Although the above methods have achieved good results, III. M ETHODOLOGY
they still ignored the integration of multi-scale channel atten- In this section, we first present the overview of our proposed
tion and hierarchical hybrid manner. Therefore, the problem is method. Then, the details of the proposed hierarchical hybrid
how to aggregate the strengths of CNNs, multi-scale channel Transformer block will be given.
attention and Transformers to build a unified block with rea-
sonable parameters and computation complexity for medical A. Overview and Preliminary
image segmentation. Firstly, for CNNs, they can extract the The overall pipeline of our proposed method is illustrated
local spatial information well, but they still have an inevitably in Fig. 3, following the design philosophy of [7], [31], the

Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2766 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023

wise attention, and it ignores the relationship among channels,

thus subpar to tackle situations such as segmenting multi scale
lesion regions in medical images. Considering these issues,
we introduce the hybrid Transformer block during the feature
extraction, which consists of the multi-scale channel attention,
a convolutional layers and Transformer layers, forming a
unified block, as shown in Fig. 3(b).
1) Multi-Scale Channel Attention (MSCA): Multi-scale and
channel-wise relationships are very important for medical
image segmentation with differen shapes and scales, and thus
Fig. 2. Conceptual comparison of four current popular methods for
medical image segmentation. (a): The standard pure convolutional we propose multi-scale channel attention to enable atten-
UNet [7]. (b): The cascaded hybrid of CNN and Transformer structure tion mechanism in a channel-wise perspective to capture
(e.g., TransUNet [17]). (c): The pure Transformer structure (e.g., Swin- multi-scale dependencies of tokens, which is a complement
UNet [18] and MISSFormer [40]). (d): Our proposed efficient hierarchical
hybrid Transformer structure.
to token-wise self-attention in Transformer.
As we can observe in Fig. 4, The multi-scale tokens have
the same tokenization center but with different context scales,
structure consists of four stages in encoder. The main novelty and all these tokens will be concatenated to form multi-
comes from the unified hierarchical hybrid of convolution, scale tokens. In practice, the process can be implemented by
multi-sacle channel attention and Transformer feature encod- convolutional layers with different kernels with stride of 1,
ing strategies with computational efficiency. Here, the term note that we do not reduce the number of tokens in the
“hierarchical” means that we adopt the hybrid Transformer feature maps. The multi scale features of stagei are denoted
blocks in each stage, which facilitates the medical image as F̂im , i ∈ {1, 2, 3, 4}, which can be obtained by:
segmentation task. The “hybrid” refers to the three feature
F̂im = Concat(Tis ), s ∈ S = {2, 4, 8, 16, 32}, (1)
encoding strategies can be integrated into a unified block.
Specifically, given an input image I ∈ R H ×W ×C , H , W HW
×Ci
and C denote the height, width and the channel number of where F̂im ∈ R 22(i+1 ) is the multi scale tokens in stagei ,
HW
×C s
the image, respectively. First, following ViT [11] and Swin- Concat denotes the token concatenation. Tis ∈ R 22(i+1 ) i is
transformer [31], a convolutional stem is used for patch the tokensPwith scale of s and dimensions of Cis in stagei ,
partition, which is implemented with a convolutional layer satisfying s∈S Cis = Ci , and Tis can be obtained by:
and a MaxPooling layer. The image is divided into N patches
X = {xi ∈ R P×P×C , i = 1, 2, . . . , N } of patch size P × P Tis = MSTis (Fi−1 ), i ∈ {1, 2, 3, 4}, (2)
(P = 4 in our paper), and N = HP×W 2 , x i is the i-th patch. where MSTis is the multi-scale tokenization layers with s × s
Then, we embed the patches X ∈ R N ×(P×P×C) to image kernels and stride of 1 in stagei . And then, the multi-scale
tokens E ∈ R N ×C1 with a trainable linear projection. Then, token features can be calibrated by channel attention:
the embedded tokens E are fed into the four encoder stages to
produce the hierarchical features, and the spatial resolution of Fim = σ (Conv1d(GAP( F̂im ))) ⊗ F̂im , i ∈ {1, 2, 3, 4}, (3)
feature maps in the four stages are with {1/4, 1/8, 1/16, 1/32}
GAP is the global average pooling layer to aggregate features
of the original input resolution, and these feature maps are
from different scales. Conv1d is the fast 1D convolutional layer
also used for skip connections in the decoder, which will pro-
with kernel size of 3 according to [41] to perform channel
vide the fine-to-coarse features for segmentation. We denote
H
× W ×C relationship. σ is a Sigmoid activation to get the weights
the output features of stagei as Fi ∈ R 2i+1 2i+1 i , i ∈ that indicate the importance of each channel. Then, multi-
{1, 2, 3, 4}, note that F0 = E. Patch merging layer is used to scale features are calibrated via the multiplication of channel
reduce the number of tokens and expand the token dimensions H
× W ×C
weights and original features. Fim ∈ R 2i+1 2i+1 i is the final
between two adjacent stages, which is implemented with a
calibrated multi-scale features.
3 × 3 convolutional layer of stride 2 for the token number
To reduce the computational complexity of the MSCA,
reduction. Each stage consists of one hybrid Transformer block
lower dimension is used for large scales while high dimension
(Sec. III-B), which can capture multi-scale long-range features
for small scales. Assuming that the dimension of input tokens
and local spatial information with feature selection. Then, the
is Ci in stagei , for the first stage of high resolution feature
hierarchical features Fi of the four stages are fed into the patch
maps with low dimension, we adopt four scales (i.e., 4 × 4,
expanding path (Sec. III-C) to predict the segmentation results
8 × 8, 16 × 16 and 32 × 32), and the output dimension are
Y ∈ R H ×W ×K , where K is the number of categories. Next,
Ci /2, Ci /4, Ci /8 and Ci /8, respectively, making it the same
we elaborate the proposed network design.
dimensions as the input, as shown in Fig. 4. For the rest of low
resolution feature maps, there is no need to adopt large kernel
B. Hybrid Transformer Block and we use two scales of 2 × 2 and 4 × 4 (See Table VI), and
Due to the intrinsic locality of convolution, which is the output dimensions are Ci /2 and Ci /2. Therefore, MSCA
incapable of modeling long-range dependencies. In addition, can extract multi-scale features with reasonable computational
Transformer generates single scale features with only token- complexity.

Fig. 3. (a) Overall architecture of the proposed H2Former. The encoder consists of four stages, and each stage contains a hybrid Transformer
block. (b) The structure of the proposed hybrid Transformer block. It extracted local features and multi-scale channel attention features through
the convolutional block and MSCA, and then aggregates them together. Then, rich features are extracted by the Transformer block to achieve
the purpose of combining the advantages of CNNs, multi-scale channel-wise attention and token-wise features of Transformers simultaneously.
‘Img2Seq’ and ‘Seq2Img’ denote the reshape operation to meet the shape requirements.

Network (FFN). To make Transformer block be aware of

multi-scale channel dependencies and local dependencies,
the extracted multi-scale channel-wise token features Fim is
integrated with local position features Fic as the input of
Transformer block:
X i = Img2Seq(Fim + Fic ), (5)
The output of the Transformer block in the i-th stage can be
Fig. 4. The schematic diagram of the proposed multi-scale channel formulated as:
attention (MSCA), which consists of four scales with (4 × 4, 8 × 8,
16 × 16 and 32 × 32) and perform channel-wise attention among Fi′ = MHSA(LayerNorm(X i )) + X i , (6)
different channels of tokens.
Fi = Seq2Img(FFN(LayerNorm(Fi′ )) + Fi′ ), (7)
where Img2Seq is the reshape operation to change the 2D
2) Convolutional Block: To better encode spatial location feature maps to 1D tokens, and Seq2Img is the reverse
information and inject strong inductive bias, we adopt con- operation. LayerNorm is the Layer Normalization [42] and Fi
volutional block to extract local spatial features. Specifically, is the output of the i-th encoder stage. MHSA is the window
given an input features Fi−1 , we adopt the convolutional block based multi-head self-attention using regular or shifted win-
to model local spatial features, which are shown as follows: dow partitioning [31] performed within each window, which
Fic = ConvBlocki (Fi−1 ), i ∈ {1, 2, 3, 4}, (4) is formulated as:
√
H
× W
×C MHSA(x) = Softmax(Q(x)K (x)T / d)V (x), (8)
Fic ∈ R 2i+1 2i+1 i is the local features, which contains
2D spatial location information, making it possible to encode Q, K , V are the query, key and value transformation matrices,
position information. Since we focus on the macro hybrid of and they are implemented with three linear projection layers;
CNNs and Transformers rather than the micro network struc- d is the dimension of input tokens. The self-attention is the
ture design, and thus we adopt the off-the-shelf convolutional token-wise attention, which is complementary to multi-scale
network structures. The convolutional blocks in the four stages channel-wise attention. FFN can be formulated as:
are the same as Conv2 , Conv3, Conv4 and Conv5 in original
FFN(x) = Linear(GELU(Linear(x))), (9)
ResNet-34 [1] architecture.
3) Transformer Block: The structure of Transformer block where x is the features from the window-based self-attention,
is shown in the upper part of Fig. 1(b), which consists of a Linear and GELU is the linear projection layer and non-linear
Multi-Head Self-Attention (MHSA) layer to build long-range activation function, respectively. Following [17], [31], [40], the
dependencies, two Layer Normalization and a Feed-Forward expansion ratio of the first linear layer in FFN is set to 4, and

Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2768 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023

the window sizes for window self-attention is 16 × 16. For respectively. Due to the imbalanced issue in fundus lesion
each stage, the number of transformer block is {2, 2, 2, 2}, and segmentation, the combination of the two loss function can
the number of head in the four stages is set to {2, 4, 8, 16}. effectively alleviate the problem of class imbalance, and the
The hybrid Transformer block can model local spatial rela- default setting of λ is 1.
tionships and introduce the inductive bias of CNNs, and this
property makes the network data-efficient, which is suitable for IV. E XPERIMENTS
medical image segmentation with low data regime. In addition,
In this section, we evaluate the method on three 2D and
it can also model the multi-scale channel-wise and token-wise
two 3D medical image segmentation tasks. We first introduce
long-range dependencies at the same time, which integrates the
the benchmark datasets, implementation details and evaluation
advantages of them into a unified block. There are two main
metrics. Then we compare our method with state-of-the-art
advantages between our hybrid Transformer block and previ-
methods and perform ablation studies to validate the design
ous hybrid structure: 1) Our hybrid Transformer block intro-
choices of the proposed architecture.
duces hierarchical spatial information, global feature encoding
and the multi-scale channel attention mechanism into a unified
block to capture multi-granularity channel-wise and token- A. Datasets
wise features, which can better model lesion features with 1) IDRiD Dataset [44]: The Indian Diabetic Retinopathy
different sizes and shapes in medical images; 2) It achieves Image Dataset (IDRiD) is available for the segmentation and
better segmentation results in both 2D and 3D medical image grading of retinal image that hosted by International Sympo-
segmentation tasks, and maintains reasonable parameters and sium on Biomedical Imaging (ISBI) conference 2018. This
FLOPs. By combining the three kinds of features, our method fundus lesion segmentation dataset consists of 81 images with
can learn better feature representation. the resolution of 2848 × 4288, and pixel-wise label masks of
hard exudates (EX), hemorrhages (HE), microaneurysms (MA)
C. Patch Expanding Path and soft exudates (SE) are provided. Due to the small fundus
We Observe That Patch Expanding With Pure Transformer lesions, there is a serious imbalance between background and
Structure Like Swin-UNet [18] and MISSFormer [40] Will foreground lesions, which is a great challenge for pixel-level
Leads to Two Problems: First, it introduces redundant com- segmentation tasks. There is an official split of 54 images for
putational cost during training and testing. Through the training and 27 images for testing.
experimental results in Table VI, we find that the simple con- 2) Kvasir-SEG Dataset [45]: This is a dataset for gas-
volutional decoder can get pretty good segmentation results, trointestinal polyp segmentation, which contains 1000 polyp
because the encoder has strong ability to extract rich features images and their corresponding label masks. Since there is no
for decoder. The second problem is that redundant features will official train and test dataset split, we perform 5-fold cross
cause difficulties in optimization, thus confusing the model and validation and report their average results.
causing sub-optimal solutions. Therefore, we adopt the simple 3) Skin Lesion Dataset [46]: This is a dermoscopic image
convolution for patch expanding with efficiency. analysis benchmark challenge for automated diagnosis of skin
In the patch expanding path, we adopt convolutional lay- cancer. The dermoscopy image dataset we used is the ISIC
ers for decoding the extracted hybrid multi-scale features 2016 skin lesion segmentation challenge dataset. It contains
{F1 , F2 , F3 , F4 } in the encoder path. We upsample the token 900 training dermoscopic images and 379 test images, and
features by patch expanding (PE) layer, which is implemented the binary ground truth masks are labeled by the experts.
as a transposed convolution layer. After patch expanding, 4) ACDC Dataset [47]: Automatic Cardiac Diagnosis Chal-
we concatenated the features from encoder path with the same lenge (ACDC) is a dataset for automated cardiac diagnosis,
resolution like skip connection in UNet: which conains 100 patients. For all these data, the correspond-
ing manual annnotations for the cavity of the right ventricle
Yi = Convi (Concat(PEi (Yi−1 ), F4−i )), (i = 1, 2, 3), (10) (RV), the myocardium (MYO) of the left ventricle and the
cavity of the left ventricle (LV) are given by clinical experts.
where Yi denotes the output features of each decoder stage,
The dataset is split into 70 training samples, 10 validation
Y0 = F4 , Conv denotes the convolutional layers consists of a
samples and 20 test samples, the same as nnFormer [19].
3 × 3 convolutional layer, a Batch Normalization [43] layer
5) Synapse Dataset [48]: This dataset consists of 30 CT
and a ReLU activation layer, and Concat denotes the channel-
scans for multi-organ segmentation, which conains 13 organs,
wise concatenation. In the final prediction head, Y3 are firstly
which are spleen, right kidney, left kidney, gallbladder, esoph-
resized to the same resolution as the input, and then we apply
agus, liver, stomach, aorta, inferior vena cava, portal vein and
a 1 × 1 convolution layer followed by the softmax activation
splenic vein, pancreas, right adrenal gland and left adrenal
function to map them to K channels, corresponding to (K-1)
gland. The dataset is split into 20 training samples and 10 test
lesion maps and one background map.
samples, and the evaluation setting is the same as [19].
For training, the supervised loss is defined as:
L = Lce (y, p) + λLdice (y, p), (11) B. Implementation Details and Evaluation Metrics
where Lce is the cross-entropy loss and Ldice is the Dice 1) Implementation Details: In the experiments, to reduce
loss, y and p are the ground truth and prediction results, overfitting and improve the generalization ability of the model,

we adopted data augmentation of horizontal flips, vertical flips, TABLE I

and random rescales. the network was trained for 90 epochs Q UANTITATIVE C OMPARISONS TO P REVIOUS S TATE - OF - THE -A RT
M ETHODS ON THE ISIC 2016 S KIN L ESION DATASET. W E C ONDUCT
using the AdamW [49] optimizer and the weight decay is E XPERIMENTS FOR F IVE T IMES AND R EPORT THE AVERAGE R ESULTS
0.0001. The initial learning rates is 0.0001, which is reduced
by the “Poly” strategy. The number of feature channels for the
four stages are [64, 128, 256, 512]. For polyp and skin lesion
segmentation, we resized the images and labels to 512 × 512,
and the batch size is 18. To preserve the details of tiny lesions
in fundus images of IDRiD dataset, the input size is set to
960 × 960 and the batch size is 3. For ACDC and Synapse
datasets, we resized the 2D slices and labels to 320 × 320 and
480 × 480, respectively, and the batch size is 18. The initial
learning rates is set to 0.0005 and 0.0007, respectively. The
Transformer blocks in our proposed model are randomly
initialized, and the convolutional layers are pretrained on
ImageNet [50]. Note that all the experiments were conducted
with the same experiment settings for a fair comparison and we
did not use any pre-processing or post-processing. Our frame-
work was implemented with Pytorch and all experiments were
performed on an NVIDIA GeForce RTX 3090 GPU. Code will
be available at https://github.com/NKUhealong/H2Former.
2) Evaluation Metrics: The following commonly used met-
TABLE II
rics are adopted for systematical performance evaluation, Q UANTITATIVE C OMPARISONS TO P REVIOUS S TATE - OF - THE -A RT
including pixel-wise Accuracy (Acc), Mean Absolute Error M ETHODS ON THE K VASIR -SEG DATASET. W E C ONDUCT 5-F OLD
(MAE), Dice Similariy Coefficient (Dice), and Intersection- C ROSS VALIDATION AND R EPORT T HEIR AVERAGE R ESULTS
over-Union (IoU).

C. Comparisons With Other Methods

To demonstrate the effectiveness of our proposed method,
we compare it against its CNN, Transformer and hybrid
counterparts on both 2D and 3D medical image segmentation
tasks. To make fair comparisons, we obtained the results with
convincing parameters for these models. For CNN and hybrid
methods, we adopt ResNet-34 pre-trained on the ImageNet
as backbone to keep reasonable parameters and computational
cost except SFA [51], PraNet [52] and ACSNet [53], and the
three results are obtained from their papers.
1) Results on Skin Lesion Segmentation: The quantitative
results on skin lesion segmentation task are shown in Table I.
Compared to these models, our method outperforms the three
kinds of counterparts by a large margin, which achieves the
best trade-off between performance and model complexity.
To be more specific, our H2Former achieves 92.41% Dice and
86.45% IoU, surpassing TransUnet by 1.10% and 1.49% with 2) Results on Polyp Segmentation: Then, we conducted
only about 33.71M parameters, which is about 31% of Tran- experiments on polyp lesion segmentation task. Similarly,
sUnet with about 109.54M parameters. Moreover, our method as shown in Table II, the proposed method still outperforms
also surpasses the SOTA nnUNet and nnFormer, as they are other competitive methods. Specifically, our method is better
designed systematically for 3D volumetric datasets, may not than the best method PraNet [26] by 2.00% in Dice and 2.29%
be suitable for 2D segmentation. The superior performance in IoU. Note that TransUNet adopts Transformer in the last
owns to the hybrid Transformer block that makes it possible stage of CNNs, which ignores the hierachical and multi-scale
to process spatial local information and multi-scale long-range features and thus obtain sub-optimal performance. The results
dependencies efficiently within a unified block. From the demonstrate the effectiveness and benefits of our proposed
results, we find that the hierarchical hybrid model has distinct hierarchical hybrid architecture.
advantages over the pure Transformer, pure CNN methods and 3) Results on IDRiD Lesion Segmentation: To further verify
naive cascaded methods, which is more friendly to medical the capacity of our method, we conduct the tiny fundus lesion
image segmentation tasks with a small number of training segmentation experiments on IDRiD dataset, which contains
samples. four small lesion types. Due to the imbalanced issule in

Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2770 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023

TABLE III TABLE IV

Q UANTITATIVE C OMPARISONS TO P REVIOUS S TATE - OF - THE -A RT Q UANTITATIVE C OMPARISONS TO P REVIOUS S TATE - OF - THE -A RT
M ETHODS ON THE IDRiD DATASET. W E C ONDUCT E XPERIMENTS FOR M ETHODS ON THE ACDC DATASET. T HE E VALUATION
F IVE T IMES AND R EPORT THE AVERAGE R ESULTS M ETRIC I S D ICE S CORE

TABLE V
fundus lesion segmentation task, MAE and Acc metrics are not Q UANTITATIVE C OMPARISONS TO P REVIOUS S TATE - OF - THE -A RT
M ETHODS ON THE M ULTI -O RGAN S EGMENTATION DATASET
suitable to measure the segmentation performance, and thus we
adopt Dice and IoU to measure the segmentation performance.
We report the average performance of the four lesion types in
Table III, extensive experiments shows that our method can
outperform the previous methods by a large margin, and reveal
the great potential of our method in tiny lesion segmentation
tasks.
Following TransUNet [17], our method are trained with 2D
slices and tested in a slice-by-slice fashion for all the 3D
volumes, and then the predicted 2D slices are stacked together
to reconstruct the 3D prediction results for evaluation, and thus
it can compare with recent competing methods fairly. For the
evaluation metric, we adopt Dice score and it is the same as
nnUNet [10] and nnFormer [19].
4) Results on ACDC for Automated Cardiac Segmentation:
As shown in Table IV, we can see that the best methods
is H2Former, and it was better than nnUNet and nnFormer
on Dice score with naive training strategies. While nnUNet
and nnFormer are specifically designed for 3D medical image shows that our proposed hybrid Transformer block is still
segmentation, and the training strategy is also specially opti- effective for 3D medical image segmentation.
mized for nnUNet. While our method adopt naive training The results of H2Former on the three 2D and two 3D
recipe without pre-processing, post-processing and advanced medical image segmentation datasets fully demonstrate its
strategies like deep supervision, which demonstrates the effec- excellent ability of handling different modalities with large
tiveness of our approach. and tiny organs or lesions with reasonable model complexity.
5) Results on Multi-Organ Segmentation: Because this Comprehensive experimental results show that the proposed
dataset has 13 classes, different from the dataset used in hierarchical hybrid vision Transformer can achieve superior
nnFormer [19] and nnUNet [10], the dataset they used only performance against other SOTA methods in quantitative
contains 8 classes, and thus we re-implemented thes methods results for different medical image segmentation tasks.
with their released code. We report the average performance 6) Visualization Results: For qualitative analysis, we show
of the 13 multi-organ segmentation results, and the 95% Haus- the 2D visual comparisons of our method and other models
dorff Distance (HD95) and Dice score are shown in Table V. in Fig. 5 and Fig. 6. It can be seen that the predictions of
Compared with the current SOTA method like nnUNet and H2Former captures the features really well and generate much
nnFormer, our method can achieve the best results, which better segmentation results. For more challenging cases, such

Fig. 5. The visual comparison results of skin and polyp segmentation tasks. The images in the first two rows are the results of skin lesion
segmentation, and the last two rows are the results of polyp segmentation. It can be seen that skin image obscured by the hair and polyp image
with the complex surrounding tissues can be well handled.

Fig. 6. The visual comparison results of fundus lesion segmentation. As indicated by the white arrows, our model is able to segment small lesions
in the fundus images under low contrast, while other methods could not segment them well, showing that our method could not only segment large
lesions, but also small lesions with strong robustness. Red, green, blue, and pinkmarkings denote hard exudates, hemorrhages, microaneurysms
and soft exudates, respectively.

as skin image in the first row obscured by the hair and polyp and complex surrounding mucosa. As shown in Fig. 6, for
image in the third row with the complex surrounding mucosa, the small lesion segmentation from fundus images, H2Former
our model handles these challenging cases well and can pro- is still capable of segmenting the tiny lesions indicated by
duce clear boundaries. The segmentation performance of other the white arrows. We can observe that the small lesions
models will be decreased due to the challenging occlusion highlighted by the white arrows are misclassified by other

Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2772 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023

TABLE VI
A BLATION S TUDIES ON K VASIR -SEG DATASET. ‘T RANS ’ D ENOTES
T RANSFORMER , ‘C ASC ’ D ENOTES C ASCADED M ANNER AND ‘ HIER ’
D ENOTES H IERACHICAL M ANNER

Fig. 7. Visualization results of automatic cardiac segmentation on

ACDC dataset, red, green, and blue denote the LV, Myo and RV,
respectively.
Transformer in a hierachical hybrid manner significantly out-
performs pure CNN encoder, pure Transformer encoder and
cascaded manner, which shows that CNNs and Transformer
can indeed be effectively incorporated in terms of encoding
local spatial features and long-range dependencies, and thus
they are complementary to each other, achieving promising
hybrid results, which improve the baseline with 2.44% in IoU.
It suggests that local details and long-range dependencies are
both crucial to medical image segmentation in the hierachical
manner.
2) Multi-Scale Channel Attention (MSCA): By comparing E.4
and E.7, the results show that MSCA can bring better perfor-
mance, improving the Dice score and IoU score by 1.19% and
1.12%, respectively, which indicates that multi-scale channel
attention can make up for the shortcomings of single scale
token-wise self-attention, and it models the relationship among
different channels, which benefits the lesion segmentation with
Fig. 8. Visualization results of automatic multi-organ segmentation on
Synapse dataset. different scales and shapes. In order to validate the importance
of multi-scale feature selection, we make a comparison with
CEL layer in CrossFormer [25], and set the stride as 1. It can
models. As the hybrid Transformer blocks can encode local be seen from E.6 and E.7 that the results without feature
features and long range pixel-wise dependencies, which makes selection are inferior to our results, which indicates that it
our predictions more precise and better than other models. is necessary to consider feature selection when performing
For cardiac and multi-organ segmentation, the visual results multi-scale feature fusion. In order to verify that there is no
are shown in Fig. 7 and Fig. 8, we can see that our method need to adopt large scales for the feature maps in stage 2,
produces very competitive segmentation results compared with stage 3 and stage 4, four scales are used in multi-scale channel
other SOTA methods. Especially, for multi-organ segmentation attention (2 × 2, 4 × 4, 8 × 8 and 16 × 16). It can be
task, our method is superior to other methods in dealing found from E.7 and E.8 that the large scales are not helpful
with the boundary of organs, and has better capability on for low-resolution feature maps, but increases the computation
segmenting organs. These results verify that our model works cost. The reason is that for the late stages, after the large
well with 3D volumetric data in a 2D slice way. multi-scale feature extraction of the first stage, their receptive
fields are sufficient, and thus two small scales of 2 × 2 and
4 × 4 are enough to achieve a good performance.
D. Ablation Studies 3) Hybrid Manner: For the hybrid choice, there are two
We conduct a series of ablation studies to investigate the options: one is to integrate Transformer blocks directly in
effectiveness of our proposed framework and justify the design the final stage of ResNet-34 like TransUNet [17] (cascaded),
choices on Kvasir-SEG dataset, and the baseline method is or the other is to interact them at each feature level in a
UNet based on ResNet-34 [1]. Unless otherwise stated, the hierachical manner. By comparing E.4 and E.3, we can observe
decoder of all the settings remains the same, and we perform that cascaded manner yields sub-optimal performance because
5-fold cross validation and report their average results in it ignores multi level information interaction, which drops
Table VI. the IoU by 1.46%, and thus the hierachical manner is better
1) Hierachical Hybrid Encoder: By comparing E.4 against than cascaded. We believe that such short and long-range
E.1, E.2 and E.3, we can see that CNNs combined with feature interaction can play a role of feature constraint, that

is, the two kinds of features will not deviate from each other. TABLE VII
However, cascade manner may lead to deviation between A NALYSIS OF PARAMETERS , FLOPs AND I NFERENCE T IME FOR
VARIOUS M ODELS W ITH I NPUT S IZE OF 512 × 512. A ND THE
shallow short-range features and deep long-range features, B ACKBONES A RE B ASED ON R ES N ET-34 [1]
resulting in performance degradation.
4) Convolution Stem: To verify the necessity of the convo-
lution stem, reshape and linear embedding are used instead of
convolution stem for patch partition, from the ablation study
results of E.9 and E.7, it can be found that the performance
of convolution stem is better than that of reshape and linear
embedding. The reason is that convolution stem can provide
strong local spatial information, while linear embedding will
not preserve the local position information well, thus convo-
lution stem performs better. Therefore, we use a convolution
stem for patch partition.
5) Final Structure: Further, by comparing E.7 (H2Former)
against other settings, we observe that it achieves the best
results among them, and outperforms baseline (E.1) by 3.39%
in Dice and 3.56% in IoU, which means that it is neces-
sary to consider the long-range dependencies, local spatial
information and multi-scale tokenization features within a
unified block in medical image segmentation. As shown in
E.5, if we replace the convolutional decoder with Transformer
decoder, the performance deteriorates and the computational
cost increases significantly. Compared with CNN, Transformer
lacks the perception ability of 2D spatial information, and Fig. 9. A failure case from IDRiD dataset. As we can see from the
segmentation results, there are confusions among different lesion types
thus it failed to recover the spatial information well when and the missed small lesions. It is very challenging to segment lesions
the decoder partially recovers the original spatial position in such a complex context because four types of lesions appear in the
information, while CNN has a strong inductive bias and adjacent region and the low contrast of this region.
strong perception ability of position information, and it can
better recover the position information and obtain accurate
higher than ours. Further, their segmentation performance is
segmentation mask. The Transformer decoder is more complex
significantly worse than our method, see Table I, Table II and
than the convolutional decoder, it requires more training data
Table III. Therefore, our hybrid structure can ensure good
to fit. However, the labeled medical images are limited and
results while keep reasonable computational cost, and thus it
not enough to train the over-complex network structure, and
is a more effective structure. From the results in Table VII,
thus the simple convolutional decoder works better. Therefore,
our model achieves the best trade-off between performance
the light-weight convolutional decoder is more efficient than
and model complexity in terms of the number of parameters,
complex Transformer decoder, and it can achieve satisfactory
FLOPs and the inference time.
results with reasonable complexity.

F. Failure Cases Analysis

E. Model Complexity It is also important to analyze the limitations of the model,
In this section, we compare the complexity of the proposed and thus we show a failure case of fundus lesion segmentation
model with other methods, including parameters, FLOPs and in Fig. 9. We can see that the lesion regions indicated by
inference time. FLOPs and inference time are calculated the white rectangle are highly similar to the surrounding
based on the input of 512 × 512 on an NVIDIA GeForce regions with low contrast, the segmentation performance will
RTX 3090 GPU using the PyTorch framework. As shown in be decreased. And some lesions in fundus image are very
Table VII, MISSFormer has 42.33M parameters and 109.45G small, only a few tens of pixels, and thus it is easy to be
FLOPs, which is 3.26 times of our method, this is because overlooked during downsampling, which makes it difficult to
it adopts pure Transformer structure with global self-attention segment. The segmentation results show the misclassification,
and causes a large amount of computational cost. Our method blurred boundary and the neglect of small lesions, as shown
has 33.71M parameters with 33.56G FLOPs, which is only by the yellow arrows. The main reason for this failure case
30.77% and 59.23% of TransUNet respectively, and the seg- is that features are difficult to be retained and differentiated
mentation performance is obviously better than it. Although for this challenging region, resulting in poor segmentation
Swin-UNet has fewer parameters, its encoder and decoder performance for tiny lesions. First, a simple method is to
both adopt transformer structure, resulting in increased com- increase the resolution of the input images and improve the
putational cost and its performance is inferior to our method. perception ability of the model for small lesions, but this will
For UNet, UNet++, Att-UNet and DeepLabv3+, although their have high requirements for GPU memory and computational
parameters is similar to ours, their FLOPs is comparable or cost, which may not be practical. Second, designing more

Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2774 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023

powerful self-attention models could improve the segmentation [15] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
results, more advanced training piplines like self-supervised S. Zagoruyko, “End-to-end object detection with transformers,” in
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020,
learning and semi-supervised learning may enhance the feature pp. 213–229.
diversity and improve the generalization ability of the model. [16] S. Zheng et al., “Rethinking semantic segmentation from a
sequence-to-sequence perspective with transformers,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
V. C ONCLUSION AND F UTURE W ORK pp. 6881–6890.
In this work, to leverage the benefits of CNNs, multi-scale [17] J. Chen et al., “TransUNet: Transformers make strong encoders for
channel-wise and token-wise features of Transformer, we pro- medical image segmentation,” 2021, arXiv:2102.04306.
[18] H. Cao et al., “Swin-Unet: Unet-like pure transformer for medical image
posed a hierarchical hybrid vision Transformer for medical segmentation,” 2021, arXiv:2105.05537.
image segmentation, dubbed as H2Former, which integrates [19] H.-Y. Zhou, J. Guo, Y. Zhang, L. Yu, L. Wang, and Y. Yu,
the inductive bias of CNNs, multi-scale channel attention “NnFormer: Interleaved transformer for volumetric segmentation,” 2021,
token features and the Transformer to combine the merits arXiv:2109.03201.
[20] S. H. Lee, S. Lee, and B. C. Song, “Vision transformer for small-size
of them into a unified block. As an alternative framework datasets,” 2021, arXiv:2112.13492.
to the CNNs, H2Former can increase the model’s capability [21] Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “ViTAE: Vision transformer
for modeling multi-scale long-range dependencies and local advanced by exploring intrinsic inductive bias,” in Proc. Adv. Neural
Inf. Process. Syst., vol. 34, 2021, pp. 1–12.
spatial information. It is superior to the previous CNNs,
[22] H. Wu et al., “CvT: Introducing convolutions to vision transform-
Transformer and hybrid methods, showing better results on ers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
several 2D and 3D medical image segmentation tasks without pp. 22–31.
any advanced training strategies. It also demonstrates that [23] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and
A. Vaswani, “Bottleneck transformers for visual recognition,” in Proc.
the proposed method achieves good trade-off in the model IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
complexity and segmentation performance. pp. 16519–16529.
In the future, we plan to validate the capabilities of [24] Q. Zhang and Y.-B. Yang, “ResT: An efficient transformer for visual
recognition,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
H2Former on other tasks such as medical image classification, pp. 15475–15485.
self-supervised learning and semi-supervised learning. [25] W. Wang et al., “CrossFormer: A versatile vision transformer hinging
on cross-scale attention,” 2021, arXiv:2108.00154.
R EFERENCES [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Jul. 2017, pp. 2881–2890.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [27] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
(CVPR), Jun. 2016, pp. 770–778. “Encoder–decoder with Atrous separable convolution for semantic
[2] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely image segmentation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. pp. 801–818.
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. [28] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng,
[3] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for con- “H-DenseUNet: Hybrid densely connected UNet for liver and tumor
volutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, segmentation from CT volumes,” IEEE Trans. Med. Imag., vol. 37,
pp. 6105–6114. no. 12, pp. 2663–2674, Dec. 2017.
[4] Z. Gu et al., “CE-Net: Context encoder network for 2D medical image [29] H. Huang et al., “UNet 3+: A full-scale connected UNet for medical
segmentation,” IEEE Trans. Med. Imag., vol. 38, no. 10, pp. 2281–2292, image segmentation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Mar. 2019. Process. (ICASSP), May 2020, pp. 1055–1059.
[5] H. Fu, J. Cheng, Y. Xu, D. W. K. Wong, J. Liu, and X. Cao, “Joint optic
[30] D. Fan et al., “Inf-Net: Automatic COVID-19 lung infection segmen-
disc and cup segmentation based on multi-label deep network and polar
tation from CT images,” IEEE Trans. Med. Imag., vol. 39, no. 8,
transformation,” IEEE Trans. Med. Imag., vol. 37, no. 7, pp. 1597–1605,
pp. 2626–2637, Aug. 2020.
Jul. 2018.
[6] T. Li et al., “Applications of deep learning in fundus images: A review,” [31] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
Med. Image Anal., vol. 69, Apr. 2021, Art. no. 101971. shifted Windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
[7] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- Oct. 2021, pp. 10012–10022.
works for biomedical image segmentation,” in Proc. Int. Conf. Med. [32] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, former for semantic segmentation,” in Proc. IEEE/CVF Int. Conf.
2015, pp. 234–241. Comput. Vis. (ICCV), Oct. 2021, pp. 7262–7272.
[8] Z. Zhou et al., “UNet++: Redesigning skip connections to exploit [33] S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang,
multiscale features in image segmentation,” IEEE Trans. Med. Imag., “TransReID: Transformer-based object re-identification,” in
vol. 39, no. 6, pp. 1856–1867, Dec. 2019. Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
[9] O. Oktay et al., “Attention U-Net: Learning where to look for the pp. 15013–15022.
pancreas,” 2018, arXiv:1804.03999. [34] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
[10] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein, H. Jégou, “Training data-efficient image transformers & distilla-
“NnU-Net: A self-configuring method for deep learning-based biomed- tion through attention,” in Proc. Int. Conf. Mach. Learn., 2021,
ical image segmentation,” Nature Methods, vol. 18, no. 2, pp. 203–211, pp. 10347–10357.
Feb. 2021. [35] Z. Dai, H. Liu, Q. Le, and M. Tan, “CoAtNet: Marrying convolution
[11] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. and attention for all data sizes,” in Proc. Adv. Neural Inf. Process. Syst.,
Process. Syst., 2017, pp. 5998–6008. vol. 34, 2021, pp. 1–12.
[12] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional [36] Y. Zhang, H. Liu, and Q. Hu, “TransFuse: Fusing transformers and
block attention module,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, CNNs for medical image segmentation,” in Proc. Int. Conf. Med. Image
pp. 3–19. Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2021,
[13] J. Fu et al., “Dual attention network for scene segmentation,” in Proc. pp. 14–24.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, [37] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “CoTr: Efficiently bridging CNN
pp. 3146–3154. and transformer for 3D medical image segmentation,” in Proc. Int. Conf.
[14] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers Med. Image Comput. Comput.-Assist. Intervent. Cham, Switzerland:
for image recognition at scale,” 2020, arXiv:2010.11929. Springer, 2021, pp. 171–180.

[38] X. Yan, H. Tang, S. Sun, H. Ma, D. Kong, and X. Xie, “AFTer-UNet: [47] O. Bernard et al., “Deep learning techniques for automatic MRI cardiac
Axial fusion transformer UNet for medical image segmentation,” in multi-structures segmentation and diagnosis: Is the problem solved?”
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2022, IEEE Trans. Med. Imag., vol. 37, no. 11, pp. 2514–2525, May 2018.
pp. 3971–3981. [48] B. Landman, Z. Xu, J. E. Igelsias, M. Styner, T. R. Langerak, and
[39] Y. Gao, M. Zhou, and D. N. Metaxas, “UTNet: A hybrid transformer A. Klein, “MICCAI multi-atlas labeling beyond the cranial vault-
architecture for medical image segmentation,” in Proc. Int. Conf. Med. workshop and challenge,” in Proc. Int. Conf. Med. Image Comput.
Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, Comput.-Assist. Intervent., 2015, p. 12.
2021, pp. 61–71. [49] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[40] X. Huang, Z. Deng, D. Li, and X. Yuan, “MISSFormer: An effective in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–19.
medical image segmentation transformer,” 2021, arXiv:2109.07162. [50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
[41] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
channel attention for deep convolutional neural networks,” in Proc. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, [51] Y. Fang, C. Chen, Y. Yuan, and K.-Y. Tong, “Selective feature aggrega-
pp. 11531–11539. tion network with area-boundary constraints for polyp segmentation,” in
[42] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Cham,
arXiv:1607.06450. Switzerland: Springer, 2019, pp. 302–310.
[43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep [52] D.-P. Fan et al., “PraNet: Parallel reverse attention network for polyp
network training by reducing internal covariate shift,” in Proc. Int. Conf. segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist.
Mach. Learn., 2015, pp. 448–456. Intervent. Cham, Switzerland: Springer, 2020, pp. 263–273.
[44] P. Porwal et al., “Indian diabetic retinopathy image dataset (IDRiD): [53] R. Zhang, G. Li, Z. Li, S. Cui, D. Qian, and Y. Yu, “Adaptive context
A database for diabetic retinopathy screening research,” Data, vol. 3, selection for polyp segmentation,” in Proc. Int. Conf. Med. Image
no. 3, p. 25, 2018. Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2020,
[45] D. Jha et al., “Kvasir-SEG: A segmented polyp dataset,” in Proc. pp. 253–262.
Int. Conf. Multimedia Modeling. Cham, Switzerland: Springer, 2020, [54] A. Hatamizadeh et al., “UNETR: Transformers for 3D medical image
pp. 451–462. segmentation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis.
[46] D. Gutman et al., “Skin lesion analysis toward melanoma detection: (WACV), Jan. 2022, pp. 1748–1758.
A challenge at the international symposium on biomedical imaging [55] G. Xu, X. Wu, X. Zhang, and X. He, “LeViT-UNet: Make faster
(ISBI) 2016, hosted by the international skin imaging collaboration encoders with transformer for medical image segmentation,” 2021,
(ISIC),” 2016, arXiv:1605.01397. arXiv:2107.08623.

Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.

3306 Troubleshoot
94% (17)
3306 Troubleshoot
86 pages
Tesla and Scalar Energy Explained
100% (3)
Tesla and Scalar Energy Explained
9 pages
Learning Resource - Physical and Chemical Properties of Minerals
No ratings yet
Learning Resource - Physical and Chemical Properties of Minerals
8 pages
MISSFormer An Effective Transformer For 2D Medical Image Segmentation
No ratings yet
MISSFormer An Effective Transformer For 2D Medical Image Segmentation
12 pages
Rahman 24 A
No ratings yet
Rahman 24 A
19 pages
Transattunet: Multi-Level Attention-Guided U-Net With Transformer For Medical Image Segmentation
No ratings yet
Transattunet: Multi-Level Attention-Guided U-Net With Transformer For Medical Image Segmentation
13 pages
UNesT - Local Spatial Representation Learning With Hierarchical Transformer For Efficient Medical Segmentation
No ratings yet
UNesT - Local Spatial Representation Learning With Hierarchical Transformer For Efficient Medical Segmentation
21 pages
Advantages of Transformer and Its Application For Medical Image Segmentation: A Survey
No ratings yet
Advantages of Transformer and Its Application For Medical Image Segmentation: A Survey
22 pages
Multi-Axis Vision Transformer for Medical Image Segmentation
No ratings yet
Multi-Axis Vision Transformer for Medical Image Segmentation
49 pages
MTANet_Multi-Task_Attention_Network_for_Automatic_Medical_Image_Segmentation_and_Classification
No ratings yet
MTANet_Multi-Task_Attention_Network_for_Automatic_Medical_Image_Segmentation_and_Classification
12 pages
s12859-023-05196-1
No ratings yet
s12859-023-05196-1
22 pages
Miccal 2022 ConTrans Improving Transformer With Convolutional Attention For Medical Image Segmentation
No ratings yet
Miccal 2022 ConTrans Improving Transformer With Convolutional Attention For Medical Image Segmentation
11 pages
Swin-Unet: Unet-Like Pure Transformer For Medical Image Segmentation
No ratings yet
Swin-Unet: Unet-Like Pure Transformer For Medical Image Segmentation
14 pages
Medical Transformer: Gated Axial-Attention For Medical Image Segmentation
No ratings yet
Medical Transformer: Gated Axial-Attention For Medical Image Segmentation
18 pages
EMCAD Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation
No ratings yet
EMCAD Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation
14 pages
Rahman Medical Image Segmentation via Cascaded Attention Decoding WACV 2023 Paper
No ratings yet
Rahman Medical Image Segmentation via Cascaded Attention Decoding WACV 2023 Paper
10 pages
Transunet: Transformers Make Strong Encoders For Medical Image Segmentation
No ratings yet
Transunet: Transformers Make Strong Encoders For Medical Image Segmentation
13 pages
1 s2.0 S0097849320300546 Main
No ratings yet
1 s2.0 S0097849320300546 Main
10 pages
U-Net-Based Medical Image Segmentation
No ratings yet
U-Net-Based Medical Image Segmentation
16 pages
Modality preserving U-Net for segmentation of multimodal medical images
No ratings yet
Modality preserving U-Net for segmentation of multimodal medical images
16 pages
BEST CODE UNETR Delving Into Efficient and Accurate 3D Medical Image Segmentation
No ratings yet
BEST CODE UNETR Delving Into Efficient and Accurate 3D Medical Image Segmentation
14 pages
Deep Learning and Convolutional Neural Networks For Medical Imaging and Clinical Informatics
No ratings yet
Deep Learning and Convolutional Neural Networks For Medical Imaging and Clinical Informatics
452 pages
UNetFormer - A Unified Vision Transformer Model and Pre-Training Framework For 3D Medical Image Segmentation - 2204.00631v2
No ratings yet
UNetFormer - A Unified Vision Transformer Model and Pre-Training Framework For 3D Medical Image Segmentation - 2204.00631v2
12 pages
LET_NET_Semantic_Segmentation
No ratings yet
LET_NET_Semantic_Segmentation
15 pages
vmunet
No ratings yet
vmunet
9 pages
RM UNetUNet LikeMambawithrotationalSSMmoduleformedical
No ratings yet
RM UNetUNet LikeMambawithrotationalSSMmoduleformedical
17 pages
Medical Image Segmentation
No ratings yet
Medical Image Segmentation
12 pages
2024_Real-Time-Multi-Organ-Classification-onComputed-Tomography-Images
No ratings yet
2024_Real-Time-Multi-Organ-Classification-onComputed-Tomography-Images
11 pages
UNETR: Transformers For 3D Medical Image Segmentation
No ratings yet
UNETR: Transformers For 3D Medical Image Segmentation
11 pages
Medical Image Analysis With Transformers
No ratings yet
Medical Image Analysis With Transformers
66 pages
4
No ratings yet
4
16 pages
2211.14830v1
No ratings yet
2211.14830v1
38 pages
Transformers in Medical Imaging - A Survey
No ratings yet
Transformers in Medical Imaging - A Survey
40 pages
A Hybrid CNN-Transformer Architecture for Precise Medical Image Segmentation
No ratings yet
A Hybrid CNN-Transformer Architecture for Precise Medical Image Segmentation
13 pages
Iclr2022 Should We Replace Cnns With TR
No ratings yet
Iclr2022 Should We Replace Cnns With TR
15 pages
Segment Anything in Medical Images
No ratings yet
Segment Anything in Medical Images
9 pages
FANet_A_Feedback_Attention_Network_for_Improved_Biomedical_Image_Segmentation
No ratings yet
FANet_A_Feedback_Attention_Network_for_Improved_Biomedical_Image_Segmentation
14 pages
An Explainable AI System for Medical Image Segmentation With Preserved Local Resolution Mammogram Tumor Segmentation
No ratings yet
An Explainable AI System for Medical Image Segmentation With Preserved Local Resolution Mammogram Tumor Segmentation
19 pages
Azad Beyond Self-Attention Deformable Large Kernel Attention for Medical Image Segmentation WACV 2024 Paper
No ratings yet
Azad Beyond Self-Attention Deformable Large Kernel Attention for Medical Image Segmentation WACV 2024 Paper
11 pages
2. Dual-Contrastive Dual-Consistency Dual-Transformer a Semi-Supervised Approach to Medical Image Segmentation ICCVW 2023 Paper
No ratings yet
2. Dual-Contrastive Dual-Consistency Dual-Transformer a Semi-Supervised Approach to Medical Image Segmentation ICCVW 2023 Paper
10 pages
Final Version
No ratings yet
Final Version
26 pages
Sustainability 13 01224 v2
No ratings yet
Sustainability 13 01224 v2
29 pages
U Transformer
No ratings yet
U Transformer
10 pages
Vision Transformers in Medical Imaging- A Review
No ratings yet
Vision Transformers in Medical Imaging- A Review
31 pages
A Comprehensive Analysis of Medical Image Segmentation Using Deep Learning
No ratings yet
A Comprehensive Analysis of Medical Image Segmentation Using Deep Learning
10 pages
PUBLICATION
No ratings yet
PUBLICATION
26 pages
Transunext: Towards A More Advanced U-Shaped Framework For Automatic Vessel Segmentation in The Fundus Image
No ratings yet
Transunext: Towards A More Advanced U-Shaped Framework For Automatic Vessel Segmentation in The Fundus Image
23 pages
ScribFormer Transformer Makes CNN Work Better For Scribble-Based Medical Image Segmentation
No ratings yet
ScribFormer Transformer Makes CNN Work Better For Scribble-Based Medical Image Segmentation
12 pages
1 s2.0 S0031320322007075 Main
No ratings yet
1 s2.0 S0031320322007075 Main
12 pages
Interactive Medical Image Segmentation Using Deep Learning With Image-Specific Fine Tuning
No ratings yet
Interactive Medical Image Segmentation Using Deep Learning With Image-Specific Fine Tuning
12 pages
Swin Transformer Medical
No ratings yet
Swin Transformer Medical
11 pages
Hierarchical Attention
No ratings yet
Hierarchical Attention
10 pages
Diagnostics 12 03064 v2
No ratings yet
Diagnostics 12 03064 v2
31 pages
2211.01784v1
No ratings yet
2211.01784v1
7 pages
2302.09462v1 (1)
No ratings yet
2302.09462v1 (1)
15 pages
2406.02918v3
No ratings yet
2406.02918v3
14 pages
Modality Specific U-Net Variants For Biomedical Image Segmentation A Survey
No ratings yet
Modality Specific U-Net Variants For Biomedical Image Segmentation A Survey
45 pages
Bioengineering 12 00140 v2
No ratings yet
Bioengineering 12 00140 v2
16 pages
1-s2.0-S0010482524015178-main
No ratings yet
1-s2.0-S0010482524015178-main
10 pages
Embracing Imperfect Datasets: A Review of Deep Learning Solutions For Medical Image Segmentation
No ratings yet
Embracing Imperfect Datasets: A Review of Deep Learning Solutions For Medical Image Segmentation
34 pages
Segment Anything in Medical Images
No ratings yet
Segment Anything in Medical Images
20 pages
Medical Image Segmentation With Deep Learning
No ratings yet
Medical Image Segmentation With Deep Learning
42 pages
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Van Liebergen - Machine Learning in Compliance Risk Management PDF
No ratings yet
Van Liebergen - Machine Learning in Compliance Risk Management PDF
8 pages
ECON 357: Lecture 8: The Growth and Variance of RM Sizes
No ratings yet
ECON 357: Lecture 8: The Growth and Variance of RM Sizes
17 pages
Pages From Chapter 17-9
No ratings yet
Pages From Chapter 17-9
11 pages
ACU OD Revit 0922
No ratings yet
ACU OD Revit 0922
4 pages
Maulana Abul Kalam Azad University of Technology 4th Exam
No ratings yet
Maulana Abul Kalam Azad University of Technology 4th Exam
1 page
3 2 Elliptic
No ratings yet
3 2 Elliptic
40 pages
Karl Popper s Science and Philosophy Zuzana Parusniková - Get instant access to the full ebook content
100% (1)
Karl Popper s Science and Philosophy Zuzana Parusniková - Get instant access to the full ebook content
73 pages
Process Capability
No ratings yet
Process Capability
14 pages
Cellular Structures and Functions
No ratings yet
Cellular Structures and Functions
3 pages
ALLEN IIT-JEE (2024-Lecture Planner)
No ratings yet
ALLEN IIT-JEE (2024-Lecture Planner)
5 pages
ASWC WirelessPowerTransfer-final
No ratings yet
ASWC WirelessPowerTransfer-final
38 pages
Completing The Square 1
No ratings yet
Completing The Square 1
7 pages
Festo
No ratings yet
Festo
8 pages
Sulfur Iodine Process Summary For The Hydrogen Technology Down-Selection
No ratings yet
Sulfur Iodine Process Summary For The Hydrogen Technology Down-Selection
49 pages
Tensor Algebra Removed
No ratings yet
Tensor Algebra Removed
55 pages
Math 062 Glossary - English Language Document - United States - English Glossary Review
No ratings yet
Math 062 Glossary - English Language Document - United States - English Glossary Review
14 pages
لاب قياسات تجربة 2
No ratings yet
لاب قياسات تجربة 2
5 pages
Alpha-Beta Pruning-Ref-4 PDF
No ratings yet
Alpha-Beta Pruning-Ref-4 PDF
6 pages
Cs Notes Database Concepts Class12
No ratings yet
Cs Notes Database Concepts Class12
3 pages
6050 FS 3
No ratings yet
6050 FS 3
1 page
Statistics For GMAT
No ratings yet
Statistics For GMAT
28 pages
Assignment 3: 1) Write A Program To Implement DDA Algorithm
No ratings yet
Assignment 3: 1) Write A Program To Implement DDA Algorithm
47 pages
Urea Production
100% (1)
Urea Production
9 pages
Basic Stats
No ratings yet
Basic Stats
15 pages
College of Electrical and Mechanical Engineering, Nust Islamabad
No ratings yet
College of Electrical and Mechanical Engineering, Nust Islamabad
14 pages
LM324 PDF
No ratings yet
LM324 PDF
4 pages
Software Evolution Is A Term Which Refers To The Process of
No ratings yet
Software Evolution Is A Term Which Refers To The Process of
2 pages

H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation

Uploaded by

H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation

Uploaded by

IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO.

9, SEPTEMBER 2023 2763

H2Former: An Efficient Hierarchical Hybrid

To answer this question, we redesign the network struc-

wise attention, and it ignores the relationship among channels,

Network (FFN). To make Transformer block be aware of

we adopted data augmentation of horizontal flips, vertical flips, TABLE I

C. Comparisons With Other Methods

TABLE III TABLE IV

Fig. 7. Visualization results of automatic cardiac segmentation on

F. Failure Cases Analysis

You might also like