H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation
H2Former_An_Efficient_Hierarchical_Hybrid_Transformer_for_Medical_Image_Segmentation
Abstract— Accurate medical image segmentation is of diagnosis is time-consuming and is easily affected by clinical
great significance for computer aided diagnosis. Although experience. Therefore, automated medical image segmentation
methods based on convolutional neural networks (CNNs) is a good auxiliary tool for medical experts.
have achieved good results, it is weak to model the
long-range dependencies, which is very important for Recently, convolutional neural networks (CNNs) have been
segmentation task to build global context dependencies. the main-stream in computer vision [1], [2], [3], as well as
The Transformers can establish long-range dependencies in the field of medical image analysis [4], [5], [6], [7], [8].
among pixels by self-attention, providing a supplement Most of the segmentation methods are based on UNet [7]
to the local convolution. In addition, multi-scale feature or its variants [8], [9], [10]. However, due to the intrinsic
fusion and feature selection are crucial for medical image
segmentation tasks, which is ignored by Transformers. locality of convolution, CNNs failed to model long-range
However, it is challenging to directly apply self-attention dependencies, leading to sub-optimal results, and it has the
to CNNs due to the quadratic computational complexity following two shortcomings. Firstly, the small convolutional
for high-resolution feature maps. Therefore, to integrate kernels attend to a local region and force the network to
the merits of CNNs, multi-scale channel attention and focus on local feature patterns rather than the global features.
Transformers, we propose an efficient hierarchical hybrid
vision Transformer (H2Former) for medical image segmen- For medical image segmentation, it often requires long-range
tation. With these merits, the model can be data-efficient for global information for reliable segmentation, since the shape
limited medical data regime. The experimental results show and size of lesions vary greatly. Secondly, after training, the
that our approach exceeds previous Transformer, CNNs convolutional kernels are static and fixed, and the well trained
and hybrid methods on three 2D and two 3D medical image parameters cannot adapt to the input image contents. There-
segmentation tasks. Moreover, it keeps computational
efficiency in model parameters, FLOPs and inference time. fore, CNNs lack flexibility for different inputs with different
For example, H2Former outperforms TransUNet by 2.29% in characteristics.
IoU score on KVASIR-SEG dataset with 30.77% parameters Meanwhile, Transformer [11] based on self-attention has
and 59.23% FLOPs. achieved remarkable sucsess in natural language processing.
Index Terms— Hierarchical hybrid transformer, medical It is well known that visual attention plays an important role
image segmentation, multi-scale channel attention, self- in computer vision [12], [13]. Therefore, a natural idea is
attention.
to introduce the self-attention into computer vision. Vision
I. I NTRODUCTION Transformers (ViT) [14] is the first attempt to adopt Trans-
RECISE lesion segmentation from medical images has
P been one of the fundamental yet challenging problems
in medical image analysis, which can benefit disease diagno-
former encoder for image classification, achieving competi-
tive performance with its CNN counterparts. Because of its
great success, Transformers are also popular in other vision
sis and treatment planning. However, relying on experts for tasks, such as object detection [15] and semantic segmenta-
Manuscript received 23 February 2023; accepted 29 March 2023. tion [16], and achieves the state-of-the-art performance. With
Date of publication 5 April 2023; date of current version 31 August the long-range information interaction and dynamic feature
2023. This work was supported in part by the National Natural Science encoding ability, Transformers have been applied for medical
Foundation under Grant 62272248, in part by CAAI-Huawei MindSpore
Open Fund under Grant CAAIXSJLJJ-2021-025A, in part by AISG Tech image segmentation. TransUNet [17] is the first attempt to
Challenge Funding under Grant AISG2-TC-2021-003, and in part by employ Transformer for medical image segmentation, which
A*STAR CDF under Grant C222812010. (Corresponding author: Tao Li.) is a hybrid structure of CNNs and Transformers. Then, a pure
Along He, Kai Wang, and Chengkun Du are with the Tianjin
Key Laboratory of Network and Data Security Technology, Transformer encoder-decoder structure named Swin-UNet [18]
College of Computer Science, Nankai University, Tianjin 300350, and nnFormer [19] for volumetric medical image segmentation
China (e-mail: [email protected]; [email protected]; are proposed, showing strong performance.
[email protected]).
Tao Li is with the College of Computer Science, Nankai University, Although long-range dependencies can be modeled well in
Tianjin 300350, China, and also with the Xingchuang Haihe Laboratory, Transformer, it still has drawbacks and limits the performance.
Tianjin 300459, China (e-mail: [email protected]). Firstly, the spatial information is ignored, as it serializes
Shuang Xia is with the Radiology Department, Tianjin
First Central Hospital, Nankai, Tianjin 300350, China (e-mail: images into 1D tokens, and it is weak in local feature learning,
[email protected]). which is crucial for 2D images. Although this problem can be
Huazhu Fu is with the Institute of High Performance Computing relieved by position encoding, while position encoding needs
(IHPC), Agency for Science, Technology and Research (A*STAR), Sin-
gapore 138632 (e-mail: [email protected]). to adapt to the various input resolution through interpolation,
Digital Object Identifier 10.1109/TMI.2023.3264513 thus affecting the performance. The second drawback of
1558-254X © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2764 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
HE et al.: H2Former: AN EFFICIENT HIERARCHICAL HYBRID TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION 2765
few years. UNet [7] is seminal and widely used for medical limited receptive field and lack the ability of modeling long-
image segmentation and has achieved state-of-the-art results range cues. Secondly, for Transformers, although long-range
on various medical image segmentation tasks [4], [5], [6]. dependencies can be modeled well, they still ignore the local
Then, many variants of UNet are developed, such as Dense- detailed information and structure priors to some extent, which
UNet [28], Attention UNet [9], UNet++ [8], nnUNet [10] is essential for segmentation tasks. And beyond that, pure
and UNet3+ [29]. Inspired by these fundamental architectures, Transformer models ignore channel-wise attention, and the
some methods are tailored for specific tasks, such as optic global self-attention like ViT [11] require large-scale datasets
disc and optic cup segmentation from fundus images [5] and high computational complexity, making it difficult to train
and COVID-19 lung infection segmentation [30], etc. All transformers efficiently for medical segmentation applications.
in all, CNNs have made great progress in medical image Thirdly, for the hybrid Transformer, they only consider token-
segmentation. wise self-attention to build the relationship among tokens.
It does not fully consider the channel-wise interaction and
B. Vision Transformers for Medical Image Segmentation conduct feature learning with single scale. There are three
key differences between the proposed H2Former and previous
Recently, the Transformer based methods have shown
works, as shown in Fig. 2: (1) Multi-scale feature fusion
promising performance for vision tasks [16], [24], [25], [31],
is crucial to feature representation, and we want to fuse
[32], [33], and they have been viewed as an alternative to the
multi-scale feature fusion in Transformer in a simple and effi-
CNN counterparts. ViT [11] is the first work to apply pure
cient way. Although there are previous works like CrossFormer
Transformers for image classification, which takes the input
considering multi-scale feature fusion within a stage, they
image as a sequence of image patches and then adopt Trans-
only concatenate these multi-scale features without feature
former blocks to encode image features. In order to train ViT
selection, resulting in sub-optimal solutions (See Table VI E.6
more effectively, DeiT [34] proposed several efficient training
and E.7) because of the conflict between features with different
strategies that allow ViT to converge well with no exter-
scales. Therefore, how to perform multi-scale feature fusion
nal data. To mitigate the quadratic computational overhead,
and considering feature selection becomes very important.
a window-based Transformer named Swin-transformer [31]
For this reason, we propose MSCA to conduct dynamic
was proposed to performs self-attention through local windows
selection of multi-scale features and model the relationship
for computer vision tasks, reducing computation cost and
between channel dimensions, which complement token-wise
achieving good performance. In order to introduce desirable
self-attention. (2) The labeled training data is limited in
properties of CNNs to the Transformers, some works learn
medical domain, and it means that we should fully explore
from CNN’s design philosophy [23], [35], and achieved a
the features from the backbone. Our hybrid block can model
strong performance with efficiency.
the local and long-range dependency at the same time, and
Due to the promising performance of Transformers, some
multiscale feature with feature selection is also considered
works have been proposed for both 2D and 3D medical image
to perform more effective multi-scale feature fusion. These
segmentation [36], [37], [38], and nnFormer [19] is a strong
merits make the proposed method superior to other hybrid
3D volumetric medical image segmentation method with local
transformer-CNN models using multi-scale feature fusion.
and global volume-based self-attention mechanism to learn
Other hybrid transformer-CNN models using multi-scale fea-
volume representations. TransUNet [17] is the first to employ
ture fusion only consider the inherent multi-scale feature from
a hybrid CNN-Transformer architecture for medical image
multi-stage and fuse them together without feature selection.
segmentation. It leverage the local features from CNNs and
In addition to considering this implicit multi-scale feature,
the global context features from Transformers in a cascade
we also explicitly modeled the multi-scale feature inside each
manner, which is a strong alternative to CNNs. Inspired by
stage, forming a nested multi-scale feature representation,
Swin-transformer [31], Swin-UNet [18] was proposed with a
which is conducive to the network to make full use of
pure Transformer structure. However, it ignore local spatial
the features of various granularity. (3) Previous models only
information, which is extremely essential for segmentation
consider the performance of the model or the efficiency of the
tasks. In order to addresses the dilemma that Transformer
model, and we further consider both of them at the same time,
requires large amounts of data, a hybrid Transformer archi-
and thus our model can achieve excellent performance while
tecture (UTNet) [39] that integrates self-attention into a CNN
maintaining a high inference speed (See Table VII and Fig. 1).
for enhancing medical image segmentation.
The experimental results show that such integration can signif-
icantly improve the feature representation ability of the model.
C. Analysis of Previous Work
Although the above methods have achieved good results, III. M ETHODOLOGY
they still ignored the integration of multi-scale channel atten- In this section, we first present the overview of our proposed
tion and hierarchical hybrid manner. Therefore, the problem is method. Then, the details of the proposed hierarchical hybrid
how to aggregate the strengths of CNNs, multi-scale channel Transformer block will be given.
attention and Transformers to build a unified block with rea-
sonable parameters and computation complexity for medical A. Overview and Preliminary
image segmentation. Firstly, for CNNs, they can extract the The overall pipeline of our proposed method is illustrated
local spatial information well, but they still have an inevitably in Fig. 3, following the design philosophy of [7], [31], the
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2766 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
HE et al.: H2Former: AN EFFICIENT HIERARCHICAL HYBRID TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION 2767
Fig. 3. (a) Overall architecture of the proposed H2Former. The encoder consists of four stages, and each stage contains a hybrid Transformer
block. (b) The structure of the proposed hybrid Transformer block. It extracted local features and multi-scale channel attention features through
the convolutional block and MSCA, and then aggregates them together. Then, rich features are extracted by the Transformer block to achieve
the purpose of combining the advantages of CNNs, multi-scale channel-wise attention and token-wise features of Transformers simultaneously.
‘Img2Seq’ and ‘Seq2Img’ denote the reshape operation to meet the shape requirements.
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2768 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023
the window sizes for window self-attention is 16 × 16. For respectively. Due to the imbalanced issue in fundus lesion
each stage, the number of transformer block is {2, 2, 2, 2}, and segmentation, the combination of the two loss function can
the number of head in the four stages is set to {2, 4, 8, 16}. effectively alleviate the problem of class imbalance, and the
The hybrid Transformer block can model local spatial rela- default setting of λ is 1.
tionships and introduce the inductive bias of CNNs, and this
property makes the network data-efficient, which is suitable for IV. E XPERIMENTS
medical image segmentation with low data regime. In addition,
In this section, we evaluate the method on three 2D and
it can also model the multi-scale channel-wise and token-wise
two 3D medical image segmentation tasks. We first introduce
long-range dependencies at the same time, which integrates the
the benchmark datasets, implementation details and evaluation
advantages of them into a unified block. There are two main
metrics. Then we compare our method with state-of-the-art
advantages between our hybrid Transformer block and previ-
methods and perform ablation studies to validate the design
ous hybrid structure: 1) Our hybrid Transformer block intro-
choices of the proposed architecture.
duces hierarchical spatial information, global feature encoding
and the multi-scale channel attention mechanism into a unified
block to capture multi-granularity channel-wise and token- A. Datasets
wise features, which can better model lesion features with 1) IDRiD Dataset [44]: The Indian Diabetic Retinopathy
different sizes and shapes in medical images; 2) It achieves Image Dataset (IDRiD) is available for the segmentation and
better segmentation results in both 2D and 3D medical image grading of retinal image that hosted by International Sympo-
segmentation tasks, and maintains reasonable parameters and sium on Biomedical Imaging (ISBI) conference 2018. This
FLOPs. By combining the three kinds of features, our method fundus lesion segmentation dataset consists of 81 images with
can learn better feature representation. the resolution of 2848 × 4288, and pixel-wise label masks of
hard exudates (EX), hemorrhages (HE), microaneurysms (MA)
C. Patch Expanding Path and soft exudates (SE) are provided. Due to the small fundus
We Observe That Patch Expanding With Pure Transformer lesions, there is a serious imbalance between background and
Structure Like Swin-UNet [18] and MISSFormer [40] Will foreground lesions, which is a great challenge for pixel-level
Leads to Two Problems: First, it introduces redundant com- segmentation tasks. There is an official split of 54 images for
putational cost during training and testing. Through the training and 27 images for testing.
experimental results in Table VI, we find that the simple con- 2) Kvasir-SEG Dataset [45]: This is a dataset for gas-
volutional decoder can get pretty good segmentation results, trointestinal polyp segmentation, which contains 1000 polyp
because the encoder has strong ability to extract rich features images and their corresponding label masks. Since there is no
for decoder. The second problem is that redundant features will official train and test dataset split, we perform 5-fold cross
cause difficulties in optimization, thus confusing the model and validation and report their average results.
causing sub-optimal solutions. Therefore, we adopt the simple 3) Skin Lesion Dataset [46]: This is a dermoscopic image
convolution for patch expanding with efficiency. analysis benchmark challenge for automated diagnosis of skin
In the patch expanding path, we adopt convolutional lay- cancer. The dermoscopy image dataset we used is the ISIC
ers for decoding the extracted hybrid multi-scale features 2016 skin lesion segmentation challenge dataset. It contains
{F1 , F2 , F3 , F4 } in the encoder path. We upsample the token 900 training dermoscopic images and 379 test images, and
features by patch expanding (PE) layer, which is implemented the binary ground truth masks are labeled by the experts.
as a transposed convolution layer. After patch expanding, 4) ACDC Dataset [47]: Automatic Cardiac Diagnosis Chal-
we concatenated the features from encoder path with the same lenge (ACDC) is a dataset for automated cardiac diagnosis,
resolution like skip connection in UNet: which conains 100 patients. For all these data, the correspond-
ing manual annnotations for the cavity of the right ventricle
Yi = Convi (Concat(PEi (Yi−1 ), F4−i )), (i = 1, 2, 3), (10) (RV), the myocardium (MYO) of the left ventricle and the
cavity of the left ventricle (LV) are given by clinical experts.
where Yi denotes the output features of each decoder stage,
The dataset is split into 70 training samples, 10 validation
Y0 = F4 , Conv denotes the convolutional layers consists of a
samples and 20 test samples, the same as nnFormer [19].
3 × 3 convolutional layer, a Batch Normalization [43] layer
5) Synapse Dataset [48]: This dataset consists of 30 CT
and a ReLU activation layer, and Concat denotes the channel-
scans for multi-organ segmentation, which conains 13 organs,
wise concatenation. In the final prediction head, Y3 are firstly
which are spleen, right kidney, left kidney, gallbladder, esoph-
resized to the same resolution as the input, and then we apply
agus, liver, stomach, aorta, inferior vena cava, portal vein and
a 1 × 1 convolution layer followed by the softmax activation
splenic vein, pancreas, right adrenal gland and left adrenal
function to map them to K channels, corresponding to (K-1)
gland. The dataset is split into 20 training samples and 10 test
lesion maps and one background map.
samples, and the evaluation setting is the same as [19].
For training, the supervised loss is defined as:
L = Lce (y, p) + λLdice (y, p), (11) B. Implementation Details and Evaluation Metrics
where Lce is the cross-entropy loss and Ldice is the Dice 1) Implementation Details: In the experiments, to reduce
loss, y and p are the ground truth and prediction results, overfitting and improve the generalization ability of the model,
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
HE et al.: H2Former: AN EFFICIENT HIERARCHICAL HYBRID TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION 2769
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2770 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023
TABLE V
fundus lesion segmentation task, MAE and Acc metrics are not Q UANTITATIVE C OMPARISONS TO P REVIOUS S TATE - OF - THE -A RT
M ETHODS ON THE M ULTI -O RGAN S EGMENTATION DATASET
suitable to measure the segmentation performance, and thus we
adopt Dice and IoU to measure the segmentation performance.
We report the average performance of the four lesion types in
Table III, extensive experiments shows that our method can
outperform the previous methods by a large margin, and reveal
the great potential of our method in tiny lesion segmentation
tasks.
Following TransUNet [17], our method are trained with 2D
slices and tested in a slice-by-slice fashion for all the 3D
volumes, and then the predicted 2D slices are stacked together
to reconstruct the 3D prediction results for evaluation, and thus
it can compare with recent competing methods fairly. For the
evaluation metric, we adopt Dice score and it is the same as
nnUNet [10] and nnFormer [19].
4) Results on ACDC for Automated Cardiac Segmentation:
As shown in Table IV, we can see that the best methods
is H2Former, and it was better than nnUNet and nnFormer
on Dice score with naive training strategies. While nnUNet
and nnFormer are specifically designed for 3D medical image shows that our proposed hybrid Transformer block is still
segmentation, and the training strategy is also specially opti- effective for 3D medical image segmentation.
mized for nnUNet. While our method adopt naive training The results of H2Former on the three 2D and two 3D
recipe without pre-processing, post-processing and advanced medical image segmentation datasets fully demonstrate its
strategies like deep supervision, which demonstrates the effec- excellent ability of handling different modalities with large
tiveness of our approach. and tiny organs or lesions with reasonable model complexity.
5) Results on Multi-Organ Segmentation: Because this Comprehensive experimental results show that the proposed
dataset has 13 classes, different from the dataset used in hierarchical hybrid vision Transformer can achieve superior
nnFormer [19] and nnUNet [10], the dataset they used only performance against other SOTA methods in quantitative
contains 8 classes, and thus we re-implemented thes methods results for different medical image segmentation tasks.
with their released code. We report the average performance 6) Visualization Results: For qualitative analysis, we show
of the 13 multi-organ segmentation results, and the 95% Haus- the 2D visual comparisons of our method and other models
dorff Distance (HD95) and Dice score are shown in Table V. in Fig. 5 and Fig. 6. It can be seen that the predictions of
Compared with the current SOTA method like nnUNet and H2Former captures the features really well and generate much
nnFormer, our method can achieve the best results, which better segmentation results. For more challenging cases, such
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
HE et al.: H2Former: AN EFFICIENT HIERARCHICAL HYBRID TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION 2771
Fig. 5. The visual comparison results of skin and polyp segmentation tasks. The images in the first two rows are the results of skin lesion
segmentation, and the last two rows are the results of polyp segmentation. It can be seen that skin image obscured by the hair and polyp image
with the complex surrounding tissues can be well handled.
Fig. 6. The visual comparison results of fundus lesion segmentation. As indicated by the white arrows, our model is able to segment small lesions
in the fundus images under low contrast, while other methods could not segment them well, showing that our method could not only segment large
lesions, but also small lesions with strong robustness. Red, green, blue, and pinkmarkings denote hard exudates, hemorrhages, microaneurysms
and soft exudates, respectively.
as skin image in the first row obscured by the hair and polyp and complex surrounding mucosa. As shown in Fig. 6, for
image in the third row with the complex surrounding mucosa, the small lesion segmentation from fundus images, H2Former
our model handles these challenging cases well and can pro- is still capable of segmenting the tiny lesions indicated by
duce clear boundaries. The segmentation performance of other the white arrows. We can observe that the small lesions
models will be decreased due to the challenging occlusion highlighted by the white arrows are misclassified by other
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2772 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023
TABLE VI
A BLATION S TUDIES ON K VASIR -SEG DATASET. ‘T RANS ’ D ENOTES
T RANSFORMER , ‘C ASC ’ D ENOTES C ASCADED M ANNER AND ‘ HIER ’
D ENOTES H IERACHICAL M ANNER
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
HE et al.: H2Former: AN EFFICIENT HIERARCHICAL HYBRID TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION 2773
is, the two kinds of features will not deviate from each other. TABLE VII
However, cascade manner may lead to deviation between A NALYSIS OF PARAMETERS , FLOPs AND I NFERENCE T IME FOR
VARIOUS M ODELS W ITH I NPUT S IZE OF 512 × 512. A ND THE
shallow short-range features and deep long-range features, B ACKBONES A RE B ASED ON R ES N ET-34 [1]
resulting in performance degradation.
4) Convolution Stem: To verify the necessity of the convo-
lution stem, reshape and linear embedding are used instead of
convolution stem for patch partition, from the ablation study
results of E.9 and E.7, it can be found that the performance
of convolution stem is better than that of reshape and linear
embedding. The reason is that convolution stem can provide
strong local spatial information, while linear embedding will
not preserve the local position information well, thus convo-
lution stem performs better. Therefore, we use a convolution
stem for patch partition.
5) Final Structure: Further, by comparing E.7 (H2Former)
against other settings, we observe that it achieves the best
results among them, and outperforms baseline (E.1) by 3.39%
in Dice and 3.56% in IoU, which means that it is neces-
sary to consider the long-range dependencies, local spatial
information and multi-scale tokenization features within a
unified block in medical image segmentation. As shown in
E.5, if we replace the convolutional decoder with Transformer
decoder, the performance deteriorates and the computational
cost increases significantly. Compared with CNN, Transformer
lacks the perception ability of 2D spatial information, and Fig. 9. A failure case from IDRiD dataset. As we can see from the
segmentation results, there are confusions among different lesion types
thus it failed to recover the spatial information well when and the missed small lesions. It is very challenging to segment lesions
the decoder partially recovers the original spatial position in such a complex context because four types of lesions appear in the
information, while CNN has a strong inductive bias and adjacent region and the low contrast of this region.
strong perception ability of position information, and it can
better recover the position information and obtain accurate
higher than ours. Further, their segmentation performance is
segmentation mask. The Transformer decoder is more complex
significantly worse than our method, see Table I, Table II and
than the convolutional decoder, it requires more training data
Table III. Therefore, our hybrid structure can ensure good
to fit. However, the labeled medical images are limited and
results while keep reasonable computational cost, and thus it
not enough to train the over-complex network structure, and
is a more effective structure. From the results in Table VII,
thus the simple convolutional decoder works better. Therefore,
our model achieves the best trade-off between performance
the light-weight convolutional decoder is more efficient than
and model complexity in terms of the number of parameters,
complex Transformer decoder, and it can achieve satisfactory
FLOPs and the inference time.
results with reasonable complexity.
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
2774 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 42, NO. 9, SEPTEMBER 2023
powerful self-attention models could improve the segmentation [15] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
results, more advanced training piplines like self-supervised S. Zagoruyko, “End-to-end object detection with transformers,” in
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020,
learning and semi-supervised learning may enhance the feature pp. 213–229.
diversity and improve the generalization ability of the model. [16] S. Zheng et al., “Rethinking semantic segmentation from a
sequence-to-sequence perspective with transformers,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
V. C ONCLUSION AND F UTURE W ORK pp. 6881–6890.
In this work, to leverage the benefits of CNNs, multi-scale [17] J. Chen et al., “TransUNet: Transformers make strong encoders for
channel-wise and token-wise features of Transformer, we pro- medical image segmentation,” 2021, arXiv:2102.04306.
[18] H. Cao et al., “Swin-Unet: Unet-like pure transformer for medical image
posed a hierarchical hybrid vision Transformer for medical segmentation,” 2021, arXiv:2105.05537.
image segmentation, dubbed as H2Former, which integrates [19] H.-Y. Zhou, J. Guo, Y. Zhang, L. Yu, L. Wang, and Y. Yu,
the inductive bias of CNNs, multi-scale channel attention “NnFormer: Interleaved transformer for volumetric segmentation,” 2021,
token features and the Transformer to combine the merits arXiv:2109.03201.
[20] S. H. Lee, S. Lee, and B. C. Song, “Vision transformer for small-size
of them into a unified block. As an alternative framework datasets,” 2021, arXiv:2112.13492.
to the CNNs, H2Former can increase the model’s capability [21] Y. Xu, Q. Zhang, J. Zhang, and D. Tao, “ViTAE: Vision transformer
for modeling multi-scale long-range dependencies and local advanced by exploring intrinsic inductive bias,” in Proc. Adv. Neural
Inf. Process. Syst., vol. 34, 2021, pp. 1–12.
spatial information. It is superior to the previous CNNs,
[22] H. Wu et al., “CvT: Introducing convolutions to vision transform-
Transformer and hybrid methods, showing better results on ers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
several 2D and 3D medical image segmentation tasks without pp. 22–31.
any advanced training strategies. It also demonstrates that [23] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and
A. Vaswani, “Bottleneck transformers for visual recognition,” in Proc.
the proposed method achieves good trade-off in the model IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
complexity and segmentation performance. pp. 16519–16529.
In the future, we plan to validate the capabilities of [24] Q. Zhang and Y.-B. Yang, “ResT: An efficient transformer for visual
recognition,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021,
H2Former on other tasks such as medical image classification, pp. 15475–15485.
self-supervised learning and semi-supervised learning. [25] W. Wang et al., “CrossFormer: A versatile vision transformer hinging
on cross-scale attention,” 2021, arXiv:2108.00154.
R EFERENCES [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Jul. 2017, pp. 2881–2890.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [27] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
(CVPR), Jun. 2016, pp. 770–778. “Encoder–decoder with Atrous separable convolution for semantic
[2] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely image segmentation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. pp. 801–818.
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. [28] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng,
[3] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for con- “H-DenseUNet: Hybrid densely connected UNet for liver and tumor
volutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, segmentation from CT volumes,” IEEE Trans. Med. Imag., vol. 37,
pp. 6105–6114. no. 12, pp. 2663–2674, Dec. 2017.
[4] Z. Gu et al., “CE-Net: Context encoder network for 2D medical image [29] H. Huang et al., “UNet 3+: A full-scale connected UNet for medical
segmentation,” IEEE Trans. Med. Imag., vol. 38, no. 10, pp. 2281–2292, image segmentation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Mar. 2019. Process. (ICASSP), May 2020, pp. 1055–1059.
[5] H. Fu, J. Cheng, Y. Xu, D. W. K. Wong, J. Liu, and X. Cao, “Joint optic
[30] D. Fan et al., “Inf-Net: Automatic COVID-19 lung infection segmen-
disc and cup segmentation based on multi-label deep network and polar
tation from CT images,” IEEE Trans. Med. Imag., vol. 39, no. 8,
transformation,” IEEE Trans. Med. Imag., vol. 37, no. 7, pp. 1597–1605,
pp. 2626–2637, Aug. 2020.
Jul. 2018.
[6] T. Li et al., “Applications of deep learning in fundus images: A review,” [31] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
Med. Image Anal., vol. 69, Apr. 2021, Art. no. 101971. shifted Windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
[7] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- Oct. 2021, pp. 10012–10022.
works for biomedical image segmentation,” in Proc. Int. Conf. Med. [32] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, former for semantic segmentation,” in Proc. IEEE/CVF Int. Conf.
2015, pp. 234–241. Comput. Vis. (ICCV), Oct. 2021, pp. 7262–7272.
[8] Z. Zhou et al., “UNet++: Redesigning skip connections to exploit [33] S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang,
multiscale features in image segmentation,” IEEE Trans. Med. Imag., “TransReID: Transformer-based object re-identification,” in
vol. 39, no. 6, pp. 1856–1867, Dec. 2019. Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021,
[9] O. Oktay et al., “Attention U-Net: Learning where to look for the pp. 15013–15022.
pancreas,” 2018, arXiv:1804.03999. [34] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
[10] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein, H. Jégou, “Training data-efficient image transformers & distilla-
“NnU-Net: A self-configuring method for deep learning-based biomed- tion through attention,” in Proc. Int. Conf. Mach. Learn., 2021,
ical image segmentation,” Nature Methods, vol. 18, no. 2, pp. 203–211, pp. 10347–10357.
Feb. 2021. [35] Z. Dai, H. Liu, Q. Le, and M. Tan, “CoAtNet: Marrying convolution
[11] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. and attention for all data sizes,” in Proc. Adv. Neural Inf. Process. Syst.,
Process. Syst., 2017, pp. 5998–6008. vol. 34, 2021, pp. 1–12.
[12] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional [36] Y. Zhang, H. Liu, and Q. Hu, “TransFuse: Fusing transformers and
block attention module,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, CNNs for medical image segmentation,” in Proc. Int. Conf. Med. Image
pp. 3–19. Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2021,
[13] J. Fu et al., “Dual attention network for scene segmentation,” in Proc. pp. 14–24.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, [37] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “CoTr: Efficiently bridging CNN
pp. 3146–3154. and transformer for 3D medical image segmentation,” in Proc. Int. Conf.
[14] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers Med. Image Comput. Comput.-Assist. Intervent. Cham, Switzerland:
for image recognition at scale,” 2020, arXiv:2010.11929. Springer, 2021, pp. 171–180.
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.
HE et al.: H2Former: AN EFFICIENT HIERARCHICAL HYBRID TRANSFORMER FOR MEDICAL IMAGE SEGMENTATION 2775
[38] X. Yan, H. Tang, S. Sun, H. Ma, D. Kong, and X. Xie, “AFTer-UNet: [47] O. Bernard et al., “Deep learning techniques for automatic MRI cardiac
Axial fusion transformer UNet for medical image segmentation,” in multi-structures segmentation and diagnosis: Is the problem solved?”
Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2022, IEEE Trans. Med. Imag., vol. 37, no. 11, pp. 2514–2525, May 2018.
pp. 3971–3981. [48] B. Landman, Z. Xu, J. E. Igelsias, M. Styner, T. R. Langerak, and
[39] Y. Gao, M. Zhou, and D. N. Metaxas, “UTNet: A hybrid transformer A. Klein, “MICCAI multi-atlas labeling beyond the cranial vault-
architecture for medical image segmentation,” in Proc. Int. Conf. Med. workshop and challenge,” in Proc. Int. Conf. Med. Image Comput.
Image Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, Comput.-Assist. Intervent., 2015, p. 12.
2021, pp. 61–71. [49] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[40] X. Huang, Z. Deng, D. Li, and X. Yuan, “MISSFormer: An effective in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–19.
medical image segmentation transformer,” 2021, arXiv:2109.07162. [50] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:
[41] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-Net: Efficient A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
channel attention for deep convolutional neural networks,” in Proc. Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, [51] Y. Fang, C. Chen, Y. Yuan, and K.-Y. Tong, “Selective feature aggrega-
pp. 11531–11539. tion network with area-boundary constraints for polyp segmentation,” in
[42] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Intervent. Cham,
arXiv:1607.06450. Switzerland: Springer, 2019, pp. 302–310.
[43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep [52] D.-P. Fan et al., “PraNet: Parallel reverse attention network for polyp
network training by reducing internal covariate shift,” in Proc. Int. Conf. segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist.
Mach. Learn., 2015, pp. 448–456. Intervent. Cham, Switzerland: Springer, 2020, pp. 263–273.
[44] P. Porwal et al., “Indian diabetic retinopathy image dataset (IDRiD): [53] R. Zhang, G. Li, Z. Li, S. Cui, D. Qian, and Y. Yu, “Adaptive context
A database for diabetic retinopathy screening research,” Data, vol. 3, selection for polyp segmentation,” in Proc. Int. Conf. Med. Image
no. 3, p. 25, 2018. Comput. Comput.-Assist. Intervent. Cham, Switzerland: Springer, 2020,
[45] D. Jha et al., “Kvasir-SEG: A segmented polyp dataset,” in Proc. pp. 253–262.
Int. Conf. Multimedia Modeling. Cham, Switzerland: Springer, 2020, [54] A. Hatamizadeh et al., “UNETR: Transformers for 3D medical image
pp. 451–462. segmentation,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis.
[46] D. Gutman et al., “Skin lesion analysis toward melanoma detection: (WACV), Jan. 2022, pp. 1748–1758.
A challenge at the international symposium on biomedical imaging [55] G. Xu, X. Wu, X. Zhang, and X. He, “LeViT-UNet: Make faster
(ISBI) 2016, hosted by the international skin imaging collaboration encoders with transformer for medical image segmentation,” 2021,
(ISIC),” 2016, arXiv:1605.01397. arXiv:2107.08623.
Authorized licensed use limited to: University of Louisville. Downloaded on September 18,2024 at 14:23:24 UTC from IEEE Xplore. Restrictions apply.