MISSFormer An Effective Transformer For 2D Medical Image Segmentation
MISSFormer An Effective Transformer For 2D Medical Image Segmentation
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022 1
Abstract— Transformer-based methods are recently pop- a great contribution to accurate medical image segmentation
ular in vision tasks because of their capability to model results. Since the fully convolutional networks (FCNs) [1]
global dependencies alone. However, it limits the perfor- opened a door for semantic segmentation, as one of its variants,
mance of networks due to the lack of modeling local
context and global-local correlations of multi-scale fea- the U-shaped networks [2], [3] got a promising performance
tures. In this paper, we present MISSFormer, a Medical in medical image segmentation by the improvement of skip
Image Segmentation tranSFormer. MISSFormer is a hier- connection. According to this elegant U-shaped architecture,
archical encoder-decoder network with two appealing de- the variants of U-Net [4]–[6] achieved good performance and
signs: 1) a feed-forward network in transformer block of impressive results. They suffer from a limitation in modeling
U-shaped encoder-decoder structure is redesigned, ReMix-
FFN, which explore global dependencies and local context the global dependencies because of the locality of convo-
for better feature discrimination by re-integrating the local lution operation [7], [8], which failed to achieve the goal
context and global dependencies; 2) a ReMixed Trans- of precise medical image analysis, although above methods
former Context Bridge is proposed to extract the correla- brought positive performance and prevalence. To overcome
tions of global dependencies and local context in multi- the limitation, a type of method uses dilated convolution [9],
scale features generated by our hierarchical transformer
encoder. The MISSFormer shows a solid capacity to cap- [10] and pyramid pooling [11] to enlarge the receptive field
ture more discriminative dependencies and context in as much as possible, and another type [12], [13] tries to
medical image segmentation. The experiments on multi- employ few self-attention layers in high-level semantic feature
organ, cardiac segmentation and retinal vessel segmenta- maps, due to the quadratic relationship between self-attention
tion tasks demonstrate the superiority, effectiveness and computational complexity and feature map size. However,
robustness of our MISSFormer. Specifically, the experimen-
tal results of MISSFormer trained from scratch even outper- these methods make it insufficient to capture the abundant
form state-of-the-art methods pre-trained on ImageNet, and global dependencies.
the core designs can be generalized to other visual seg- Recently, the success of transformers [14] that capture
mentation tasks. The code has been released on Github: global dependencies makes it possible to solve the above
https://github.com/ZhifangDeng/MISSFormer. problems. Especially, the research on visual transformer [15]–
Index Terms— Context bridge, ReMix-FFN, global depen- [21] are in full swing and have got a promising performance in
dencies, local context, segmentation. vision tasks, encouraged by the great success of transformer
in natural language processing (NLP). Corresponding to the
transformer in NLP, vision transformer (ViT) [16] fed the im-
I. I NTRODUCTION
age into a standard transformer with positional embeddings by
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
2 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022
as their feed-forward neural network (FFN), which ignores the B. Vision Transformers
local context modeling. Recently, vision transformers have attracted extensive at-
In this paper, MISSFormer, an effective and powerful Med- tention of researchers. A. Dosovitskiy et.al [16] introduced
ical Image Segmentation tranSFormer, is proposed to integrate transformer [14] into visual tasks for the first time and
the global dependencies and local context to produce accurate achieved impressive performance because of the capacity for
medical image segmentation results. MISSFormer is based on global dependencies of the transformer. Vision tasks developed
the U-shaped architecture, whose redesigned transformer block to a new stage inspired by ViT. For example, H. Touvron
enhances the feature representations. MISSFormer captures et.al proposed Data-efficient image Transformers (DeiT) [29]
global dependencies by applying global attention in the feature to explore the efficient training strategies for ViT. To reduce
map, and integrating global information and local context the computational complexity, Wang et.al [17] and Liu et.al
through the redesigned ReMix-FFN layer. Finally, it models proposed an efficient and effective hierarchical vision trans-
the global and local information at full-scale feature maps former with SRA and window-based mechanism, respectively.
through the ReMixed Transformer Context Bridge, so as to Some methods [30]–[32] studied the position embedding and
generate pixel-wise segmentation predictions by decoder. The locality in transformer. For other specific tasks, SETR [21]
main contributions of this paper can be summarized as follows: was a semantic segmentation network based on the transformer
• We propose MISSFormer, a position-free and hierarchical and made ViT a backbone, Xie et.al [20] introduced a simple
U-shaped transformer for medical image segmentation. and efficient design for semantic segmentation powered by
• To explore global dependencies and local context for transformer, N. Carion et.al [33] proposed an end-to-end
better feature discrimination, we propose ReMix-FFN by object detection framework with transformer, Wang et.al [34]
re-integrating the local context and global dependencies. built a general U-shaped transformer based on window-based
• To improve the responsiveness of relevant features, we self-attention on non-overlapped image patches for image
devise a ReMixed Transformer Context Bridge by cap- restoration.
turing the global-local correlations of multi-scale features.
• To demonstrate the effectiveness and robustness of the C. Medical Image Segmentation Transformers.
proposed MISSFormer, we carry out experiments and ob-
Researchers borrowed the transformer for medical image
tained competitive results on medical image segmentation
segmentation inspired by the rapid development of vision
datasets.
transformers. Chen et.al proposed TransUNet [22], employed
The remainder of this paper is organized as follows. Sec- some transformer layers into the low-resolution encoder fea-
tion II gives a review of the medical image segmentation ture maps to capture the global dependencies. A. Hatamizadeh
methods and transformers. Section III introduces the proposed et.al [35] applied transformer to make a powerful encoder for
MISSFormer in detail. Section IV presents the performance 3d medical image segmentation with CNN-based decoder. Xie
of MISSFormer, including the experiments settings, ablation et.al [8] and Wang et.al [36] bridged the CNN-based encoder
studies and comparisons. Section V draws the conclusion of and decoder with the transformer to improve the segmenta-
this paper. tion performance in the low-resolution stage. Besides these
methods which are the combination of CNN and transformer,
Cao et.al [7] proposed Swin-Unet with pre-trained encoder
II. R ELATED WORK based on Swin transformer [15], to demonstrate the application
potential of pure transformer in medical image segmentation.
A. Medical Image Segmentation
Different from these methods, the proposed MISSFormer is
The U-shaped networks [2] played a cornerstone role in conducted to explore the global dependencies and local context
medical image segmentation tasks because of its superior for more discriminative features with pure transformer trained
performance and elegant structure. Due to the rapid develop- on the medical image datasets from scratch.
ment of computer vision tasks [23], [24], the medical image
segmentation drew lessons from their key insight. For example, III. M ETHOD
ResNet [23] architecture became a general encoder backbone This section describes the overall pipeline and the specific
for medical image segmentation networks, the dilated convolu- structure of MISSFormer first, and then we show the details
tion and pyramid pooling were utilized to enlarge the receptive of the improved transformer block with ReMix-FFN, which
field for lesion and organ segmentations [9], [10]. Besides, is the basic unit of MISSFormer. After that, we introduce the
various attention mechanisms were effective to promote seg- proposed ReMixed Transformer Context Bridge, which models
mentation performance, reverse attention [25] was applied the local and global correlations of hierarchical multi-scale
to accurate polyp segmentation [26], squeeze-and-excitation information.
attention [27] was integrated into the module to refine the
channel information to segment vessels in retina images [28],
and some work [12], [13] employed self-attention mechanism A. Overall Pipeline
to supplement the global dependencies for segmentation tasks, The proposed MISSFormer is shown in Fig. 1(a), an
whereas these methods still lack enough global dependencies encoder-decoder architecture with a remixed transformer con-
for better performance. text bridge module appended between encoder and decoder.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 3
H W class
Overlap Patch N C N C
Linear Projection
H W C
Embedding
Patch Expanding
H W
C
4 4 ReMix-FFN
Transformer Block Transformer Block
with ReMix-FFN with ReMix-FFN
x2 x2 H W
C
4 4
Overlap Patch Merging Patch Expanding
H W Layer Norm
2C
8 8
ReMixed Transformer
Context Bridge
Transformer Block Transformer Block
with ReMix-FFN with ReMix-FFN
x2 x2 H W
2C
8 8
Overlap Patch Merging Patch Expanding
H W
4C
16 16 Efficient Self-Attention
x4
Transformer Block Transformer Block
with ReMix-FFN with ReMix-FFN
x2 x2 H W
4C
16 16
Overlap Patch Merging Patch Expanding
Layer Norm
H W
8C
32 32
H W
Transformer Block 8C
32 32
with ReMix-FFN
x2
(a) (b)
Fig. 1. The overall structure of the proposed MISSFormer. (a) The proposed MISSFormer framework. (b) The structure of Transformer Block with
ReMix-FFN.
Specifically, given an input image of size Hin ×Win ×Cin , the adjacent feature maps to twice the original resolution
MISSFormer first divides it into overlapping patches of size except that the last one is four times. Last, the pixel-wise
4*4 pixels to preserve its local continuity with convolutional segmentation prediction is output by a linear projection.
layers of 7*7 kernel size and 4 pixels stride. Then, the
overlapping patches are fed into the encoder to produce the
multi-scale features. Here, the encoder is hierarchical, and
each stage includes transformer blocks with ReMix-FFN and B. Transformer Block
patch merging layer. The transformer block learns the global
Global dependencies and local context are effective for
dependencies and local context with limited computational
accurate medical image segmentation. Transformer and con-
complexity. The patch merging layer is applied to generate
volution are good choices for global dependencies and locality
the downsampling features, the output size of i-th block is
Hin Win i−1 at present, respectively. At the same time, the computational
2i+1 × 2i+1 × 2 C1 (i > 0), C1 is the channel dimension
complexity of the original transformer block is quadratic with
of the first block.
the feature map resolution, making it unsuitable for high-
After that, MISSFormer makes the generated multi-scale resolution feature maps. Second, the transformer lacks the abil-
features pass through the ReMixed Transformer Context ity to extract the local context [30]–[32], although Uformer,
Bridge to capture the local and global correlations of different SegFormer and PVTv2 tried to overcome the limitation by
scale features. Given multi-scale features, they are flattened embedding a convolutional layer in feed-forward network
in spatial dimension and reshaped to make channel dimension directly, we argue that this approach limits the discrimination
equal to C1 , for example, the features with size 2Hi+1 in
× 2Wi+1
in
of features, although some improved performance is achieved
i−1 Hin Win
× 2 C1 will be rearranged into 2i+3 × C1 . And then by them.
concatenate them in flattened spatial dimension and feed into To solve the above problems, we proposed Transformer
the Remixed Transformer Context Bridge with d-depth. Last, Block with ReMix-FFN. As is shown in Fig. 1(b), the
we split and restore them to their original shape 2Hi+1
in
× 2Wi+1
in
× Transformer Block is composed of LayerNorm, Efficient Self-
i−1
2 C1 , and obtain the discriminative hierarchical multi-scale Attention and ReMix-FFN.
features. 1) Efficient Self-Attention: Efficient self-attention is a spatial
For the segmentation prediction, MISSFormer takes the reduction self-attention [17], which can be applied to high-
discriminative features and skip connections as inputs of resolution feature map. Given a feature map F ∈ RH×W ×C ,
decoders. Each decoder stage includes transformer blocks with and H, W, C is the height, width and channel depth, re-
ReMix-FFN and patch expanding layer [7]. Contrary to the spectively. For the original standard multi-head self-attention,
patch merging layer, the patch expanding layer upsamples it makes Q, K, V have the same shape N × C, where
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
4 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022
FC_out
FC_out Layer Norm
Img2Seq
GELU
GELU
1x1 Conv
Layer Norm
Layer Norm
Fig. 2. The various exploration of locality in feed-forward neural network, from left to right: (a)Residual Block in LocalViT [32], (b)proposed ReMix-
FFN, (c)proposed ReMix-FFN with recursive step.
N
new K = Reshape( , C · R)W (C · R, C), (2)
R
it first reshapes K and V to N R × (C · R), and then a linear global local
projection W is used to make channel depth restore to C. After
that, the computational complexity of self-attention reduces to Fig. 3. The ReMixed Transformer Context Bridge.
2
O( NR ), and can be applied to high-resolution feature maps.
The spatial reduction operation is convolution or pooling in
common. where, xin is the output of efficient self-attention, Conv3×3
2) Mix-FFN: For consistency and integrity of the article, is convolution with kernel 3 × 3, we applied depth-wise
we will briefly review Mix-FFN. Xie et.al [20] embedded a convolution for efficiency in this paper. We will show that
convolution layer between the fully-connected layers in the our designs are essential for Mix-FFN in Section IV.B.
FFN to capture the local information as a supplement to global Inspired by [37], we extend our design to a general form
dependencies, and it can be denoted as: with the help of layer norm, which facilitates the optimization
xout = F C(GELU (Conv3×3 (F C(xin )))) + xin , (3) of skip connection [14]. As shown in Fig. 2(c), we make an
ReMix block embedded in the original feed-forward network.
although some improvements have been made, we find that the We introduced recursive skip connection in the ReMix block,
direct embedding of convolution layer can not assign weights given an input feature map xin , a depth-wise convolution layer
well and limit the discrimination of features sometimes ac- is applied to capture the local context, and then a recursive skip
cording to the attention heatmap, which is shown in Fig. 4, connection followed, and it can be defined as:
Fig. 6 and Table II.
3) ReMix-FFN: Different from previous methods, we re- yi = LN (F C(xin ) + yi−1 ),
designed the structure of Mix-FFN [20] to align features and (5)
xout = F C(GELU (yi )) + xin ,
make discriminative representations. As shown in Fig. 2(b),
we add a skip connection before the depth-wise convolution where y1 = LN (Conv3×3 (F C(xin )) + F C(xin )). After
for better feature fusion. And then, we applied layer norm that, the model makes more expressive power due to the
after the skip connection for feature distribution, which can construction of different feature distribution and consistency
be denoted as: by each recursive step.
y1 = LN(Conv3×3 (F C(xin )) + FC(xin )),
(4)
xout = F C(GELU (y1 )) + xin ,
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 5
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
6 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022
15
12
6
5
9
grad norm
grad norm
3 6
1 U-mlpFormer
U-SegFormer
U-mlpFormer
MISSFormer_S U-SegFormer
0 MISSFormer_S
0 0.5 1 1.5 2 2.5 3 3.5 4 0
0 0.5 1 1.5 2 2.5 3 3.5 4
step 104 step 104
60
50
50
40
40
grad norm
grad norm
30
30
20
20
10
10
U-mlpFormer
U-mlpFormer
U-SegFormer
U-SegFormer
MISSFormer_S
MISSFormer_S
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.5 1 1.5 2 2.5 3 3.5 4
step 104 step 104
Fig. 4. The average L1 norm of gradients to the second fully connected weight in FFN for layer 1,3,6,7
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 7
is that our design fuses the global feature caused by self- TABLE III
attention and local feature extracted by convolution, and makes C OMPARISON OF DIFFERENT LOCAL MODULES IN TRANSFORMER
BLOCK
features more discriminative. And our MISSFormer S has 1%
and 0.52% improvements based on U-SegFormer w/skip in Synapse ACDC
Architecture
DSC↑ HD↓ DSC↑
two datasets because of the feature redistribution. Finally, with
U-mlpFormer [16] 75.88 27.22 89.04
the help of the redesigned feed-forward network, we improved U-SegFormer [20] 76.10 26.97 89.21
feature distributions and enhanced feature representations to U-LocalViT [32] 76.92 23.62 86.34
generate an increasing promotion of 3.63 DSC in Synapse MISSFormer S 79.73 20.14 90.61
dataset and 1.4 DSC in ACDC dataset, compared with the
U-SegFormer baseline. In order to explore the reasons for the 3) Comparison of different local modules in Transformer: In
effectiveness of above improvements, we observe the tendency order to prove the necessity of supplementing local informa-
of gradients of the second fully connected weight in FFN tion and the effectiveness of the proposed method, we compare
for different models with 8 encoder layers in Synapse. Fig. it with other methods of supplementing local information.
4 shows the average L1 norm of gradients of different layers The experiment is carried out by replacing the FFN in the
of models. We observe that the gradients of the shallow layer, transformer block with different modules, such as Mix-FFN
middle layer and deep layer in U-SegFormer are all smaller in SegFormer [20], residual blocks in LocalViT [32] and
than the corresponding ones in U-mlpFormer, and even the the proposed ReMix-FFN in MISSFormer S. The results are
gradients of the middle layer in U-SegFormer become 1/3 shown in Table III. Our MISSFormer S takes a big advantage
compared with U-mlpFormer in Fig. 4(b), which indicates over other methods with different local modules because of
that direct embedding of 3×3 convolution between fully the more discriminative representation of ReMix-FFN.
connected layers makes the update of the middle layer slow
and may not optimize better weights, although it supplements TABLE IV
local information and makes slight improvements, while our I MPACT OF RECURSIVE SKIP CONNECTION IN T RANSFORMER B LOCK ,
method not only retain local information but also promote STEP MEANS RECURSIVE STEP.
0.7 TABLE V
dice similarity coefficient
0.6
I MPACT OF R E M IXED T RANSFORMER C ONTEXT B RIDGE ON
RECURSIVE SKIP CONNECTION OF MISSF ORMER .
0.5
Synapse ACDC
0.4 Architecture step bridge 4
DSC↑ HD↓ DSC↑
0.3
1 – 79.73 20.14 90.61
0.2 2 – 79.91 21.33 90.40
MLP[16] MISSFormer R
Mix-FFN[20] 3 – 80.74 19.65 90.49
0.1 ReMix-FFN
1 X 81.96 18.20 90.98
0 2 X 80.91 19.48 90.86
0 50 100 150 200 250 300 350 400 MISSFormer
epoch
3 X 80.72 23.43 90.82
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
8 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022
TABLE VI TABLE IX
E XPLORATION OF THE BRIDGE DEPTH AND MULTI - SCALE INFORMATION T HE 5- FOLD CROSS - VALIDATION RESULTS ON S YNAPSE DATASET AND
IN MISSF ORMER . T HE 4/6 IN DEPTH DENOTES THE BRIDGE DEPTH IS ACDC DATASET. T HE FORMAT OF NUMERICAL RECORD IS MEAN±STD.
4 FOR S YNAPSE AND 6 FOR ACDC.
Synapse ACDC
method
Synapse ACDC DSC↑ HD↓ DSC↑
Architecture depth stage
DSC↑ HD↓ DSC↑ TransUNet [22] 75.92±4.48 29.12±8.77 90.31±0.64
2 4/3/2/1 80.19 18.88 90.97 Swin-Unet [7] 74.87±7.58 22.25±7.63 90.46±0.58
4 4/3/2/1 81.96 18.20 90.98 MISSFormer 79.05±4.28 24.24±7.62 91.00±0.62
6 4/3/2/1 81.03 21.36 91.19 P-values <0.02 >0.05 <0.01
MISSFormer
4/6 4/3/2 80.65 18.39 90.69
4/6 4/3 79.86 20.33 90.50
4/6 4 79.56 20.95 90.88
TABLE VII
C OMPARISON OF DIFFERENT MODULES IN T RANSFORMER C ONTEXT
B RIDGE .
Synapse ACDC
Architecture Context Bridge
DSC↑ HD↓ DSC↑
– 79.73 20.14 90.61
mlp-FFN [16] 79.54 17.26 90.72
MISSFormer
Mix-FFN [20] 80.18 20.17 90.65
ReMix-FFN 81.96 18.20 91.19
features. 2 https://github.com/HuCaoFighting/Swin-Unet
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 9
TABLE VIII
C OMPARISON WITH OTHER STATE - OF - THE - ART METHODS ON S YNAPSE DATASET.
Methods DSC↑ HD ↓ Aorta Gallbladdr Kidney(L) Kidney(R) Liver Pancreas Spleen Stomach
U-Net [2] 76.26 36.62 86.95 64.76 80.98 75.06 93.09 50.24 86.05 72.93
Att-UNet [40] 78.07 32.10 88.41 69.45 83.14 74.55 93.46 54.80 86.61 74.16
R50 U-Net [2] 77.24 31.01 87.30 63.03 80.55 77.59 93.26 53.38 86.34 74.47
R50 Att-Unet [40] 79.25 34.74 87.37 58.84 82.99 78.52 93.98 64.29 87.71 80.32
TransUNet [22] 77.24 31.18 87.32 61.09 80.77 76.15 94.54 55.97 86.15 75.88
Swin-Unet [7] 78.71 23.85 85.29 64.49 83.28 80.62 93.87 56.78 90.42 74.93
MISSFormer 81.96 18.20 86.99 68.65 85.21 82.00 94.41 65.67 91.92 80.81
(a) GroundTruth (b) MISSFormer (c) MISSFormer_S (d) Swin-Unet (e) TransUNet
Fig. 7. The visual comparison with previous state-of-the-art methods on Synapse dataset. The pictures in last row is a failed case. Our MISSFormer
shows a better performance than other methods.
t-test statistical analysis with the proposed method and the Swin-Unet, even in the bad case. Comparing MISSFormer
best method in each index based on the 5-fold cross-validation and MISSFormer S, MISSFormer has precise results and less
results, the P-values prove the universality of the superiority false segmentation because of the integration of multi-scale
of the proposed method in these datasets. The MISSFormer is information.
significantly ahead of the other two methods, which verifies
the superiority and effectiveness of the proposed method again. At the same time, we have noticed that our method has not
The visualization results are shown in Fig. 7. It can be achieved the best results in the segmentation of the relatively
seen that our MISSFormer achieves better edge predictions small organs Aorta (deep blue in Fig.7) and Galbladdr (green
and hard example segmentations compared to Tranunet and in Fig.7). This may be because our method gives a large
weight to significant targets, which can get more attention
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
10 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022
Fig. 8. The visual comparison with previous state-of-the-art methods on ACDC dataset.
because they account for a larger proportion in multi-scale of MISSFormer. The visualization results are shown in Fig. 8
features, which viewed from Fig. 6. Although we supplement
the local context in MISSFormer, the Aorta and Galbladdr TABLE X
results imply that the local context may still be insufficient for C OMPARISON WITH STATE - OF - THE - ART METHODS ON ACDC DATASET.
the small target in our MISSFormer. We will study the further Methods DSC↑ RV Myo LV
supplement of local information in future work to improve this R50 U-Net [2] 90.16 88.67 86.88 94.92
situation. R50 Att-UNet [40] 91.03 89.90 87.98 95.21
TransUNet [22] 90.44 88.82 87.54 94.95
Swin-Unet [7] 90.41 88.41 87.71 95.13
2) Experiment results on ACDC dataset: We evaluate our MISSFormer 91.19 89.85 88.38 95.34
method on the ACDC dataset in the format of MRI. Table
IX and X present the 5-fold cross-validation results and seg- 3) Experiment results on DRIVE dataset: In addition to the
mentation accuracy, respectively. MISSFormer maintains the above two datasets, we also verified the effectiveness of the
first position because of the powerful feature discrimination, proposed method on the retinal vessel segmentation dataset
which indicates the outstanding generalization and robustness DRIVE. As is shown in Table XI, the proposed method MISS-
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
AUTHOR et al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIONS ON MEDICAL IMAGING 11
TABLE XI [10] S. Feng, H. Zhao, F. Shi, X. Cheng, M. Wang, Y. Ma, D. Xiang, W. Zhu,
C OMPARISON WITH STATE - OF - THE - ART METHODS ON DRIVE DATASET. and X. Chen, “Cpfnet: Context pyramid fusion network for medical
image segmentation,” IEEE transactions on medical imaging, vol. 39,
Methods Acc↑ Sen AUC no. 10, pp. 3008–3018, 2020.
BCOSFIRE [38] 94.42 76.55 96.14 [11] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
WSF [39] 95.80 77.40 97.50 network,” in Proceedings of the IEEE conference on computer vision
DeepVessel [40] 95.23 76.03 97.89 and pattern recognition, 2017, pp. 2881–2890.
U-Net [2] 95.31 75.37 96.01 [12] L. Mou, Y. Zhao, L. Chen, J. Cheng, Z. Gu, H. Hao, H. Qi, Y. Zheng,
R2U-Net [41] 95.56 77.92 97.84 A. Frangi, and J. Liu, “Cs-net: channel and spatial attention network for
CE-Net [9] 95.45 83.09 97.79 curvilinear structure segmentation,” in International Conference on Med-
CS-Net [12] 96.32 81.70 97.98 ical Image Computing and Computer-Assisted Intervention. Springer,
MISSFormer 96.03 84.69 98.44 2019, pp. 721–730.
[13] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medical im-
age segmentation,” IEEE journal of biomedical and health informatics,
vol. 25, no. 1, pp. 121–130, 2020.
Former achieved very competitive performance compared with [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
other methods. As is shown in Fig. 9, the proposed method Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
can pay attention to more details of the retinal vessel images, in neural information processing systems, 2017, pp. 5998–6008.
[15] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
which benefits from the global-local feature extraction of B. Guo, “Swin transformer: Hierarchical vision transformer using shifted
MISSFormer. windows,” arXiv preprint arXiv:2103.14030, 2021.
[16] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
V. C ONCLUSION “An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
In this paper, we presented MISSFormer, a U-shaped medi- [17] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and
cal image segmentation transformer, which explored the global L. Shao, “Pyramid vision transformer: A versatile backbone for dense
prediction without convolutions,” arXiv preprint arXiv:2102.12122,
dependencies and local context capture. The proposed ReMix- 2021.
FFN can enhance the global dependencies and supplement the [18] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jégou,
local context to make discriminative feature representations. and M. Douze, “Levit: a vision transformer in convnet’s clothing for
faster inference,” arXiv preprint arXiv:2104.01136, 2021.
Based on these core designs, we further investigated the [19] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and
integration of multi-scale features generated by our hierar- C. Shen, “Twins: Revisiting the design of spatial attention in vision
chical transformer encoder with the ReMixed Transformer transformers,” arXiv preprint arXiv:2104.13840, vol. 1, no. 2, p. 3, 2021.
Context Bridge, which is essential for accurate segmentation. [20] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,
“Segformer: Simple and efficient design for semantic segmentation with
We evaluated our method on three datasets with different transformers,” arXiv preprint arXiv:2105.15203, 2021.
formats, the superior results demonstrate the effectiveness and [21] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng,
robustness of MISSFormer. T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a
sequence-to-sequence perspective with transformers,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021, pp. 6881–6890.
R EFERENCES [22] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and
Y. Zhou, “Transunet: Transformers make strong encoders for medical
[1] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
for semantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 3431–3440. [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
[2] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
and pattern recognition, 2016, pp. 770–778.
for biomedical image segmentation,” in International Conference on
Medical Image Computing and Computer-Assisted Intervention, 2015. [24] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
[3] z. iek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d “Deeplab: Semantic image segmentation with deep convolutional nets,
u-net: Learning dense volumetric segmentation from sparse annotation,” atrous convolution, and fully connected crfs,” IEEE transactions on
in Springer, Cham, 2016. pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
[4] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, 2017.
“nnu-net: a self-configuring method for deep learning-based biomedical [25] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient ob-
image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021. ject detection,” in Proceedings of the European Conference on Computer
[5] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Vision (ECCV), 2018, pp. 234–250.
A nested u-net architecture for medical image segmentation,” in Deep [26] D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao,
learning in medical image analysis and multimodal learning for clinical “Pranet: Parallel reverse attention network for polyp segmentation,” in
decision support. Springer, 2018, pp. 3–11. International Conference on Medical Image Computing and Computer-
[6] H. Huang, L. Lin, R. Tong, H. Hu, Q. Zhang, Y. Iwamoto, X. Han, Assisted Intervention. Springer, 2020, pp. 263–273.
Y.-W. Chen, and J. Wu, “Unet 3+: A full-scale connected unet for [27] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
medical image segmentation,” in ICASSP 2020-2020 IEEE International Proceedings of the IEEE conference on computer vision and pattern
Conference on Acoustics, Speech and Signal Processing (ICASSP). recognition, 2018, pp. 7132–7141.
IEEE, 2020, pp. 1055–1059. [28] Z. Zhang, H. Fu, H. Dai, J. Shen, Y. Pang, and L. Shao, “Et-net: A
[7] H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, generic edge-attention guidance network for medical image segmenta-
“Swin-unet: Unet-like pure transformer for medical image segmenta- tion,” in International Conference on Medical Image Computing and
tion,” arXiv preprint arXiv:2105.05537, 2021. Computer-Assisted Intervention. Springer, 2019, pp. 442–450.
[8] Y. Xie, J. Zhang, C. Shen, and Y. Xia, “Cotr: Efficiently bridging cnn [29] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
and transformer for 3d medical image segmentation,” arXiv preprint H. Jégou, “Training data-efficient image transformers & distillation
arXiv:2103.03024, 2021. through attention,” in International Conference on Machine Learning.
[9] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, PMLR, 2021, pp. 10 347–10 357.
and J. Liu, “Ce-net: Context encoder network for 2d medical image [30] M. A. Islam, S. Jia, and N. D. Bruce, “How much position in-
segmentation,” IEEE transactions on medical imaging, vol. 38, no. 10, formation do convolutional neural networks encode?” arXiv preprint
pp. 2281–2292, 2019. arXiv:2001.08248, 2020.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in IEEE Transactions on Medical Imaging. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMI.2022.3230943
12 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. XX, NO. XX, XXXX 2022
[31] X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, and C. Shen, “Con-
ditional positional encodings for vision transformers,” arXiv preprint
arXiv:2102.10882, 2021.
[32] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool, “Localvit: Bring-
ing locality to vision transformers,” arXiv preprint arXiv:2104.05707,
2021.
[33] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in
European Conference on Computer Vision. Springer, 2020, pp. 213–
229.
[34] Z. Wang, X. Cun, J. Bao, and J. Liu, “Uformer: A general u-shaped
transformer for image restoration,” arXiv preprint arXiv:2106.03106,
2021.
[35] A. Hatamizadeh, D. Yang, H. Roth, and D. Xu, “Unetr: Transformers
for 3d medical image segmentation,” arXiv preprint arXiv:2103.10504,
2021.
[36] W. Wang, C. Chen, M. Ding, J. Li, H. Yu, and S. Zha, “Transbts:
Multimodal brain tumor segmentation using transformer,” arXiv preprint
arXiv:2103.04430, 2021.
[37] F. Liu, X. Ren, Z. Zhang, X. Sun, and Y. Zou, “Rethinking skip connec-
tion with layer normalization,” in Proceedings of the 28th International
Conference on Computational Linguistics, 2020, pp. 3586–3598.
[38] G. Azzopardi, N. Strisciuglio, M. Vento, and N. Petkov, “Trainable
cosfire filters for vessel delineation with application to retinal images,”
Medical image analysis, vol. 19, no. 1, pp. 46–57, 2015.
[39] Y. Zhao, Y. Zheng, Y. Liu, Y. Zhao, L. Luo, S. Yang, T. Na, Y. Wang,
and J. Liu, “Automatic 2-d/3-d vessel enhancement in multiple modality
images using a weighted symmetry filter,” IEEE transactions on medical
imaging, vol. 37, no. 2, pp. 438–450, 2017.
[40] H. Fu, Y. Xu, S. Lin, D. W. Kee Wong, and J. Liu, “Deepvessel: Retinal
vessel segmentation via deep learning and conditional random field,” in
International conference on medical image computing and computer-
assisted intervention. Springer, 2016, pp. 132–139.
[41] M. Z. Alom, M. Hasan, C. Yakopcic, T. M. Taha, and V. K. Asari,
“Recurrent residual convolutional neural network based on u-net (r2u-
net) for medical image segmentation,” arXiv preprint arXiv:1802.06955,
2018.
© 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing University. Downloaded on March 30,2023 at 12:02:10 UTC from IEEE Xplore. Restrictions apply.