U Transformer
U Transformer
1 Introduction
Organ segmentation is of crucial importance in medical imaging and computed-
aided diagnosis, e.g. for radiologists to assess physical changes in response to a
treatment or for computer-assisted interventions.
Currently, state-of-the-art methods rely on Fully Convolutional Networks
(FCNs), such as U-Net and variants [9, 2, 7, 18]. U-Nets use an encoder-decoder
architecture: the encoder extracts high-level semantic representations by using
a cascade of convolutional layers, while the decoder leverages skip connections
to re-use high-resolution feature maps from the encoder in order to recover lost
spatial information from high-level representations.
Despite their outstanding performances, FCNs suffer from conceptual limi-
tations in complex segmentation tasks, e.g. when dealing with local visual am-
biguities and low contrast between organs. This is illustrated in Fig 1a) for
2 O. Petit et al.
Fig. 1. Global context is crucial for complex organ segmentation but cannot be cap-
tured by vanilla U-Nets with a limited receptive field, i.e. blue cross region in a) with
failed segmentation in c). The proposed U-Transformer network represents full image
context by means of attention maps b), which leverage long-range interactions with
other anatomical structures to properly segment the complex pancreas region in d).
segmenting the blue cross region corresponding to the pancreas with U-Net: the
limited Receptive Field (RF) framed in red does not capture sufficient contextual
information, making the segmentation fail, see Fig 1c).
In this paper, we introduce the U-Transformer network, which leverages the
strong abilities of transformers [13] to model long-range interactions and spatial
relationships between anatomical structures. U-Transformer keeps the inductive
bias of convolution by using a U-shaped architecture, but introduces attention
mechanisms at two main levels, which help to interpret the model decision.
Firstly, a self-attention module leverages global interactions between semantic
features at the end of the encoder to explicitly model full contextual information.
Secondly, we introduce cross-attention in the skip connections to filter out non-
semantic features, allowing a fine spatial recovery in the U-Net decoder.
Fig 1b) shows a cross-attention map induced by U-Transformer, which high-
lights the most important regions for segmenting the blue cross region in Fig 1a):
our model leverages the long-range interactions with respect to other organs
(liver, stomach, spleen) and their positions to properly segment the whole pan-
creas region, see Fig 1d). Quantitative experiments conducted on two abdom-
inal CT-image datasets show the large performance gain brought out by U-
Transformer compared to U-Net and to the local attention in [11].
Related Work. Attention mechanisms are a relatively recent problem in med-
ical imaging [16, 8, 10–12]. Attention in segmentation is often based on multi-
resolution features combined with a simple attention module [16, 6]. These con-
tributions however fail to incorporate long-range dependencies. Recent works
successfully tackle this aspect through Dual attention networks [12, 5] proving
the importance of full range attention but to the cost of large parameter overhead
and multiple concurrent loss functions.
Transformers [13] models also bring global attention and have witnessed in-
creasing success in the last five years, started in natural language processing with
text embeddings [3]. A pioneer use of transformers in computer vision is non-
U-Transformer: Self and Cross Attention for Medical Image Segmentation 3
2.1 Self-attention
The MHSA module is designed to extract long range structural information from
the images. To this end, it is composed of multi-head self-attention functions as
described in [13] positioned at the bottom of the U-Net as shown in Figure 2.
4 O. Petit et al.
The main goal of MHSA is to connect every element in the highest feature map
with each other, thus giving access to a receptive field including all the input
image. The decision for one specific pixel can thus be influenced by any input
pixel. The attention formulation is given in Equation 1. A self-attention module
takes three inputs, a matrix of queries Q ∈ Rn×dk , a matrix of keys K ∈ Rn×dk
and a matrix of values V ∈ Rn×dk .
QK T
Attention(Q, K, V ) = softmax( √ )V = AV (1)
dk
A line of the attention matrix A ∈ Rn×n corresponds to the similarity of a
given element in Q with respect to all the elements in K. Then, the attention
function performs a weighted average of the elements of the value V to account
for all the interactions between the queries and the keys as illustrated in Figure
3. In our segmentation task, Q, K and V share the same size and correspond
to different learnt embedding of the highest level feature map denoted by X in
Figure 3. The embedding matrices are denoted as W q , W k and W v . The atten-
tion is calculated separately in multiple heads before being combined through
another embedding. Moreover, to account for absolute contextual information,
a positional encoding is added to the input features. It is especially relevant for
medical image segmentation, where the different anatomical structures follow a
fixed spatial position. The positional encoding can thus be leveraged to capture
absolute and relative position between organs in MHSA.
Fig. 3. MHSA module: the input tensor is embedded into a matrix of queries Q,
keys K and values V . The attention matrix A in purple is computed based on Q and
K. (1) A line of A corresponds to the attention given to all the elements in K with
respect to one element in Q. (2) A column of the value V corresponds to a feature
map weighted by the attention in A.
2.2 Cross-attention
The MHSA module allows to connect every element in the input with each other.
Attention may also be used to increase the U-Net decoder efficiency and in par-
ticular enhance the lower level feature maps that are passed through the skip
connections. Indeed, if these skip connections insure to keep a high resolution
U-Transformer: Self and Cross Attention for Medical Image Segmentation 5
information they lack the semantic richness that can be found deeper in the net-
work. The idea behind the MHCA module is to turn off irrelevant or noisy areas
from the skip connection features and highlight regions that present a signifi-
cant interest for the application. Figure 4 shows the cross-attention module. The
MHCA block is designed as a gating operation of the skip connection S based on
the attention given to a high level feature map Y . The computed weight values
are then re-scaled between 0 and 1 through a sigmoid activation function. The
resulting tensor, denoted Z in Figure 4, is a filter where low magnitude elements
indicate noisy or irrelevant areas to be reduced. A cleaned up version of S is
then given by the Hadamard product Z S. Finally, the result of this filtering
operation is concatenated with the high level feature tensor Y . Here, the keys
and queries are computed from the same source as we are designing a filtering
operation whereas for NLP tasks, having homogeneous keys and values may be
more meaningful. This configuration proved to be empirically more effective.
Fig. 4. MHCA module: the value of the attention function corresponds to the skip
connection S weighted by the attention given to the high level feature map Y . This
output is transformed into a filter Z and applied to the skip connection.
3 Experiments
We evaluate U-Transformer for abdominal organ segmentation on the TCIA
pancreas public dataset, and an internal multi-organ dataset.
Accurate pancreas segmentation is particularly difficult, due to its small size,
complex and variable shape, and because of the low contrast with the neigh-
6 O. Petit et al.
boring structures, see Fig 1. In addition, the multi-organ setting assesses how
U-transformer can leverage attention from multi-organ annotations.
Experimental setup The TCIA pancreas dataset4 contains 82 CT-scans with
pixel-level annotations. Each CT-scan has around 181 ∼ 466 slices of 512 × 512
pixels and a voxel spacing of ([0.66 ∼ 0.98] × [0.66 ∼ 0.98] × [0.5 ∼ 1.0]) mm3 .
We also experiment with an Internal Multi-Organ (IMO) dataset composed
of 85 CT-scans annotated with 7 classes: liver, gallbladder, pancreas, spleen,
right and left kidneys, and stomach. Each CT-scan has around 57 ∼ 500 slices
of 512 × 512 pixels and a voxel spacing of ([0.42 ∼ 0.98] × [0.42 ∼ 0.98] ×
[0.63 ∼ 4.00])mm3 .
All experiments follow a 5-fold cross validation, using 80% of images in train-
ing and 20% in test. We use the Tensorflow library to train the model, with Adam
optimizer (10−4 learning rate, exponential decay scheduler).
We compare U-Transformer to the U-Net baseline [9] and Attention U-
Net [11] with the same convolutional backbone for fair comparison. We also
report performances with self-attention only (MHSA, section 2.1), and the cross-
attention only (MHCA, section 2.2). U-Net has ∼ 30M parameters, the overhead
from U-transformer is limited (MHSA ∼ 5M, each MHCA block ∼ 2.5M).
Table 1 reports the performances in Dice averaged over the 5 folds, and over or-
gans for IMO. U-Transformer outperforms U-Net by 2.4pts on TCIA and 1.3pts
for IMO, and Attention U-Net by 1.7pts for TCIA and 1.6pts for IMO. The
gains are consistent on all folds, and paired t-tests show that the improvement
is significant with p−values < 3% for every experiment.
Table 1. Results for each method in Dice similarity coefficient (DSC, %)
Fig. 5. Segmentation results for U-Net [9], Attention U-Net [11] and U-Transformer
on the multi-organ IMO dataset (first row) and on TCIA pancreas (second row).
MHSA MHCA
U-Net Attn U-Net wo PE – w PE 1 lvl wo PE – 1 lvl w PE – multi-lvl w PE
TCIA 76.35 77.23 78.17 78.90 77.18 78.88 80.65
IMO 88.18 87.52 88.16 88.76 87.96 88.52 89.13
4 Conclusion
This paper introduces the U-Transformer network, which augments a U-shaped
FCN with Transformers. We propose to use self and cross-attention modules
to model long-range interactions and spatial dependencies. We highlight the
relevance of the approach for abdominal organ segmentation, especially for small
and complex organs. Future works could include the study of U-Transformer in
3D networks, with other modalities such as MRI or US images, as well as for
other medical image tasks.
U-Transformer: Self and Cross Attention for Medical Image Segmentation 9
References
1. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-
to-end object detection with transformers. In: European Conference on Computer
Vision. pp. 213–229. Springer (2020)
2. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net:
Learning dense volumetric segmentation from sparse annotation. In: MICCAI. pp.
424–432 (2016)
3. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec-
tional transformers for language understanding. CoRR abs/1810.04805 (2018),
http://arxiv.org/abs/1810.04805
4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An image is worth 16x16 words: Transformers for image recognition at scale. In:
International Conference on Learning Representations (2021)
5. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for
scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) (June 2019)
6. Li, C., Tong, Q., Liao, X., Si, W., Sun, Y., Wang, Q., Heng, P.A.: Attention based
hierarchical aggregation network for 3d left atrial segmentation. In: Statistical At-
lases and Computational Models of the Heart. Atrial Segmentation and LV Quan-
tification Challenges. pp. 255–264 (2019)
7. Milletari, F., Navab, N., Ahmadi, S.: V-net: Fully convolutional neural networks for
volumetric medical image segmentation. In: 2016 Fourth International Conference
on 3D Vision (3DV). pp. 565–571 (2016)
8. Nie, D., Gao, Y., Wang, L., Shen, D.: Asdnet: Attention based semi-supervised deep
networks for medical image segmentation. In: Frangi, A., Fichtinger, G., Schnabel,
J., Alberola-López, C., Davatzikos, C. (eds.) MICCAI 2018. pp. 370–378. Lecture
Notes in Computer Science, Springer Verlag (2018)
9. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: MICCAI. pp. 234–241 (2015)
10. Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel squeeze &
excitation in fully convolutional networks. In: MICCAI. vol. abs/1803.02579 (2018)
11. Schlemper, J., Oktay, O., Schaap, M., Heinrich, M., Kainz, B., Glocker,
B., Rueckert, D.: Attention gated networks: Learning to leverage salient
regions in medical images. Medical Image Analysis 53 (02 2019).
https://doi.org/10.1016/j.media.2019.01.012
12. Sinha, A., Dolz, J.: Multi-scale self-guided attention for medical image segmenta-
tion. IEEE Journal of Biomedical and Health Informatics pp. 1–1 (2020)
13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. In: NeurIPS. pp. 5998–6008 (2017)
14. Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.C.: Axial-deeplab:
Stand-alone axial-attention for panoptic segmentation. In: European Conference
on Computer Vision. pp. 108–126 (2020)
15. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Pro-
ceedings of the IEEE conference on computer vision and pattern recognition. pp.
7794–7803 (2018)
16. Wang, Y., Deng, Z., Hu, X., Zhu, L., Yang, X., xu, X., Heng, P.A., Ni, D.: Deep
attentional features for prostate segmentation in ultrasound. In: MICCAI (09 2018)
10 O. Petit et al.
17. Ye, L., Rochan, M., Liu, Z., Wang, Y.: Cross-modal self-attention network for
referring image segmentation. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 10502–10511 (2019)
18. Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested
u-net architecture for medical image segmentation. In: Deep Learning in Medical
Image Analysis and Multimodal Learning for Clinical Decision Support. pp. 3–11
(2018)
Fig. 7. Evolution of the Dice Score on TCIA (fold 1) when the number of heads varies
between 0 and 8 in MHSA.