Evaluation_of_Deep_Neural_Network_Models_for_Insta
Evaluation_of_Deep_Neural_Network_Models_for_Insta
Jiasong Chen a, Linchen Qian a, Linhai Ma a, Timur Urakov b, Weiyong Gu c, and Liang Liang a
a
Department of Computer Science, University of Miami, Coral Gables, FL
b
Department of Neurological Surgery, University of Miami, Coral Gables, FL
c
Department of Mechanical and Aerospace Engineering, University of Miami, Coral Gables, FL
For correspondence:
University of Miami
Abstract
Intervertebral disc disease, a prevalent ailment, frequently leads to intermittent or persistent low back pain, and
diagnosing and assessing of this disease rely on accurate measurement of vertebral bone and intervertebral disc
geometries from lumbar MR images. Deep neural network (DNN) models may assist clinicians with more efficient
image segmentation of individual instances (discs and vertebrae) of the lumbar spine in an automated way, which
is termed as instance image segmentation. In this work, we evaluated 15 existing DNN models for lumbar spine
MR image segmentation. We introduced a new data augmentation technique to create synthetic yet realistic MR
image dataset, named SSMSpine, which is made publicly available. The 15 image segmentation models are
evaluated on our private in-house dataset and the public SSMSpine dataset, using two metrics, Dice Similarity
Coefficient and 95% Hausdorff Distance. The SSMSpine dataset are available at
https://github.com/jiasongchen/SSMSpine.
Keywords: Lumbar spine MRI, Medical image instance segmentation, Data augmentation
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
1. Introduction
The intervertebral discs in humans can undergo a profound degenerative process as early in the adolescence
(Cox et al., 2014; Kos et al., 2019), which can be accompanied by facet arthropathy and hypertrophy. This
degeneration can manifest as various conditions, including discogenic low back pain, disc herniation, spinal stenosis,
and spondylolisthesis, which may necessitate the implementation of surgical or non-surgical interventions aimed at
alleviating pain and restoring normal functionality. Magnetic resonance imaging (MRI) is the most widely used
technique for specifically quantifying intervertebral discs degeneration (IDD) by assessing changes in disc geometry
deformation and signal strength degradation (Mallio et al., 2022; Roberts et al., 2021; Tamagawa et al., 2022). The
information derived from imaging data is of utmost importance for medical professionals in terms of both
diagnosing medical conditions and planning appropriate treatments. Furthermore, this information serves as a
critical foundation for developing patient-specific computational models, which hold the potential to mature over
time and eventually enable accurate predictions of treatment outcomes within clinical settings. Presently, the
process of geometry reconstruction, signal measurements, and grading from magnetic resonance (MR) images
heavily relies on manual annotation. However, this process is not only time-consuming but also vulnerable to human
bias. Consequently, there is an urgent need for automated MR image analysis methods to address these challenges.
In medical imaging, semantic/instance image segmentation, which divides the images into distinct sections
at the pixel level so that each pixel belongs to a specific region, has the potential to be carried out through automated
techniques (Galbusera et al., 2019). The traditional methods, such as watershed and level set, have demonstrated
satisfactory performance in medical image segmentation tasks. The watershed method treats an image as a
topological map where intensity represents the altitude of the pixels. The watershed segmentation is determined by
the watershed lines on a topographic surface (Chevrefils et al., 2007; Huang and Chen, 2004). The level set method
performs image segmentation by utilizing dynamic variational boundaries (Huang et al., 2013). However, the
traditional method suffers from the clinical variation of different patients and the noise effect of different medical
imaging equipment, problems like the over-segmentation and time-consuming consist (Li et al., 2007).
Since the increasingly vast amount of medical imaging data and computational resources have become
available, machine learning (ML) methods, especially deep neural network techniques, show superior performance
than traditional methods. Convolutional neural network (CNN) has a significant edge over its predecessors in that
it possesses the capability to recognize essential components/features without requiring any human intervention
(Suganyadevi et al., 2022). CNNs are specifically designed to effectively utilize spatial and configural information
by accepting 2D or 3D images as input. This approach helps to prevent the loss or disruption of structural and
configural information in medical images (Shen et al., 2017).Various deep CNNs, including UNet++ (Zhou et al.,
2018), Attention U-Net (Oktay et al., 2018), MultiResUNet (Ibtehaz and Rahman, 2020) and UNeXt (Valanarasu
and Patel, 2022) have been proposed for image segmentation for different medical imaging modalities and different
organs (e.g. heart (Cao et al., 2023; Gao et al., 2021; Huang et al., 2023), lung (Zhou et al., 2018), brain
(Hatamizadeh et al., 2022, 2021; Hu et al., 2022; Valanarasu et al., 2021), pancreas (Oktay et al., 2018), gland
(Valanarasu et al., 2021; Wang et al., 2022), spine (Sekuboyina et al., 2018; Wang et al., 2023), retina blood vessels
(Moccia et al., 2018; Soomro et al., 2019), aorta (Berhane et al., 2020; Noothout et al., 2018; Pepe et al., 2020),
etc). Although these methods have achieved promising performance, there are still some limitations in a more
complex context coping with long-range dependency explicitly due to the intrinsic locality of convolutions.
Recently, Transformer, an ML technique, has shown exceptional performance not only on natural language
processing (NLP) challenges like machine translation (Vaswani et al., 2017), but also image analysis tasks including
image classification (Shamshad et al., 2023) and segmentation (Chen et al., 2021; Hatamizadeh et al., 2022, 2021;
Liu et al., 2021; Wang et al., 2022). Various variations of Transformer models have demonstrated that the global
information perceived by the self-attention operations is beneficial in medical imaging tasks. TransUNet was the
first Transformer-based network specifically for medical image segmentation on the synapse multi-organ
segmentation dataset (Chen et al., 2021). Wang et al. (2022) substituted the original skip connection scheme of U-
Net with the proposed UCTransNet that includes a multi-scale Channel Cross fusion Transformer and a Channel-
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
wise Cross-Attention and tested the network on the gland segmentation dataset (Sirinukunwattana et al., 2017) and
synapse multi-organ segmentation dataset (Landman et al., 2015). Hatamizadeh et al., (2021, 2022) proposed both
UNETR and Swin UNETR for 3D medical imaging segmentation. UNETR utilizes a U-shape network with a vision
Transformer as the encoder and a CNN-based decoder. Swin UNETR is constructed by replacing the vision
transformer encoder in UNETR architecture with the Swin Transformer encoder. Feng et al. (2022) proposed SLT-
Net to utilize CSwin Transformer (Dong et al., 2022) as the encoder for feature extraction and the multi-scale
context Transformer as the skip connection for skin lesion segmentation. Swin-Unet adopted Swin Transformer
(Liu et al., 2021) with shifted windows as encoder and a symmetric Swin Transformer-based decoder with patch
expanding layer as decoder for multi-organ segmentation task (Cao et al., 2023). Pu et al., (2023) proposed a semi-
supervised learning framework with Inception-SwinUnet adopting convolution and sliding window attention in
different channels for vessel segmentation on small amount of labeled data. Besides the self-attention mechanism,
position embeddings are another crucial component of Transformer models. Regarding changing the order of the
input, a Transformer model is invariant (Vaswani et al., 2017) without position embeddings. However, since text
data inherently has a sequential structure, the absence of position information results in the ambiguous or undefined
meaning of a sentence (Dufter et al., 2022). For image segmentation, usually, an image patch is treated as a token,
and Transformers process the entire input sequence of tokens in parallel. With position embeddings, a Transformer
would be able to differentiate between image patches with similar content that appear in different positions in the
input image, which is beneficial for image segmentation applications. A variety of different methods may be used
to incorporate the position information into Transformer models. Absolute position encoding and relative position
encoding are two main categories to encode a token’s position information. Vaswani firstly introduced absolute and
relative position embedding in the vanilla Transformer model (Vaswani et al., 2017). Shaw extended the self-
attention mechanism with the capacity of effectively incorporating the representation of relative position (Shaw et
al., 2018). Valanarasu et al. (2021) proposed a gated position-sensitive axial attention mechanism to cope the
difficulty in learning position encoding for the images.
Specifically for lumbar spine research, instance segmentation of MR images is preferred, which not only
determines whether or not a pixel belongs to a disc, but also labeling the precise instance to which it belongs
(Galbusera et al., 2019). In recent years, most instance segmentation methods for spine image segmentation are
based on CNNs-only networks, and only a few Transformer-based networks are employed. For example, Kuang et
al. (2020) built an unsupervised segmentation network for spine image segmentation using the rule-based region of
interest (ROI) detection, a voting mechanism accompanied by a CNN network. Sekuboyina et al. (2018) proposed
a dual branch fully convolutional network that take advantages of both low-resolution attention information on two-
dimensional sagittal slices and high-resolution segmentation context on three-dimensional patches for effective
segmentation of the vertebrae. MLKCA-Unet incorporates multi-scale large-kernel convolution and convolutional
block attention into the U-net architecture for efficient feature extraction in spine MRI segmentation (Wang et al.,
2023). Pang et al. (2022) introduced a mixed-supervised segmentation network and it was trained on a strongly
supervised dataset with full segmentation labels and a weakly-supervised dataset with only key points. BianqueNet
combined new modules with a modified deeplabv3+ network (Chen et al., 2018), which includes a Swin
Transformer-skip connection module, for segmentation of lumbar intervertebral disc degeneration related regions
(Zheng et al., 2022).
It was shown that the Transformer-based models only perform effectively when trained on large-scale
datasets since the lack of inductive bias (Dosovitskiy et al., 2021). The utilization of Transformer-based networks
for medical imaging tasks poses a challenge due to the limited availability of labeled images in medical datasets.
Obtaining well-annotated medical imaging datasets presents significantly greater challenges compared to curating
traditional computer vision datasets. Dealing with expensive imaging equipment, complex image acquisition
pipelines, expert annotation requirements, and privacy concerns are all part of the problematic issues (Litjens et al.,
2017). This scarcity hampers the effective application of Transformer-based models in the medical domain. In such
scenarios, the adoption of suitable and feasible data augmentation techniques becomes crucial prior to model
training. These techniques can help to increase the effective size of the medical image dataset and improve the
performance of the Transformer-based model.
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
In this study, we evaluated 15 DNN modes for lumbar spinal MRI instance segmentation. For this purpose,
we developed a novel data synthesis method based on statistical shape model (SSM) and biomechanics. This SSM-
biomechanics-based data synthesis method generates lumbar spine images with large and plausible deformations,
which can be used for model training and evaluation.
In addition to the segmentation of vertebral bodies, the segmentation of intervertebral discs (IVDs) is vital
for lumbar spinal disease diagnosis and treatment. Since the water content in IVDs cannot be revealed on CT images,
currently, MRI is the gold standard imaging modality for the evaluation of IVD pathologies (Kirnaz et al., 2022).
Our study aims to explore the benefit of combining CNN and transformer for instance segmentation of lumbar spine
MR images.
Generally, there are mainly two steps to define the position embedding. The first step is defining the position
function or distance function, which is used for encoding the position information of input tokens. There are plenty
of position functions such as index function, Euclidean distance, and sinusoidal functions etc., The second step is
defining methods to incorporate the encoded position information into self-attention.
Absolute position embedding and relative position embedding are two main position representation
methods to incorporate the position information into input tokens. Absolute position embedding encodes the
absolute positions of each input tokens as individual encoding vectors, and relative position embedding focus on
the relative positional relationships of pairwise input tokens (Lin et al., 2022; Wu et al., 2021). In the vanilla
Transformer designed for NLP (Vaswani et al., 2017), it used a combination of absolute and relative position
embedding to add position information to the tokens. It is inconclusive that if relative position embedding is better
or worse than absolute position embedding, and the answer seems to be dependent on specific applications (Dufter
et al., 2022; Huang et al., 2020; Shaw et al., 2018; Wu et al., 2021). Relative position encoding benefits from
capturing the details of relative distance/direction and is invariant to tokens’ shifting. The intuition is that, in the
self-attention mechanism, the pairwise positional relationship (both in terms of direction and distance) between
input elements might be more advantageous than absolute position of individual elements (Lin et al., 2022). In such
a case, position information in Transformer is an extensive research area, and various relative position encodings
have been proposed for medical imaging segmentation (Dosovitskiy et al., 2021). For example, UTNet proposed
the 2-dimensional relative position encoding by adding relative height and width information (Gao et al., 2021).
MedT updated self-attention mechanism with position encoding along the width axis with the inspiration of axial
attention (Valanarasu et al., 2021; Wang et al., 2020; Zhang and Zhang, 2022). The Parameter-Efficient Transformer
added a trainable position vector to the input to encode relative distances (Hu et al., 2022). In this work, we propose
a novel relative position embedding method for segmentation performance improvement.
𝑋𝑐 𝑊𝑄 (𝑋𝑐 𝑊𝐾 )𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( ) (𝑋𝑐 𝑊𝑉 )
UCTransNet √𝑑
Note: 𝑋𝑐 is composed of image channels instead of image patches.
𝑋𝑊𝑄 (𝑋𝑊𝐾 )𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥 ( + 𝑅) (𝑋𝑊𝑉 )
BianqueNet √𝑑
Note: 𝑅 is called relative position bias in the reference
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
3. Methods
3.1. Novel data augmentation/synthesis method based on SSM and biomechanics
For medical image data augmentation, elastic deformation is often used for nonlinear deformation of the
images to increase diversity of training data (Ronneberger et al., 2015). Briefly, the input space is discretized by a
grid, and a random displacement field on the grid is generated by sampling from a normal distribution with standard
deviation equal to 𝜎 × grid resolution (i.e., the size of a grid cell). The parameter 𝜎 determines deformation
magnitude. To ensure a large deformation with diffeomorphism, the grid needs to be coarser than the input size (i.e.,
512 ×512). In this study, we applied two successive elastic deformations to each training image, with grid sizes of
9 × 9 and 17 × 17. As shown in Figure 1, when the deformation parameter 𝜎 is larger than 0.5, the generated images
and spine shapes are highly unrealistic.
Fig. 1. Data augmentation/synthesis examples using elastic deformation with sigma from 𝜎 to 2.0.
In this work, we developed a new method to synthesize lumbar spine MR images suitable for model training
and evaluation. First, we built a statistical shape model (SSM) of lumbar spine shapes (i.e., contours of discs and
vertebrae) in a dataset set, and the SSM represents the probability distribution of lumbar spine shapes. We refer the
reader to the reference papers (Ambellan et al., 2019; Sarkalkan et al., 2014; Hufnagel et al., 2007; Davies et al.,
2003; Cootes et al., 1995) for the details of constructing an SSM. By sampling from the SSM, different lumbar
spine shapes can be generated, and each generated lumbar spine shape could be considered from a virtual patient.
We note that the SSM technique has been used to generate virtual but realistic patient geometries in many
applications, such as generating aortic aneurysm geometries (Liang et al., 2017; van Veldhuizen et al., 2022;
Wiputra et al., 2023). Given a lumbar spine shape, if a lumbar spine MR image can be generated and consistent
with the shape, then we will have a new sample with ground-truth. For this purpose, we developed a biomechanics-
based method to generate a lumbar spine MR image 𝐼̃ from a lumbar spine shape 𝑆̃ by using a reference image 𝐼
with its ground-truth shape 𝑆. Intuitively speaking, a nonlinear spatial transform from the shape 𝑆 to the shape 𝑆̃ is
determined by using biomechanics principles, and then 𝐼̃ is obtained by applying the spatial transform to 𝐼. The
generated images are visually plausible, as shown in Figure 2.
In the implementation, we obtain the spatial transform 𝑇 from 𝑆̃ to 𝑆, and apply the spatial transform 𝑇 to
a regular mesh grid around the shape 𝑆̃ to obtain a deformed grid in the space of the reference image 𝐼, and then 𝐼̃
is obtained by interpolating pixel values of 𝐼 at each node of the deformed grid, i.e., 𝐼̃(𝑥, 𝑦) = 𝐼(𝑇(𝑥, 𝑦)) where
(𝑥, 𝑦) denotes a 2D spatial point and 𝑇(𝑥, 𝑦) is the transformed point. By using biomechanics and finite element
analysis (FEA), the spatial transform, i.e., the deformation field on the mesh grid, is obtained by minimizing the
following energy/loss function Π:
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
4
Π = ∫𝑉 Ψ 𝑑𝑉 + 𝜆 ∙ 𝑎𝑣𝑔𝑖 ‖𝑇(𝑆̃(𝑖 )) − 𝑆(𝑖 )‖ (1)
In the above Eq.(1), 𝑉 represents the undeformed mesh grid of the image 𝐼̃ to be generated, 𝑆̃(𝑖 ) represents the i-
th point location of the shape 𝑆̃, 𝑇(𝑆̃(𝑖 )) is the transformed point location that needs to be equal to 𝑆(𝑖 ), and ‖∎‖
denotes vector L2 norm. 𝑎𝑣𝑔 is the average operator. 𝜆 is a weight constant (set to 16 in experiments). Ψ is the
strain energy density function that is determined by deformation and mechanical property of soft biological tissues
around the lumbar spine. From the perspective of FEA and biomechanics, the Eq.(1) simulates the scenario that
under the external “force” proportional to 𝑇(𝑆̃(𝑖 )) − 𝑆(𝑖 ) at each point of the lumbar spine shape, the soft biological
tissues of human body will deform and reach to an equilibrium state of minimum energy. To speed up the
optimization process, we use a deep neural network with sine activation functions to parameterize the transform T,
i.e., 𝑇(𝑥, 𝑦) = 𝐷𝑁𝑁(𝑥, 𝑦), and then the energy function in Eq.(1) becomes a function of the DNN internal
parameters. The energy optimization problem is resolved by adjusting/optimizing the parameters of the DNN. Once
the optimization is done, the deformation field is obtained and then the image 𝐼̃ is generated. Since our goal is to
generate plausible images for model training and evaluation in the image segmentation tasks, not for patient-specific
FEA simulation of human body deformation, we made an assumption about the strain energy density function to
reduce computation cost: tissue mechanical behavior follows the Ogden hyperelastic model with homogeneous
tissue properties (Ogden and Hill, 1997; Dwivedi et al., 2022). The whole procedure is implemented by using our
newly developed PyTorch-FEA library for large deformation biomechanics (Liang et al., 2023).
Fig. 2. data augmentation/synthesis examples (a-f) using our method. Please zoom in for better visualization.
𝑝̂𝑚 (𝑖, 𝑗) is the m-th element in the output tensor from the softmax layer at the pixel location (𝑖, 𝑗), which corresponds
to the m-th object (a disc or a vertebra) at the location (𝑖, 𝑗). 𝑦(𝑖, 𝑗) is the true label of the pixel at location (𝑖, 𝑗).
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
𝑤𝑚 is a nonnegative weight inversely proportional to the area of the m-th object, and ∑𝑚 𝑤𝑚 = 1. 𝜖 is a small
constant (1e-4) to prevent the case of 0/0 in Eq.(3). In a lumbar spine image, where the background area is
substantially larger than the combined area of the discs and vertebrae, employing area-weighted cross entropy loss
effectively reduces the influence of the background in the loss function.
4. Experiments
In the instance segmentation evaluation, we assess model segmentation performance for individual lumbar
spine instances/objects in the MR images. The instance segmentation task is formulated as a task of labeling 12
distinct objects (5 lumbar discs, 6 vertebrae, and a background). The input to each model is a single-channel mid-
sagittal lumbar spine MR image with size of 512 × 512 pixels. In the segmentation output, each class is represented
by a distinct channel as a binary segmentation map. We employ both the Dice Similarity Coefficient and the 95%
Hausdorff Distance (HD95) as evaluation metrics. In both experiments, the original test set with 20 samples and
the augmented test set with 2500 samples are used separately for model performance assessment on unseen data.
Each model was trained on a Nvidia A6000 GPU with 48GB VRAM. During the training process, a batch
size of 6 was used for most of the models, except for training MedT, where a batch size of 2 was utilized due to its
large model size. The Adam optimizer with an initial learning rate of 0.0001 was employed for model optimization.
A low learning rate is generally preferred to ensure stable convergence during training. Although a low learning
rate might slow down the convergence process, it helps avoid convergence failures. Gradient clipping is applied
during training to prevent potentially large gradients from causing instability in the learning process. We performed
model selection based on the performance on the validation set.
4.3. Results of Experiment-A with the original training set
In experiment-A, the 15 models were trained on the original training set with elastic deformation and
random-shift, and then the models were evaluated on both the original test set and the augmented test set to measure
instance segmentation accuracy and translation robustness. Figure 3 displays the performance of the top 4 models.
Fig. 3. Top 4 Model Comparison Results (Dice) on the augmented test set
4.3.1. Instance Segmentation Evaluation
Table 2 summarizes the instance segmentation results for vertebrae bodies (VB) and intervertebral discs
(IVD) in terms of the Dice Similarity Coefficient on the original test set consisting of 20 samples. For better clarity
and ease of understanding, we have converted Dice Similarity Coefficient into percentage ratios between 0 and
100%. The results show that Transformer-based models surpasses the CNN-only models.
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Table 2. Dice (the higher, the better) of each model on the original test set
L1 L2 L3 L4 L5 S1 D1 D2 D3 D4 D5 Average
93.126 96.813 93.874 95.805 94.380 92.194 92.757 94.299 92.821 89.777 91.077 93.357
Swin UNETR
±7.356 ±1.641 ±11.389 ±3.189 ±4.816 ±6.449 ±4.568 ±3.519 ±6.563 ±7.645 ±10.179 ±4.098
91.967 97.052 94.328 93.619 95.763 93.805 92.990 92.938 90.282 91.481 92.589 93.347
SLT-Net
±6.991 ±0.902 ±10.261 ±13.504 ±2.553 ±1.815 ±4.893 ±6.914 ±16.896 ±5.639 ±6.048 ±5.886
94.355 96.795 91.900 95.546 94.099 92.988 93.261 93.554 90.931 90.435 91.446 93.210
UNETR
±3.131 ±1.709 ±18.125 ±3.196 ±3.814 ±4.539 ±3.878 ±4.254 ±9.003 ±5.678 ±4.198 ±4.333
89.330 93.332 91.819 95.245 96.011 94.193 90.069 92.566 93.635 93.539 94.326 93.097
BianqueNet
±21.689 ±16.860 ±18.667 ±6.709 ±2.155 ±5.037 ±21.017 ±9.930 ±4.852 ±3.373 ±4.325 ±2.840
Inception- 91.397 97.335 91.926 93.837 91.156 91.592 93.533 93.281 89.739 87.727 91.233 92.069
SwinUnet ±19.415 ±0.812 ±20.366 ±12.637 ±18.678 ±8.410 ±4.614 ±7.055 ±20.753 ±14.249 ±11.216 ±9.514
94.240 94.116 90.123 91.649 93.456 92.662 92.948 88.941 89.008 89.158 91.986 91.662
HSNet
±12.384 ±10.544 ±22.355 ±21.243 ±11.688 ±9.901 ±7.903 ±21.43 ±21.266 ±17.099 ±12.085 ±13.085
90.839 93.514 91.781 91.781 90.076 92.197 91.638 90.000 89.755 87.864 88.554 90.727
Swin-Unet
±20.912 ±16.261 ±21.085 ±21.08 ±21.037 ±6.612 ±11.014 ±20.632 ±20.720 ±20.629 ±16.934 ±16.984
92.354 95.835 91.220 89.282 88.818 87.325 92.082 91.271 87.280 85.790 89.065 90.029
UNeXt
±15.512 ±4.808 ±21.036 ±21.93 ±18.828 ±14.428 ±9.072 ±16.774 ±21.834 ±19.574 ±17.579 ±12.849
92.245 87.526 88.178 91.605 91.219 89.705 81.444 86.309 88.812 88.060 90.294 88.672
TransUNet
±9.199 ±25.59 ±24.371 ±21.122 ±21.025 ±20.496 ±32.928 ±24.176 ±21.076 ±20.596 ±20.782 ±20.293
88.707 94.461 91.011 90.999 87.121 87.183 87.639 89.689 87.318 82.663 85.180 88.361
MedT
±20.984 ±8.57 ±20.989 ±20.981 ±25.111 ±15.317 ±20.944 ±19.384 ±20.279 ±23.855 ±22.131 ±18.641
81.322 82.886 85.159 88.802 90.776 87.080 77.511 78.711 83.856 85.608 89.312 84.639
UTNet
±33.18 ±31.442 ±29.06 ±24.539 ±20.988 ±14.823 ±32.691 ±32.650 ±28.478 ±21.674 ±18.307 ±23.716
75.346 78.997 81.862 80.907 86.107 86.968 80.058 80.407 76.787 81.144 88.063 81.513
UCTransNet
±31.945 ±28.28 ±26.04 ±29.371 ±22.629 ±20.202 ±25.635 ±22.833 ±30.266 ±25.322 ±19.596 ±20.303
Attention 74.333 74.288 68.732 76.499 86.964 85.474 73.700 68.943 68.092 80.147 87.694 76.806
U-Net ±35.642 ±33.164 ±36.905 ±29.188 ±22.583 ±25.111 ±37.113 ±37.476 ±35.719 ±25.685 ±21.912 ±25.335
68.734 66.998 74.033 71.468 81.297 83.980 65.450 71.473 67.208 79.834 84.529 74.091
UNet++
±40.68 ±35.694 ±27.911 ±40.229 ±31.709 ±27.527 ±39.937 ±28.932 ±40.117 ±28.708 ±26.387 ±26.461
77.421 74.309 66.317 66.828 76.870 84.644 75.710 70.025 58.563 72.553 75.217 72.587
MultiResUNet
±31.596 ±29.915 ±34.739 ±35.945 ±35.698 ±22.624 ±33.508 ±27.873 ±39.241 ±33.342 ±31.98 ±26.547
Table 3 shows the instance segmentation results measured by 95% Hausdorff distance (HD95) on the
original test set. The results show that Transformer-based models surpasses the CNN-only models.
Table 3. HD95 (the lower, the better) of each model on the original test set.
L1 L2 L3 L4 L5 S1 D1 D2 D3 D4 D5 Average
8.594 11.024 2.992 3.147 3.759 9.454 19.057 1.894 3.466 3.814 8.128 6.848
Swin UNETR
±12.047 ±33.718 ±3.483 ±6.521 ±5.905 ±13.059 ±41.678 ±2.472 ±5.801 ±3.828 ±12.917 ±10.137
13.716 1.332 2.371 2.778 2.363 8.11 2.348 2.004 3.118 2.865 4.762 4.161
SLT-Net
±16.122 ±0.64 ±3.562 ±5.493 ±2.252 ±17.93 ±3.912 ±2.338 ±5.032 ±2.174 ±12.584 ±4.089
3.744 1.512 3.679 2.237 5.927 4.724 2.442 1.704 3.589 3.277 7.517 3.668
UNETR
±6.201 ±1.261 ±7.65 ±1.722 ±6.302 ±5.909 ±3.25 ±1.3 ±4.982 ±2.968 ±8.093 ±2.34
5.056 7.158 3.624 3.375 2.290 5.308 4.278 5.106 3.138 5.857 3.713 4.446
BianqueNet
±10.105 ±15.206 ±7.993 ±7.988 ±2.285 ±9.319 ±9.886 ±10.580 ±7.382 ±12.764 ±6.602 ±9.739
Inception- 3.378 1.197 3.5 3.033 3.362 6.39 7.53 2.051 1.59 3.957 5.715 3.791
SwinUnet ±7.275 ±0.555 ±8.067 ±6.334 ±3.809 ±10.21 ±25.967 ±3.129 ±0.876 ±4.714 ±9.957 ±4.134
3.006 3.634 4.121 3.94 5.547 5.196 2.869 5.378 4.145 5.966 5.045 4.441
HSNet
±6.714 ±8.238 ±8.874 ±9.34 ±11.36 ±10.482 ±7.442 ±11.527 ±9.551 ±11.862 ±11.035 ±7.376
2.131 2.226 1.753 1.971 2.767 4.732 2.221 2.937 1.679 2.08 6.339 2.803
Swin-Unet
±3.459 ±4.686 ±1.591 ±2.484 ±2.596 ±7.47 ±3.907 ±6.313 ±1.159 ±1.308 ±10.19 ±2.24
2.805 2.016 1.933 7.259 13.485 10.8 5.98 2.536 5.69 7.292 11.01 6.437
UNeXt
±4.959 ±2.768 ±1.542 ±19.018 ±17.479 ±15.286 ±14.131 ±4.028 ±15.364 ±13.081 ±15.926 ±8.162
7.727 8.556 9.261 3.836 4.372 7.595 8.968 8.662 3.876 5.176 4.15 6.562
TransUNet
±15.874 ±20.98 ±19.824 ±9.163 ±9.063 ±12.805 ±22.198 ±19.145 ±9.056 ±10.127 ±9.494 ±11.646
8.665 2.557 6.168 2.386 7.056 10.406 5.684 7.024 63.762 5.359 18.524 12.508
MedT
±14.016 ±2.564 ±11.39 ±2.751 ±10.149 ±10.254 ±10.784 ±11.272 ±83.727 ±8.133 ±33.337 ±7.994
6.897 7.209 10.986 5.092 8.523 20.203 7.421 13.133 11.486 10.463 8.903 10.029
UTNet
±12.365 ±12.726 ±16.485 ±10.427 ±13.468 ±35.303 ±13.484 ±18.251 ±17.583 ±17.173 ±13.608 ±12.359
11.554 21.176 30.833 12.752 11.034 9.536 20.424 23.576 20.578 14.814 7.444 16.702
UCTransNet
±13.053 ±20.644 ±45.698 ±15.773 ±15.015 ±15.319 ±32.647 ±27.95 ±22.595 ±16.837 ±13.146 ±12.286
Attention 9.394 13.821 16.946 21.225 10.944 7.474 13.878 15.805 28.632 10.027 7.124 14.116
U-Net ±12.866 ±17.43 ±18.222 ±32.217 ±15.876 ±13.279 ±18.345 ±19.732 ±35.708 ±15.161 ±13.263 ±13.022
6.733 16.872 22.11 15.901 12.515 9.518 9.862 21.599 19.863 10.369 9.929 14.116
UNet++
±11.95 ±17.754 ±20.232 ±19.52 ±16.298 ±14.037 ±14.455 ±20.421 ±26.463 ±16.036 ±15.288 ±11.826
8.841 17.292 19.035 15.814 9.93 9.582 9.63 18.331 22.223 10.918 10.632 13.839
MultiResUNet
±11.614 ±16.037 ±17.013 ±15.802 ±13.24 ±13.03 ±14.817 ±18.619 ±19.32 ±15.737 ±14.034 ±11.524
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
We also assessed the segmentation performance of all models on the augmented test set consisting of 2500
samples. Table 4 (Dice) and Table 5 (HD95) summarize the instance segmentation performance of each model
evaluated on the augmented test set.
Table 4. Dice (the higher, the better) of each model on the augmented test set
L1 L2 L3 L4 L5 S1 D1 D2 D3 D4 D5 Average
93.108 96.337 93.286 95.270 94.695 93.072 93.127 94.191 91.970 90.438 91.308 93.346
Swin UNETR
±8.399 ±3.289 ±14.544 ±5.054 ±4.905 ±5.52 ±5.447 ±4.227 ±9.192 ±6.799 ±6.215 ±4.983
91.925 96.736 93.889 92.538 94.557 92.437 92.712 93.833 89.858 90.419 91.149 92.732
SLT-Net
±11.161 ±2.401 ±12.742 ±18.37 ±8.863 ±8.045 ±7.199 ±4.888 ±18.676 ±11.831 ±11.399 ±9.07
93.839 96.260 91.773 94.984 94.052 89.817 93.039 93.865 90.486 90.317 89.976 92.583
UNETR
±6.152 ±3.454 ±18.475 ±5.13 ±5.003 ±10.051 ±3.998 ±4.332 ±9.81 ±5.791 ±6.801 ±5.384
92.103 95.329 94.670 95.486 96.003 95.191 92.275 93.724 93.752 93.384 94.187 94.191
BianqueNet
±16.133 ±10.126 ±10.097 ±7.397 ±2.653 ±2.992 ±13.814 ±7.138 ±6.614 ±5.339 ±4.917 ±8.996
Inception- 91.988 96.977 91.923 92.382 94.237 92.904 93.660 93.689 89.551 88.781 91.991 92.553
SwinUnet ±18.095 ±3.847 ±20.632 ±18.076 ±8.303 ±9.058 ±6.763 ±6.904 ±20.673 ±14.933 ±7.246 ±10.665
94.628 95.868 92.061 91.512 95.849 95.682 93.117 93.874 89.820 89.645 94.570 93.330
HSNet
±14.373 ±10.502 ±18.001 ±21.12 ±8.781 ±3.865 ±13.282 ±10.222 ±19.811 ±17.916 ±4.643 ±9.196
91.328 93.299 91.703 91.669 91.198 91.366 91.283 90.320 89.189 87.815 90.297 90.861
Swin-Unet
±18.901 ±17.946 ±21.094 ±21.131 ±18.97 ±12.268 ±14.972 ±20.105 ±20.739 ±20.844 ±13.743 ±17.347
91.280 95.036 91.013 88.953 91.170 89.812 91.255 92.186 87.131 87.645 90.288 90.525
UNeXt
±16.796 ±8.14 ±21.094 ±23.119 ±13.263 ±12.544 ±11.141 ±12.876 ±22.224 ±15.647 ±10.6 ±13.111
92.319 93.879 92.586 94.676 94.741 94.003 90.806 91.343 90.749 91.550 92.894 92.686
TransUNet
±16.2 ±16.636 ±17.395 ±10.604 ±7.245 ±5.979 ±18.041 ±16.735 ±16.481 ±10.169 ±6.709 ±10.357
86.861 92.629 91.421 89.039 88.147 84.398 86.124 89.722 86.538 83.335 83.075 87.390
MedT
±23.723 ±15.296 ±19.013 ±21.809 ±20.587 ±18.84 ±22.55 ±19.247 ±20.526 ±23.549 ±21.478 ±18.852
82.709 80.782 84.073 85.725 91.096 91.306 76.822 80.621 83.916 85.401 90.734 84.835
UTNet
±28.31 ±30.95 ±26.989 ±27.893 ±16.215 ±9.014 ±30.988 ±28.224 ±27.028 ±20.584 ±11.839 ±19.788
72.343 76.444 81.393 79.606 87.287 92.642 73.083 81.171 74.872 79.460 89.673 80.725
UCTransNet
±34.307 ±29.91 ±24.405 ±28.661 ±18.466 ±8.999 ±32.634 ±23.118 ±29.442 ±23.231 ±13.893 ±18.567
Attention 68.915 69.948 65.592 75.681 86.721 93.227 66.819 63.876 66.104 75.905 91.537 74.939
U-Net ±36.382 ±35.755 ±34.768 ±28.502 ±22.809 ±8.204 ±38.116 ±37.865 ±32.424 ±28.8 ±11.763 ±20.816
65.096 69.498 69.993 66.925 81.392 90.460 64.565 72.047 63.061 72.931 87.159 73.012
UNet++
±40.792 ±36.367 ±31.743 ±37.244 ±28.029 ±13.191 ±40.33 ±31.75 ±38.234 ±31.167 ±20.266 ±23.317
73.766 72.475 70.752 70.171 78.962 87.809 68.730 69.854 68.738 72.479 79.657 73.945
MultiResUNet
±33.559 ±32.877 ±32.574 ±35.186 ±32.165 ±18.254 ±36.759 ±34.213 ±34.452 ±33.345 ±28.326 ±25.091
Table 5. HD95 value (the lower, the better) of each model on the augmented test set
L1 L2 L3 L4 L5 S1 D1 D2 D3 D4 D5 Average
11.353 9.468 7.005 5.783 5.375 16.07 13.829 3.708 6.191 4.775 11.15 8.609
Swin UNETR
±33.189 ±31.478 ±23.16 ±17.828 ±11.093 ±35.564 ±40.509 ±15.61 ±18.392 ±11.302 ±17.265 ±15.102
8.464 2.537 3.463 3.078 3.766 5.409 3.96 2.376 3.835 4.023 5.751 4.242
SLT-Net
±17.817 ±9.812 ±9.614 ±8.708 ±9.767 ±15.909 ±15.752 ±5.825 ±13.919 ±10.257 ±11.967 ±7.772
5.873 2.63 4.111 2.506 5.014 10.871 2.772 2.103 4.111 3.735 6.404 4.557
UNETR
±10.615 ±6.155 ±7.566 ±3.004 ±7.516 ±19.237 ±4.803 ±3.727 ±6.555 ±4.708 ±8.943 ±4.305
5.555 5.767 3.398 3.173 3.318 3.540 5.303 5.818 2.231 2.868 3.384 4.032
BianqueNet
±10.365 ±13.823 ±8.156 ±6.849 ±7.518 ±11.313 ±11.274 ±15.604 ±4.499 ±6.468 ±7.982 ±10.045
Inception- 3.41 1.529 3.685 3.125 4.276 5.534 1.677 2.424 3.671 3.923 6.093 3.577
SwinUnet ±10.003 ±5.754 ±10.956 ±7.22 ±10.702 ±13.91 ±3.326 ±7.185 ±9.82 ±6.581 ±11.203 ±5.158
2.575 2.72 3.826 2.955 1.958 2.37 2.165 2.778 3.13 2.894 2.258 2.693
HSNet
±6.116 ±9.091 ±9.386 ±7.523 ±5.101 ±4.584 ±6.245 ±7.9 ±8.164 ±6.34 ±6.682 ±4.533
3.718 2.281 2.092 2.044 4.841 5.539 2.601 2.392 2.103 2.735 6.01 3.305
Swin-Unet
±12.477 ±9.182 ±5.183 ±5.193 ±11.639 ±9.335 ±10.854 ±9.554 ±4.736 ±5.601 ±10.775 ±4.993
7.212 4.233 4.913 7.77 11.646 9.462 8.457 3.848 7.327 7.806 11.651 7.666
UNeXt
±18.727 ±12.846 ±16.54 ±17.801 ±17.788 ±21.051 ±20.29 ±10.381 ±19.721 ±12.668 ±16.133 ±9.932
6.245 6.008 6.191 5.246 6.385 7.016 5.417 8.688 7.87 5.241 9.589 6.718
TransUNet
±18.622 ±18.599 ±17.579 ±18.663 ±20.503 ±19.462 ±18.977 ±23.985 ±21.105 ±14.506 ±24.032 ±13.318
8.115 3.878 4.385 5.283 10.009 13.935 6.139 4.428 72.075 8.635 12.997 13.625
MedT
±12.876 ±6.953 ±7.671 ±9.288 ±13.817 ±16.408 ±11.146 ±9.506 ±82.33 ±16.218 ±17.864 ±9.645
8.667 10.592 15.484 9.202 14.237 14.724 10.674 15.102 11.572 10.649 8.581 11.771±
UTNet
±17.117 ±20.449 ±20.453 ±17.838 ±27.825 ±22.491 ±16.985 ±21.452 ±19.287 ±18.77 ±16.592 11.935
14.267 21.247 23.542 20.769 12.504 9.226 21.576 27.888 28.971 17.921 10.608 18.956
UCTransNet
±20.828 ±25.68 ±31.268 ±31.321 ±16.412 ±22.252 ±32.676 ±31.415 ±34.852 ±25.073 ±19.215 ±13.438
Attention 21.033 18.984 22.082 31.271 11.244 6.783 18.986 19 32.27 19.275 11.469 19.309
U-Net ±29.85 ±20.704 ±22.065 ±36.969 ±20.511 ±16.671 ±25.984 ±21.259 ±31.194 ±26.051 ±28.816 ±14.67
18.011 16.192 23.953 18.749 12.423 6.939 13.378 20.232 25.36 17.679 8.728 16.513
UNet++
±27.449 ±19.325 ±20.546 ±21.889 ±16.826 ±12.271 ±20.277 ±21.096 ±30.082 ±20.881 ±15.29 ±12.019
13.291 16.623 20.534 16.424 8.357 7.503 13.491 20.504 19.608 11.27 8.239 14.168
MultiResUNet
±19.988 ±17.923 ±20.18 ±18.717 ±12.915 ±11.828 ±20.581 ±23.767 ±22.555 ±16.494 ±13.843 ±11.371
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Figure 4 shows segmentation examples of the 15 models. All of the models produce misclassifications or
fragmentation errors. For example, BianqueNet exhibits hollow holes in both D3 and D4. TransUnet incorrectly
identifies D3 as D2 and exhibits a significant segmentation error in the lower left corner in the MR scan. Also seen
in Figure 4, many models produce incorrect predictions around pixels in close proximity to the boundary of two
adjacent objects.
. . 1 . 1 1. . 11. .1 1. . . . 1 1 . .1 1.
Fig. 4. Segmentation examples of the 15 models. IMG is the input image. GT indicates the Ground-Truth annotation.
Dice (green color) and HD95 (yellow color) are shown on top of each image. The 11 lumbar objects are shown in
different colors. Segmenttion errors are indicated by pink arrows.
The augmented dataset offers a broader domain, enabling the models to acquire more diverse knowledge and
achieve improved segmentation performance. This increased resilience to data variations is a notable benefit of the
augmented training dataset. Increasing the complexity of the training dataset helps prevent overfitting and regulating
models from memorizing the training.
Table 8. Dice (the higher, the better) of each model on the augmented test set. The “change” is the Dice
difference between training on the augmented training set and training on the original training set
L1 L2 L3 L4 L5 S1 D1 D2 D3 D4 D5 Average Change
Swin 94.593 96.124 93.027 94.442 95.345 93.384 94.949 94.369 92.323 92.806 93.900 94.115
+0.769
UNETR ±8.113 ±7.352 ±17.824 ±12.723 ±6.840 ±6.406 ±3.461 ±7.674 ±13.660 ±5.711 ±4.372 ±9.581
90.928 96.708 92.836 92.604 96.143 95.545 91.135 95.435 91.254 90.982 95.427 93.545
SLT-Net +0.813
±21.398 ±5.097 ±18.462 ±19.937 ±4.279 ±2.669 ±18.062 ±3.448 ±17.130 ±15.676 ±2.972 ±14.102
96.099 96.878 93.981 94.798 95.509 94.832 94.930 94.486 92.957 92.616 93.974 94.642
UNETR +2.059
±2.574 ±2.479 ±11.711 ±8.922 ±3.888 ±2.951 ±2.159 ±5.341 ±8.036 ±4.935 ±3.926 ±6.081
93.381 95.339 93.786 96.670 96.789 95.666 92.427 93.723 94.040 94.184 95.276 94.662
BianqueNet +0.471
±16.166 ±12.916 ±13.760 ±4.308 ±2.243 ±2.302 ±16.211 ±11.633 ±7.633 ±4.318 ±3.268 ±10.231
Table 9. HD95 (the lower, the better) of each model on the augmented test set. The “change” is the HD
difference between training on the augmented training set and training on the original training set
L1 L2 L3 L4 L5 S1 D1 D2 D3 D4 D5 Average Change
Swin 17.591 13.865 12.207 11.290 10.393 19.144 9.570 12.249 7.145 5.524 11.808 11.889
+3.280
UNETR ±52.428 ±44.911 ±42.917 ±38.870 ±35.903 ±52.802 ±35.018 ±43.617 ±30.825 ±21.039 ±30.508 ±40.236
3.720 1.577 3.227 1.899 2.012 1.871 2.344 1.387 3.100 2.584 1.602 2.302
SLT-Net -1.940
±8.774 ±3.310 ±7.838 ±5.205 ±2.264 ±2.843 ±6.312 ±2.514 ±8.190 ±4.752 ±2.714 ±5.549
1.923 1.774 2.679 3.029 3.056 3.340 1.329 1.788 2.348 2.640 2.455 2.396
UNETR -2.161
±2.598 ±3.164 ±4.416 ±7.859 ±5.502 ±5.462 ±1.5817 ±2.9369 ±3.860 ±3.108 ±3.079 ±4.341
3.815 2.907 3.849 2.605 2.268 2.667 3.383 4.257 3.149 2.061 2.860 3.075
BianqueNet -0.957
±9.985 ±8.092 ±10.217 ±9.389 ±7.335 ±10.589 ±9.089 ±11.508 ±10.065 ±5.126 ±13.087 ±9.734
5. Conclusion
In this paper, we present an evaluation of 15 DNN models for instance segmentation of lumbar spine MR
images, using the original dataset of 100 patients and the generated SSMSpine dataset of thousands of virtual
patients. We developed the SSM-biomechanics based data augmentation method to further improve model
performance by providing large and diverse datasets of synthetic images with ground-truth. Given that our
augmented datasets consist entirely of synthetic data, we have made our augmented dataset, SSMSpine, publicly
available. The results presented indicate that models trained on the augmented training set had comparably or even
better performance than the same models trained on the original training set. This underscores that our data
augmentation method can generate synthetic data that eliminates privacy concerns while retaining in the same image
domain.
Our current study mainly focused on the mid-sagittal lumbar spine MR images for two major reasons. First,
as shown in a clinical study (Hu et al., 2018), the mid-sagittal image of a patient provides the most useful
information for the diagnosis of lumbar spine degeneration. Second, the slice thickness of a lumbar MR scan in the
sagittal direction is often much larger than 5mm, which causes difficulties to create accurate 3D ground-truth
annotation for model training. Nevertheless, the models could be directly extended to handle 3D images once the
slice thickness becomes acceptably small with the advancement of imaging technology.
The 15 DNN models often generate segmentation artifacts, such as: (1) extra areas not belonging to any
discs or vertebrae, (2) assigning the same class label to two different discs, and (3) broken area of a disc or vertebrae.
Thus, new models are needed for artifact-free geometry reconstruction of lumbar spine from MR images.
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
References
Ambellan, F., Lamecker, H., von Tycowicz, C., Zachow, S., 2019. Statistical Shape Models: Understanding and
Mastering Variation in Anatomy, in: Rea, P.M. (Ed.), Biomedical Visualisation : Volume , Advances in
Experimental Medicine and Biology. Springer International Publishing, Cham, pp. 67–84.
https://doi.org/10.1007/978-3-030-19385-0_5
Berhane, H., Scott, M., Elbaz, M., Jarvis, K., McCarthy, P., Carr, J., Malaisrie, C., Avery, R., Barker, A.J.,
Robinson, J.D., Rigsby, C.K., Markl, M., 2020. Fully automated 3D aortic segmentation of 4D flow MRI
for hemodynamic analysis using deep learning. Magn. Reson. Med. 84, 2204–2218.
https://doi.org/10.1002/mrm.28257
Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M., 2023. Swin-Unet: Unet-Like Pure
Transformer for Medical Image Segmentation, in: Karlinsky, L., Michaeli, T., Nishino, K. (Eds.),
Computer Vision – ECCV 2022 Workshops, Lecture Notes in Computer Science. Springer Nature
Switzerland, Cham, pp. 205–218. https://doi.org/10.1007/978-3-031-25066-8_9
Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y., 2021. TransUNet:
Transformers Make Strong Encoders for Medical Image Segmentation.
https://doi.org/10.48550/arXiv.2102.04306
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018. Encoder-Decoder with Atrous Separable
Convolution for Semantic Image Segmentation. Presented at the Proceedings of the European Conference
on Computer Vision (ECCV), pp. 801–818.
Chevrefils, C., Chériet, F., Grimard, G., Aubin, C.-E., 2007. Watershed Segmentation of Intervertebral Disk and
Spinal Canal from MRI Images, in: Kamel, M., Campilho, A. (Eds.), Image Analysis and Recognition,
Lecture Notes in Computer Science. Springer, Berlin, Heidelberg, pp. 1017–1027.
https://doi.org/10.1007/978-3-540-74260-9_90
Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J., 1995. Active Shape Models-Their Training and Application.
Comput. Vis. Image Underst. 61, 38–59. https://doi.org/10.1006/cviu.1995.1004
Cox, M., Serra, R., Shapiro, I., Risbud, M., 2014. The Intervertebral Disc: Molecular and Structural Studies of the
Disc in Health and Disease.
Davies, R.H., Twining, C.J., Daniel Allen, P., Cootes, T.F., Taylor, C.J., 2003. Building optimal 2D statistical
shape models. Image Vis. Comput., British Machine Vision Computing 2001 21, 1171–1182.
https://doi.org/10.1016/j.imavis.2003.09.003
Dong, X., Bao, J., Chen, Dongdong, Zhang, W., Yu, N., Yuan, L., Chen, Dong, Guo, B., 2022. CSWin
Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows.
https://doi.org/10.48550/arXiv.2107.00652
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An Image is Worth 16x16 Words:
Transformers for Image Recognition at Scale. https://doi.org/10.48550/arXiv.2010.11929
Dufter, P., Schmitt, M., Schütze, H., 2022. Position Information in Transformers: An Overview. Comput.
Linguist. 48, 733–763. https://doi.org/10.1162/coli_a_00445
Dwivedi, K.Kr., Lakhani, P., Kumar, S., Kumar, N., 2022. A hyperelastic model to capture the mechanical
behaviour and histological aspects of the soft tissues. J. Mech. Behav. Biomed. Mater. 126, 105013.
https://doi.org/10.1016/j.jmbbm.2021.105013
Feng, K., Ren, L., Wang, G., Wang, H., Li, Y., 2022. SLT-Net: A codec network for skin lesion segmentation.
Comput. Biol. Med. 148, 105942. https://doi.org/10.1016/j.compbiomed.2022.105942
Galbusera, F., Casaroli, G., Bassani, T., 2019. Artificial intelligence and machine learning in spine research. JOR
SPINE 2, e1044. https://doi.org/10.1002/jsp2.1044
Gao, Y., Zhou, M., Metaxas, D.N., 2021. UTNet: A Hybrid Transformer Architecture for Medical Image
Segmentation, in: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C.
(Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, Lecture Notes in
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Lin, T., Wang, Y., Liu, X., Qiu, X., 2022. A survey of transformers. AI Open 3, 111–132.
https://doi.org/10.1016/j.aiopen.2022.10.001
Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., van der Laak, J.A.W.M., van
Ginneken, B., Sánchez, C.I., 2017. A survey on deep learning in medical image analysis. Med. Image
Anal. 42, 60–88. https://doi.org/10.1016/j.media.2017.07.005
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin Transformer: Hierarchical
Vision Transformer using Shifted Windows. https://doi.org/10.48550/arXiv.2103.14030
Mallio, C.A., Vadalà, G., Russo, F., Bernetti, C., Ambrosio, L., Zobel, B.B., Quattrocchi, C.C., Papalia, R.,
Denaro, V., 2022. Novel Magnetic Resonance Imaging Tools for the Diagnosis of Degenerative Disc
Disease: A Narrative Review. Diagnostics 12, 420. https://doi.org/10.3390/diagnostics12020420
Moccia, S., De Momi, E., El Hadji, S., Mattos, L.S., 2018. Blood vessel segmentation algorithms — Review of
methods, datasets and evaluation metrics. Comput. Methods Programs Biomed. 158, 71–91.
https://doi.org/10.1016/j.cmpb.2018.02.001
Noothout, J.M.H., Vos, B.D. de, Wolterink, J.M., Išgum, I., 01 . Automatic segmentation of thoracic aorta
segments in low-dose chest CT, in: Medical Imaging 2018: Image Processing. Presented at the Medical
Imaging 2018: Image Processing, SPIE, pp. 446–451. https://doi.org/10.1117/12.2293114
Ogden, R.W., Hill, R., 1997. Large deformation isotropic elasticity – on the correlation of theory and experiment
for incompressible rubberlike solids. Proc. R. Soc. Lond. Math. Phys. Sci. 326, 565–584.
https://doi.org/10.1098/rspa.1972.0026
Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla,
N.Y., Kainz, B., Glocker, B., Rueckert, D., 2018. Attention U-Net: Learning Where to Look for the
Pancreas. https://doi.org/10.48550/arXiv.1804.03999
Pang, S., Pang, C., Su, Z., Lin, L., Zhao, L., Chen, Y., Zhou, Y., Lu, H., Feng, Q., 2022. DGMSNet: Spine
segmentation for MR image by a detection-guided mixed-supervised segmentation network. Med. Image
Anal. 75, 102261. https://doi.org/10.1016/j.media.2021.102261
Pepe, A., Li, J., Rolf-Pissarczyk, M., Gsaxner, C., Chen, X., Holzapfel, G.A., Egger, J., 2020. Detection,
segmentation, simulation and visualization of aortic dissections: A review. Med. Image Anal. 65, 101773.
https://doi.org/10.1016/j.media.2020.101773
Pu, Y., Zhang, Q., Qian, C., Zeng, Q., Li, N., Zhang, L., Zhou, S., Zhao, G., 2023. Semi-supervised segmentation
of coronary DSA using mixed networks and multi-strategies. Comput. Biol. Med. 156, 106493.
https://doi.org/10.1016/j.compbiomed.2022.106493
Roberts, S., Gardner, C., Jiang, Z., Abedi, A., Buser, Z., Wang, J.C., 2021. Analysis of trends in lumbar disc
degeneration using kinematic MRI. Clin. Imaging 79, 136–141.
https://doi.org/10.1016/j.clinimag.2021.04.028
Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional Networks for Biomedical Image
Segmentation, in: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (Eds.), Medical Image Computing
and Computer-Assisted Intervention – MICCAI 2015, Lecture Notes in Computer Science. Springer
International Publishing, Cham, pp. 234–241. https://doi.org/10.1007/978-3-319-24574-4_28
Sarkalkan, N., Weinans, H., Zadpoor, A.A., 2014. Statistical shape and appearance models of bones. Bone 60,
129–140. https://doi.org/10.1016/j.bone.2013.12.006
Sekuboyina, A., Kukačka, J., Kirschke, J.S., Menze, B.H., Valentinitsch, A., 01 . Attention-Driven Deep
Learning for Pathological Spine Segmentation, in: Glocker, B., Yao, J., Vrtovec, T., Frangi, A., Zheng, G.
(Eds.), Computational Methods and Clinical Applications in Musculoskeletal Imaging, Lecture Notes in
Computer Science. Springer International Publishing, Cham, pp. 108–119. https://doi.org/10.1007/978-3-
319-74113-0_10
Shamshad, F., Khan, S., Zamir, S.W., Khan, M.H., Hayat, M., Khan, F.S., Fu, H., 2023. Transformers in medical
imaging: A survey. Med. Image Anal. 88, 102802. https://doi.org/10.1016/j.media.2023.102802
Shaw, P., Uszkoreit, J., Vaswani, A., 2018. Self-Attention with Relative Position Representations.
https://doi.org/10.48550/arXiv.1803.02155
Shen, D., Wu, G., Suk, H.-I., 2017. Deep Learning in Medical Image Analysis. Annu. Rev. Biomed. Eng. 19,
221–248. https://doi.org/10.1146/annurev-bioeng-071516-044442
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., Yan, S., 2022. Inception Transformer.
https://doi.org/10.48550/arXiv.2205.12956
Sirinukunwattana, K., Pluim, J.P.W., Chen, H., Qi, X., Heng, P.-A., Guo, Y.B., Wang, L.Y., Matuszewski, B.J.,
Bruni, E., Sanchez, U., Böhm, A., Ronneberger, O., Cheikh, B.B., Racoceanu, D., Kainz, P., Pfeiffer, M.,
Urschler, M., Snead, D.R.J., Rajpoot, N.M., 2017. Gland segmentation in colon histology images: The
glas challenge contest. Med. Image Anal. 35, 489–502. https://doi.org/10.1016/j.media.2016.08.008
Soomro, T.A., Afifi, A.J., Zheng, L., Soomro, S., Gao, J., Hellwich, O., Paul, M., 2019. Deep Learning Models
for Retinal Blood Vessels Segmentation: A Review. IEEE Access 7, 71696–71717.
https://doi.org/10.1109/ACCESS.2019.2920616
Suganyadevi, S., Seethalakshmi, V., Balasamy, K., 2022. A review on deep learning in medical image analysis.
Int. J. Multimed. Inf. Retr. 11, 19–38. https://doi.org/10.1007/s13735-021-00218-1
Tamagawa, S., Sakai, D., Nojiri, H., Sato, M., Ishijima, M., Watanabe, M., 2022. Imaging Evaluation of
Intervertebral Disc Degeneration and Painful Discs—Advances and Challenges in Quantitative MRI.
Diagnostics 12, 707. https://doi.org/10.3390/diagnostics12030707
Tao, R., Liu, W., Zheng, G., 2022. Spine-transformers: Vertebra labeling and segmentation in arbitrary field-of-
view spine CTs via 3D transformers. Med. Image Anal. 75, 102258.
https://doi.org/10.1016/j.media.2021.102258
Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M., 2021. Medical Transformer: Gated Axial-Attention for
Medical Image Segmentation, in: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng,
Y., Essert, C. (Eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2021,
Lecture Notes in Computer Science. Springer International Publishing, Cham, pp. 36–46.
https://doi.org/10.1007/978-3-030-87193-2_4
Valanarasu, J.M.J., Patel, V.M., 2022. UNeXt: MLP-Based Rapid Medical Image Segmentation Network, in:
Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (Eds.), Medical Image Computing and Computer
Assisted Intervention – MICCAI 2022, Lecture Notes in Computer Science. Springer Nature Switzerland,
Cham, pp. 23–33. https://doi.org/10.1007/978-3-031-16443-9_3
van Veldhuizen, W.A., Schuurmann, R.C.L., IJpma, F.F.A., Kropman, R.H.J., Antoniou, G.A., Wolterink, J.M.,
de Vries, J.-P.P.M., 2022. A Statistical Shape Model of the Morphological Variation of the Infrarenal
Abdominal Aortic Aneurysm Neck. J. Clin. Med. 11, 1687. https://doi.org/10.3390/jcm11061687
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 01 .
Attention is All you Need, in: Advances in Neural Information Processing Systems. Curran Associates,
Inc.
Wang, B., Qin, J., Lv, L., Cheng, M., Li, L., Xia, D., Wang, S., 2023. MLKCA-Unet: Multiscale large-kernel
convolution and attention in Unet for spine MRI segmentation. Optik 272, 170277.
https://doi.org/10.1016/j.ijleo.2022.170277
Wang, H., Cao, P., Wang, J., Zaiane, O.R., 2022. UCTransNet: Rethinking the Skip Connections in U-Net from a
Channel-Wise Perspective with Transformer. Proc. AAAI Conf. Artif. Intell. 36, 2441–2449.
https://doi.org/10.1609/aaai.v36i3.20144
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C., 2020. Axial-DeepLab: Stand-Alone Axial-
Attention for Panoptic Segmentation, in: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (Eds.),
Computer Vision – ECCV 2020, Lecture Notes in Computer Science. Springer International Publishing,
Cham, pp. 108–126. https://doi.org/10.1007/978-3-030-58548-8_7
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L., 2022. PVT v2: Improved
Baselines with Pyramid Vision Transformer. Comput. Vis. Media 8, 415–424.
https://doi.org/10.1007/s41095-022-0274-8
Wiputra, H., Matsumoto, S., Wagenseil, J.E., Braverman, A.C., Voeller, R.K., Barocas, V.H., 2023. Statistical
shape representation of the thoracic aorta: accounting for major branches of the aortic arch. Comput.
Methods Biomech. Biomed. Engin. 26, 1557–1571. https://doi.org/10.1080/10255842.2022.2128672
Wu, K., Peng, H., Chen, M., Fu, J., Chao, H., 2021. Rethinking and Improving Relative Position Encoding for
Vision Transformer. Presented at the Proceedings of the IEEE/CVF International Conference on
Computer Vision, pp. 10033–10041.
bioRxiv preprint doi: https://doi.org/10.1101/2024.04.02.587810; this version posted April 3, 2024. The copyright holder for this preprint (which
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC-ND 4.0 International license.
You, X., Gu, Y., Liu, Y., Lu, S., Tang, X., Yang, J., 2022. EG-Trans3DUNet: A Single-Staged Transformer-
Based Model for Accurate Vertebrae Segmentation from Spinal Ct Images, in: 2022 IEEE 19th
International Symposium on Biomedical Imaging (ISBI). Presented at the 2022 IEEE 19th International
Symposium on Biomedical Imaging (ISBI), pp. 1–5. https://doi.org/10.1109/ISBI52829.2022.9761551
Zhang, L., Yang, J., Liu, D., Zhang, F., Nie, S., Tan, Y., Guo, T., 2022. Spine X-ray Image Segmentation Based
on Transformer and Adaptive Optimized Postprocessing, in: 2022 IEEE 2nd International Conference on
Software Engineering and Artificial Intelligence (SEAI). Presented at the 2022 IEEE 2nd International
Conference on Software Engineering and Artificial Intelligence (SEAI), pp. 88–92.
https://doi.org/10.1109/SEAI55746.2022.9832144
Zhang, W., Fu, C., Zheng, Y., Zhang, F., Zhao, Y., Sham, C.-W., 2022. HSNet: A hybrid semantic network for
polyp segmentation. Comput. Biol. Med. 150, 106173.
https://doi.org/10.1016/j.compbiomed.2022.106173
Zhang, Z., Zhang, W., 2022. Pyramid Medical Transformer for Medical Image Segmentation.
https://doi.org/10.48550/arXiv.2104.14702
Zheng, H.-D., Sun, Y.-L., Kong, D.-W., Yin, M.-C., Chen, J., Lin, Y.-P., Ma, X.-F., Wang, H.-S., Yuan, G.-J.,
Yao, M., Cui, X.-J., Tian, Y.-Z., Wang, Y.-J., 2022. Deep learning-based high-accuracy quantitation for
lumbar intervertebral disc degeneration from MRI. Nat. Commun. 13, 841.
https://doi.org/10.1038/s41467-022-28387-5
Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J., 2018. UNet++: A Nested U-Net Architecture for
Medical Image Segmentation, in: Stoyanov, D., Taylor, Z., Carneiro, G., Syeda-Mahmood, T., Martel, A.,
Maier-Hein, L., Tavares, J.M.R.S., Bradley, A., Papa, J.P., Belagiannis, V., Nascimento, J.C., Lu, Z.,
Conjeti, S., Moradi, M., Greenspan, H., Madabhushi, A. (Eds.), Deep Learning in Medical Image
Analysis and Multimodal Learning for Clinical Decision Support, Lecture Notes in Computer Science.
Springer International Publishing, Cham, pp. 3–11. https://doi.org/10.1007/978-3-030-00889-5_1