Generalizing Deep Learning For Medical Image Segmentation To Unseen Domains Via Deep Stacked Transformation
Generalizing Deep Learning For Medical Image Segmentation To Unseen Domains Via Deep Stacked Transformation
Abstract — Recent advances in deep learning for med- ing two medical imaging modalities (MRI and ultrasound)
ical image segmentation demonstrate expert-level accuracy. involving eight publicly available challenge datasets. The
However, application of these models in clinically real- results show that when training on relatively small dataset
istic environments can result in poor generalization and (n = 10∼32 volumes, depending on the size of the avail-
decreased accuracy, mainly due to the domain shift across able datasets) from a single source domain: (i) BigAug
different hospitals, scanner vendors, imaging protocols, models degrade an average of 11% (Dice score change) from
and patient populations etc. Common transfer learning and source to unseen domain, substantially better than conven-
domain adaptation techniques are proposed to address tional augmentation (degrading 39%) and CycleGAN-based
this bottleneck. However, these solutions require data (and domain adaptation method (degrading 25%), (ii) BigAug is
annotations) from the target domain to retrain the model, better than “shallower” stacked transforms (i.e. those with
and is therefore restrictive in practice for widespread model fewer transforms) on unseen domains and demonstrates
deployment. Ideally, we wish to have a trained (locked) modest improvement to conventional augmentation on the
model that can work uniformly well across unseen domains source domain, (iii) after training with BigAug on one source
without further training. In this paper, we propose a deep domain, performance on an unseen domain is similar to
stacked transformation approach for domain generaliza- training a model from scratch on that domain when using the
tion. Specifically, a series of n stacked transformations same number of training samples. When training on large
are applied to each image during network training. The datasets (n = 465 volumes) with BigAug, (iv) application
underlying assumption is that the “expected” domain shift to unseen domains reaches the performance of state-of-
for a specific medical imaging modality could be simulated the-art fully supervised models that are trained and tested
by applying extensive data augmentation on a single source on their source domains. These findings establish a strong
domain, and consequently, a deep model trained on the benchmark for the study of domain generalization in medical
augmented “big” data (BigAug) could generalize well on imaging, and can be generalized to the design of highly
unseen domains. We exploit four surprisingly effective, but robust deep segmentation models for clinical deployment.
previously understudied, image-based characteristics for
data augmentation to overcome the domain generalization Index Terms — Domain generalization, data augmenta-
problem. We train and evaluate the BigAug model (with tion, deep learning, medical image segmentation.
n = 9 transformations) on three different 3D segmenta-
tion tasks (prostate gland, left atrial, left ventricle) cover- I. I NTRODUCTION
Manuscript received December 19, 2019; revised February 4, 2020;
accepted February 8, 2020. Date of publication February 12, 2020; date
of current version June 30, 2020. This work was supported in part by
S UCCESSFUL clinical deployment of deep learning-based
artificial intelligence (AI) models for medical imaging
tasks requires a trained model to maintain a high level of
the NIH Center for Interventional Oncology and the Intramural Research
Program of the NIH. NIH and NVIDIA have a cooperative research and accuracy when applied to unseen domains (i.e., different
development agreement. This project has been funded in whole or in part hospitals, scanner vendors, imaging protocols, patient pop-
with federal funds from the National Cancer Institute, National Institutes ulations, etc.) [1], as illustrated in Fig. 1. Ideally, highly
of Health, under Contract No. HHSN261200800001E. The content of
this publication does not necessarily reflect the views or policies of the generalizable models in medical imaging could be achieved
Department of Health and Human Services, nor does mention of trade when training datasets include a large quantity of high-
names, commercial products, or organizations imply endorsement by the quality images from multiple centers with diverse imaging
U.S. Government. (Corresponding authors: Ling Zhang; Ziyue Xu.)
Ling Zhang was with Nvidia Corporation, Bethesda, MD 20814 USA. vendors/protocols. Unfortunately, in current practice, datasets
He is now with PAII Inc., Bethesda, MD 20817 USA (e-mail: are often limited by the lack of annotations and difficulty
[email protected]). in data sharing among centers [2]. These limitations have
Xiaosong Wang, Dong Yang, Holger Roth, Andriy Myronenko, Daguang
Xu, and Ziyue Xu are with Nvidia Corporation, Bethesda, MD 20814 USA led to scenarios where small training datasets which lack
(e-mail: [email protected]). diversity fail to maintain their performance on data from
Thomas Sanford, Baris Turkbey, and Bradford J. Wood are with the “unseen” domains. For example, the error rate of a deep
National Institutes of Health Clinical Center, Bethesda, MD 20892 USA.
Stephanie Harmon is with the Clinical Research Directorate, Frederick model for retinal image analysis was 5.5% on images from
National Laboratory for Cancer Research, National Cancer Institute, the same vendor used in training dataset, but decreased to
Bethesda, MD 20892 USA. 46.6% on images from another vendor [3]. This issue of poor
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. generalizability has become one of the major roadblocks for
Digital Object Identifier 10.1109/TMI.2020.2973595 deploying deep learning models into clinical practice [4].
0278-0062 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
2532 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 39, NO. 7, JULY 2020
Fig. 1. Medical image segmentation in the source and unseen domains (i.e., a specific medical imaging modality across different vendors, imaging
protocols, and patient populations, etc.) for (a) whole prostate MRI, (b) left atrial MRI, and (c) left ventricle ultrasound. The illustrated images are
processed with intensity normalization.
Given the limited quantity and quality of medical imaging of medical imaging, we are usually faced with the difficult
data, it is infeasible to employ naive strategies that aggregate situation that the training dataset is derived from a single
data from multiple source domains and impractical to train center and acquired on one vendor system with a specific
separate high-quality domain-specific (e.g., vendor specific) protocol. Some non-deep-learning models have been shown
models. Two popular solutions have been proposed to improve to be robust to center-specific or vendor-specific variability.
the generalizability of deep learning models using data trained For example, by combining mesh-based computational atlas
from a single source domain. The first, transfer learning, is the with Gaussian appearance model [17] or by Bayesian transfer
process of fine-tuning a portion of a pre-trained network (usu- learning [18], 2D brain MRI segmentation can be generalized
ally the last few layers [5] or shared convolutional filters [6]). to unseen domains at certain accuracy. Inclusion of a deep
Transfer learning is able to overcome some of the aforemen- learning in the classical probabilistic generative model is
tioned issues by only requiring a small amount of annotated also proposed to improve 2D brain MRI segmentation on
data in the unseen domain; however, it is limited in use due unseen domains [19]. In 2D computer vision applications
to the lack of pre-trained models developed on a large amount with deep learning, researchers recently made progress in
of medical imaging data. A second solution, domain adapta- this highly challenge setting [20]–[22]. Their approaches,
tion [7], aims to generalize to a known target domain whose essentially, are various complexities of data augmentations to
annotations are unknown during model training. Generative expand the data distribution coverage (with higher variations).
adversarial network (GAN) [8] and its variants (e.g., Cycle- Specifically, additional training data samples are generated in
GAN [9]) are frequently integrated into domain adaptation image domain [20], semantic space [22], or by adversarial
methods, by either learning domain-invariant features (seen learning [21], respectively.
in MRI [10], ultrasound [11], histopathology [12]), or trans- Data augmentation has proven to be among the most
lating image style between the source and target domains important regularization techniques related to deep learning’s
(used in X-ray [13], [14] and ultrasound [15]). Additionally, generalization performance [23]. It helps prevent models from
these methods have been used to model the imaging physics overfitting to the training data and generalize better on the
(e.g., estimating the T1–w pulse sequence) of the target MRI testing data. However, majority of published work has focused
imaging domain, and by applying the model to create target on non-medical imaging data and default augmentation set-
data specific training features, an augmented deep model can tings are either derived from the same source in training and
be trained [16]. validation or do not consider domain source at all [23]–[26].
Despite their promising performance, the assumption of a In specific applications of medical image segmentation, image
known target domain requires specific image samples need rotation and GAN-based augmentations have been shown
to be collected (or even labeled) and a new model needs to to improve the performance in 2D data for both CT and
be retrained before deployment. It is not feasible to obtain MRI [27], as they can extrapolate and interpolate the manifold
a pair of source and target domain images to implement of data, respectively. Recently, we proposed a reinforcement
the adversarial domain adaptation for every new application. learning-based searching approach for selecting necessary data
Therefore, model deployment using these typs of techniques augmentations in 3D medical image segmentation tasks [28].
is impractical in diverse patient populations (e.g., multiple However, implementing data augmentation methods, even
clinical centers) or unpredictable scenarios (e.g., emergency optimal on the source domain, does not guarantee the gener-
care or rural area use of ultrasound). alizability on data from unseen domains. Furthermore, while
Domain generalization, which indicates settings that one has a large amount of medical imaging data is acquired in 3D,
no access to any data from the unseen target domains, has the the majority of published work considers 2D data augmenta-
potential to overcome these issues. Particularly, in the field tion approaches due to augmentation in large 3D volumetric
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: GENERALIZING DEEP LEARNING FOR MEDICAL IMAGE SEGMENTATION TO UNSEEN DOMAINS 2533
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
2534 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 39, NO. 7, JULY 2020
Fig. 2. Examples of deep stacked transformations (BigAug) results on (a) whole prostate MRI, (b) left atrial MRI, and (c) left ventricle ultrasound.
1st row: ROIs randomly cropped in volumes from source domains; 2nd row: corresponding cropped ROIs after BigAug; 3rd row: ROIs randomly
cropped in volumes from unseen domains. The image pairs of 2nd–3rd rows have better visual similarity than 1st–3rd rows.
between [0.1, 1.0] to the image. The three image quality spatial variations in MRI and ultrasound. These operations
transformations are mainly based on Gaussian function/filter, are computational expensive for large 3D volumetric data.1
as a Gaussian distribution is commonly used to represent real- A GPU-based acceleration approach [33] could be developed,
valued variables with unknown distributions. There exist many but allocating the maximal capacity of GPU memory for
other specific functions/filters, such as speckle and Poisson model training only along with data augmentation on the fly
noise, median and median filter, etc., which may improve are more desirable. In addition, since the whole 3D volume
the performance for special imaging modalities. The image does not fit into the limited memory of the GPU, sub-
quality-based transforms do not apply to annotations Y S . volumes cropping are usually needed to fed into the model
2) Image Appearance: The appearance difference of medical during the training. In this work, we develop an extremely
imaging is related to the statistical characteristics of image efficient CPU-based spatial transform technique based on an
intensities, such as variations of brightness and contrast, and open-source implementation,2 which first calculates the 3D
intensity perturbation, which result from different scanners coordinate grid of sub-volume (with size of w × h × d voxels)
and scanning protocols. Refer to the 1st and 3rd rows in to which the transformations (combining random 3D rotation,
Fig. 2 for the image appearance differences in MRI and scaling, deformation, and cropping) are applied and then image
ultrasound. To adjust the brightness of the image, we randomly interpolation is performed. We make further accelerations by
shift the intensity level with magnitude ranging between only performing the interpolation within the minimal cuboid
[−0.1, 0.1] for the image. To control the contrast of the containing the 3D coordinate grid so that the computational
image, we apply gamma correction with magnitude (gamma time is independent from the input volume size (i.e., only
value) ranging between [0.5, 1.0] or [1.0, 4.5], where mag- depending on the cropping sub-volume size), and the spatial
nitude = 1 gives the original image and smaller/larger value transform augmentation can be performed on the fly during
makes image lighter/darker, respectively. Gamma correction training. The rotation and scaling are both performed along
is used in a highly competitive brain MRI segmentation all three axes, and the magnitudes are controlled by rotation
algorithm [30], and contributes to the robust segmentation degree ranging between [−20◦, 20◦ ] and by scaling factor
performance across multiple hospitals [31]. To perturbing ranging between [0.4, 1.6], respectively. The deformation is
image intensities, we multiply a scale factor and add a shift achieved by sampling a grid of random offset vectors, which
factor for the image, both with magnitude ranging between is smoothed by Gaussian smoothing filter (standard deviation
[−0.1, 0.1]. Such a method is a component in the state-of- ranging between [10, 13]) and rescaled by a random factor
the-art brain MRI segmentation algorithm [32]. The image (ranging between [0, 1000]). The spatial transforms are applied
appearance transforms are not applied to annotations Y S . to both data X S and annotations Y S .
3) Spatial Configuration: Spatial variations may include rota-
1 For example, a typical MR scan consisting of hundreds of 512×512 slices
tion (e.g., caused by different patient orientations during
requires about 1 minute to perform all three spatial transform operations, then
scanning), scaling (e.g., variation of organ/lesion size), and training 100 scans to converge (usually requiring 300 epochs in our work)
deformation (e.g., caused by organ motion or abnormality). needs about 500 hours.
Refer to the 1st and 3rd rows in Fig. 2 (a–c) for the 2 https://github.com/MIC-DKFZ/batchgenerators
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: GENERALIZING DEEP LEARNING FOR MEDICAL IMAGE SEGMENTATION TO UNSEEN DOMAINS 2535
Note that instead of augmenting training images in such an We first validate our method on three tasks as follows:
explicit way, transformations (e.g., spatial) could be incorpo- Task 1: whole prostate segmentation in MRI volumes, Task 2:
rated into the network learning process, through approaches left atrial segmentation in MRI volumes, and Task 3: left
like Spatial Transformer Networks [24]. However, the learned ventricle segmentation in ultrasound volumes. Each model
invariants are from the source domain, which may not gener- is trained and validated on a single source domain dataset
alize well on different unseen domains. The key idea of our with the same BigAug configuration, and applied/tested to
BigAug is to use data augmentation to extrapolate the manifold 2–3 unseen domain sets. Second, we investigate the variation
of the source data, with the regularization of prior knowledge in model performances when the models are trained with a sin-
to handle the domain shift in medical imaging. gle augmentation transformation or a combination of several
best-performing transformations. Third, we train deep models
B. 3D Deep Segmentation from scratch for whole prostate (WP), peripheral zone (PZ),
and transition zone (TZ) segmentation in prostate MRI with
We use AH-Net [29] as the backbone of our 3D segmenta-
different numbers of training data from a target domain,
tion network. The AH-Net takes the advantage of both 2D and
and compare their performances with the BigAug augmented
3D deep segmentation networks by transferring deep features
model. Finally, models for whole prostate segmentation in
learned from large-scale 2D images to 3D encoder-decoder
MRI are trained on a self-collected big data with and without
network. For the training, the inputs are sub-volumes cropped
BigAug, and applied to four unseen domains.
from the whole volume and outputs are the corresponding sub-
volumes of segmentation masks with 1-channel annotations.
To increase the variation of training data, sub-volumes are B. Datasets: Source vs. Unseen Domain
randomly cropped and equally distributed between the fore- Task 1: Four publicly available 3D prostate MRI datasets
ground and background. We use Dice loss [34] as the loss are used: MSD-Prostate (MSD-P),2 PROMISE123 [36], NCI-
function which naturally balances the positive and negative ISBI13,4 and ProstateX5 [37]. MSD-P serves as the single
voxel distribution. In testing, sliding window with overlapping source domain, and others are different unseen domains.
is applied to the whole 3D volume to generate the final 3D Task 2: Three publicly available 3D heart MRI
segmentation. datasets are used: MSD-Heart (MSD-H)2, 2018 ASC,6 and
MM-WHS7 [38]. MSD-H serves as the single source domain,
III. E XPERIMENTS and others are different unseen domains.
A. Experimental Design Task 3: One publicly available 3D ultrasound dataset,
3D medical imaging mainly includes CT, MRI, PET, ultra- CETUS1 is used, where data is equally acquired from
sound, and OCT. Therefore, we would like to evaluate the three different ultrasound vendors (i.e., GE, Philips, Siemens,
proposed method with various data from public resources, 10 volumes each). We used heuristics to identify vendor
including Medical Segmentation Decathlon (MSD),1 Grand association, but we acknowledge that our split strategy may
Challenges in Biomedical Image Analysis,2 and recent include wrong associations.8 Vendor A is used as the single
MICCAI challenges. source domain, and Vendor B and C serve as unseen domains.
Due to data availability and restrictions (only one public All datasets have annotations provided by the data source,
PET segmentation challenge is available from a single center3; except for the ProstateX where no prostate segmentation is
an ideal public OCT dataset (containing three vendors)4 is available and the annotations of both peripheral zone (PZ)
available but can only be used for the challenge), and also and transition zone (TZ) are provided by our radiologist
to include sufficient image variabilities (CT imaging is fairly collaborators. One patient’s study in ProstateX was excluded
standardized to Hounsfield scale so the domain shift is usually due to prior surgical procedure to resect a large portion of the
less of a concern), we decided to use MRI and ultrasound TZ (transurethral resection of the prostate), which deformed
to illustrate the capabilities of the proposed method. Prostate the appearance of the prostate. Table I briefly summarizes the
MRI and Heart MRI datasets from the MSD challenge are used datasets.
selected as the source domain data of our Task 1 and Task 2, In addition to the benchmark datasets, a large MRI
respectively, because: 1) MSD [35] is a recent large scale dataset including 465 patients is used in the final experi-
annotated medical image dataset which represents the state- ment. Our collaborated radiologists collected 465 MRI data
of-the-art dataset with high quality; 2) moreover, more than (denoted as MultiCenter) from multiple medical centers
two other Prostate MRI and Heart MRI public datasets with worldwide, representing multiple MRI vendors (i.e. GE,
annotations can be found, serving as multiple unseen domains. Philips, Siemens), and various center-specific MRI protocols.
In addition, the CETUS Heart ultrasound dataset1 is selected 2 http://medicaldecathlon.com/index.html
as our Task 3, since it contains image data from the three 3 https://promise12.grand-challenge.org/
major ultrasound vendors (i.e., GE, Philips, Siemens). 4 http://doi.org/10.7937/K9/TCIA.2015.zF0vlOPv
5 https://prostatex.grand-challenge.org/
1 http://medicaldecathlon.com/index.html 6 http://atriaseg2018.cardiacatlas.org/
2 https://grand-challenge.org/challenges 7 http://www.sdspeople.fudan.edu.cn/zhuangxiahai/0/mmwhs/
3 https://portal.fli-iam.irisa.fr/petseg-challenge/overview 8 We recognize different vendors by visually observing the CETUS image
4 https://retouch.grand-challenge.org
appearance, as the vendor information is not provided. Patients 1,2,8,9,13 are
1 https://www.creatis.insa-lyon.fr/Challenge/CETUS/ from vendor A, 3,4,12,14,15 vendor B, and 5,6,7,10,11 vendor C.
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
2536 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 39, NO. 7, JULY 2020
TABLE I
D ATASETS U SED IN O UR E XPERIMENT
The whole prostate boundaries were manually traced in three stride of sliding window is (w − 16) × (h − 16) × (d − 16). For
planes on T2-weighted MRI by a radiologist with over Task 1, since MSD-P has 2-class annotations for PZ and TZ,
10 years of experience in interpretation of prostate MRIs. we first train a 2-class model and then combine the output
A second radiologist with <1 year experience in reading into 1-class as the whole prostate after inference; and only
prostate MRI was trained under supervision of the expert and T2-weighted image is used, as most unseen data only has T2.
performed segmentation in the same fashion using the same
segmentation software. The segmentations from the expert
radiologist were considered ground truth. This ‘MultiCenter’ D. Experimental Results and Analysis
dataset serves as a large source of training data, and MSD-P, 1) BigAug vs. Standard Model vs. Domain Adaptation
PROMISE12, NCI-ISBI13, and ProstateX are four unseen (CycleGAN): For each segmentation task, we trained models
domains. on the source data, including a baseline model with random
cropping only, nine models each with a single augmenta-
C. Implementation tion/transformation, and a BigAug model. The train/validation
splits in the source data are shown in Table I. Addition-
This work is implemented using NVIDIA Transfer Learning
ally, we implemented a popular domain adaptation method
Toolkit for Medical Imaging.1 We first resample all the data
– CycleGAN [9], [13], [14], which first transfers the unseen
in source domain into a fixed resolution of 1.0mm × 1.0mm ×
testing image to the appearance similar to the source domain,
1.0mm. Then, image intensities I are normalized to [0, 1] by
and then applies the baseline model on the transferred image.
(I − mi n)/(max − mi n), where mi n = 0, max = 2048 for
Such a method has been shown to be no worse than tra-
all MRIs2 except for ASC dataset, and mi n = 0, max =
ditional inference-time image transformation approach, such
255 for ultrasound and ASC which range from [0, 255].
as histogram matching [14]. More specifically, we split each
In BigAug, the probability to apply each transformation is
dataset with a ratio of 4:1 for the training and validation of
set to 0.5; transformations are in the order as described in
CycleGAN. 2D image slices are extracted at certain amount
Section II-A. Performances of models are not sensitive to
of interval from 3D volumes, in order to balance the slice
different orders based on our preliminary experiments (prostate
numbers in source and target sets. All the image slices are
MRI segmentation) – the generated images might have slight
resized to 256×256, and rescaled to [0, 255]. We train the
differences if changing the order of transformations; however,
CycleGAN model for 200 epochs.
considering the comprehensive changes after BigAug and the
We report the Dice coefficient for the segmentation on the
overall large amount of generated training samples for training,
validation sets in the source domains and on the testing sets in
these differences tend to result in minor differences on the
unseen domains in Table II. The Dice coefficient is a standard
network performance. Image intensities are not renormalized
metric to report the segmentation performance . All of the
after BigAug, as renormalization results in lower performance
public challenge datasets utilized in this study use Dice as one
in empirical experiments. The cropped sub-volumes are of the
of or even the single evaluation metric (i.e., ASC, MM-WHS).
following sizes: 96 × 96 × 32 (w × h × d ) for Task 1, and
While distance-based metrics, such as Hausdorff distance,
96 × 96 × 96 for Task 2 and Task 3. ResNet50’s weights
are also important, we only use Dice to keep simplicity for
pretrained on ImageNet are used to initialize the encoder part
interpreting results and facilitate comparison with supervised
of AH-Net. We use ADAM to optimize the network with the
methods on all public datasets. The numbers of testing images
initial learning rate of 0.0001. Task 1 and Task 2 are trained
for each unseen dataset are listed in Table I. The baseline
on 4 GPUs on the NVIDIA DGX cluster, and Task 3 is trained
model degrades dramatically on unseen domains, from 89.1%
in 1 NVIDIA Titan XP GPU, all using SGD and with a mini-
to 49.8% on average. The major findings are:
batch size of 4 ROIs per GPU. Since randomness exists in the
(i) On average, across all tasks on unseen domains, BigAug
whole training process, each model is trained for 300 epochs
(Dice = 80.0%) performs substantially better than any one
on the source domain for three times, and the model with the
of the tested augmentations, and significantly better than the
best performance on the validation set of the source domain
baseline model (49.8%) and CycleGAN (63.5%). Using only
is selected to be applied to unseen domains.
simple random crop (baseline) does not generalize well on
In model inference, the testing data is resampled into
unseen datasets with Dice dropping as much as 40%, which
1.0mm × 1.0mm × 1.0mm and normalized to [0, 1], and the
supports the importance of data augmentation in general. It is
1 https://developer.nvidia.com/transfer-learning-toolkit surprising that the BigAug based domain generalization is even
2 This normalization works better than normalizing to zero mean and unit better than CycelGAN based domain adaptation which has
std. in this experiment in a preliminary comparison. seen the target domain.
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: GENERALIZING DEEP LEARNING FOR MEDICAL IMAGE SEGMENTATION TO UNSEEN DOMAINS 2537
TABLE II
T HE E FFECT OF B IG AUG AND VARIOUS AUGMENTATION M ETHODS ON U NSEEN D OMAIN G ENERALIZATION (M EASURED W ITH D ICE S CORES ).
Source C OLUMNS I NDICATE THE D ATASET U SED FOR T RAINING , AND I TS D ICE S CORES A RE VALIDATION D ICE S CORES (U SING A S PLIT )
FOR C OMPARISONS . Unseen C OLUMNS L IST D ICE R ESULTS W HEN A PPLIED TO U NSEEN D ATASETS ( OF THE M ODEL T RAINED ON
THE S OURCE ). H ERE B ASELINE R EFERS TO A R ANDOM C ROP W ITH NO F URTHER AUGMENTATION . Top4 S TANDS FOR THE
C OMBINATION OF F OUR B EST P ERFORMING AUGMENTATIONS (S HARPENING , B RIGHTNESS , C ONTRAST, S CALING ).
Supervised I NDICATES THE S TATE - OF - THE -A RT L ITERATURE R ESULTS , W HEN A M ODEL I S T RAINED
AND T ESTED ON THE S AME D ATASET. ∗ I NDICATES I NTER -O BSERVER VARIABILITY.
SDSPEOPLE . FUDAN . EDU. CN / ZHUANGXIAHAI /0/ MMWHS 17/ RESULT. HTML
(ii) The major imaging differences caused by domain shift TABLE III
of MRI are image quality and appearance, in which sharpening R ESULTS OF PAIRED C ROSS -D OMAIN E VALUATION A MONG V ENDORS
ON THE CETUS H EART U LTRASOUND D ATASET. R ESULTS A RE
is the most important one, followed by contrast, brightness, P RESENTED AS D ICE S CORES OF B ASELINE /B IG AUG
and intensity perturbation. Refer to Fig. 1 for some examples.
Fig. 1(a) demonstrate that contrast and sharpening are the
major differences with unseen A (PROMISE12) and unseen
B (NCI-ISBI13), respectively, compared to the source image
(MSD-P). Note that the spatial transforms seem to be less
important for prostate MRI, but they contribute to transform
heart MRIs where the shape, size, and orientation of heart can
be very different (refer to Fig. 1 (b) and Fig. 2 (b)). This is
likely due to the prostate is relatively static while the heart is (v) Overall, for both MRI and ultrasound, the top 4 aug-
a moving/beating object. mentations are contrast (Dice = 63.6%), brightness (63.6%),
(iii) The imaging differences caused by domain shift of sharpening (62.9%), and 3D scaling (61.3%), each of which
different Ultrasound vendors are more comprehensive, which is comparable with CycleGAN (63.5%).
could be related to the spatial transform, image appearance (vi) BigAug performance is ∼10% worse compared to those
and quality, in which 3D scaling is the most important with fully supervised methods, as they have advantages of
one, followed by brightness, blurring, and contrast. Refer to training and testing on the same domain and more training
Fig. 1(c) for some examples: compared to the source image data. This gap can be reduced by using a larger source dataset
(CETUS-A), scaling and contrast are the major differences (as shown in later Section III-D4), in which case the BigAug
with unseen A (CETUS-B) and unseen B (CETUS-C), respec- performance is comparable to the supervised methods.
tively. Spatial transformations substantially contribute to heart Examples of unseen domain segmentation produced by
ultrasound segmentation task, partially because the heart is a baseline model, CycleGAN-based domain adaptation, and our
deformable object and different angles between the ultrasound BigAug domain generalization are shown in Fig. 3. The base-
probe and heart can result in images with different rotation line and BigAug models are trained only on individual source
degrees. In addition, the size of training dataset CETUS-A is domains, while CycleGAN requires images from target/unseen
small, not covering enough geometric variations. domain to train an additional generative model.
(iv) For a specific unseen domain (e.g., ASC in Task 2), To further demonstrate the general effectiveness of BigAug,
all settings with a single augmentation perform poorly (Dice a paired cross-domain evaluation is performed among ven-
lower than 12.7%) including CycleGAN (18.0%) which cannot dors, i.e., picking one for training and one for testing from
synthesize spatial difference, but BigAug could significantly CETUS-A, -B, and -C at each time. Results in Table III
boost the segmentation performance (Dice = 65.5%). This is show that BigAug can generalize substantially better than
due to the very different characteristics in the objects in the baseline model regardless of training on which ultrasound
unseen domain with a mix of changes in the morphology and vendors. However, the absolute accuracies on unseen ven-
image quality & appearance. Thus, comprehensive transforms dors can be different depending on different source-unseen
are required to represent such large changes. pairs.
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
2538 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 39, NO. 7, JULY 2020
Fig. 3. Generalization to unseen domains for three different 3D medical image segmentation tasks. Baseline standard deep models have the low
performances on unseen MRI and ultrasound images from different clinical centers, scanner vendors, etc. CycleGAN based domain adaptation
method help improve the segmentation performances. BigAug training generates robust models which significantly improve the segmentation
performances on unseen domains. Segmentation masks (red) overlayed on unseen or CycleGAN synthesized images are illustrated.
2) BigAug vs. Shallower Stacked Transformations: Individual a more diverse data distribution, which helps better prevent
augmentation transforms may perform slightly better on some overfitting while improving generalization. For Task 1 and
isolated cases (e.g., brightness augmentation for MM-WHS in Task 3, BigAug is better than the baseline and top4 for
Task 2 in Table II), but on average only BigAug consistently both source and unseen testing sets, which indicates the
shows good generalization. effect of BigAug on small sized data (e.g., 10 training data
To investigate the optimal augmentation configuration for for CETUS).
domain generalization, i.e., how many and which transforma- Besides the significant improvement (30.2%) on unseen
tions should be used, we combine the four best performing domains, BigAug could also slightly improve (2.5%) the per-
transformations as “Top4” (i.e., sharpness, brightness, contrast, formance on source domains, from 89.1% to 91.6% on average
and scaling). “Top4” are competitive but “shallower” stacked (note sometimes can be slightly worse, e.g., Task 2). This is
transformations, which cover at least one aspect across image an important benefit of BigAug, i.e., it retains performance
quality, appearance and spatial transform. The results are on the source domain. Therefore, using all the presented
shown in Table II. Overall, the shallower competitor (top4) transformations is recommended in general.
achieves a Dice of 74.9%, which is substantially higher than 3) BigAug vs. Training From Scratch on Target Domain:
the baseline model (49.8%), but lower than BigAug (80.0%) Another important finding is that models trained with BigAug
which uses all transformations. This also applies to each on the source domain have comparable, or slightly lower
individual task – except for one case, i.e., brightness augmen- performance than a model that is trained from scratch on target
tation for MM-WHS in Task 2. This could be explained by domain using the same amount of data.
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: GENERALIZING DEEP LEARNING FOR MEDICAL IMAGE SEGMENTATION TO UNSEEN DOMAINS 2539
TABLE IV
T HE E FFECT OF B IG AUG W ITH B IG D ATA (465 MRI VOLUMES F ROM M ULTIPLE M EDIAL C ENTERS , MRI V ENDORS , AND P ROTOCOLS ) FOR THE
TASK OF W HOLE P ROSTATE S EGMENTATION IN MRI VOLUMES . N OTE T HAT THE S TATE - OF - THE -A RT M ETHODS M ARKED W ITH * A RE
T RAINED AND T ESTED ON THE S AME D OMAIN OR I NTER -O BSERVER VARIABILITY (91.9%). N O E VALUATION OF THE W HOLE
P ROSTATE S EGMENTATION I S AVAILABLE IN MSD C HALLENGE ( HTTP :// MEDICALDECATHLON . COM / RESULTS . HTML )
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.
2540 IEEE TRANSACTIONS ON MEDICAL IMAGING, VOL. 39, NO. 7, JULY 2020
(prostate, left atrial, left ventricle) involving two medical [21] R. Volpi, H. Namkoong, O. Sener, J. Duchi, V. Murino, and S. Savarese,
imaging modalities (MRI and ultrasound). The experiments “Generalizing to unseen domains via adversarial data augmentation,” in
Proc. NeurIPS, 2018, pp. 5334–5344.
utilize eight public challenge datasets and establish a strong
[22] T.-D. Truong, C. Nhan Duong, K. Luu, M.-T. Tran, and M. Do, “Beyond
benchmark for the study of domain generalization in medical domain adaptation: unseen domain encapsulation via universal non-
imaging. The empirical evaluation, performance analysis, and volume preserving models,” 2018, arXiv:1812.03407. [Online]. Avail-
conclusive insights can be generalized to the design of really able: http://arxiv.org/abs/1812.03407
practical, highly robust, and competitive deep segmentation [23] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand-
ing deep learning requires rethinking generalization,” in Proc. ICLR,
models for other medical imaging tasks. 2017, pp. 1–15.
[24] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spa-
tial transformer networks,” in Proc. NIPS, 2015, pp. 2017–2025.
R EFERENCES [25] L. Perez and J. Wang, “The effectiveness of data augmentation in image
classification using deep learning,” 2017, arXiv:1712.04621. [Online].
[1] M. D. Abràmoff, P. T. Lavin, M. Birch, N. Shah, and J. C. Folk, “Pivotal Available: http://arxiv.org/abs/1712.04621
trial of an autonomous AI-based diagnostic system for detection of [26] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaug-
diabetic retinopathy in primary care offices,” NPJ Digit. Med., vol. 1, ment: Learning augmentation strategies from data,” in Proc. CVPR,
no. 1, p. 39, Aug. 2018. 2019, pp. 113–123.
[2] A. Hosny, C. Parmar, J. Quackenbush, L. H. Schwartz, and H. J. Aerts,
“Artificial intelligence in radiology,” Nature Rev. Cancer, vol. 18, no. 8, [27] C. Bowles et al., “GAN Augmentation: Augmenting training data using
p. 500, 2018. generative adversarial networks,” 2018, arXiv:1810.10863. [Online].
[3] J. De Fauw et al., “Clinically applicable deep learning for diagnosis and Available: http://arxiv.org/abs/1810.10863
referral in retinal disease,” Nature Med., vol. 24, no. 9, pp. 1342–1350, [28] D. Yang, H. Roth, Z. Xu, F. Milletari, L. Zhang, and D. Xu, “Searching
Sep. 2018. learning strategy with reinforcement learning for 3D medical image
[4] K. Yasaka and O. Abe, “Deep learning and artificial intelligence in radi- segmentation,” in Proc. MICCAI. Cham, Switzerland: Springer, 2019,
ology: Current applications and future directions,” PLoS Med, vol. 15, pp. 3–11.
no. 11, Nov. 2018, Art. no. e1002707. [29] S. Liu et al., “3D anisotropic hybrid network: Transferring convolutional
[5] M. Ghafoorian et al., “Transfer learning for domain adaptation in MRI: features from 2d images to 3d anisotropic volumes,” in Proc. MICCAI.
Application in brain lesion segmentation,” in Proc. MICCAI. Cham, Cham, Switzerland: Springer, 2018, pp. 851–858.
Switzerland: Springer, 2017, pp. 516–524. [30] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and
[6] N. Karani, K. Chaitanya, C. Baumgartner, and E. Konukoglu, “A life- K. H. Maier-Hein, “Brain tumor segmentation and radiomics survival
long learning approach to brain MR segmentation across scanners and prediction: Contribution to the brats 2017 challenge,” in Proc. Int.
protocols,” in Proc. MICCAI, 2018, pp. 476–484. MICCAI Brainlesion Workshop. Cham, Switzerland: Springer, 2017,
[7] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discrim- pp. 287–297.
inative domain adaptation,” in Proc. CVPR, p. 4, vol. 1, no. 2, 2017.
[31] P. Kickingereder et al., “Automated quantitative tumour response
[8] I. Goodfellow et al., “Generative adversarial nets,” in Proc. NIPS, 2014,
assessment of MRI in neuro-oncology with artificial neural networks:
pp. 2672–2680.
A multicentre, retrospective study,” Lancet Oncol., vol. 20, no. 5,
[9] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
pp. 728–740, May 2019.
translation using cycle-consistent adversarial networks,” in Proc. ICCV,
2017, pp. 2223–2232. [32] A. Myronenko, “3D MRI brain tumor segmentation using autoencoder
[10] K. Kamnitsas et al., “Unsupervised domain adaptation in brain lesion regularization,” in Proc. Int. MICCAI Brainlesion Workshop. Cham,
segmentation with adversarial networks,” in Proc. IPMI. Cham, Switzer- Switzerland: Springer, 2018, pp. 311–320.
land: Springer, 2017, pp. 597–609. [33] B. Rister, D. Yi, K. Shivakumar, T. Nobashi, and D. L. Rubin,
[11] M. A. Degel, N. Navab, and S. Albarqouni, “Domain and geometry “CT organ segmentation using GPU data augmentation, unsupervised
agnostic CNNs for left atrium segmentation in 3D ultrasound,” in Proc. labels and IOU loss,” 2018, arXiv:1811.11226. [Online]. Available:
MICCAI, 2018, pp. 630–637. http://arxiv.org/abs/1811.11226
[12] J. Ren, I. Hacihaliloglu, E. A. Singer, D. J. Foran, and X. Qi, “Adversar- [34] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional
ial domain adaptation for classification of prostate histopathology whole- neural networks for volumetric medical image segmentation,” in Proc.
slide images,” in Proc. MICCAI, 2018, pp. 201–209. Int. Conf. 3D Vis. (3DV), 2016, pp. 565–571.
[13] Y. Zhang, S. Miao, T. Mansi, and R. Liao, “Task driven generative
[35] A. L. Simpson et al., “A large annotated medical image dataset for
modeling for unsupervised domain adaptation: Application to X-ray
the development and evaluation of segmentation algorithms,” 2019,
image segmentation,” in Proc. MICCAI, 2018, pp. 599–607.
arXiv:1902.09063. [Online]. Available: http://arxiv.org/abs/1902.09063
[14] C. Chen, Q. Dou, H. Chen, and P.-A. Heng, “Semantic-aware generative
adversarial nets for unsupervised domain adaptation in chest X-ray [36] G. Litjens et al., “Evaluation of prostate segmentation algorithms for
segmentation,” in Proc. Int. Workshop Mach. Learn. Med. Imag. Cham, MRI: The PROMISE12 challenge,” Med. Image Anal., vol. 18, no. 2,
Switzerland: Springer, 2018, pp. 143–151. pp. 359–373, Feb. 2014.
[15] X. Yang et al., “Generalizing deep models for ultrasound image [37] G. Litjens, O. Debats, J. Barentsz, N. Karssemeijer, and H. Huisman,
segmentation,” in Proc. MICCAI. Cham, Switzerland: Springer, 2018, “Computer-aided detection of prostate cancer in MRI,” IEEE Trans.
pp. 497–505. Med. Imag., vol. 33, no. 5, pp. 1083–1092, May 2014.
[16] A. Jog and B. Fischl, “Pulse sequence resilient fast brain segmentation,” [38] X. Zhuang and J. Shen, “Multi-scale patch and multi-modality atlases
in Proc. MICCAI. Cham, Switzerland: Springer, 2018, pp. 654–662. for whole heart segmentation of MRI,” Med. Image Anal., vol. 31,
[17] O. Puonti, J. E. Iglesias, and K. Van Leemput, “Fast and sequence- pp. 77–87, Jul. 2016.
adaptive whole-brain segmentation using parametric Bayesian model-
[39] Q. Zhu, B. Du, and P. Yan, “Boundary-weighted domain adaptive neural
ing,” NeuroImage, vol. 143, pp. 235–249, Dec. 2016.
network for prostate mr image segmentation,” IEEE Trans. Med. Imag.,
[18] W. M. Kouw, S. N. Ørting, J. Petersen, K. S. Pedersen, and to be published.
M. de Bruijne, “A cross-center smoothness prior for variational Bayesian
brain tissue segmentation,” in Proc. IPMI. Cham, Switzerland: Springer, [40] H. Jia et al., “3D APA-Net: 3D adversarial pyramid anisotropic convo-
2019, pp. 360–371. lutional network for prostate segmentation in MR images,” IEEE Trans.
[19] M. Brudfors, Y. Balbastre, and J. Ashburner, “Nonlinear Markov random Med. Imag., vol. 39, no. 2, pp. 447–457, Feb. 2020.
fields learned via backpropagation,” in Proc. IPMI. Cham, Switzerland: [41] Z. Xiong, V. V. Fedorov, X. Fu, E. Cheng, R. Macleod, and
Springer, 2019, pp. 805–817. J. Zhao, “Fully automatic left atrium segmentation from late gadolinium
[20] E. Romera, L. M. Bergasa, J. M. Alvarez, and M. Trivedi, “Train here, enhanced magnetic resonance imaging using a dual fully convolutional
deploy there: Robust segmentation in unseen domains,” in Proc. IEEE neural network,” IEEE Trans. Med. Imag., vol. 38, no. 2, pp. 515–524,
Intell. Vehicles Symp. (IV), Jun. 2018, pp. 1828–1833. Feb. 2019.
Authorized licensed use limited to: Jessore University of Science and Technology. Downloaded on June 05,2024 at 17:28:07 UTC from IEEE Xplore. Restrictions apply.