9. CauSSL Causality-Inspired Semi-supervised Learning for Medical Image Segmentation ICCV 2023 Paper

Uploaded by

kantam1920

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

9. CauSSL Causality-Inspired Semi-supervised Learning for Medical Image Segmentation ICCV 2023 Paper

Uploaded by

kantam1920

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;

the final published version of the proceedings is available on IEEE Xplore.

CauSSL: Causality-inspired Semi-supervised Learning for

Medical Image Segmentation

Juzheng Miao1 , Cheng Chen2 , Furui Liu3 *, Hao Wei4 , Pheng-Ann Heng1,5
1
Department of Computer Science and Engineering, The Chinese University of Hong Kong
2
Center for Advanced Medical Computing and Analysis,
Harvard Medical School and Massachusetts General Hospital
3
Zhejiang Lab, 4 Department of Biomedical Engineering, The Chinese University of Hong Kong
5
Institute of Medical Intelligence and XR, The Chinese University of Hong Kong

Abstract

Semi-supervised learning (SSL) has recently demon-

strated great success in medical image segmentation, sig-
nificantly enhancing data efficiency with limited annota-
tions. However, despite its empirical benefits, there are
still concerns in the literature about the theoretical foun-
dation and explanation of semi-supervised segmentation.
To explore this problem, this study first proposes a novel
causal diagram to provide a theoretical foundation for the
mainstream semi-supervised segmentation methods. Our
Figure 1. Previous causal diagram for medical image segmenta-
causal diagram takes two additional intermediate variables
tion. The cause is the input image (X), based on which experts
into account, which are neglected in previous work. Draw- give the corresponding effect, i.e. segmentation mask (Y ). The
ing from this proposed causal diagram, we then introduce green arrow represents the prediction direction for the segmenta-
a causality-inspired SSL approach on top of co-training tion network. ⊥ indicates algorithmic independence defined on
frameworks called CauSSL, to improve SSL for medical im- two functions or distributions. According to ICM, unlabeled data
age segmentation. Specifically, we first point out the impor- can help improve the estimation of P (X) but fail to improve the
tance of algorithmic independence between two networks network P (Y |X) since they are algorithmically independent.
or branches in SSL, which is often overlooked in the lit-
erature. We then propose a novel statistical quantification
of the uncomputable algorithmic independence and further However, they typically require a large number of high-
enhance the independence via a min-max optimization pro- quality labeled data, which is extremely costly and diffi-
cess. Our method can be flexibly incorporated into different cult to obtain for the pixel-wise annotations of medical im-
existing SSL methods to improve their performance. Our ages requiring domain expertise. To solve this problem,
method has been evaluated on three challenging medical semi-supervised learning (SSL) has become more and more
image segmentation tasks using both 2D and 3D network popular and achieved remarkable success in medical image
architectures and has shown consistent improvements over segmentation with limited annotations and large amounts
state-of-the-art methods. Our code is publicly available at: of unlabeled data [1, 14, 25, 55]. Current semi-supervised
https://github.com/JuzhengMiao/CauSSL. segmentation methods can be mainly divided into two cat-
egories. The first one is self-training or pseudo-labeling
methods [1, 14], which utilize pseudo-labels as supervision
for unlabeled images. The other mainstream SSL methods
1. Introduction are based on consistency regularization. These methods ap-
Data-driven deep learning methods have shown remark- ply consistency regularization on the predictions between
able performance in medical image segmentation [18, 38]. different models or branches based on the popular Mean
Teacher (MT) [25, 55] or co-training [36, 52] frameworks.
* Corresponding author. Although many works have demonstrated the success

21426
of SSL methods on medical image segmentation tasks, give comprehensive explanations of the limitations of
Kügelgen et al. [23] argue that semi-supervised segmen- the vanilla self-training and MT-based methods com-
tation should be expected ineffective based on the princi- pared to co-training frameworks. This deepens the re-
ple of independent causal mechanisms (ICM) [13, 35, 42]. searchers’ understanding of the SSL framework from
ICM claims that the causal generative process consists of in- the viewpoint of algorithmic information and provides
dependent mechanisms that do not share information with justifications for the significance of considering algo-
each other. Fig. 1 depicts the causal diagram of the an- rithmic independence for model learning.
notation generation process for medical image segmenta-
tion following [9]. In this process, experts manually de- • We propose a novel statistical quantification of the un-
lineate the corresponding segmentation mask for a given computable algorithmic independence, specialized for
medical image based on visual inspection of the image con- deep convolutional networks, named as network inde-
tent and intensity contrast. As a result, the image X is the pendence. This defines the algorithmic independence
cause, while the annotation Y is the effect [9]. According to on the preservation of matrix ranks, treating the con-
ICM, for a causal prediction task, i.e., X → Y , the image volution kernels as matrices. A min-max optimization
generation process P (X) should be algorithmically inde- framework is then utilized to enable end-to-end met-
pendent with respect to the annotation generation process ric learning and validated on both co-training and MT-
P (Y |X). In this regard, adding unlabeled data can pro- based learning scenarios.
vide more knowledge about the data generation mechanism
P (X), but brings no helpful information about P (Y |X), • We evaluate our method with extensive experiments on
parameterized by the segmentation network since there is three public medical image segmentation tasks by us-
no link between them. This challenges the theoretical ex- ing both 2D and 3D network architectures. The supe-
planation of the success in SSL segmentation. rior performance of our method provides empirical ev-
To address this issue, we propose a new causal dia- idence for the claim that semi-supervised medical im-
gram (see Fig. 2) by introducing two intermediate nodes age segmentation can be improved by causal-diagram-
which can provide a better explanation for the mainstream induced algorithmic independence.
semi-supervised segmentation methods. These intermedi-
ate nodes denote pseudo-labels or predictions of another 2. Related Work
network/branch to assist the network learning on unlabeled
SSL for Medical Image Segmentation. In recent years,
data which are common in current SSL methods but are ne-
SSL has made significant progress in leveraging unlabeled
glected in Fig. 1. The detailed analysis can be found in Sec-
data to improve the segmentation performance under lim-
tion 3. Based on the new diagram, we further demonstrate
ited annotations. Previous methods can be broadly catego-
that the algorithmic independence in a co-training frame-
rized into self-training methods [1, 14], and consistency-
work can be even beneficial to the segmentation perfor-
regularization methods [25, 55]. Bai et al. [1] developed a
mance. However, the formalized measurement of the algo-
representative self-training framework for cardiac MR im-
rithmic independence, i.e., Kolmogorov complexity, is not
age segmentation. It includes the network predictions for
computable, and proxies are often used in specific appli-
unlabeled data as pseudo-labels and updates the training
cations, such as the minimum description length for NLP
network iteratively. Under the framework of consistency
tasks [19]. Nonetheless, a proxy for segmentation networks
regularization, Li et al. [25] proposed to enhance the con-
has not yet been explored. In this work, we propose a
sistency between predictions of inputs under different data
novel statistical quantification of the algorithmic indepen-
augmentations on top of the MT framework [46]. On the
dence specialized for deep convolutional networks, based
other hand, Xia et al. [52] demonstrated the effectiveness of
on which, we design a min-max optimization process to fur-
the co-training strategy by training a segmentation network
ther enhance the independence in co-training frameworks.
on each view of volume data and encouraging the multi-
In summary, our main contributions are four-folds:
view consistency among these networks.
• This study proposes a novel causal diagram which is Causality in Medical Image Analysis. Improving the
in compatibility with mainstream SSL methods in seg- models’ performance on medical image analysis from the
mentation. The diagram sheds light on the effective- view of causality has received significant attention recently.
ness of semi-supervised segmentation from a causal Causality-inspired learning models have been applied to
perspective, and thus provides a theoretical founda- discover causal links of various neural processes [41], pro-
tion for understanding and further improving the per- vide explanations for network performances [6, 20], and
formance of SSL in medical image segmentation. improve fairness [43]. Another interesting direction is to
generate images of the potential appearance if a patient was
• Based on our proposed causal diagram and ICM, we healthy using counterfactual techniques [15]. Moreover, a

21427
lot of works focus on improving the robustness and gen-
eralization abilities of their networks using causal reason-
ing, such as domain adaptation and Out-of-Distribution de-
tection [48]. For example, Kouw et al. [22] introduced a
causal Bayesian prior to enhance the cross-center segmenta-
tion performance on MRI data. Ouyang et al. [32] proposed
a causality-inspired data augmentation approach and lever-
aged causal intervention to improve the model robustness
on the single-source domain generalization problem.
Causality in Semi-supervised Learning. Most
causality-related works on SSL focus on how the causal
direction can affect the learning performance. Schölkopf Figure 2. The causal diagram for semi-supervised medical image
et al. [42] first pointed out that SSL works better when segmentation, where the variables with solid line boundaries are
predicting the cause variables from its effects (anticausal observable variables while those with dotted line boundaries are
learning) or from confounded inputs (confounded learning) unobservable variables. The green arrow presents the loss depen-
′
and should be impossible when predicting the target labels dence on unlabeled data. P̂ and P̂ mean the network predictions.
from the causes (causal learning). Based on this conclusion,
Kügelgen et al. [23] further proposed a new framework for
semi-supervised classification by conducting the prediction Castro et al. [9], we consider X as the cause of Y . P̂ and
′
using both cause and effect features simultaneously, creat- P̂ are the approximations of the target mask generated by
′
ing an anti-causal learning setting. However, the pessimistic the segmentation networks f and f , respectively, and thus
conclusion cannot explain the promising achievements of determined by both the image and the mask.
SSL in segmentation tasks which is a classic causal learn-
ing setting [9]. In our work, we aim to investigate the causal
3.2. Consistency with Mainstream Methods
diagram of SSL segmentation and demonstrate that the prin-
ciple of independent causal mechanisms is not always detri- In the SSL setting, the training dataset D consists
mental to causal learning settings like segmentation tasks. of ML labeled data and MU unlabeled data, denoted as
Instead, we show proper statistical quantification and fur- ML ML +MU
L = {(xi , yi )}i=1 and U = {(xi )}i=M L +1
, where xi ∈
ther enhancement of the algorithmic independence property R Hin ×Win H ×W ×C
denotes an image and yi ∈ {0, 1} in in
is helpful for improving the segmentation performance. represents the corresponding ground-truth label for labeled
data, with C meaning the number of semantic classes. For
3. Causal Modeling of SSL Segmentation the labeled data, Y can be directly observed in Fig. 2 and a
′
In this section, we propose a novel causal diagram that is supervised loss Ls is utilized to help P̂ and P̂ approximate
compatible with the current SSL frameworks. Based on the the mask directly. On the other hand, since the ground-truth
′
theoretical foundation, plausible explanations for the effec- labels are not available for unlabeled data, P̂ is usually
tiveness of SSL segmentation methods are presented, and adopted as a proxy of Y and used to guide the learning of
some key factors of SSL segmentation performance are nat- P̂ , as indicated by the green arrow in Fig. 2.
urally highlighted with the lens of causality. In self-training methods, P̂ is the prediction generated
′
by the segmentation network, while P̂ is the output for
3.1. Causal Diagram for SSL Segmentation
the same unlabeled image by the same network predicted
In Fig. 1, only the label annotation process in the SSL in previous iterations and functions as a pseudo label to su-
′
segmentation tasks is considered. However, actual imple- pervise P̂ with supervised loss Ls . By contrast, P̂ can be
mentations of SSL methods tend to introduce some inter- predicted by another network or branch and guide the learn-
mediate variables such as pseudo-labels or predictions of ing of P̂ via a consistency regulation loss in the MT/co-
another network/branch to assist the network learning on training framework. For instance, methods based on the MT
′
unlabeled data. The incomplete causal diagram in previ- framework usually adopt the teacher model as f to gen-
′
ous works [23, 35, 42] results in the pessimistic conclusion erate P̂ and enforce the consistency between the teacher
on SSL segmentation. Therefore, we introduce intermedi- and student using a mean squared error (MSE) loss. In the
ate variables into our proposed causal diagram to better de- co-training framework, the CPS method [10] uses another
scribe the general learning process in most SSL methods. independent segmentation network with the same architec-
′
As shown in Fig. 2, the input image X, the mask of the ture but different weight initializations to generate P̂ . P̂
′
target organ Y , and the original predictions of segmenta- and P̂ then function as the pseudo-labels for each other via
′
tion networks P̂ and P̂ are observable variables. Following a cross-entropy loss.

21428
3.3. Effectiveness Explanation and Key Compo-
nents for SSL Segmentation
We conjecture the remarkable progress of the SSL seg-
mentation methods can be largely attributed to the green
arrow in Fig. 2. More unlabeled data help us obtain more
information about P (X). This helps provide more infor-
′ ′
mation about P̂ through f , which is a noisy estimation of
P (Y ). Naturally, if the approximation is precise enough,
the SSL problem becomes a supervised one, where a good
performance can be ensured. Therefore, with more pairs of Figure 3. An illustration of the 2D convolution process.
input images and segmentation masks, the network predic-
tions P̂ can be improved if the quality of the approximation
is good enough. Also, from the causal perspective, the loss According to ICM [13, 35, 42], mechanisms P (P̂ |X, Y )
′
′
between P̂ and P̂ introduces a learning direction from P̂ and P (P̂ |X, Y ) should be algorithmically independent and
′
to P̂ , which is a confounded learning setting and should do not inform or influence each other. Since they are highly
help information sharing among different mechanisms and related to the segmentation models, the two models should
improve the segmentation performance [23, 42]. be algorithmically independent to some extent as well. This
With the help of the proposed causal diagram, it’s also also aligns with our intuition that two different networks
easier for us to identify the key components that have sig- should provide complementary help for each other.
nificant influences on the medical image segmentation per- The importance and the effectiveness of the algorith-
formance in an SSL framework. As presented in Fig. 2, the mic independence between the two networks align with the
learning of P̂ is directly affected by ground-truth labels and practical observations in SSL segmentation tasks. For self-
′
P̂ for labeled and unlabeled data, respectively. Therefore, training and MT-based methods, the high dependence be-
′ ′
the quality of P̂ , as well as the learning constraint between tween the assistant network f and the training network f
′
P̂ and P̂ are of great importance. The former has been no- leads to the performance bottleneck [10, 17, 21]. In an ex-
′ ′
ticed in [44, 47, 54]. In addition, how to design an appropri- treme case when f is the same as f , the P̂ in our pro-
′
ate loss to ensure a thorough consistency between P̂ and P̂ posed causal diagram (Fig. 2) disappears. This degenerates
such as shape-aware constraints rather than using the pixel- into the causal learning setting as shown in Fig. 1, mak-
wise MSE can be a promising direction. Moreover, uncer- ing SSL segmentation models not so useful since unlabeled
tainty estimation can be integrated into the constraint loss data P (X) don’t contain helpful information for network
to reweight the contribution from different regions of unla- f . This explains the performance bottleneck of vanilla self-
′
beled data and avoid the harmful guidance from P̂ when training and MT-based frameworks from a causal view. By
′
the quality of P̂ is not good enough [31, 45, 49, 52, 55]. contrast, the co-training framework breaks such limits and
In the following section, we will further demonstrate that obtains a better performance by using two independent net-
′
the algorithmic independence between f and f also signif- works with different initial parameters [10], introducing dif-
icantly affects the segmentation performance. ferent decoders with different upsampling strategies on top
of the same encoder [50, 51], constructing different net-
4. Method work architectures [27], and even adopting adversarial sam-
ples [34]. Despite the efficacy, these methods are intuitive
4.1. Structural Causal Model for SSL Segmentation improvements on the algorithmic independence and fail to
The structural causal model (SCM) framework [33] is propose metrics to directly measure the independence.
adopted to describe our proposed causal diagram for causal
′ 4.2. Network Independence
analysis. The observed variables Xi ∈ O = {X, P̂ , P̂ , Y }
are determined by their parents P Ai and noise variables Our work focuses on finding a reasonable and com-
Ni , using a deterministic function fi : Xi := fi (P Ai , Ni ). putable proxy for the algorithmic independence metric
Also, an independent assumption is often made over the un- (Kolmogorov complexity) in the scenery of convolutional
observed noise variables following [23], indicating there are networks. For simplicity, we only consider the case of dif-
no hidden confounders. Then, the joint distribution over the ferent networks/branches with the same convolutional ar-
observed variables can be factorized as: chitecture. As mentioned in Section 1, Kolmogorov com-
′ Y plexity K(x) describes the compression length of x. There-
P (X, P̂ , P̂ , Y ) = P (Xi | P Ai ) fore, we design our proxy metric based on the Minimum
Xi ∈O (1) Description Length (MDL) principle [16, 37] from the as-
′
= P (X)P (P̂ |X, Y )P (P̂ |X, Y )P (Y |X) pect of compression similar to [8, 19].

21429
Figure 4. The pipeline of our CauSSL framework. Two networks are optimized by minimizing the combination of the supervised loss Ls
on labeled data, the unsupervised loss Lu on unlabeled data, and the network independence loss Lin between network parameters.

Fig. 3 illustrates how a 2D convolutional kernel is ap- sion of kernel weights as A and B with a size of Cout × d:
plied to the input features and generates a filtered result. C 2
out
For a k × k convolutional kernel, the actual number of the 1 X vA,i · qB,i
Lin (A, B; GB ) =
kernel weights is k ×k ×Cin , where Cin means the channel Cout |vA,i | × |qB,i | (2)
i=1
number of input features. Then, such an extended 3D kernel
qB,i = (GB × B)i
calculates the dot product with part of features, and slides
over the whole feature map, generating the result of an out- where vA,i is the i-th row vector in matrix A, and qB,i is the
put channel. With Cout different kernels similar to this, we optimal linear combination vector using the vector group of
can generate a total of Cout output channels. From the view B that can approximate vA,i as close as possible. GB is the
of signal processing, each kernel can be seen as a template optimal coefficient matrix whose elements are the optimal
pattern to find specific patterns. Features with patterns sim- linear combination coefficients, with a size of Cout × Cout .
ilar to the kernel will generate a high activation value by the In this way, the lower the Lin is, the higher the network
dot product. Thus, the flattened kernel can be considered independence it indicates.
as a base vector in linear algebra to detect the similarity of We then define the network dependence between two
input features on this pattern. Then, a convolutional layer neural networks or branches with the same architecture by
with Cout × (k × k × Cin ) parameters can be viewed as a taking the average over all the convolutional layers:
Cout × d matrix (d = k × k × Cin ), and each row in this
#layers
matrix is a d-dimensional base vector for a certain pattern. 1 X
Lin (θ1 , θ2 ; G2 ) = Lin (θ1,i , θ2,i ; G2,i )
With the view of vectors and matrices, the compress- # layers i=1
ibility of a neural network is thus transferred to the com- (3)
pressibility of matrices and the latter can be naturally re- where θ1,i , θ2,i , and G2,i are the weight parameters in the
lated to the matrix rank. According to the MDL princi- format of matrices and the optimal coefficient matrix of the
ple, the algorithmic independence between two matrices i-th convolutional layer, respectively. Only convolutional
indicates that the length of describing matrix A and B layers are considered in this work.
together equals the sum of the separate description, i.e.
rank([A, B]) = rank(A) + rank(B). [A, B] is the ex- 4.3. Causality-inspired SSL
tended (Cout,A + Cout,B ) × d matrix by concatenating a Based on the network dependence, we propose a
Cout,A × d matrix A and a Cout,B × d matrix B along the causality-inspired SSL framework to further enhance the
row dimension. This holds if any row vector in A and B algorithmic independence on top of the co-training frame-
cannot be represented by the linear combination of the row work called CauSSL (Fig. 4), via a min-max optimization
vectors in the other matrix. Based on this linear indepen- framework to improve the SSL segmentation performance:
dence condition, we propose a novel statistical quantifica-
min max Lin θ1 , θ¯2 ; G2 + Lin θ2 , θ¯1 ; G1

tion proxy for Kolmogorov complexity to measure the net- (4)
θ1 ,θ2 G1 ,G2
work dependence (the opposite of independence).
We first define the dependence between the same convo- where G1 and G2 are the set of linear coefficient matrices
lutional layers from two networks, denoting the matrix ver- which are only used in training and will be discarded during

21430
Algorithm 1 Pseudocode of CauSSL where ŷU ,i is the predicted probability maps on the unla-
Input: labeled data L, unlabeled data U and hyperparam- beled data generated by the two networks, and ỹU ,i means
eters λ1 and λ2 . the corresponding one-hot pseudo-labels. Details of how to
Output: Two independent segmentation networks or extend our method to a framework with 3 branches can be
branches parameterized by θ1 and θ2 , respectively. found in Appendix Section 1.
1: Randomly initialize the network weights θ1 , θ2 and lin-
ear coefficients G1 , G2 . // initialization 5. Experiments
2: i = 0 // iteration number
5.1. Datasets and Evaluation Metrics
3: while i ≤maximum iterations do // training
4: for j=1:smax do // maximize Our proposed method is validated on three public
5: Fix θ1 , θ2 . Update G1 , G2 by maximizing datasets with different imaging modalities and segmenta-
6: Lin (θ2 , θ¯1 ; G1 ) and Lin (θ1 , θ¯2 ; G2 ). tion tasks, i.e., the Automatic Cardiac Diagnosis Challenge
7: for j=1:smin do // minimize dataset (ACDC) [7], Pancreas-CT dataset [12, 39, 40] and
8: Fix G1 , G2 . Update θ1 , θ2 by minimizing Multimodal Brain Tumor Segmentation Challenge 2019
9: Ltotal,1 and Ltotal,2 using Equation 5. (BraTS’19) dataset [3, 4, 5, 29]. The details of these
10: i = i + 1. datasets and the preprocessing steps are described in Ap-
11: Return θ1 and θ2 . pendix Section 2. Four metrics were used for evaluation, in-
cluding the Dice similarity coefficient (DSC), Jaccard (JC),
95% Hausdor Distance (95HD), and the average surface
inference. θ¯1 and θ¯2 represent weights copy without gradi- distance (ASD). We have highlighted the results in bold
ent flows. In this standard bilevel optimization problem, we when our proposed CauSSL outperforms the original coun-
want to find the best linear combination coefficients that can terparts and underlined the best results. Also, standard de-
maximize Lin to provide an accurate estimation of the net- viations are reported in parentheses.
work dependence on one hand, and enhance the algorithmic
independence between two networks by minimizing Lin on 5.2. Implementation Details and Baselines
the other hand. To this end, we update the linear combina- We applied our proposed CauSSL on top of two pop-
tion coefficients and network weights in an alternative way ular co-training methods, CPS [10] and MC-Net+ [50],
as shown in Algorithm 1. In each round, we first fix the pa- and denote the modified methods as CPSCauSSL and MC-
rameters of networks and update the linear combination co- CauSSL, respectively. Both of them are trained using Al-
efficients to maximize Lin for smax steps. Then, we fix the gorithm 1 but the CPS method utilizes cross-entropy loss as
linear coefficients and update the network by minimizing the unsupervised loss whereas MC-Net+ uses the MSE loss
Ltotal for smin steps, which is a combination of the super- (Equation 7). We applied these two independence-enhanced
vised loss Ls on the labeled data, the unsupervised loss Lu methods for various network architectures to demonstrate
on the unlabeled data, and the network independence loss the efficacy of our method, including 2D U-Net [38], 3D
Lin between different networks or branches: V-Net [30], and 3D U-Net [11]. Specific settings on each
dataset are described in Appendix Section 2.
Ltotal,1 = Ls,1 + λ1 Lu,1 + λ2 Lin (θ1 , θ¯2 ; G2 )
(5) In all experiments of our proposed method, we empiri-
Ltotal,2 = Ls,2 + λ1 Lu,2 + λ2 Lin (θ2 , θ¯1 ; G1 ) cally updated the network weights and linear coefficient ma-
trices alternatively, with 60 steps for each. In addition, the
1 linear coefficient matrices were optimized using an Adam
Ls,i (ŷL,i , y) =
[Ldice (ŷL,i , y) + Lce (ŷL,i , y)] (6)
2 optimizer with a fixed learning rate of 0.02. Moreover,
where i ∈ [1, 2]. λ1 and λ2 are balancing coefficients. our method was compared with fully supervised learning
2
λ (t) = 0.1 ∗ e−5(1− tmax ) is adopted following [55] con-
t
1
only (SL), MT [46], uncertainty-aware Mean Teacher (UA-
sidering the quality of predictions from the assistant net- MT) [55], SASSNet [24], DTC[26], URPC [28], CPS [10],
work might not be good enough in the initial training stage. and MC-Net+ [50], which were re-implemented in the iden-
Ldice and Lce indicate the Dice loss and the cross-entropy tical environment and used the same training configurations
loss, respectively. ŷL,i represents the predicted probability for a fair comparison. Also, we compared our method
maps of the i-th network for labeled data. The unsuper- with BCP (CVPR’23) [2] on the ACDC dataset and FUSS-
vised loss Lu can be either cross-entropy loss between the Net (MICCAI’22) [53] on the Pancreas-CT dataset, which
network prediction and pseudo-labels generated by another are the state-of-the-art (SOTA) methods in their respective
network/branch or MSE loss between two probability maps: datasets. If the same training setting (dataset and data split)
is used, we directly reported the results from their original
Lu,1 = Lce (ŷU ,1 , ỹU ,2 ) or Lu,1 = MSE (ŷU ,1 , ŷU ,2 ) (7) paper. Otherwise, we re-ran their publicly available code

21431
Table 1. Comparisons with other methods on the ACDC dataset
with 10% and 20% labeled data.
Labeled% Method DSC (%) ↑ JC (%) ↑ 95HD (voxel) ↓ ASD (voxel) ↓
100% SL (upper bound) 91.53(2.89) 84.76(4.62) 2.41(5.28) 0.59(1.08)
SL 77.66(13.10) 66.40(14.53) 11.68(12.30) 3.31(3.65)
MT 81.11(9.65) 69.99(11.72) 8.99(10.25) 2.70(3.14)
UA-MT 80.71(9.69) 69.58(12.02) 13.69(16.54) 4.50(6.00)
SASSNet 82.56(8.94) 71.90(11.42) 9.13(9.99) 2.64(2.71)
DTC 84.32(6.92) 74.04(9.29) 9.47(11.61) 2.63(3.00)
10% URPC 82.41(10.15) 71.69(12.91) 5.83(9.09) 1.65(2.87)
CPS 84.24(6.85) 73.91(9.37) 8.26(9.68) 2.45(2.90)
CPSCauSSL 85.25(6.43) 75.31(8.98) 6.05(8.87) 1.97(2.64)
MC-Net+ 86.14(6.13) 76.61(8.32) 6.04(9.02) 1.85(2.50)
MCCauSSL 86.80(5.34) 77.48(7.62) 5.73(9.26) 1.83(2.56)
BCP 88.84(/) 80.62(/) 3.98(/) 1.17(/)
BCPCauSSL 89.66(3.82) 81.79(5.93) 3.67(8.16) 0.93(1.27)

SL 84.62(8.74) 74.85(11.13) 6.32(9.20) 1.79(2.58)

MT 85.46(7.28) 75.89(9.97) 8.02(10.55) 2.39(3.20)
UA-MT 85.16(7.41) 75.49(9.93) 5.91(8.95) 1.79(2.71)
SASSnet 86.45(6.77) 77.20(9.53) 6.63(8.52) 1.98(2.40)
DTC 87.10(6.18) 78.15(8.76) 6.76(10.83) 1.99(3.10)
20% URPC 85.44(9.29) 76.36(11.27) 5.93(9.04) 1.70(2.87)
CPS 86.85(7.05) 77.96(9.41) 5.48(9.13) 1.64(2.70)
CPSCauSSL 87.24(6.18) 78.44(8.55) 5.57(9.13) 1.73(2.65)
MC-Net+ 87.10(6.45) 78.21(9.03) 5.04(8.49) 1.56(2.43)
MCCauSSL 87.84(6.31) 79.32(8.84) 4.37(8.04) 1.28(2.30)
BCP 89.52(4.20) 81.62(6.44) 3.69(7.02) 1.03(1.89)
BCPCauSSL 89.99(3.65) 82.34(5.77) 3.60(8.62) 0.88(1.73)

Figure 5. Visualization of segmentation results on the ACDC test-

Table 2. Comparisons with other methods on the Pancreas-CT ing dataset trained with 10% labeled images.
dataset with 6 and 12 volumes having annotations.
#Labeled Method DSC (%) ↑ JC (%) ↑ 95HD (voxel) ↓ ASD (voxel) ↓
62/62 SL (upper bound) 82.80(6.32) 71.10(8.87) 5.58(4.18) 1.26(0.98)
SL 56.59(21.31) 42.24(19.99) 23.79(16.35) 7.18(5.45)
MT 68.61(13.86) 53.71(15.07) 18.64(16.45) 5.28(4.12)
UA-MT 66.96(14.43) 51.89(15.32) 21.65(14.12) 6.25(3.25)
SASSNet 66.69(14.86) 51.66(15.51) 18.88(11.55) 5.76(2.70)
DTC 67.28(17.37) 52.86(17.63) 17.74(18.58) 1.97(0.89)
6/62 URPC 64.73(15.36) 49.62(16.57) 21.90(9.83) 7.73(3.02)
FUSSNet 72.55(10.66) 57.95(13.03) 18.45(19.22) 5.23(5.96)
CPS 66.97(13.94) 51.93(15.17) 14.73(8.90) 4.49(2.25)
CPSCauSSL 67.33(13.59) 52.28(14.65) 16.16(8.38) 5.212.27)
MC-Net+ 68.18(12.49) 52.94(13.67) 16.35(11.05) 4.13(2.80)
MCCauSSL 72.89(8.90) 58.06(10.84) 14.19(11.20) 4.37(2.88)

SL 72.72(11.32) 58.25(13.33) 19.23(14.83) 5.77(3.99)

MT 76.39(9.80) 62.73(12.44) 9.91(9.54) 2.56(2.57)
UA-MT 77.42(8.68) 63.91(11.16) 7.96(5.44) 1.87(1.00)
SASSNet 78.06(7.40) 64.59(9.89) 12.76(15.78) 3.15(3.51)
DTC 76.82(12.53) 63.70(13.95) 8.69(10.38) 1.28(0.42)
12/62 URPC 79.09(7.39) 65.99(9.86) 11.68(13.80) 3.31(2.62)
FUSSNet 80.37(5.93) 67.57(8.16) 13.75(20.92) 3.46(3.87)
CPS 78.16(7.33) 64.74(9.83) 9.54(9.11) 2.63(2.14) Figure 6. Visualization of segmentation results on the Pancreas-CT
CPSCauSSL 78.58(7.52) 65.32(9.97) 8.30(6.22) 2.34(1.57) testing dataset trained with 6 labeled volumes.
MC-Net+ 79.36(6.54) 66.23(8.87) 10.22(9.59) 2.66(2.21)
MCCauSSL 80.92(5.20) 68.26(7.30) 8.11(9.24) 1.53(1.30)

dence constraint into the co-training framework, we can

using the default hyperparameters. further achieve a performance improvement (highlighted
in bold) and outperform other SOTA methods. Almost
5.3. Comparison on Organ Segmentation Tasks
all the metrics are improved under different ratios of la-
Table 1 and Table 2 tabulate the quantitative compar- beled data on both datasets and our proposed CauSSL ob-
ison results on the ACDC and Pancreas-CT datasets, re- tains the best results across various settings. On the ACDC
spectively, from which several observations can be found. dataset, the performance gain using our proposed CauSSL
First, methods with a co-training framework (CPS and MC- is 1.01% DSC and 1.40% JC for the CPS method with
Net+) generally outperform the vanilla MT framework (MT 10% labeled data. MCCauSSL also outperforms the orig-
and UA-MT). For example, CPS surpasses MT and UA- inal version by about 0.7% DSC and 0.9% JC. When 20%
MT with a margin of 3.13% and 3.53% DSC, respectively annotations are used, the performance improvement for the
on the ACDC dataset. This is consistent with findings CPS method is narrowed (0.39% DSC), whereas the gain is
in [10, 17, 21] and demonstrates the importance of algorith- still stable (0.74% DSC and 1.11% JC) for MC-Net+. On
mic independence in the SSL framework. Nonetheless, it is the Pancreas-CT dataset, the gap between MC-Net+ and
worth noting that incorporating other intricate modules into MCCauSSL is even larger, with a margin of 4.71% and
the MT-based method can also yield outstanding results, as 1.56% DSC using 6 and 12 annotated volumes, respectively.
demonstrated by approaches like BCP [2]. Trained by only about 20% labeled data, MCCauSSL even
Second, by introducing a causality-inspired indepen- closely approaches the upper bound (80.92 vs 82.80 DSC).

21432
Table 3. Comparisons with other methods on the BraTS’19 dataset
with 10% annotations.
Method DSC (%) ↑ JC (%) ↑ 95HD (voxel) ↓ ASD (voxel) ↓
SL 72.84(17.53) 60.05(20.52) 41.64(36.39) 2.56(1.45)
CPS 82.56(12.90) 72.18(17.11) 14.41(16.36) 2.31(2.01)
CPSCauSSL 83.56(12.84) 73.60(16.82) 11.91(14.17) 2.06(1.70)
MC-Net+ 81.84(15.21) 71.65(18.99) 13.82(17.04) 2.43(2.17)
MCCauSSL 83.54(12.41) 73.46(16.61) 12.53(15.94) 1.98(1.53)

Table 4. DSC results of applying our CauSSL to the vanilla MT.

Dataset #Labeled MT MTCauSSL
7/70 81.11% 82.89%
ACDC
14/70 85.46% 86.35%
6/62 68.61% 71.36%
Pancreas-CT
12/62 76.39% 77.63%
Figure 7. Network dependence of different methods during train-
ing on the ACDC dataset using 10% labeled training data.
ample, our CauSSL scheme obtains 1% and 1.7% DSC im-
The effectiveness of our proposed method can also be provements over CPS and MC-Net+, respectively, demon-
shown in some hard examples (See Fig. 5 and Fig. 6). Two strating the efficacy of our proposed method on another
cases are presented here on each dataset. The first and third more challenging setting and network architecture.
rows are the predictions for a certain slice while the second 5.5. Application to MT-based Methods
and fourth rows show 3D visualizations for each case. In
these examples, CPS and MC-Net+ tend to generate false Although our proposed CauSSL is originally designed
predictions (Fig. 5, row (a),(b)) or incomplete structures on top of co-training methods to further improve the net-
(Fig. 6), whereas the introduction of independence con- work independence, it can also be applied to MT-based
straint can mitigate these problems and obtain a more plau- methods (named as MTCauSSL) thanks to its plug-and-play
sible segmentation result. nature by adding the network independence loss to the stu-
More comparisons with other SOTA methods further dent training. In other words, we only keep one item in
demonstrate the efficacy of our proposed method. For Equation 4 for MTCauSSL. As shown in Table 4, our MT-
example, on the Pancreas-CT dataset, our proposed MC- CauSSL can obtain a stable and significant improvement
CauSSL is superior to the FUSSNet [53] under various ra- over the vanilla MT method on both the ACDC and the
tios of labeled data, with an improvement of 0.55% DSC Pancreas-CT datasets. Especially, the DSC improvement is
trained with 12 labeled volumes (See Table 2). By contrast, 2.75% when only 6 labeled data are used on the Pancreas-
although the MCCauSSL obtains a lower DSC compared to CT dataset. Moreover, by applying our proposed method to
BCP [2] on the ACDC dataset as shown in Table 1, applying the SOTA method BCP, we can further improve its perfor-
our proposed CauSSL on top of the BCP method can further mance from DSC 88.84% to 89.66% on the ACDC dataset
improve the performance by 0.82% DSC when 10% labeled with 10% labeled data as shown in Table 1.
data are used. It also approaches the upper bound with all
the labels (91.53%) only with a margin of 1.87%. In spite 5.6. Analysis of Network Dependence
of the performance improvement for BCP by introducing To demonstrate the efficacy of our proposed min-max
another 10% annotations, the DSC result (89.52%) is even framework in enhancing the algorithmic independence, we
lower than our BCPCauSSL with half labels (89.66%). further measure the dependence using the metric defined in
Equation 3. We take the average of all the networks or
5.4. Comparison on a Tumor Segmentation Task
branches in a method as the final measurement. Fig. 7 il-
To validate the generalizability of our method, we also lustrates the network dependence of different methods on
tried our method with the challenging brain tumor segmen- the ACDC dataset during the training process.
tation on the BraTS’19 dataset using a 3D U-Net backbone. First of all, the network dependence of the MT method
Table 3 shows results with 10% labeled data for different is extremely high, approaching 1 at last. This is due to the
methods. Similar to the results on organ segmentation tasks, use of the exponential moving average strategy. Using the
semi-supervised learning methods are superior to the base- limit theory, Ke et al. [21] have proved that the weights of
line using labeled data alone, with a DSC improvement of the teacher network and the student model will converge to
over 10% for both CPS and MC-Net+. In the meanwhile, the same target given infinite iteration steps. Such a de-
our proposed CauSSL can further improve the performance pendence explains the performance bottleneck of MT-based
of the SSL methods, as shown in bold in Table 3. For ex- methods. By contrast, CPS and MC-Net+ have a much

21433
Figure 8. DSC performances on the ACDC dataset with different hyperparameters under the setting of 10% labeled data.

lower dependence and also achieve a superior result on this by our method mainly includes matrix multiplication, nor-
dataset, proving the necessity of independence constraint. malization, MSE, and average calculation over all the con-
Moreover, the proposed min-max framework can further volutional layers. All of these can be implemented in Py-
reduce the algorithmic dependence and improve the seg- torch with high efficiency and just slightly increase the over-
mentation performance. The original dependence score for all training time. For example, on the ACDC dataset with
CPS and MC-Net+ are 0.24 and 0.20, respectively. Af- 10% labeled data using an NVIDIA RTX 3090 GPU, the
ter integrating the independence constraint into the training training duration of our CPSCauSSL was 4.71 hours, which
process, the dependence of CPS is significantly reduced to is comparable to CPS taking 4.40 hours. Likewise, MC-
0.04, whereas the metric for MC-Net+ becomes 4e-4. The CauSSL requires just an additional 0.34 hours compared to
improvement of the algorithmic dependence is consistent MC-Net+ (4.61 vs. 4.27 hours).
with the performance gain of the two co-training methods.
6. Conclusion
5.7. Impacts of Hyperparameters
This paper proposes a novel causal diagram to provide
We first compare the mean DSC of different indepen- plausible explanations for the effectiveness of SSL medi-
dence constraint coefficients using 10% labeled data on the cal image segmentation. Based on the diagram, the im-
ACDC dataset. As shown in Fig. 8 (a), no matter which portance of the algorithmic independence is noticed and a
coefficient is taken, both CPSCauSSL and MCCauSSL sur- novel statistical quantification is designed for convolutional
pass their counterparts without the independence constraint networks to approximate the uncomputable algorithmic in-
in most cases, indicated by the horizontal dotted lines, dependence. Then, we propose a causality-inspired SSL
demonstrating the efficacy of our proposed CauSSL frame- framework to further enhance the algorithmic independence
work. Second, the number of minimizing or maximizing and thus improve the SSL segmentation performance. Com-
iterations is observed to have a similar effect on both meth- parisons on three datasets and three network architectures
ods. 60 steps for minimizing obtains the highest DSC, demonstrate the effectiveness of our proposed method.
whereas too many iterations might introduce over enhance-
ment of the network independence and weak enforcement 7. Acknowledgement
of the independence might not fulfill the potential of the co-
training framework if the number of minimizing iterations This work described in this paper was supported in part
is not big enough (See Fig. 8 (b)). Moreover, according by the Shenzhen Portion of Shenzhen-Hong Kong Sci-
to Fig. 8 (c), when the number of maximizing steps is too ence and Technology Innovation Cooperation Zone under
small (such as 15 and 30), the linear coefficients might fail HZQB-KCZYB-20200089. The work was also partially
to match well, leading to an underestimated network de- supported by a grant from the Research Grants Council
pendence and possibly wrong independence optimization. of the Hong Kong Special Administrative Region, China
On the other hand, although more maximizing steps don’t (Project Number: T45-401/22-N) and by the following
bring higher segmentation performance, it is better than in- grants from the Hong Kong Innovation and Technology
sufficient dependence measurement. Fund (Project number: MHP/085/21 and GHP/080/20SZ).

5.8. Analysis of Training Efficiency References

As shown in Equation 2 and 3, compared to the orig- [1] Wenjia Bai, Ozan Oktay, Matthew Sinclair, Hideaki Suzuki,
inal SSL methods, the additional computation introduced Martin Rajchl, Giacomo Tarroni, Ben Glocker, Andrew

21434
King, Paul M Matthews, and Daniel Rueckert. Semi- David Maffitt, Michael Pringle, et al. The cancer imag-
supervised learning for network-based cardiac mr image seg- ing archive (tcia): maintaining and operating a public infor-
mentation. In International Conference on Medical Image mation repository. Journal of digital imaging, 26(6):1045–
Computing and Computer-Assisted Intervention, pages 253– 1057, 2013.
260. Springer, 2017. [13] Povilas Daniušis, Dominik Janzing, Joris Mooij, Jakob
[2] Yunhao Bai, Duowen Chen, Qingli Li, Wei Shen, and Yan Zscheischler, Bastian Steudel, Kun Zhang, and Bernhard
Wang. Bidirectional copy-paste for semi-supervised medical Schölkopf. Inferring deterministic causal relations. In Pro-
image segmentation. In Proceedings of the IEEE/CVF Con- ceedings of the Twenty-Sixth Conference on Uncertainty in
ference on Computer Vision and Pattern Recognition, pages Artificial Intelligence, pages 143–150, 2010.
11514–11524, 2023. [14] Deng-Ping Fan, Tao Zhou, Ge-Peng Ji, Yi Zhou, Geng Chen,
[3] Spyridon Bakas, Hamed Akbari, Aristeidis Sotiras, Michel Huazhu Fu, Jianbing Shen, and Ling Shao. Inf-net: Auto-
Bilello, Martin Rozycki, Justin S Kirby, John B Freymann, matic covid-19 lung infection segmentation from ct images.
Keyvan Farahani, and Christos Davatzikos. Advancing the IEEE Transactions on Medical Imaging, 39(8):2626–2637,
cancer genome atlas glioma mri collections with expert seg- 2020.
mentation labels and radiomic features. Scientific data, [15] Pedro M Gordaliza, Juan José Vaquero, and Arrate Muñoz-
4(1):1–13, 2017. Barrutia. Translational lung imaging analysis through disen-
[4] Spyridon Bakas, Mauricio Reyes, Andras Jakab, Stefan tangled representations. arXiv preprint arXiv:2203.01668,
Bauer, Markus Rempfler, Alessandro Crimi, Russell Takeshi 2022.
Shinohara, Christoph Berger, Sung Min Ha, Martin Rozycki, [16] Peter D Grünwald. The minimum description length princi-
et al. Identifying the best machine learning algorithms for ple. MIT press, 2007.
brain tumor segmentation, progression assessment, and over- [17] Xinyue Huo, Lingxi Xie, Jianzhong He, Zijie Yang, Wen-
all survival prediction in the brats challenge. arXiv preprint gang Zhou, Houqiang Li, and Qi Tian. Atso: Asynchronous
arXiv:1811.02629, 2018. teacher-student optimization for semi-supervised image seg-
[5] Spyridon (Spyros) Bakas. Brats miccai brain tumor dataset, mentation. In Proceedings of the IEEE/CVF conference on
2020. computer vision and pattern recognition, pages 1235–1244,
[6] Mehri Baniasadi, Mikkel V Petersen, Jorge Gonçalves, An- 2021.
dreas Horn, Vanja Vlasov, Frank Hertel, and Andreas Husch. [18] Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Pe-
Dbsegment: Fast and robust segmentation of deep brain tersen, and Klaus H Maier-Hein. nnu-net: a self-configuring
structures considering domain generalization. Human Brain method for deep learning-based biomedical image segmen-
Mapping, 44(2):762–778, 2023. tation. Nature methods, 18(2):203–211, 2021.
[7] Olivier Bernard, Alain Lalande, Clement Zotti, Freder- [19] Zhijing Jin, Julius von Kügelgen, Jingwei Ni, Tejas Vaidhya,
ick Cervenansky, Xin Yang, Pheng-Ann Heng, Irem Cetin, Ayush Kaushal, Mrinmaya Sachan, and Bernhard Schölkopf.
Karim Lekadir, Oscar Camara, Miguel Angel Gonzalez Causal direction of data collection matters: Implications of
Ballester, et al. Deep learning techniques for automatic mri causal and anticausal learning for nlp. In Proceedings of
cardiac multi-structures segmentation and diagnosis: is the the 2021 Conference on Empirical Methods in Natural Lan-
problem solved? IEEE transactions on medical imaging, guage Processing, pages 9499–9513. Association for Com-
37(11):2514–2525, 2018. putational Linguistics, 2021.
[8] Kailash Budhathoki and Jilles Vreeken. Causal inference by [20] Maxime Kayser, Roger D Soberanis-Mukul, Anna-Maria
compression. In 2016 IEEE 16th international conference Zvereva, Peter Klare, Nassir Navab, and Shadi Albarqouni.
on data mining (ICDM), pages 41–50. IEEE, 2016. Understanding the effects of artifacts on automated polyp de-
[9] Daniel C Castro, Ian Walker, and Ben Glocker. Causal- tection and incorporating that knowledge via learning with-
ity matters in medical imaging. Nature Communications, out forgetting. arXiv preprint arXiv:2002.02883, 2020.
11(1):3673, 2020. [21] Zhanghan Ke, Daoye Wang, Qiong Yan, Jimmy Ren, and
[10] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Rynson WH Lau. Dual student: Breaking the limits of the
Wang. Semi-supervised semantic segmentation with cross teacher in semi-supervised learning. In Proceedings of the
pseudo supervision. In Proceedings of the IEEE/CVF Con- IEEE/CVF International Conference on Computer Vision,
ference on Computer Vision and Pattern Recognition, pages pages 6728–6736, 2019.
2613–2622, 2021. [22] Wouter M Kouw, Silas N Ørting, Jens Petersen, Kim S Ped-
[11] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, ersen, and Marleen de Bruijne. A cross-center smoothness
Thomas Brox, and Olaf Ronneberger. 3d u-net: learn- prior for variational bayesian brain tissue segmentation. In
ing dense volumetric segmentation from sparse annota- Information Processing in Medical Imaging: 26th Interna-
tion. In Medical Image Computing and Computer-Assisted tional Conference, IPMI 2019, Hong Kong, China, June 2–7,
Intervention–MICCAI 2016: 19th International Conference, 2019, Proceedings 26, pages 360–371. Springer, 2019.
Athens, Greece, October 17-21, 2016, Proceedings, Part II [23] Julius Kügelgen, Alexander Mey, Marco Loog, and Bernhard
19, pages 424–432. Springer, 2016. Schölkopf. Semi-supervised learning, causality, and the con-
[12] Kenneth Clark, Bruce Vendt, Kirk Smith, John Freymann, ditional cluster assumption. In Conference on Uncertainty in
Justin Kirby, Paul Koppel, Stephen Moore, Stanley Phillips, Artificial Intelligence, pages 1–10. PMLR, 2020.

21435
[24] Shuailin Li, Chuyu Zhang, and Xuming He. Shape-aware [38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
semi-supervised 3d semantic segmentation for medical im- net: Convolutional networks for biomedical image segmen-
ages. In International Conference on Medical Image Com- tation. In International Conference on Medical image com-
puting and Computer-Assisted Intervention, pages 552–561. puting and computer-assisted intervention, pages 234–241.
Springer, 2020. Springer, 2015.
[25] Xiaomeng Li, Lequan Yu, Hao Chen, Chi-Wing Fu, Lei [39] Holger R Roth, Amal Farag, E Turkbey, Le Lu, Jiamin Liu,
Xing, and Pheng-Ann Heng. Transformation-consistent self- and Ronald M Summers. Data from pancreas-ct. the cancer
ensembling model for semisupervised medical image seg- imaging archive. IEEE Transactions on Image Processing,
mentation. IEEE Transactions on Neural Networks and 2016.
Learning Systems, 32(2):523–534, 2020. [40] Holger R Roth, Le Lu, Amal Farag, Hoo-Chang Shin, Jiamin
[26] Xiangde Luo, Jieneng Chen, Tao Song, and Guotai Wang. Liu, Evrim B Turkbey, and Ronald M Summers. Deeporgan:
Semi-supervised medical image segmentation through dual- Multi-level deep convolutional networks for automated pan-
task consistency. In Proceedings of the AAAI Conference on creas segmentation. In International conference on medical
Artificial Intelligence, volume 35, pages 8801–8809, 2021. image computing and computer-assisted intervention, pages
[27] Xiangde Luo, Minhao Hu, Tao Song, Guotai Wang, and 556–564. Springer, 2015.
Shaoting Zhang. Semi-supervised medical image segmen- [41] Ruben Sanchez-Romero, Joseph D Ramsey, Kun Zhang, and
tation via cross teaching between cnn and transformer. In Clark Glymour. Identification of effective connectivity sub-
International Conference on Medical Imaging with Deep regions. arXiv preprint arXiv:1908.03264, 2019.
Learning, pages 820–833. PMLR, 2022. [42] Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni
[28] Xiangde Luo, Wenjun Liao, Jieneng Chen, Tao Song, Yinan Sgouritsa, Kun Zhang, and Joris Mooij. On causal and an-
Chen, Shichuan Zhang, Nianyong Chen, Guotai Wang, and ticausal learning. In Proceedings of the 29th International
Shaoting Zhang. Efficient semi-supervised gross target vol- Coference on International Conference on Machine Learn-
ume of nasopharyngeal carcinoma segmentation via uncer- ing, pages 459–466, 2012.
tainty rectified pyramid consistency. In International Confer- [43] Jessica Schrouff, Natalie Harris, Oluwasanmi Koyejo,
ence on Medical Image Computing and Computer-Assisted Ibrahim Alabdulmohsin, Eva Schnider, Krista Opsahl-Ong,
Intervention, pages 318–329. Springer, 2021. Alex Brown, Subhrajit Roy, Diana Mincu, Christina Chen,
[29] Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree et al. Maintaining fairness across distribution shift: do we
Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya have viable solutions for real-world applications? arXiv
Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, preprint arXiv:2202.01034, 2022.
et al. The multimodal brain tumor image segmentation [44] Zhiqiang Shen, Peng Cao, Hua Yang, Xiaoli Liu, Jinzhu
benchmark (brats). IEEE transactions on medical imaging, Yang, and Osmar R Zaiane. Co-training with high-
34(10):1993–2024, 2014. confidence pseudo labels for semi-supervised medical image
[30] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. segmentation. arXiv preprint arXiv:2301.04465, 2023.
V-net: Fully convolutional neural networks for volumetric [45] Yinghuan Shi, Jian Zhang, Tong Ling, Jiwen Lu, Yefeng
medical image segmentation. In 2016 fourth international Zheng, Qian Yu, Lei Qi, and Yang Gao. Inconsistency-
conference on 3D vision (3DV), pages 565–571. Ieee, 2016. aware uncertainty estimation for semi-supervised medical
[31] Dong Nie, Yaozong Gao, Li Wang, and Dinggang Shen. image segmentation. IEEE Transactions on Medical Imag-
Asdnet: attention based semi-supervised deep networks for ing, 2021.
medical image segmentation. In International conference on [46] Antti Tarvainen and Harri Valpola. Mean teachers are better
medical image computing and computer-assisted interven- role models: Weight-averaged consistency targets improve
tion, pages 370–378. Springer, 2018. semi-supervised deep learning results. Advances in neural
[32] Cheng Ouyang, Chen Chen, Surui Li, Zeju Li, Chen Qin, information processing systems, 30, 2017.
Wenjia Bai, and Daniel Rueckert. Causality-inspired single- [47] Bethany H Thompson, Gaetano Di Caterina, and Jeremy P
source domain generalization for medical image segmenta- Voisey. Pseudo-label refinement using superpixels for semi-
tion. IEEE Transactions on Medical Imaging, 2022. supervised brain tumour segmentation. In 2022 IEEE 19th
[33] Judea Pearl. Causality. Cambridge university press, 2009. International Symposium on Biomedical Imaging (ISBI),
[34] Jizong Peng, Guillermo Estrada, Marco Pedersoli, and Chris- pages 1–5. IEEE, 2022.
tian Desrosiers. Deep co-training for semi-supervised image [48] Athanasios Vlontzos, Daniel Rueckert, and Bernhard Kainz.
segmentation. Pattern Recognition, 107:107269, 2020. A review of causality for learning algorithms in medical im-
[35] Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. El- age analysis. arXiv preprint arXiv:2206.05498, 2022.
ements of causal inference: foundations and learning algo- [49] Kaiping Wang, Bo Zhan, Chen Zu, Xi Wu, Jiliu Zhou, Lup-
rithms. The MIT Press, 2017. ing Zhou, and Yan Wang. Tripled-uncertainty guided mean
[36] Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, and Alan teacher model for semi-supervised medical image segmenta-
Yuille. Deep co-training for semi-supervised image recognition. In International Conference on Medical Image Com-
tion. In Proceedings of the european conference on computer puting and Computer-Assisted Intervention, pages 450–460.
vision (eccv), pages 135–152, 2018. Springer, 2021.
[37] Jorma Rissanen. Modeling by shortest data description. Au- [50] Yicheng Wu, Zongyuan Ge, Donghao Zhang, Minfeng Xu,
tomatica, 14(5):465–471, 1978. Lei Zhang, Yong Xia, and Jianfei Cai. Enforcing mutual con-

21436
sistency of hard regions for semi-supervised medical image
segmentation. arXiv preprint arXiv:2109.09960, 2021.
[51] Yicheng Wu, Minfeng Xu, Zongyuan Ge, Jianfei Cai, and
Lei Zhang. Semi-supervised left atrium segmentation with
mutual consistency training. In International Conference on
Medical Image Computing and Computer-Assisted Interven-
tion, pages 297–306. Springer, 2021.
[52] Yingda Xia, Dong Yang, Zhiding Yu, Fengze Liu, Jinzheng
Cai, Lequan Yu, Zhuotun Zhu, Daguang Xu, Alan Yuille,
and Holger Roth. Uncertainty-aware multi-view co-training
for semi-supervised medical image segmentation and do-
main adaptation. Medical Image Analysis, 65:101766, 2020.
[53] Jinyi Xiang, Peng Qiu, and Yang Yang. Fussnet: Fusing two
sources of uncertainty for semi-supervised medical image
segmentation. In International Conference on Medical Im-
age Computing and Computer-Assisted Intervention, pages
481–491. Springer, 2022.
[54] Huifeng Yao, Xiaowei Hu, and Xiaomeng Li. Enhancing
pseudo label quality for semi-supervised domain-generalized
medical image segmentation. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 36, pages
3099–3107, 2022.
[55] Lequan Yu, Shujun Wang, Xiaomeng Li, Chi-Wing Fu, and
Pheng-Ann Heng. Uncertainty-aware self-ensembling model
for semi-supervised 3d left atrium segmentation. In In-
ternational Conference on Medical Image Computing and
Computer-Assisted Intervention, pages 605–613. Springer,
2019.