Zhang 等 - 2023 - Causal reasoning in typical computer vision tasks
Zhang 等 - 2023 - Causal reasoning in typical computer vision tasks
Abstract—Deep learning has revolutionized the field of ar- use mistakenly non-causal relationships, also called spurious
tificial intelligence. Based on the statistical correlations un- correlations, as the basis for network inference. In particu-
covered by deep learning-based methods, computer vision has lar, when the data is multidimensional and heterogeneous,
contributed to tremendous growth in areas like autonomous
driving and robotics. Despite being the basis of deep learning, the complexity of the relationships may further amplify the
such correlation is not stable and is susceptible to uncontrolled impact of spurious correlations. Simpson’s paradox proposed
arXiv:2307.13992v2 [cs.CV] 31 Jul 2023
factors. In the absence of the guidance of prior knowledge, by Blyth et al. [14] well confirms the flaws of statistical
statistical correlations can easily turn into spurious correlations correlation-based methods. It confirms that the direction of
and cause confounders. As a result, researchers are now trying to an association at the population-level may be reversed within
enhance deep learning methods with causal theory. Causal theory
models the intrinsic causal structure unaffected by data bias and the subgroups comprising that population. As stated in [15],
is effective in avoiding spurious correlations. This paper aims to the relation between coffee consumption and neuroticism is
comprehensively review the existing causal methods in typical positive in each individual, but those individuals who drink
vision and vision-language tasks such as semantic segmentation, more coffee are generally less neurotic. For each individual,
object detection, and image captioning. The advantages of the correlation between coffee consumption and neuroticism
causality and the approaches for building causal paradigms will
be summarized. Future roadmaps are also proposed, including is positive, but in the population the correlation is negative.
facilitating the development of causal theory and its application Such paradox is worth noticing as different levels of interpre-
in other complex scenes and systems. tation lead to different results for the same data. It follows
Index Terms—causal reasoning, computer vision tasks, vision- that neither individual-level nor population-level statistical
language tasks, semantic segmentation, object detection. correlations can fully characterize the relationship between
coffee consumption and neuroticism. However, the causality-
based methods will specify the prior knowledge concerning
I. I NTRODUCTION the causal structures and derive the correct causal chains at
Deep learning techniques have improved our understand- the specific level of interpretation. As a result, the causality-
ing of the world, and deep learning-based computer vision based methods are more logical and effective compared to the
methods enable high-performance intelligent perception of our statistical correlation-based ones.
surroundings [1]. Areas such as autonomous cars [2, 3], un- Typical deep learning-based computer vision tasks can be
manned aircraft [4], and robotics [5, 6] are developing rapidly summarized as follows: Given an image X, the goal is to
with technological innovations. To guarantee performance build a network to predict its label Y correctly [16]. A
across one or even more domains, training strategies like statistical model fitted with a suitable objective function is
attention mechanisms [7, 8], pre-training mechanisms [9, 10], often used to estimate the conditional probability distribution
and generic large models [11] have been proposed. Despite P (Y |X). However, only under the independent identical
their great performance, the basis of these deep learning- distribution (I.I.D.) hypothesis can the learned conditional
based methods is to learn the statistical correlation. However, probability distribution P (Y |X) be applied appropriately
statistical correlation learns regular knowledge based on the from the training set to the testing set. It requires that new
final presentation of the data, which lacks the guidance of prediction samples stay consistent with the distribution of
prior knowledge. In contrast, causality focuses on the mecha- the training set. To minimize the impact of domain differ-
nisms of the data generation process and the causal structure ences between two sets, deep learning-based methods such
of specific tasks. Causality implies the relationship between as domain adaptation [17] and generalization [18] have been
two variables in which the cause variable directly affects the proposed. However, the problem of generalizability cannot
effect variable [12]. Though both statistical correlation and be fundamentally solved through these methods, as the bias
causality are data-driven [13], the former uses the consistency caused by domain differences is not fundamentally eliminated.
of data trends as the basis for determining relationships, Additionally, since deep learning-based methods benefit from
while causality is the reflection of inherent characteristics and many stacked layers and parameters for approximating high-
structures both inside and between variables. As there is not dimensional complex functions, it is hard to interpret the
always a causal relationship between variables that change in learned models and parameters [16]. In conclusion, existing
a consistent trend, statistical correlation-based methods tend to statistical correlation-based methods overly rely on the given
2
data and analyze the problems through approximating high- article outlines the benefits of the causal approaches regard-
dimensional functions rather than mechanical mining. There ing accuracy, generalizability, and interpretability. Section II
are areas for improvement in both generalizability and inter- introduces the basic concepts and terminology of causality,
pretability. including structural causal models, causal interventions, back-
Due to its mechanical strengths, causality has gained signifi- door adjustments, front-door adjustments, and counterfactual.
cant attention recently and has rooted its developments across Section III proposes a systematic and structural review of
several fields, such as statistics [19, 20], economics [21, 22], the specific vision and vision-language tasks. It analyzes the
epidemiology [23, 24], and computer science [25, 26]. Ba- corresponding causal structure and identifies the confounders.
sic causal methods can be divided into two main aspects, Five typical causal structures are summarized for compari-
causal discovery and causal inference [27]. Based on the son and methodology overviews. Section IV lists the future
causal structures learned by causal discovery, causal inference roadmaps for better development of causal theory and a broad
leverages those relationships for further analysis. An example application of causal theory. Section V concludes this review.
is proposed to highlight the advantages of causality-based
methods. Fig. 1 compares the statistical correlation-based and II. P RELIMINARIES
the causality-based methods for the same image classification In this section, the basic causal methods are introduced,
task. Given the input images of the sheep and the corre- including causal discovery and causal inference. Causal
sponding labels in Fig. 1 (a), the model is trained for correct discovery aims to build causal relationships with the
identification. The visualizations of the learned features are structural causal models (SCMs), while causal inference is
shown in Fig. 1 (b). Both statistical correlation-based and used to estimate the causal effect.
causality-based learning mechanisms are specifically shown
in Fig. 1 (c). Since the sheep and the grassland frequently
coexist in the training data, the statistical correlation-based
A. Causal discovery
process tends to regard the grassland characteristics as the
basis for the labels due to their similar distribution. In contrast, 1) Causal structure: A causal structure is a directed acyclic
the causality-based methods focus more on the objective cause graph (DAG) in which each node corresponds to a specific
chain of sheep and prefer wool features as the basis. Different variable, and each edge indicates a direct functional relation-
learning mechanisms lead to different classification results, ship between the linked variables [12]. If there is a directed
especially when faced with rare samples, as shown in Fig. 1 edge pointing from Y to X, X is called the child variable,
(d). When given an image of a sheep standing in the snow, the while Y is the corresponding parent variable. If a variable has
statistical correlation-based model fails to assign the correct its parent variable in the causal structure, it is endogenous;
label because no grassland features are found. Conversely, by otherwise, it is exogenous. To better describe the concepts
focusing on the wool features, the causality-based model can involved, the causal structure depicting the process of image
accurately categorize the sheep based on the causal features. generation is shown in Fig. 2 as an example.
As a result, the causality-based methods rely not only on the The graph has four nodes involved, where X represents
consistency of data trends but also on the underlying processes the given image, Y represents the corresponding label, C
that generate the causal structure between variables [28]. They represents the content, and D represents the domain involved.
are more robust towards an unknown environment, and the The edge D → − X ← − C describes image generation process
learning process tends to be more interpretable. in that the image includes both the content and the domain
This review focuses on the implications of causal theory information. The edge X → − Y indicates the ultimate goal of
for vision and vision-language tasks, including classification, a vision task: finding a suitable label for the given image. The
detection, segmentation, visual recognition, image captioning, edge C → − Y means that the content in the image determines
and visual question answering. There exist some surveys about the label. Since there are two directed edges pointing to X:
causal theory [13, 27, 29, 30]. Kaddour et al. [29] group D → − X and C → − X, D and C are the parent variables of
existing causal machine learning work into five categories X. Also, D is exogenous because it has no parent variables
and comprehensively compare existing approaches. Gao et in the graph. A path between two variables is a sequence of
al. [13] focus on the application of causal reasoning to edges connecting them. The path between X and Y in Fig. 2
recommendation systems. Li et al. [30] outline the advantages can be either X → − Y or X ← −C→ − Y.
of causal theory in industrial applications. Chen et al. [27]
divide the causal discovery tasks into three types according 2) Structural causal models: Based on the causal structure,
to the variable paradigm: definite tasks, semidefinite tasks, a structural causal model can specify how each variable is
and undefinite tasks. Unlike these existing causality-based influenced by its parent variables by effect functions. Given
reviews, this work compares the causal structures of different a set of variables X1 , X2 ..., Xn indicating the nodes in the
approaches in vision and vision-language tasks and summa- causal structure, they can be written as the outcome of their
rizes the correspondence between the commonly used causal parent variables and the effect function [12].
structures and the corresponding concerns. Furthermore, the Xi = fi (P Ai , Ui ), i = 1, ..., n, (1)
3
Fig. 1. Comparison between statistical correlation-based methods and causality-based methods. Given the images and the corresponding labels, the image
classification network performs feature extraction, makes an analytical decision on the feature, and finally outputs the corresponding labels. The figure above
compares the two learning mechanisms: statistical correlation-based analysis and causality-based analysis, as shown in (c). The input in (a) is visualized
by feature extraction in (b). The green boxes indicate the captured non-causal features, while the red boxes indicate the captured causal features. Due to
different learning mechanisms, the statistical correlation-based method fails to infer the correct label. In contrast, the causality-based method can get the
correct label, as shown in (d). When faced with the learned object in an unknown environment, the correlation-based method tends to be misled by data
bias. Contrastly, the causality-based method focuses only on the causal factors associated with the object and is not disturbed by data variation.
Fig. 2. A general causal structure for image generation, where X represents the given image, Y represents the learned label, C represents the corresponding
content, and D represents the domain involved.
and the corresponding manipulated conditional probabilities. causal relationship can be obtained by calculating the sum
The example in Fig. 2 can be taken for a better illustration. A of the conditional probabilities P (Y |X, c) and the stratified
causal intervention on the domain can be proposed to observe probability of the confounder C.
the effect of domain variables on the corresponding labels. 3) Front-door adjustment: If C is unavailable, the manip-
When keeping other variables constant, P (Y |do(D)) can ulation cannot be conducted and the back-door adjustment is
denote the outcome of Y when changing domain knowledge. useless. Under such circumstances, the front-door adjustment
Due to the complexity of the relationship between variables, it is responsible for guiding the discovery of causality [12]. The
is difficult to directly determine the causal chain between the observed mediator variable M is introduced to block out the
cause and the effect without being interrupted by the spurious relationship between X and Y , as shown in Fig. 3 (f). In the
correlation. Therefore, interventions are needed to ensure presence of M , the effect of X on M and the effect of M
independence between variables when exploring the simple on Y can be calculated, respectively.
direct causal association. The basic connection structures To infer the effect of X on M as P (M = m|do(X)), the path
proposed by Rebane et al. [31] between three variables X, X← −C→ − Y → − M is blocked out [32]. With the existence
Y , and C are shown in Fig. 3 (a), (b), and (c). Different ways of a collider, the equation can be expressed as follows:
of intervention are used towards different structures owing to
P (M = m|do(X)) = P (M = m|X). (4)
their unique characteristics. The path X → − C → − Y in Fig.
3 (a) is a chain junction where X affects Y via the mediator To infer the effect of M on Y as P (Y = y|do(M )), the path
C. In vision tasks, the feature can be learned from the given W ← − X ← − C → − Y should be blocked out. As C is not
image, and the label is made referring to the learned feature. available, X should be controlled to block the path, and the
It is easy to find that an intervention on C can easily block corresponding equation can be expressed as below:
the path between X and Y . The path X ← −C → − Y in Fig. X
3 (b) is called a confounding junction, where C affects both P (Y = y|do(M )) = P (Y |C, x)P (x). (5)
x
X and Y and is called a confounder. In vision tasks, the
context can affect both the images and the labels, adding a In conclusion, the front-door adjustment can be described as:
spurious correlation to the actual causal chain between images X
P (Y |do(X)) = P (M = m|do(X))P (Y = y|do(M ))
and labels. Under these circumstances, interventions should be
m
taken on C to block the path. The path X → − C← − Y in Fig. X X
3 (c) is called a collider, where both X and Y decide C. In = P (M = m|X) P (Y |M = m, x)P (x).
m x
vision tasks, the image can be generated by both the content (6)
and domain information. When the value of C is unknown, With the front-door adjustment, the probability of variable
X and Y are independent. Once the value of C is accessible, Y being conditional on a given variable X can be obtained
X is in relation to Y . Therefore, the value of C cannot be by introducing a mediator variable M . In conclusion, both
intervened to block the path between X and Y . the back-door adjustment and the front-door adjustment can
In conclusion, causal intervention guarantees the indepen- estimate the causal effect. When choosing the de-confounding
dence between variables and eliminates the effect brought techniques, the correlations between variables should be jus-
by potential confounders. It is flexible and is decided by the tified, and the characteristics of the confounders should be
specific structure. Two typical ways of causal interventions figured out. If the confounder is available, a back-door ad-
will be described below in detail. justment is appropriate; otherwise, the front-door adjustment
2) Back-door adjustment: Back-door adjustment is one of is the better choice.
the de-confounding techniques used in the causal intervention 4) Counterfactual: Counterfactual is another way of mak-
[12]. Any path from X to Y that starts with an arrow pointing ing causal inference. It is the opposite of the factual and is
into X is a back-door path. Assume that there are three often used to estimate the difference between the variables of
variables, X, Y , and C; the corresponding structure is shown interest and their observed values in the real world [13]. Using
in Fig. 3 (d). Since the back-door path X ← − C → − Y is the error terms [12] through comparison, an intervention on
the confounding junction structure, the way to block the path the variables of interest is used to predict the outcome in
is to manipulate C through an intervention [32] if the value the counterfactual world. For example, some counterfactual
of C is available. The specific manipulation is to stratify C features can be generated randomly using noise to measure
and calculate the average causal effect at each stratum. The the effectiveness of the learned image features on the final
corresponding back-door adjustment is shown in Fig. 3 (e), label determination. The error terms of the label performance
and the formulation is as follows: can act as a criterion for effectiveness.
X 5) Potential outcome models: The potential outcome mod-
P (Y |do(X)) = P (Y |X, c)P (c). (3) els are first proposed by Neymanet al. [33] to estimate the
c
causal effect of a treatment variable on an outcome variable
With the back-door adjustment, the probability of variable without requiring the causal graph. Ideally, the difference
Y conditional on a given variable X, which indicates the between two potential outcomes can be regarded as the causal
5
Fig. 3. The three graphs (a), (b), and (c) in the first row show three common connection structures between X, C, and Y . They are the chain junction, the
confounding junction, and the collider, respectively. Different interventions are required for different types of connections. Graph (d) in the second row is the
most common causal structure, while graphs (e) and (f) are two typical causal intervention methods used for the structure in (d). (e) indicates the back-door
adjustment and (f) indicates the front-door adjustment. In each graph, X represents a set of input variables, Y represents a set of output variables, and C
represents the confounder that causes the spurious correlation. The dashed line represents the block-out path and the blue nodes represent the confounder.
effect of the treatment on the outcome. For example, given Fig. 4 (I) describes the relationships between the input,
binary treatments T = 0/1, the individual treatment effect the output, and a confounder that introduces the unwanted
(ITE) for an individual i is defined as Y1i − Y0i [13]. However, spurious correlation. If C is available, a back-door adjustment
only one of these results can be observed at a time, either is often used to cut off the link between X and C to analyze
Y1i or Y0i . As a result, the average treatment effect (ATE) is the actual effect of X on Y as follows:
proposed as an extension for the individual treatment effect X
to measure the overall average. The formulation is as follows: P (Y |do(X)) = P (Y |X, c)P (c). (8)
c
N
1 X i
AT E = E[Y1i − Y0i ] = (Y − Y0i ), (7) Fig. 4 (II) introduces an intermediate variable M between the
N i=1 1 input and the output. In vision and vision-language tasks, M is
where i = {1, 2, ..., N } represents each individual number in a more specific representation of the mapping between inputs
the population. and outputs. It is also introduced to avoid the manipulation
of C when it cannot be obtained directly in the front-door
adjustment.
III. C AUSALITY IN DIFFERENT TASKS
X X
A. Methodology summary P (Y |do(X)) = P (M = m|X) P (Y |M = m, x)P (x).
m x
The process of methods based on causal theory includes (9)
building the structural causal models through causal discovery Fig. 4 (III) introduces a new variable M between the
and choosing the proper way of causal inference. In the confounder and the output, which often indicates a high-
structure, nodes indicate the important variables or variables dimensional feature of the effect of the confounder on the
that are hidden but have an effect on the important ones, while output. If the confounder C is available, the intervention
the edges indicate inner causal relationships. Combining method will also change.
different causal structures and different concerns, suitable X
causal reasoning methods are selected to eliminate spurious P (Y |do(X)) = P (Y |X, M, c)P (M |X, c)P (c). (10)
correlations. Table I summarizes the causality-based methods c
both in vision and vision-language tasks. In addition to the
task names, the causal structures and the concerns are also Otherwise, a mediator is needed to be introduced between X
listed to compare different methods. and Y , as shown in Eq. 9 for front-door intervention.
The intermediate variable M can also be placed between the
confounder and the input, as shown in Fig. 4 (IV). In this case,
1) Typical causal structures: The causal structures covered
conditioning on the intermediate variable M is equivalent to
in this paper can be summarized into five ones, as shown in
cutting off the spurious correlation.
Fig. 4. Apart from the simplest structure (I) in Fig. 4, the other
structures introduce intermediate variables to further elucidate X
P (Y |do(X)) = P (Y |X, c)P (c, M = m). (11)
the relationship.
c
6
Fig. 4. These are the common structures involved in causality-based tasks, in which each colored part indicates confounders.
Different from the previous structures, the structure in outcome models and the counterfactual to estimate the im-
Fig. 4 (V) focuses on the effect of different co-occurring pact of the variables as the basis for determining network
confounders on the input. De-correlation of the co-occurring behavior. Generative models are often used in these methods
variables or elimination of the effect of these variables to eliminate variables with no causal relationship.
are desired in this situation. The potential outcome models When enhancing generalizability from the causal perspective,
proposed by Neyman et al. [33] to quantitatively extrapolate the spurious correlation brought by the context can be elim-
results under different conditions often used. inated. In this case, accuracy can also be enhanced since
the noise in the context will affect the exploration of causal
relationships even in a domain. Guided by prior knowledge,
2) Methods for different concerns: The main concerns in
the structural causal model makes the method inherently
introducing causal theory for performance enhancement can
interpretable.
be divided into three aspects: accuracy, generalizability and
These three concerns are used to delineate the existing articles
interpretability. These three aspects have their priorities but
in the following section as methods for accuracy enhancement,
overlap with each other.
methods to improve generalizability, and methods to promote
When enhancing model interpretability from a causal perspec-
interpretability. Since existing work does not fully encompass
tive, the approach focuses on the intrinsic causal relationship
these three concerns, some tasks will lack concerns. Nev-
between variables and develops further inference through this
ertheless, the application of causal theory for performance
objective stable correlation. Methods that use causal theory
enhancement is meaningful in every task.
for accuracy enhancement [34–36] often focus on the task
in one given domain and rebuild the structure involving the
hidden confounders. They analyze the limitations of existing B. Causal reasoning in vision tasks
statistical correlation-based methods at a holistic level and This section will cover four tasks: classification, detection,
identify the causes of the confounders. With the structural segmentation, and visual recognition. The methods for each
causal model built on prior knowledge, such methods strive task will be classified according to the different concerns:
to cut off the spurious correlation and inference with the accuracy, generalizability, and interpretability. The causal
right causal chain. In these methods, confounders are often structure for analysis and the causal inference methods will
the specific variables in the inference process of the task. also be discussed. The approaches that focus on accuracy
Methods that aim at improving generalizability [37, 38] often often construct causal structures, analyze confounders in spe-
focus on the differences of one task between multiple domains cific problems, and design effective causal inference methods
and aim to achieve stable results against the domain gap. based on the proposed structure. The approaches that focus
The confounders in these methods are often the domain and on generalizability aim to achieve stable performance between
the corresponding context information. For generalizability, different domains. They learn stable causal representations
such methods aim to eliminate the effects of the domain and through methods such as causal inference and reweighting.
learn the domain-invariant features for subsequent inference. The approaches that focus on interpretability try to explore
Methods that focus on interpretability [39, 40] often expect the reasons for network decisions by generating new samples
the model to rely on the correct visual region when making to identify the basis for network decisions.
decisions. Instead of making interventions under the guidance 1) Classification: Classification is a fundamental problem
of causal structure, such methods often use the potential in computer vision, which tries to analyze the correlation
7
TABLE I
T HIS IS A SUMMARY OF CAUSAL THEORY- BASED VISION TASKS ( VISION - LANGUAGE INCLUDED ). F OR CONVENIENCE , ABBREVIATIONS ARE USED TO
DENOTE INDIVIDUAL TASKS : CLS FOR CLASSIFICATION , DET FOR OBJECT DETECTION , SemSeg FOR SEMANTIC SEGMENTATION , W SSS FOR
WEAKLY SUPERVISED SEMANTIC SEGMENTATION , M edSeg FOR MEDICAL IMAGE SEMANTIC SEGMENTATION , V isRecog FOR TARGET RECOGNITION ,
ImgCap FOR IMAGE CAPTIONING , V QA FOR VISUAL QUESTION ANSWERING , 3DRecon FOR 3D RECONSTRUCTION , 3DP ose FOR 3D POSE
ESTIMATION , AND ObjN av FOR OBJECT NAVIGATION . T HE STRUCTURES IN THE LAST COLUMN CORRESPOND TO THE STRUCTURES IN F IG . 4.
between the images and the corresponding labels. Existing can be balanced to correct for bias from the non-random
methods [80–82] mainly focus on the statistical correlations treatment assignments. By adopting a different structure, as
between the features and the labels, which will be heavily shown in Fig. 4 (II), Miao et al. [45] use the front-door
influenced by spurious correlations in the presence of adjustment due to the specific causal structure between seven
noise [49] or inconsistent distributions [48]. Therefore, it is variables: the domain, the object, the prior knowledge, the
important to introduce causal theory into the classification category factor, the domain identity, the unseen images, and
task to learn the causal relationships between the images and the labels. They transfer unseen images to taught knowledge
the labels. which are the features of seen images and cut off excess
Methods for accuracy enhancement: The long-tail effect causal paths to calculate the causal effect. Chen et al.
unbalances the impact of different classes on momentum [51] propose a new learning paradigm, namely simulate-
and seriously affects classification accuracy. Tang et al. analyze-reduce to infer the causes of domain shift between
[34] assign the SGD momentum as the confounder and use the auxiliary and source domains during training. Through
the back-door adjustment on the structure, including the counterfactual inference, they reduce the effect of domain
feature, the SGD momentum, the projection head, and the shift between semantic concept, image and category in Fig.
predicted label in Fig. 4 (III). Such do-operator removes the 4 (I).
bad confounder bias while keeping the good mediator bias, Methods to promote interpretability: It is essential to
eliminating the negative effect of SGD momentum. It is a explain the classification decision drivers of the neural
paradox that a stronger pre-trained model may enlarge the network, however, existing correlation-based explanation
dissimilarity between the support set and the query set in methods fail to consider the impact of the confounders
few-shot learning, thus affecting the classification accuracy. and are easily affected by misleading information. Goyal et
Yue et al. [41] point out that the pre-trained knowledge is al. [39] propose the conditional generative model for the
the confounder that causes spurious correlations between the generation of counterfactuals and quantitatively measure the
sample features and class labels in the support set. They concept by causal concept effect method. With the structure
introduce the transformed representation of sample features shown in Fig. 4 (V), they model the relationships between
and use the back-door adjustment in the structure in Fig. 4 the high-level concepts, the images, and the classifier output.
(III) to cut off the effect of pre-trained knowledge on the
feature representation. The model’s misperception of the 2) Detection: The goal of object detection is to determine
imposed noise in the image affects the classification accuracy. where objects are located in a given image and to
Qiu et al. [50] find the new type of bias as task-induced bias which category each object belongs. Different from the
and use the back-door adjustment to the structure in Fig. 4 (I) classification task, the detection task requires not only
between image, label and task identifier to transforms biased an accurate classification but also the precise location of
features into unbiased ones. Yang et al. [49] summarized the target in the image. Existing methods [83–85] have
the effects of noise as unobserved confounders. Guided by made significant achievements in practical applications
the structure in Fig. 4 (I), they use the generative model concerning autonomous driving, but their effectiveness is
to generate unobserved confounders for estimation and heavily compromised by biases from complex scenarios. The
assess causal effects in noisy image classification tasks introduction of causal theory into the detection tasks allows
through treatment estimation. The learning of the robust for accurate categorization by learning stable target properties
representation is proposed against any unexpected noise. on the one hand and precise localization by sorting out the
Methods to improve generalizability: In practical image causal interactions between the target and the environment
classification tasks, the assumption of independent identical on the other.
distributions is unrealistic. In an ever-changing environment, Methods for accuracy enhancement: Existing object
there exist stable variables that remain constant and unstable detection methods tend to focus on the statistical correlation
variables that often change. The instabilities, in turn, create between instances, bounding boxes, and labels, ignoring
biases between domains, introduce spurious variables, and spurious correlations introduced by contextual bias, which in
affect generalizability. Concerning the need for multi- turn results in decreased accuracy. Huang et al. [36] formulate
domain generalization, methods are proposed with different the causalities in object detection tasks with the structure
structures. Lv et al. [46] build a structural causal model in Fig. 4 (III) and assign the context as the confounder.
containing causal factors, non-causal factors, raw inputs, and Through back-door adjustment, the non-causal but positive
the category label based on Fig. 4 (I). They use a causal correlation between pixels and labels can be avoided. In
intervention module, a causal factorization module, and an unsupervised salient object detection tasks, biases caused by
adversarial mask module for a more robust representation semantic contrast and image distribution heavily introduce
learning. With the same structure, a causal regularizer is spurious bias, thus limiting the improvement of performance
proposed by Shen et al. [37] to balance the confounder in accuracy. To eliminate both the contrast distribution bias
distributions for each treatment feature by reweighting the and the spatial distribution bias, Lin et al. [53] identify the
samples. Through reweighting, the confounder distributions visual contrast distribution as the confounder which misleads
9
the model training towards data-rich visual contrast clusters. contextual relationships for a closer resemblance to the
They use the causal inference in the structure shown in human learning process. For high-quality seed regions in
Fig. 4 (IV) to make each visual contrast cluster contribute weakly supervised semantic segmentation, Zhang et al.
fairly and propose an image-level weighting strategy to [58] construct the structural causal model between pixels,
eliminate the spatial bias. The application of knowledge contexts, and labels as shown in Fig. 4 (III) and introduce
distillation improves the models ability to capture semantic a back-door adjustment to eliminate the negative impact of
information in few-shot object detection. However, the contexts on class activation graph generation as shown in
empirical error of the teacher model degenerates the student Fig. 5. Different from the settings in [58], Wang et al. [59]
models prediction of the target labels, interfering with the attribute the poor performance of existing methods to a set of
students prediction accuracy on downstream tasks. Li et al. class-specific latent confounders in the dataset and analyze
[54] designate classification knowledge for specific domains the causality between image, image-level tag, pixel-level
as the confounder and use the back-door adjustment in the localization, and a set of class-specific latent confounders,
structure in Fig. 4 (IV) to remove the effect of the semantic referring to Fig. 4 (II). They use the front-door adjustment
knowledge and re-merge the remaining variables as new to cut off the spurious correlation between the confounder
knowledge distillation targets. and the images. Focusing on weakly supervised semantic
Methods to improve generalizability: Automated driving segmentation in medical images, an approach that focuses on
technology requires extremely high levels of safety and both the category-causality chain and the anatomy-causality
robustness assurance. Perception modules trained on neural chain is proposed by Chen et al. [60]. They use causal
networks are often used for target detection. However, interventions to get the actual cause of category prediction
modules with better training results often fail to guarantee and integrate anatomical constraints for the actual cause of
performance in unknown scenarios due to deviations between segmentation.
the target and training domains. Resnick et al. [52] collect Methods to improve generalizability: Generalizability is
real-world data through an automated driving simulator and a top priority in medical image segmentation due to the
make causal interventions on the data to discriminate factors personal safety involved. Causal reasoning is frequently
detrimental to safe driving. In turn, the source of the domain used in medical image processing to mitigate domain shifts
variation problem is resolved in advance, and the deviation caused by imaging modalities, scanning protocols, and
is eliminated. Xu et al. [55] propose to remove non-causal device manufacturers [61, 62]. A causal structure involving
factors from common features by multi-view adversarial the operating system, the surgical environment, and the
training on source domains to remove the non-causal factors segmentation graph is built by Ding et al. [61]. They
and purify the domain-invariant features. They clearly clarify align tool models with image observations by updating
the relationships among causal factors, noncausal factors, the initially incorrect robot kinematic parameters through
domain specific feature and domain common feature from a forward kinematics and differentiable rendering to optimize
causal perspective. im- age feature similarity end-to-end. With a more formal
causal structure between acquisition, content, feature maps,
3) Segmentation: Segmentation is equivalent to a pixel- medical images, and the ground-truth segmentation mask,
wise classification problem as the most demanding task Ouyang et al. [62] propose a simple causality-inspired data
in terms of accuracy, and the spatial-semantic uncertainty augmentation approach to expose a segmentation model to
principle is the main challenge [86]. Among the causality- synthesized domain-shifted training examples.
based segmentation tasks, fully supervised and weakly
supervised segmentation problems tend to focus on accuracy, 4) Visual recognition: Visual recognition aims to imitate
while medical segmentation problems are more concerned the human visual system as much as possible. Previous works
with generalizability to ensure safety. [87–89] have noticed the inconsistency between the training
Methods for accuracy enhancement: The wrongly explored set and the application environment, compensating for the
semantic correlations between word embeddings and visual bias effect through data augmentation, re-weighted loss, and
features in generative models can lead to spurious correlations normalization. However, such methods cannot design generic
and compromise performance. To eliminate the bias between methods of eradicating deviations for different scenarios. As
visible and invisible classes caused by confounders in a result, the revisiting of the fundamental process of building
zero-sample semantic segmentation, Shen et al. [56] adopt a visual recognition system is processed with the help of
the counterfactual theory with the causal structure containing causal theory to distinguish the paradoxical character of the
true and false features, word embeddings, and labels. With bias.
the same structure as Fig. 4 (I), a causal structure consisting Methods for accuracy enhancement: Contextual bias
of images, category labels, and contextual information misdirects attention to the co-occurrence context rather than
is constructed by Li et al. [57]. The confusion bias is the objects, leading to the loss of accuracy. Liu et al. [64]
eliminated through causal intervention, while a fusion build the structure between the object representations, prior
module is designed to fuse original features and causal context knowledge, image-specific context, and predictions.
10
Fig. 5. This is the overview of the proposed Context Adjustment (CONTA) in [58]. In their setting, Y represents a set of given labels indicating the specific
classes of the objects, ci represents the average mask of the ith class images, and M can be viewed as a linear combination of ci . This weakly supervised
semantic segmentation task aims to enhance the quality of the generated class activation maps.
Since the confounder proposed is unobserved, it uses an iterative procedure to establish P (c) in a back-door adjustment.
The confounders are better learned through iteration, resulting in high-quality class activation maps for segmentation.
They propose a novel paradigm with both back-door Different from vision tasks, vision-language tasks aim to
adjustment and counterfactual inference to conquer the effect combine the information from vision, and language to per-
of contextual bias with the structure in Fig. 4 (III). Dataset form complex tasks that mimic human behavior [91, 92].
bias always contributes to the biased statistic correlation- The main training steps for visual-language tasks involve
based model, resulting in decreased performance. Qin et al. encoding images and texts into single-modal embeddings
[63] argue that such bias misleads the correlation between for representation learning, designing an encoder to inte-
the input images and the output labels. They build the causal grate information from both modalities, and using different
structure between image, label, context, common sense, and aggregation methods. Despite the great success of existing
bias, referring to the structure in Fig. 4 (IV), and cut off the methods [93, 94], the bias between modalities cannot be
backdoor path involving the confounder by the back-door ignored, as the semantics of images and language differ
adjustment. significantly and are all susceptible to the influence of the
Methods to improve generalizability: Since the association environment. These confounding factors confuse the model
between images and labels is not generalizable across regarding the causal chains between important information
domains, the out-of-distribution performance is always poor. and thus deteriorate performance. In order to avoid bias, it is
Guided by the causal-transportability language [90], Mao et necessary to reconsider the link between two modalities from
al. [67] build a causal structure between the input images, a causal viewpoint in addition to learning stable characteristics
the corresponding labels, and unobserved variables encoding from each modality separately.
external sources of variation not captured in the images 1) Image captioning: Image captioning is a task that aims
and the labels themselves with the structure in Fig. 4 (I). to decipher automatically the semantic information contained
From a more complex structure in Fig. 4 (II), Mao et al. in an image and produce an accurate description of it [95].
[66] build a causal structure between six variables. Variables Numerous efforts [96, 97] have been made to improve the
include object features, unobserved confounders, background performance of image captioning systems, but the endogenous
features, images, and labels. Aware that attention mechanisms language bias is neglected since the existing image captioning
can no longer be robustly characterized in any confusing models are inclined to build spurious connections between
environment, Wang et al. [65] focus on the deficiency of the images and the high-frequency concurrent categories. By
attention mechanism when generating robust representations applying causal theory to this task, it is necessary to learn
and propose a causal attention module that self-annotates the a stable representation from the images and analyze the
context confounders in an unsupervised fashion. causal relationships between images and text for appropriate
responses.
Methods for accuracy enhancement: Since the dataset bias
C. Causal reasoning in vision-language tasks is inevitable, the accuracy performance is always affected by
This section will cover two tasks: image captioning and vi- spurious associations caused by the bias. Yang et al. [68]
sual question answering. The concerns, corresponding causal introduce the semantic structure set as a mediator and use
structure, and causal inference methods will also be discussed. the front-door adjustment in Fig. 4 (II) combined with the
11
back-door adjustment in Fig. 4 (I) to eliminate the spurious Methods to promote interpretability: Enhancing the
correlation caused by language resources, which is denoted interpretable aspects of the model helps to understand the
as the confounder. As a result, the dataset bias introduced basis on which the model is used in decision-making.
by the confounder is reduced. Noticing that Yang et al. [68] However, existing methods are guided by attention and
neglect the confounded visual features in the encoder, Chen this statistic-based approach is vulnerable to language bias.
et al. [69] eliminate the spurious correlation between visual Also, there are issues with existing methods such as the
features and certain expressions by using the back-door adjust- need for additional annotations. Chen et al. [40] consider
ment and estimate the confounder with variational inference. the VQA task as a multi-class classification problem and
Considering the limitation in [69] that the pre-training dataset use counterfactual analysis to synthesize the samples,
is hard to stratify, the confounder is explicitly divided into encouraging the model to perceive the difference in questions
two classes by Liu et al. [70]. In most transformer-based when changing some critical words and clarify the basis for
image captioning methods, the bias caused by both visual their decisions.
and linguistic confounders is often overlooked. However, the
resulting spurious correlations often compromise the accuracy
of the network. Liu et al. [70] propose to disentangle the IV. F UTURE ROADMAPS
region-based features and use the back-door adjustment to
A. Reasonable causal structure
deconfound the causal structure as shown in Fig. 4 (II) and
(V), which can effectively eliminate the spurious correlations Though causality has shown great potential in improving
caused by both visual and linguistic confounders. accuracy, generalizability, and interpretability, all the ap-
proaches are premised on the causal structures that are tailored
2) Visual question answering: Given an image-question to the specific problems [99]. However, many of the variables
pair, visual question answering (VQA) tasks expect the involved in the vision tasks are simply ignored and only some
model to answer questions correctly based on the given important ones are treated as nodes to construct the simplified
information. Since the existing models are proven to be structure. Different choices of variables can lead to differ-
fragile to linguistic variations in questions and answers ent structures, thus leading to different perspectives on the
[98], the causal relationships between images and language same task. When solving the problem of weakly supervised
information are investigated to grasp the meaning of the semantic segmentation, Zhang et al. [58] construct the causal
inquiry. structure between pixels, contexts, and labels, while Chen et
Methods for accuracy enhancement: Due to the inclusion al. [60] build the structure between images, categories, context
of both image and language modal data, VQA requires visual confounders, anatomical structures, shapes of segmentation,
analysis, language understanding, and multi-modal reasoning. and pseudo masks. Although addressing the same issues,
However, existing models suffer from linguistic bias and do the two approaches have their focus, one on contextual
not make correct use of semantic information. Niu et al. [71] information and the other on biased category information.
refer to Fig. 4 (I) to analyze this problem and mitigate the While constructing causal structures is key to causal learning,
language bias by subtracting the direct language effect from true causality is complex and there is no uniform modeling
the total causal effect with the counterfactual inference as framework. When analyzing complex vision problems, it is of
shown in Fig. 6. Zang et al. [75] are concerned about the bias great importance to model a reasonable, and comprehensive
introduced by multi-modal data in VQA tasks. They capture causal structure for a clearer interpretation of the relationships
visual features related to the semantics of the question and between visually relevant variables.
weaken the influence of local language semantics on the
question answering. Liu et al. [76] collaboratively disentangle
the spurious correlations between vision and language modal B. More guidance for multi-modal fusion
through back-door adjustment and front-door adjustment. Multi-modal fusion integrates different modal information
They capture the fine-grained interactions between visual to overcome the restrictions of incomplete information given
and linguistic semantics and learn the global semantic-aware by a single modality, hence realizing modal information
visual-linguistic representations adaptively. complementarity and improving feature representation [100].
Methods to improve generalizability: Recent VQA models However, data fusion is challenging. Data generation is driven
are proven to be brittle to linguistic and semantic variations by numerous underlying processes that depend on many
in questions and images. Agarwal et al. [72] propose to inaccessible variables, while the data itself is always hetero-
enhance the robustness through automated semantic image geneous and complex. Data fusion aims to allow modalities
manipulations and tests for consistency in model predictions. to communicate entirely and inform each other. As a result,
A data augmentation strategy that involves adding noise to the selecting an analytical model that properly depicts the link
input image to improve model performance on challenging between modalities and provides a meaningful combination
counting questions is also introduced. Li et al. [74] propose of them is critical. [101]. Causality can uncover causal chains
a learning framework to ground the question-critical scene between variables objectively and can further guide analyt-
for invariant inference across different scenes. ical model building between multi-modal data to enhance
12
Fig. 6. The biased training section shows the negative impact of language bias. The trained model will focus more on the question than the visual content,
which seriously limits the generalization. The debiased strategy is proposed for a better generalization when the distributions are inconsistent across different
stages. With this motivation, a counterfactual inference framework is constructed based on language-prior knowledge in [71].
accuracy, generalizability, and interpretability. For example, to power system trends, and assist researchers by providing
image data enables the precise visualization of the location them with more interpretable analysis. In large-scale complex
of problems, and time-series data offers a thorough history dynamical systems such as the Earth system, anthropogenic
of changes in the pertinent variables. A reliable causal chain intervention is uncontrolled and against humanitarianism. To
is constructed by detecting faults with image data and ana- find out the mechanisms between such complex systems, the
lyzing the root cause of faults with time-series data, enabling causal theory is the best choice [109]. With an explosion
more accurate fault detection and root cause identification. in the availability of large-scale time series data and an
The interpretability of causal theories can also enhance the increasingly accurate perception of the Earth model, the causal
interactive ability of multi-modal fusion. interdependencies of the underlying system can be discovered
and quantified to improve the theoretical understanding of
the Earth system. Combining time series, satellite remote
C. More applications of causal theory sensing images, and other multi-source data, the nonlinear
So far, the applications of causal theory to vision and dynamical interactions between different climate factors can
vision-language tasks have been discussed. Since the causal be analyzed and the effective prevention of tipping points
perspective can explore the relationships between variables can be predicted. In addition, a deeper understanding of
at a fundamental level, it is suitable for guiding the study the influence of meteorological parameters on meteorological
of complex systems and tasks, such as industrial processes phenomena can be obtained.
[102, 103], electrical systems [104], and Earth systems [105].
Robots must make stable and timely decisions in complex V. C ONCLUSION
industrial processes in response to the ever-changing envi-
This review aims to contribute to the development of
ronment. It is also important to make the decision-making
causality in typical computer vision tasks and provide detailed
process of robots transparent to humans. Recently, the task
explanations of existing methods from the perspective of
plan explanation methods [106, 107] have promoted human-
causal structures. The concept and necessity of causality in
robot interaction in robotics and enhanced the interpretability
guiding vision tasks are first discussed. The survey on existing
of robotic decision-making mechanisms with the help of
methods of causal reasoning in vision and vision-language
causality. The generalization techniques also make the deploy-
tasks is then conducted from different perspectives: accuracy,
ment of tasks in unknown environments a reality. The power
generalizability, and interpretability. Finally, future roadmaps
system is a unified whole capable of producing, transmitting,
are suggested to encourage the theory and applications of this
distributing, and consuming electrical energy. Recently, with
promising field.
the improvement of top-level design, the power system is
developing towards safety, high efficiency, low carbon, and in-
telligent integration. However, the strong interactions between R EFERENCES
systems introduce new challenges in maintaining high supply [1] Y. Pang, X. Bai, and G. Zhang, “Special focus on
security, as new factors can affect the overall security of the deep learning for computer vision,” SCIENCE CHINA
power system [108]. The causal theory can be used to explain Information Sciences, vol. 62, no. 12, pp. 220 100–,
the relationships between power system variables, respond 2019.
13
[2] D. Yang, K. Jiang, D. Zhao, C. Yu, Z. Cao, S. Xie, The Disunity of Psychology as a Working Hypothesis.
Z. Xiao, X. Jiao, S. Wang, and K. Zhang, “Intelligent New York, NY: Springer US, 2009, pp. 67–97.
and connected vehicles: Current status and future per- [16] N. Malik and P. V. Singh, “Deep learning in computer
spectives,” SCIENCE CHINA Technological Sciences, vision: Methods, interpretation, causation, and fair-
vol. 61, no. 10, pp. 1446–1471, 2018. ness,” in Operations Research & Management Science
[3] D. Li, M. Liu, F. Zhao, and Y. Liu, “Challenges and in the Age of Analytics. INFORMS, 2019, pp. 73–100.
countermeasures of interaction in autonomous vehi- [17] Q. Sun, C. Zhao, Y. Tang, and F. Qian, “A survey
cles,” SCIENCE CHINA Information Sciences, vol. 62, on unsupervised domain adaptation in computer vision
no. 5, pp. 050 201–, 2019. tasks,” SCIENTIA SINICA Technologica, vol. 52, no. 1,
[4] Y. Wang, H. Song, Q. Li, and H. Zhang, “Research pp. 26–54, 2022.
on a full envelop controller for an unmanned ducted- [18] K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. C. Loy,
fan helicopter based on switching control theory,” SCI- “Domain generalization in vision: A survey,” arXiv
ENCE CHINA Technological Sciences, vol. 62, no. 10, preprint arXiv:2103.02503, 2021.
pp. 1837–1844, 2019. [19] R. E. Heidel, “Causality in statistical power: Isomor-
[5] Z. Xie, Q. Zhang, Z. Jiang, and H. Liu, “Robot learning phic properties of measurement, research design, ef-
from demonstration for path planning: A review,” SCI- fect size, and sample size,” Scientifica, vol. 2016, p.
ENCE CHINA Technological Sciences, vol. 63, no. 8, 8920418, 2016.
pp. 1325–1334, 2020. [20] A. P. Dawid, “Statistical causality from a decision-
[6] Z. Chu, D. Jie, L. Su, J. Cui, and S. Fuchun, “A gecko- theoretic perspective,” Annual Review of Statistics and
inspired adhesive robotic end effector for critical- Its Application, vol. 2, pp. 273–303, 2015.
contact manipulation,” SCIENCE CHINA Information [21] J. J. Heckman and R. Pinto, “Causality and economet-
Sciences, vol. 65, no. 8, pp. 182 203–, 2022. rics,” National Bureau of Economic Research, Tech.
[7] H. Li, S. Cao, Y. Chen, m. Zhang, and Rep. 29787, 2022.
D. Feng, “Tulam: Trajectory-user linking via [22] J. Geweke, “Inference and causality in economic time
attention mechanism,” SCIENCE CHINA Information series models,” Handbook of econometrics, vol. 2, pp.
Sciences, pp. –, 2023. [Online]. Available: http: 1101–1144, 1984.
//www.sciengine.com/publisher/ScienceChinaPress/ [23] M. Kundi, “Causality and the interpretation of epidemi-
journal/SCIENCECHINAInformationSciences///10. ologic evidence,” Environmental Health Perspectives,
1007/s11432-021-3673-6, vol. 114, no. 7, pp. 969–974, 2006.
[8] G. Cheng, P. Lai, D. Gao, and J. Han, “Class attention [24] H. Ohlsson and K. S. Kendler, “Applying causal infer-
network for image recognition,” SCIENCE CHINA In- ence methods in psychiatric epidemiology: A review,”
formation Sciences, vol. 66, no. 3, pp. 132 105–, 2023. JAMA psychiatry, vol. 77, no. 6, pp. 637–644, 2020.
[9] P. Yan, Y. Tan, and Y. Tai, “Repeatable adaptive key- [25] J. F. Hair Jr and M. Sarstedt, “Data, measurement, and
point detection via self-supervised learning,” SCIENCE causal inferences in machine learning: opportunities
CHINA Information Sciences, vol. 65, no. 11, pp. and challenges for marketing,” Journal of Marketing
212 103–, 2022. Theory and Practice, vol. 29, no. 1, pp. 65–77, 2021.
[10] Y. Shao, Z. Geng, Y. Liu, J. Dai, H. Yan, F. Yang, Z. Li, [26] M. Prosperi, Y. Guo, M. Sperrin, J. S. Koopman, J. S.
H. Bao, and X. Qiu, “Cpt: A pre-trained unbalanced Min, X. He, S. Rich, M. Wang, I. E. Buchan, and
transformer for both chinese language understanding J. Bian, “Causal inference and counterfactual prediction
and generation,” SCIENCE CHINA Information Sci- in machine learning for actionable healthcare,” Nature
ences, pp. –, 2022. Machine Intelligence, vol. 2, no. 7, pp. 369–375, 2020.
[11] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrap- [27] H. Chen, K. Du, X. Yang, and C. Li, “A review and
ping language-image pre-training with frozen image roadmap of deep learning causal discovery in different
encoders and large language models,” arXiv preprint variable paradigms,” arXiv preprint arXiv:2209.06367,
arXiv:2301.12597, 2023. 2022.
[12] J. Pearl, Causality. Cambridge university press, 2009. [28] J. Pearl, “Bayesian networks,” 2011.
[13] C. Gao, Y. Zheng, W. Wang, F. Feng, X. He, [29] J. Kaddour, A. Lynch, Q. Liu, M. J. Kusner, and
and Y. Li, “Causal inference in recommender sys- R. Silva, “Causal machine learning: A survey and open
tems: A survey and future directions,” arXiv preprint problems,” arXiv preprint arXiv:2206.15475, 2022.
arXiv:2208.12397, 2022. [30] Z. Li, Z. Zhu, X. Guo, S. Zheng, Z. Guo, S. Qiang,
[14] C. R. Blyth, “On simpson’s paradox and the sure- and Y. Zhao, “A Survey of Deep Causal Models and
thing principle,” Journal of the American Statistical Their Industrial Applications,” 2023.
Association, vol. 67, no. 338, pp. 364–366, 1972. [31] G. Rebane and J. Pearl, “The recovery of causal
[15] D. Borsboom, R. A. Kievit, D. Cervone, and S. B. poly-trees from statistical data,” arXiv preprint
Hood, The Two Disciplines of Scientific Psychology, or: arXiv:1304.2736, 2013.
14
2022, pp. 756–772. [70] B. Liu, D. Wang, X. Yang, Y. Zhou, R. Yao, Z. Shao,
[58] D. Zhang, H. Zhang, J. Tang, X.-S. Hua, and Q. Sun, and J. Zhao, “Show, deconfound and tell: Image cap-
“Causal intervention for weakly-supervised semantic tioning with causal inference,” in Proceedings of the
segmentation,” Advances in Neural Information Pro- IEEE/CVF Conference on Computer Vision and Pattern
cessing Systems, vol. 33, pp. 655–666, 2020. Recognition, 2022, pp. 18 041–18 050.
[59] Y. Wang, “Causal class activation maps for weakly- [71] Y. Niu, K. Tang, H. Zhang, Z. Lu, X.-S. Hua, and J.-
supervised semantic segmentation,” in UAI 2022 Work- R. Wen, “Counterfactual vqa: A cause-effect look at
shop on Causal Representation Learning, 2022. language bias,” in Proceedings of the IEEE/CVF Con-
[60] Z. Chen, Z. Tian, J. Zhu, C. Li, and S. Du, “C- ference on Computer Vision and Pattern Recognition,
cam: Causal cam for weakly supervised semantic seg- 2021, pp. 12 700–12 710.
mentation on medical image,” in Proceedings of the [72] V. Agarwal, R. Shetty, and M. Fritz, “Towards causal
IEEE/CVF Conference on Computer Vision and Pattern vqa: Revealing and reducing spurious correlations by
Recognition, 2022, pp. 11 676–11 685. invariant and covariant semantic editing,” in Proceed-
[61] H. Ding, J. Zhang, P. Kazanzides, J. Y. Wu, and ings of the IEEE/CVF Conference on Computer Vision
M. Unberath, “Carts: Causality-driven robot tool seg- and Pattern Recognition, 2020, pp. 9690–9698.
mentation from vision and kinematics data,” in Medical [73] S. Zhang, T. Jiang, T. Wang, K. Kuang, Z. Zhao,
Image Computing and Computer Assisted Intervention– J. Zhu, J. Yu, H. Yang, and F. Wu, “Devlbert: Learning
MICCAI 2022: 25th International Conference, Singa- deconfounded visio-linguistic representations,” in Pro-
pore, September 18–22, 2022, Proceedings, Part VII. ceedings of the 28th ACM International Conference on
Springer, 2022, pp. 387–398. Multimedia, 2020, pp. 4373–4382.
[62] C. Ouyang, C. Chen, S. Li, Z. Li, C. Qin, W. Bai, and [74] Y. Li, X. Wang, J. Xiao, W. Ji, and T.-S. Chua,
D. Rueckert, “Causality-inspired single-source domain “Invariant grounding for video question answering,” in
generalization for medical image segmentation,” IEEE Proceedings of the IEEE/CVF Conference on Computer
Transactions on Medical Imaging, vol. 42, no. 4, pp. Vision and Pattern Recognition, 2022, pp. 2928–2937.
1095–1106, 2023. [75] C. Zang, H. Wang, M. Pei, and W. Liang, “Discover-
[63] W. Qin, H. Zhang, R. Hong, E.-P. Lim, and Q. Sun, ing the real association: Multimodal causal reasoning
“Causal interventional training for image recognition,” in video question answering,” in Proceedings of the
IEEE Transactions on Multimedia, vol. 25, pp. 1033– IEEE/CVF Conference on Computer Vision and Pattern
1044, 2023. Recognition, 2023, pp. 19 027–19 036.
[64] R. Liu, H. Liu, G. Li, H. Hou, T. Yu, and T. Yang, [76] Y. Liu, G. Li, and L. Lin, “Cross-modal causal re-
“Contextual debiasing for visual recognition with lational reasoning for event-level visual question an-
causal mechanisms,” in Proceedings of the IEEE/CVF swering,” IEEE Transactions on Pattern Analysis and
Conference on Computer Vision and Pattern Recogni- Machine Intelligence, pp. 1–17, 2023.
tion (CVPR), 2022, pp. 12 755–12 765. [77] W. Liu, Z. Liu, L. Paull, A. Weller, and B. Schölkopf,
[65] T. Wang, C. Zhou, Q. Sun, and H. Zhang, “Causal “Structural causal 3d reconstruction,” in Computer
attention for unbiased visual recognition,” in Proceed- Vision–ECCV 2022: 17th European Conference, Tel
ings of the IEEE/CVF International Conference on Aviv, Israel, October 23–27, 2022, Proceedings, Part
Computer Vision, 2021, pp. 3091–3100. I. Springer, 2022, pp. 140–159.
[66] C. Mao, A. Cha, A. Gupta, H. Wang, J. Yang, and [78] X. Zhang, Y. Wong, X. Wu, J. Lu, M. Kankanhalli,
C. Vondrick, “Generative interventions for causal learn- X. Li, and W. Geng, “Learning causal representation
ing,” in Proceedings of the IEEE/CVF Conference on for training cross-domain pose estimator via genera-
Computer Vision and Pattern Recognition, 2021, pp. tive interventions,” in Proceedings of the IEEE/CVF
3947–3956. International Conference on Computer Vision, 2021,
[67] C. Mao, K. Xia, J. Wang, H. Wang, J. Yang, E. Barein- pp. 11 270–11 280.
boim, and C. Vondrick, “Causal transportability for [79] S. Zhang, X. Song, W. Li, Y. Bai, X. Yu, and
visual recognition,” in Proceedings of the IEEE/CVF S. Jiang, “Layout-based causal inference for object nav-
Conference on Computer Vision and Pattern Recogni- igation,” in Proceedings of the IEEE/CVF Conference
tion, 2022, pp. 7521–7531. on Computer Vision and Pattern Recognition, 2023, pp.
[68] X. Yang, H. Zhang, and J. Cai, “Deconfounded image 10 792–10 802.
captioning: A causal retrospect,” IEEE Transactions on [80] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-
Pattern Analysis and Machine Intelligence, pp. 1–1, attention multi-scale vision transformer for image clas-
2021. sification,” in Proceedings of the IEEE/CVF interna-
[69] W. Chen, J. Tian, C. Fan, H. He, and Y. Jin, “Dependent tional conference on computer vision, 2021, pp. 357–
multi-task learning with causal intervention for image 366.
captioning,” arXiv preprint arXiv:2105.08573, 2021. [81] Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong,
16