Causal Inference: 1.1 Two Types of Causal Questions
Causal Inference: 1.1 Two Types of Causal Questions
The difference between passively observing X = x and actively intervening and setting
X = x is significant and requires different techniques and, typically, much stronger assump-
tions. This is the area known as causal inference.
1 Preliminaries
Before we jump into the details, there are a few general concepts to discuss.
1
1.3 Two Languages for Causation
There are two different mathematical languages for studying causation. The first is based
on counterfactuals. The second is based on causal graphs. It will not seem obvious at first,
but the two are mathematically equivalent (apart from some small details). Actually, there
is a third language called structural equation models but this is very closely related to causal
graphs.
1.4 Example
Consider this story. A mother notices that tall kids have a higher reading level than short
kids. The mother puts her small child on a device and stretches the child until he is tall.
She is dismayed to find out that his reading level has not changed.
The mother is correct that height and reading skill are associated. Put another way, you
can use height to predict reading skill. But that does not imply that height causes reading
skill. This is what statisticians mean when they say:
correlation is not causation.
On the other hand, consider smoking and lung cancer. We know that smoking and lung
cancer are associated. But we also believe that smoking causes lung cancer. In this case,
we recognize that intervening and forcing someone to smoke does change his probability of
getting lung cancer.
Despite the fact that causation and association are different, people confuse them up all
the time, even people trained in statistics and machine learning. On TV recently there was a
report that good health is associated with getting seven hours of sleep. So far so good. Then
the reporter goes on to say that, therefore, everyone should strive to sleep exactly seven
hours so they will be healthy. Wrong. That’s confusing causation and association. Another
TV report pointed out a correlation between people who brush their teeth regularly and low
2
● ●
● ●
Y
Y
● ●
● ●
● ●
X X
Figure 1: Left: X and Y have positive association. Right: The lines are the counterfactuals,
i.e. what would happen to each person if I changed their X value. Despite the positive
association, the causal effect is negative. If we increase X everyone’s Y values will decrease.
rates of heart disease. An interesting correlation. Then the reporter (a doctor in this case)
went on to urge people to brush their teeth to save their hearts. Wrong!
To avoid this confusion we need a way to discuss causation mathematically. That is,
we need someway to make P(Y ∈ A|set X = x) formal. As I mentioned earlier, there are
two common ways to do this. One is to use counterfactuals. The other is to use causal
graphs. There are two different languages for saying the same thing.
Causal inference is tricky and should be used with great caution. The main messages
are:
1. Causal effects can be estimated consistently from randomized experiments.
2. It is difficult to estimate causal effects from observational (non-randomized) experi-
ments.
3. All causal conclusions from observational studies should be regarded as very tentative.
Causal inference is a vast topic. We will only touch on the main ideas here.
2 Counterfactuals
Consider two variables X and Y . We will call X the “exposure” or the “treatment.” We
call Y the “response” or the “outcome.” For a given subject we see (Xi , Yi ). What we don’t
see is what their value of Yi would have been if we changed their value of Xi . This is called
3
the counterfactual. The whole causal story is made clear in Figure 1 which shows data (left)
and the counterfactuals (right).
Suppose now that X is a binary variable that represents some exposure. So X = 1 means
the subject was exposed and X = 0 means the subject was not exposed. We can address the
problem of predicting Y from X by estimating E(Y |X = x). To address causal questions,
we introduce counterfactuals. Let Y1 denote the response if the subject is exposed. Let Y0
denote the response if the subject is not exposed. Then
(
Y1 if X = 1
Y =
Y0 if X = 0.
More succinctly
Y = XY1 + (1 − X)Y0 . (1)
If we expose a subject, we observe Y1 but we do not observe Y0 . Indeed, Y0 is the value we
would have observed if the subject had been exposed. The unobserved variable is called a
counterfactual. The variables (Y0 , Y1 ) are also called potential outcomes. We have enlarged
our set of variables from (X, Y ) to (X, Y, Y0 , Y1 ). A small dataset might look like this:
X Y Y0 Y1
1 1 * 1
1 1 * 1
1 0 * 0
1 1 * 1
0 1 1 *
0 0 0 *
0 1 1 *
0 1 1 *
The asterisks indicate unobserved variables. Causal questions involve the the distribution
p(y0 , y1 ) of the potential outcomes. We can interpret p(y1 ) as p(y|set X = 1) and we can
interpret p(y0 ) as p(y|set X = 0). The mean treatment effect or mean causal effect is defined
by
θ = E(Y1 ) − E(Y0 ) = E(Y |set X = 1) − E(Y |set X = 0).
The parameter θ has the following interpretation: θ is the mean response if we exposed
everyone minus the mean response if we exposed no-one.
Lemma 1 In general,
4
Exercise: Prove this.
The same results hold when X is continuous. In this case there is a counterfactual Y (x)
for each value x of X. We again have that, in general,
E[Y (x)] 6= E[Y |X = x].
See Figure 1. But if X is randomly assigned, then we do have E[Y (x)] = E[Y |X = x] and
so E[Y (x)] can be consistently estimated using standard regression methods. Indeed, if we
had randomly chosen the X values in Figure 1 then the plot on the left would have been
downward sloping. To see this, note that θ(x) = E[Y (x)] is defined to be the average of the
lines in the right plot. Under randomization, X is independent of Y (x). So
right plot = θ(x) = E[Y (x)] = E[Y (x)|X = x] = E[Y |X = x] = left plot.
5
Adjusting For Confounding. In some cases it is not feasible to do a randomized
experiment and we must use data from from observational (non-randomized) studies. Smok-
ing and lung cancer is an example. Can we estimate causal parameters from observational
(non-randomized) studies? The answer is: sort of.
In an observational study, the treated and untreated groups will not be comparable.
Maybe the healthy people chose to take the treatment and the unhealthy people didn’t. In
other words, X is not independent of (Y0 , Y1 ). The treatment may have no effect but we
would still see a strong association between Y and X. In other words, α might be large even
though θ = 0.
Here is a simplified example. Suppose X denotes whether someone takes vitamins and
Y is some binary health outcome (with Y = 1 meaning “healthy.”)
X 1 1 1 1 0 0 0 0
Y0 1 1 1 1 0 0 0 0
Y1 1 1 1 1 0 0 0 0
In this example, there are only two types of people: healthy and unhealthy. The healthy
people have (Y0 , Y1 ) = (1, 1). These people are healthy whether or not that take vitamins.
The unhealthy people have (Y0 , Y1 ) = (0, 0). These people are unhealthy whether or not
that take vitamins. The observed data are:
X 1 1 1 1 0 0 0 0
Y 1 1 1 1 0 0 0 0.
In this example, θ = 0 but α = 1. The problem is that people who choose to take
vitamins are different than people who choose not to take vitamins. That’s just another way
of saying that X is not independent of (Y0 , Y1 ).
To account for the differences in the groups, we can measure confounding variables.
These are the variables that affect both X and Y . These variables explain why the two groups
of people are different. In other words, these variables account for the dependence between
X and (Y0 , Y1 ). By definition, there are no such variables in a randomized experiment. The
hope is that if we measure enough confounding variables Z = (Z1 , . . . , Zk ), then, perhaps the
treated and untreated groups will be comparable, conditional on Z. This means that X is
independent of (Y0 , Y1 ) conditional on Z. We say that there is no unmeasured confounding,
or that ignorability holds, if
X q (Y0 , Y1 ) Z.
6
The only way to measure the important confounding variables is to use subject matter
knowledge. In other words, causal inference in observational studies is not possible
without subject matter knowledge.
Then Z Z
θ ≡ E(Y1 ) − E(Y0 ) = µ(1, z)p(z)dz − µ(0, z)p(z)dz (2)
where
µ(x, z) = E(Y |X = x, Z = z).
A consistent estimator of θ is
n n
1X 1X
θb = b(1, Zi ) −
µ µ
b(0, Zi )
n i=1 n i=1
where µ
b(x, z) is an appropriate, consistent estimator of the regression function µ(x, z) =
E[Y |X = x, Z = z].
Remark: Estimating the quantity in (2) well is difficult and involves an area of statistics
called semiparametric inference. In statistics, biostatistics, econometrics and epidemiology,
this is the focus of much research.
Proof. We have
θ = E(Y1 ) − E(Y0 )
Z Z
= E(Y1 |Z = z)p(z)dz − E(Y0 |Z = z)p(z)dz
Z Z
= E(Y1 |X = 1, Z = z)p(z)dz − E(Y0 |X = 0, Z = z)p(z)dz
Z Z
= E(Y |X = 1, Z = z)p(z)dz − E(Y |X = 0, Z = z)p(z)dz (3)
where we used the fact that X is independent of (Y0 , Y1 ) conditional on Z in the third line
and the fact that Y = (1 − X)Y1 + XY0 in the fourth line.
The process of including confounding variables and using equation (2) is known as adjust-
ing for confounders and θb is called the adjusted treatment effect. The choice of the estimator
µ
b(x, z) is delicate. If we use a nonparametric method then we have to choose the smoothing
parameter carefully. Unlike prediction, bias and variance are not equally important. The
7
usual bias-variance tradeoff does not apply. In fact bias is worse than variance and we
need to choose the smoothing parameter smaller than usual. As mentioned above, there is
a branch of statistics called semiparametric inference that deals with this problem in detail.
It is instructive to compare the casual effect
Z Z
θ = µ(1, z)p(z)dz − µ(0, z)p(z)dz
α = E(Y |X = 1) − E(Y |X = 0)
Z Z
= µ(1, z)p(z|X = 1)dz − µ(0, z)p(z|X = 0)dz
In a linear regression, the coefficient in front of x is the causal effect of x if (i) the model is
correct and (ii) all confounding variables are included in the regression.
To summarize: the coefficients in linear regression have a causal intepretation if (i) the
model is correct and (ii) every possible confounding factor is included in the model.