did DML
did DML
Pedro H. C. Sant’Anna
Emory University
January 2025
Introduction
DiD procedures with Covariates
■ All these are implemented in DRDID and did R packages, and drdid and csdid Stata
packages.
1
Implementations, so far, only allow for parametric first-step
models.
2
What if I want to leverage Machine
Learning procedures do to DiD?
3
We will focus on the 2x2 case with Panel Data.
4
Let’s review our assumptions
Assumptions in 2x2 setup
Assumption (No-Anticipation)
For all units i, Yi,t (g) = Yi,t (∞) for all groups in their pre-treatment periods, i.e., for all
t < g.
5
Different ATT formulations
Regression adjustment procedure
■ Originally proposed by Heckman, Ichimura and Todd (1997) and Heckman, Ichimura,
Smith and Todd (1998):
h i
ATT = E [Yt=2 − Yt=1 |G = 2] − E mG∆=∞ (X) |G = 2
where
mG∆=∞ (X) ≡ E [Yt=2 − Yt=1 |G = ∞, X]
■ Now, it is “only” a matter of modelling mG =∞ (X) and applying the plug-in principle.
∆
6
Inverse probability weighting procedure
■ Sant’Anna and Zhao (2020), building on Abadie (2005), considered the following IPW
estimand when Panel data are available:
p(X) (1 − D)
D 1 − p(X)
ATTipw,p = E
E [D] − (Yt=2 − Yt=1 ) ,
std p(X) (1 − D)
E
1 − p(X)
where
p (X) ≡ P [G = 2|X]
■ Now, it is “only” a matter of modelling p (X) and applying the plug-in principle.
■ Sant’Anna and Zhao (2020) considered the following doubly robust estimand when
panel data are available:
p(X) (1 − D)
D 1 − p(X)
ATTdr,p = E
E [D] − (Yt=2 − Yt=1 ) − mtG==2∞ (X) − mtG==1∞ (X)
p(X) (1 − D)
E
1 − p(X)
■ Again, it is “only” a matter of modeling p (X) and mG =∞ (X) and applying the plug-in
∆
principle.
8
What if I want to use ML?
9
Being inspired by the recent developments in Causal ML
■ In the last 15 years or so, we have seen many advances in Causal Machine Learning.
▶ Belloni, Chernozhukov and Hansen (2014)
▶ Farrell (2015)
▶ Belloni, Chernozhukov, Fernández-Val and Hansen (2017),
▶ Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey and Robins (2017)
▶ Athey and Wager (2018)
▶ Athey, Tibshirani and Wager (2019)
▶ Chernozhukov, Demirer, Duflo and Fernández-Val (2022).
■ All these papers propose estimators that are Doubly Robust/Neyman Orthogonal.
■ OTOH, richer set of covariates can make the estimation and inference about the ATT
much more challenging.
▶ What if we have n = 200 but we have 300 different X’s?
▶ What if we do not know the functional form of the pscore and the outcome-regression?
11
Treatment effects in Data-Rich environments
■ We want to estimate and make inferences about the ATT, allowing for the number of
potential covariates, k := dim f (X), to be potentially larger than the number of
cross-sectional units in the data, n.
■ Here, we will follow the popular approach (at least in economics) of assuming that
our nuisance functions, p(X) and mG∆=∞ (X), are approximately sparse.
(This is not required in low dimensional settings; we can also make alternative assumptions).
12
Approximate Sparsity
■ The approximate sparse approach imposes that we are unsure about what to do, so
we must conduct some model selection.
■ ML procedures were not originally built to be reliable for inference but to have good
predictive properties.
13
Valid inference after model selection
■ We should ignore the model selection step unless we are willing to assume
additional structure to the model that imposes that perfect model selection is
possible.
▶ Requires that all but a small number of coefficients are exactly zero. The nonzero
coefficients are large enough to be distinguished from zero with probability near 1 in
finite samples.
■ Rules out the possibility that some variables have moderate but nonzero effects.
14
Valid inference after model selection
■ We will focus on LASSO because they are known to perform very well under
(approximate) sparsity constraints; see, e.g., Chernozhukov et al. (2017) and Chang
(2020) for additional discussions on other methods.
■ With LASSO, the implementation is very easy and requires little modifications of
available software (which is another reason why we are focusing on it)
15
Using LASSO regressions
LASSO
■ It has been successfully used in many causal inference procedures, see, e.g., Belloni
et al. (2014), Farrell (2015), Chernozhukov et al. (2017), Belloni et al. (2017), among
many others.
■ More recently, Chang (2020) have built on it for DiD analysis, too!
16
But what do I need to do LASSO, in
practice?
17
LASSO in practice
■ Now, generically speaking, LASSO becomes a penalized OLS regression (when you
think OLS is appropriate):
′ 2 !
1 n Yi − f (Xi ) b λ
n i∑
min + Ψ̂b 1
,
b
=1
2 n
1/p
where, for a generic Z, ∥Z∥p = ∑nl=1 |Zl |p is the standard lp -norm and
Ψ̂ = diag l̂1 , . . . , l̂k is a diagonal matrix of data-dependent penalty loading’s.
18
Using LASSO to estimate mG∆=∞ (X)
■ Next, we can fit penalized OLS regression using only untreated units:
2 !
1 ∆Yi − f (Xi )′ b λ
n i:G∑
min + Ψ̂b 1
,
b
=∞
2 n
i
19
Using LASSO to estimate p(X)
■ OLS is not appropriate to estimate binary outcomes, as in the case with the
propensity score.
■ But we can easily modify the criterion function and fit a penalized maximum
likelihood regression:
(
1 n
min ∑ − 1 {Di = 1} log Λ f (Xi )′ b +
b n i=1
′ λ
+1 {Di = 0} log 1 − Λ f (Xi ) b + Ψ̂b 1
,
n
where, in our context, D = 1{G = 2}, and Λ(·) is a link function–in our case, a logistic
function, Λ(·) = exp(·)/(1 + exp(·).
bps ’s, we can then estimate p(x) by π
■ Once we have our β b (x) = Λ(f(x)′ βbps ).
20
Using LASSO regressions
How do we pick the penalty parameters?
Picking penalty parameters
■ How should you choose the penalty λ and the loadings l̂j , j = 1 . . . , k?
■ But how?
▶ Theory-driven way of picking these: Belloni et al. (2017)
▶ More computationally expensive (but with good performances, too): cross-validation
Chetverikov, Liao and Chernozhukov (2021)
21
“Problem” of LASSO
■ To avoid this problem, one can use Post-LASSO, which is a two-step procedure:
1. Use LASSO as a model selection: that is, run LASSO and select all the variables such
that β̂LASSO
j, n ̸= 0, j = 1, . . . , k.
2. Run OLS (or Maximum likelihood) using only the selected variables.
■ For references, see Belloni and Chernozhukov (2013) and Belloni, Chernozhukov and
Wei (2016).
■ You can include the union of selected covariates when using doubly robust
procedures; see, e.g., Belloni et al. (2014).
22
Let’s see how these work in a DiD
simulation exercise
23
Monte Carlo Simulations
Simulations
■ Use LASSO to estimate all functions, using cross-validation to select penalty terms.
■ We estimate the pscore assuming a logit specification and the outcome regression
models assuming a linear specification.
ps ps ps ′ ps 11 − j 1
■ Likewise γ0 = (γ0,1 , . . . , γ0,p ) , where γ0,j = × 1{j ≤ 10} − 2 .
10 j
25
DGPs
26
3 DGPs, varying the level of
heterogeneity
27
DGP1 - Unconditional PT is valid
■ DGP1:
■ DGP2:
■ Approx. sparsity is only there for the growth, not for the levels–the term fv (X) is not
approximately sparse.
29
DGP3 - Conditional PT holds with varying ATT(X)
■ DGP3:
■ ATT(X) is dense in X.
30
Table 1: Monte Carlo Simulations, DGP1: Unconditional PT
31
Figure 1: Monte Carlo for DID estimators, DGP1: Unconditional PT
2.5
3 3
2.0
2.0
1.5
2 2 1.5
Density
Density
Density
Density
1.0
1.0
1 1
0.5 0.5
0 0 0.0 0.0
−0.4 0.0 0.4 0.8 −0.4 0.0 0.4 0.8 −0.4 0.0 0.4 0.8 −0.4 0.0 0.4 0.8
Unconditional DiD Estimator Regression DiD Estimator DR DiD Estimator Std. IPW DiD Estimator
32
Table 2: Monte Carlo Simulations, DGP2: Conditional PT but homogeneous ATT across X
33
Figure 2: Monte Carlo for DID estimators, DGP2: Conditional PT but homogeneous ATT across X
0.25
0.25
2.0
2.0
0.20
0.20
1.5
1.5
0.15
0.15
Density
Density
Density
Density
1.0
1.0
0.10 0.10
0.5
0.05 0.5 0.05
−5 0 5 10 −6 −3 0 3 6 −6 −3 0 3 6 −6 −3 0 3 6
Unconditional DiD Estimator Regression DiD Estimator DR DiD Estimator Std. IPW DiD Estimator
34
Figure 3: Monte Carlo for DID estimators, DGP2: Conditional PT but homogeneous ATT across X
2.0
2.0
1.5
1.5
Density
Density
1.0
1.0
0.5
0.5
0.0 0.0
35
Table 3: Monte Carlo Simulations, DGP3: Conditional PT and heterogeneous ATT across X
36
Figure 4: Monte Carlo for DID estimators, DGP3: Conditional PT and heterogeneous ATT across X
0.25
2.0
2.0
0.20
1.5
0.2
1.5
0.15
Density
Density
Density
Density
1.0
1.0
0.10
0.1
0.5
0.5 0.05
−5 0 5 10 −8 −4 0 4 −8 −4 0 4 −8 −4 0 4
Unconditional DiD Estimator Regression DiD Estimator DR DiD Estimator Std. IPW DiD Estimator
37
Figure 5: Monte Carlo for DID estimators, DGP3: Conditional PT and heterogeneous ATT across X
2.5
2.0
2.0
1.5
1.5
Density
Density
1.0
1.0
0.5
0.5
0.0 0.0
38
What are the requirements?
What are the requirements to use ML in the first step?
■ We need to ensure that the model selection mistakes are “moderately” small for the
underlying model.
▶ It suffices that the product of errors are relatively small, that is,
b (·))||2 = o n−1/4 .
||(mG∆=∞ (·) − µbG∆=∞ (·))(p (·) − π
▶ This usually comes from assumptions about the “complexity” of the model. Cross-fitting
also helps to ensure this for some classes of models (relax some additional conditions
when doing LASSO, too).
39
Take-way messages
Take-way message
■ As long as you use the Doubly-Robust formula for DiD, you can use machine learning
to estimate nuisance functions.
40
References
Abadie, Alberto, “Semiparametric Difference-in-Differences Estimators,” The Review of
Economic Studies, 2005, 72 (1), 1–19.
Athey, Susan and Stefan Wager, “Estimation and Inference of Heterogeneous Treatment
Effects using Random Forests,” Journal of the American Statistical Association, 2018, 113
(523), 1228 – 1242.
, Julie Tibshirani, and Stefan Wager, “Generalized random forests,” The Annals of
Statistics, 2019, 47 (2), 1148 – 1178.
Belloni, Alexandre and Victor Chernozhukov, “Least squares after model selection in
high-dimensional sparse models,” Bernoulli, 2013, 19 (2), 521–547.
, , and Christian Hansen, “Inference on Treatment Effects after Selection among
High-Dimensional Controls,” The Review of Economic Studies, apr 2014, 81 (2), 608–650.
, , and Ying Wei, “Post-Selection Inference for Generalized Linear Models With Many
Controls,” Journal of Business & Economic Statistics, oct 2016, 34 (4), 606–619.
40
, , Iván Fernández-Val, and Christian Hansen, “Program Evaluation and Causal
Inference With High-Dimensional Data,” Econometrica, 2017, 85 (1), 233–298.
Callaway, Brantly, David Drukker, Di Liu, and Pedro H. C. Sant’Anna,
“Difference-in-Differences via Machine Learning,” Working Paper, 2023.
Chang, Neng-Chieh, “Double/debiased machine learning for difference-in-differences
models,” The Econometrics Journal, 2020, 23 (2), 177––191.
Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen,
Whitney Newey, and James Robins, “Double/debiased machine learning for treatment
and structural parameters,” The Econometrics Journal, jun 2017, pp. 1–71.
, Mert Demirer, Esther Duflo, and Iván Fernández-Val, “Generic Machine Learning
Inference on Heterogenous Treatment Effects in Randomized Experiments ,”
arXiv:1712.04802, 2022.
Chetverikov, Denis, Zhipeng Liao, and Victor Chernozhukov, “On cross-validated Lasso in
high dimensions,” The Annals of Statistics, jun 2021, 49 (3), 1–25.
40
Farrell, Max H., “Robust inference on average treatment effects with possibly more
covariates than observations,” Journal of Econometrics, 2015, 189 (1), 1–23.
Heckman, James, Hidehiko Ichimura, Jeffrey Smith, and Petra Todd, “Characterizing
Selection Bias Using Experimental Data,” Econometrica, 1998, 66 (5), 1017–1098.
Heckman, James J., Hidehiko Ichimura, and Petra E. Todd, “Matching As An Econometric
Evaluation Estimator: Evidence from Evaluating a Job Training Programme,” The Review
of Economic Studies, October 1997, 64 (4), 605–654.
Sant’Anna, Pedro H. C. and Jun Zhao, “Doubly robust difference-in-differences estimators,”
Journal of Econometrics, November 2020, 219 (1), 101–122.
Tibshirani, Robert, “Regression Shrinkage and Selection via the Lasso,” Journal of the
Royal Statistical Society. Series B (Methodological), 1996, 58 (1).
40