0% found this document useful (0 votes)
8 views

21 Ejs1931

Uploaded by

zifang tian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

21 Ejs1931

Uploaded by

zifang tian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Electronic Journal of Statistics

Vol. 15 (2021) 6461–6543


ISSN: 1935-7524
https://doi.org/10.1214/21-EJS1931

Regularizing double machine learning in


partially linear endogenous models∗
Corinne Emmenegger and Peter Bühlmann
Seminar for Statistics,
ETH Zürich
e-mail: [email protected]; [email protected]

Abstract: The linear coefficient in a partially linear model with confound-


ing variables can be estimated using double machine learning (DML). How-
ever, this DML estimator has a two-stage least squares (TSLS) interpre-
tation and may produce overly wide confidence intervals. To address this
issue, we propose a regularization and selection scheme, regsDML, which
leads to narrower confidence intervals. It selects either the TSLS DML es-
timator or a regularization-only estimator depending on whose estimated
variance is smaller. The regularization-only estimator is tailored to have
a low mean squared error. The regsDML estimator is fully data driven.
The regsDML estimator converges at the parametric rate, is asymptotically
Gaussian distributed, and asymptotically equivalent to the TSLS DML esti-
mator, but regsDML exhibits substantially better finite sample properties.
The regsDML estimator uses the idea of k-class estimators, and we show
how DML and k-class estimation can be combined to estimate the lin-
ear coefficient in a partially linear endogenous model. Empirical examples
demonstrate our methodological and theoretical developments. Software
code for our regsDML method is available in the R-package dmlalg.

Keywords and phrases: Double machine learning, endogenous variables,


generalized method of moments, instrumental variables, k-class estimation,
partially linear model, regularization, semiparametric estimation, two-stage
least squares.

Received January 2021.

1. Introduction

Partially linear models (PLMs) combine the flexibility of nonparametric ap-


proaches with ease of interpretation of linear models. Allowing for nonparametric
terms makes the estimation procedure robust to some model misspecifications.
A plaguing issue is potential endogeneity. For instance, if a treatment is not
randomly assigned in a clinical study, subjects receiving different treatments
differ in other ways than only the treatment [73]. Another situation where an
explanatory variable is correlated with the error term occurs if the explanatory
variable is determined simultaneously with the response [97]. In such situations,
employing estimation methods that do not account for endogeneity can lead to
biased estimators [44].
∗ This project has received funding from the European Research Council (ERC) under the

European Union’s Horizon 2020 research and innovation programme (grant agreement No.
786461).
6461
6462 C. Emmenegger and P. Bühlmann

Let us consider the PLM

Y = X T β0 + gY (W ) + hY (H) + εY . (1)

The covariates X and W and the response Y are observed whereas the variable
H is not observed and acts as a potential confounder. It can cause endogeneity
in the model when it is correlated with X, W , and Y . The variable εY denotes
a random error. An overview of PLMs is presented in Härdle, Liang and Gao
[48]. Semiparametric methods are summarized in Ruppert, Wand and Carroll
[83] and Härdle et al. [49], for instance.

Chernozhukov et al. [31] introduce double machine learning (DML) to esti-


mate the linear coefficient β0 in a model similar to (1). The central ingredients
are Neyman orthogonality and sample splitting with cross-fitting. They allow
estimates of so-called nuisance terms to be plugged into the estimating equation
of β0 . The resulting estimator converges at the parametric rate N − 2 , with N
1

denoting the sample size, and is asymptotically Gaussian.

A common approach to cope with endogeneity uses instrumental variables


(IVs). Consider a random variable A that typically satisfies the assumptions of
a conditional instrument [76]. The DML procedure first adjusts A, X, and Y
for W by regressing out W of them. Then the residual Y − E[Y |W ] is regressed
on X − E[X|W ] using the instrument A − E[A|W ]. The population parameter
is identified by
 
E (A − E[A|W ])(Y − E[Y |W ])
β0 =   (2)
E (A − E[A|W ])(X − E[X|W ])
if both A and X are 1-dimensional. The restriction to the 1-dimensional case is
only for simplicity at this point. Below, we consider multivariate A and X. In
practice, we insert potentially biased machine learning (ML) estimates of the
nuisance parameters E[A|W ], E[X|W ], and E[Y |W ] into this equation for β0 .
Estimates of these nuisance parameters are typically biased if their complexity
is regularized. Neyman orthogonal scores and sample splitting allow circumvent-
ing empirical process conditions to justify inserting ML estimators of nuisance
parameters into estimating equations [19, 31].

Equation (2) has a two-stage least squares (TSLS) interpretation [91, 92,
16, 22, 13, 6]. As mentioned above, the residual term Y − E[Y |W ] is regressed
on X − E[X|W ] using the instrument A − E[A|W ]. In entirely linear models,
the following findings have been reported about TSLS and related procedures.
The TSLS estimator has been observed to be highly variable, leading to overly
wide confidence intervals. For instance, although ordinary least squares (OLS) is
biased in the presence of endogeneity, it has been observed to be less variable [96,
71, 89, 34, 62]. The issue with large or nonexisting variance of TSLS (the order
of existing moments of TSLS depends on the degree of overidentification [66,
67, 68]) is also coupled with the strength of the instrument [21, 86, 87, 35, 12].
Regularizing DML in endogenous PLMs 6463

Reducing the variability is sometimes possible by using k-class estimators [93,


51, 82, 54].
The k-class estimators have been developed for entirely linear models. The
TSLS estimator is a k-class estimator with a fixed value of k = 1, and An-
derson, Kunitomo and Morimune [8] recommend to not use fixed k-class es-
timators. Three particularly well-established k-class estimators are the limited
information maximum likelihood (LIML) estimator [10, 4] and the Fuller(1) and
Fuller(4) estimators [43]. They have been developed for entirely linear models
to overcome some deficiencies of TSLS. If many instruments are present, LIML
experiences some optimality properties [9]. Furthermore, the normal approxi-
mation for the finite sample estimator may be suboptimal for TSLS but useful
for LIML [11, 7, 5]. However, LIML has no moments [67, 79, 80, 52]. The Fuller
estimators overcome this problem. Having no moments can lead to poor squared
error performance, especially in weak instrument situations [45]. On the other
hand, the Fuller(1) estimator is approximately unbiased, and Fuller(4) has par-
ticularly low mean squared error (MSE) [43]. Takeuchi and Morimune [90] give
further asymptotic optimality results of the Fuller estimators.

We propose a regularization-selection DML method using the idea of k-class


estimators. We call our method regsDML. It is tailored to reduce variance and
hence improve the MSE of the estimator of β0 . Nevertheless, regsDML con-
verges at the parametric rate, and its coverage of confidence intervals for the lin-
ear coefficient β0 remains (asymptotically) valid. Empirical simulations demon-
strate that regsDML typically leads to shorter confidence intervals than LIML,
Fuller(1), and Fuller(4), while it still attains the nominal coverage level.

1.1. Our contribution

Our contribution is twofold. First, we build on the work of Chernozhukov et al.


[31] to estimate β0 in the endogenous PLM (1) with multidimensional A and
X such that its estimator β̂ converges at the parametric rate, N − 2 , and is
1

asymptotically Gaussian. In contrast to Chernozhukov et al. [31], we formulate


the underlying model as a structural equation model (SEM) and allow A and
X to be multidimensional. We directly specify an identifiability condition of β0
instead of giving additional conditional moment restrictions. The SEM may be
overidentified in the sense that the dimension of A can exceed the dimension of
X. Overidentification can lead to more efficient estimators [3, 18, 47] and more
robust estimators [75]. Considering SEMs and an identifiability condition allows
us to apply DML to more general situations than in Chernozhukov et al. [31].
Second, we propose a DML method that employs regularization and selec-
tion. This method is called regsDML, and we develop it in Section 4. It reduces
the potentially excessive estimated standard deviation of DML because it se-
lects either the TSLS DML estimator or a regularization-only estimator called
regDML depending on whose estimated variance is smaller. The underlying idea
of the regularization-only estimator regDML is similar to k-class estimation [93]
6464 C. Emmenegger and P. Bühlmann

and anchor regression [82, 23]. Both k-class estimation and anchor regression are
designed for linear models and may require choosing a regularization parameter.
Our approach is designed for PLMs, and the regularization parameter is data
driven. Recently, Jakobsen and Peters [54] have proposed a related strategy for
linear (structural equation) models; whereas they rely on testing for choosing
the amount of regularization, we tailor our approach to reduce the MSE such
that the coverage of confidence intervals for β0 remains valid. The regsDML
estimator converges at the parametric rate and is asymptotically Gaussian.
In this sense, and in contrast to Jakobsen and Peters [54], regsDML focuses
on statistical inference beyond point estimation with coverage guarantees not
only in linear models but also in potentially complex partially linear ones. The
regsDML estimator is asymptotically equivalent to the TSLS-type DML esti-
mator, but regsDML may exhibit substantially better finite sample properties.
Furthermore, our developments show how DML and k-class estimation can be
combined to estimate the linear coefficient in an endogenous PLM.
Our approach allows flexible model specification. We only require that X
enters linearly in (1) and that the other terms are additive. In particular, the
form of the effect of W on A or of A on W is not constrained. This is partly
similar to TSLS, which is robust to model misspecifications in its first stage be-
cause it does not rely on a correct specification of the instrument effect on the
covariate [15]. The detailed assumptions on how the variables A, X, W , H, and
Y interact are given in Section 2: the variable A needs to satisfy an assumption
similar to that for a conditional instrument, but there is some flexibility.

We consider a motivating example to illustrate some of the points mentioned


above. Figure 1 gives the SEM we generate data from and its associated causal
graph [59, 74, 76, 77, 78, 64]. By convention, we omit error variables in a causal
graph if they are mutually independent [76]. The variable A is similar to a
conditional instrument given W .

Fig 1. An SEM and its associated causal graph.

We simulate M = 1000 datasets each for a range of sample sizes N . The


nuisance parameters E[A|W
 1  ], E[X|W ], and E[Y |W ] are estimated with additive
cubic B-splines with N 5 + 2 degrees of freedom. The simulation results are
displayed in Figure 2. This figure displays the coverage, power, and relative
length of the 95% confidence intervals for β0 using “standard” DML (red) and
the newly proposed methods regDML (blue) and regsDML (green). The regDML
Regularizing DML in endogenous PLMs 6465

Fig 2. The results come from M = 1000 simulation runs each from the SEM in Figure 1
for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1. The nuisance
functions are estimated with additive splines. The figure displays the coverage of two-sided
confidence intervals for β0 , power for two-sided testing of the hypothesis H0 : β0 = 0, and
scaled lengths of two-sided confidence intervals of DML (red), regDML (blue), and regsDML
(green), where all results are at level 95%. At each N , the lengths of the confidence intervals
are scaled with the median length from DML. The shaded regions in the coverage and power
plots represent 95% confidence bands with respect to the M simulation runs. The blue and
green lines are indistinguishable in the left panel.

method is a version of regsDML with regularization only but no selection. If the


blue curve is not visible in Figure 2, it coincides with the green curve. The
dashed lines in the coverage and power plots indicate 95% confidence regions
with respect to uncertainties in the M simulation runs.
The regsDML method succeeds in producing much narrower confidence in-
tervals than DML although it maintains good coverage. The power of regsDML
is close to 1 for all considered sample sizes. For small sample sizes, regsDML
leads to confidence intervals whose length is around 10% − 20% the length of
DML’s. As the sample size increases, regsDML starts to resemble the behavior
of the DML estimator but continues to produce substantially shorter confidence
intervals. Thus, the regularization-selection regsDML (and also its version with
regularization only) is a highly effective method to increase the power and sharp-
ness of statistical inference whereas keeping the type I error and coverage under
control.
Simulation results with β0 = 0 in the SEM of Figure 2 are presented in
Figure 7 in Section D in the appendix. Further numerical results are given in
Section 5.

1.2. Additional literature

PLMs have received considerable interest. Härdle, Liang and Gao [48] present
an overview of estimation methods in purely exogenous PLMs, and many refer-
6466 C. Emmenegger and P. Bühlmann

ences are given there. The remaining part of this paragraph refers to literature
investigating endogenous PLMs. Ai and Chen [2] consider semiparametric es-
timation with a sieve estimator. Ma and Carroll [63] introduce a parametric
model for the latent variable. Yao [98] considers a heteroskedastic error term
and a partialling-out scheme [81, 85]. Florens, Johannes and Van Bellegem [42]
propose to solve an ill-posed integral equation. Su and Zhang [88] investigate a
partially linear dynamic panel data model with fixed effects and lagged variables
and consider sieve IV estimators as well as an approach with solving integral
equations. Horowitz [53] compares inference and other properties of nonpara-
metric and parametric estimation if instruments are employed.

Combining Neyman orthogonality and sample splitting (with cross-fitting) al-


lows a diverse range of estimators and machine learning algorithms to be used to
estimate nuisance parameters. This procedure has alternatively been considered
in Newey and McFadden [72], van der Laan and Robins [94], and Chernozhukov
et al. [31]. DML methods have been applied in various situations. Chen, Huang
and Tien [27] consider instrumental variables quantile regression. Liu, Zhang
and Zhou [61] apply DML in logistic partially linear models. Colangelo and Lee
[33] employ doubly debiased machine learning methods to a fully nonparamet-
ric equation of the response with a continuous treatment. Knaus [55] presents
an overview of DML methods in unconfounded models. Farbmacher et al. [41]
decompose the causal effect of a binary treatment by a mediation analysis and
estimate it by DML. Lewis and Syrgkanis [60] extend DML to estimate dynamic
effects of treatments. Chiang et al. [32] apply DML under multiway clustered
sampling environments. Cui and Tchetgen Tchetgen [36] propose a technique to
reduce the bias of DML estimators.
Nonparametric components can be estimated without sample splitting and
cross-fitting if the underlying function class satisfies some entropy conditions;
see for instance Mammen and van de Geer [65]. Alternatively, Chen, Liang and
Zhou [28] partial out the nonparametric component using a kernel method and
employ the generalized method of moments principle [46]. The mentioned en-
tropy regularity conditions limit the complexity of the function class, and ML
algorithms do usually not satisfy them. Particularly, these conditions fail to
hold if the dimension of the nonparametric variables increases with the sample
size [31].

Double robustness and orthogonality arguments have also been considered


in the following works. Okui et al. [73] consider doubly robust estimation of
the parametric part. Their estimator is consistent if either the model for the
effect of the measured confounders on the outcome or the model of the effect
of the measured confounders on the instrument is correctly specified. Smucler,
Rotnitzky and Robins [84] consider doubly robust estimation of scalar param-
eters where the nuisance functions are 1 -constrained. Targeted minimum loss
based estimators and G-estimators also feature an orthogonality property; an
overview is given in DiazOrdaz, Daniel and Kreif [38].
Regularizing DML in endogenous PLMs 6467

The literature presented in this subsection is related to but rather distinct


from our work with the only exception of Chernozhukov et al. [31]. The differ-
ence to this latter contribution is highlighted in Section 2 and Section A in the
appendix.

Outline of the paper. Section 2 and 3 describe the DML estimator. The former
section introduces an identifiability condition, and the latter investigates asymp-
totic properties. Section 4 introduces the regularized regularization-selection es-
timator regsDML and its regularization-only version regDML and investigates
their asymptotic properties. Section 5 presents numerical experiments and an
empirical real data example. Section 6 concludes our work. Proofs and addi-
tional definitions and material are given in the appendix.

Notation. We denote by [N ] the set {1, 2, . . . , N }. We add the probability


law as a subscript to the probability operator P and the expectation operator E
whenever we want to emphasize the corresponding dependence. We denote the
Lp (P ) norm by ·P,p and the Euclidean or operator norm by ·, depending
on the context. We implicitly assume that given expectations and conditional
d
expectations exist. We denote by → convergence in distribution. Furthermore,
we denote by 1d×d ∈ R d×d
the d × d identity matrix and write 1 if we do not
want to underline its dimension.

2. An identifiability condition and the DML estimator

Before we introduce regsDML in Section 4, we present our TSLS-type DML es-


timator of β0 because we require it to formulate regsDML. The DML estimator
estimates the linear coefficient in an endogenous and potentially overidentified
PLM where A and X may be multidimensional. Our work builds on Cher-
nozhukov et al. [31], but they only consider univariate A and X and restrict
conditional moments to identify the linear coefficient. We impose an uncondi-
tional moment restriction below. However, our results recover theirs if A and X
are univariate and the additional conditional moment restrictions are satisfied.
Our PLM is cast as an SEM. The SEM specifies the generating mechanism
of the random variables A, W , H, X, and Y of dimensions q, v, r, d, and 1,
respectively. The structural equation of the response is given by

Y ← X T β0 + gY (W ) + hY (H) + εY (3)

as in (1), where β0 ∈ Rd is a fixed unknown parameter vector, and where


the functions gY and hY are unknown. The variable H is hidden and causes
endogeneity. The variable εY denotes an unobserved error term. The model is
potentially overidentified in the sense that the dimension of A may exceed the
dimension of X. Observe that A does not directly affect the response Y in the
sense that it does not appear on the right hand side of (3). The model is required
to satisfy an identifiability condition as in (5) below.
6468 C. Emmenegger and P. Bühlmann

Econometric models are often presented as a system of simultaneous struc-


tural equations. Full information models consider all equations at once, and
limited information models only consider equations of interest [5].

2.1. Identifiability condition

An identifiability condition is required to identify β0 in (3). We define the resid-


ual terms

RA := A − E[A|W ], RX := X − E[X|W ], and RY := Y − E[Y |W ] (4)

that adjust A, X, and Y for W . Our DML estimator of β0 is obtained by


performing TSLS of RY on RX using the instrument RA . This scheme requires
the unconditional moment condition
 
E RA (RY − RX
T
β0 ) = 0 (5)

to identify β0 in (3). For instance, this condition is satisfied if A is indepen-


dent of both H and εY given W or if A is independent of H, εY , and W . The
identifiability condition (5) is strictly weaker than the conditional moment con-
ditions introduced in Chernozhukov et al. [31]; see Section A in the appendix
that presents an example where our identifiability condition holds but the condi-
tional moment conditions do not. The subsequent theorem asserts identifiability
of β0 .

Theorem 2.1. Let the dimensions q = dim(A) and d = dim(X), and assume
q ≥ d. Assume furthermore that the matrices E[RX RA T
] and E[RA RA
T
] are of full
rank, and assume the identifiability condition (5). Then the representation
      −1    
T −1 T −1
β0 = E RX RA
T
E RA RA E RA RX
T
E RX RA
T
E RA RA E[RA RY ]

holds.

Theorem 2.1 precludes underidentification. The full rank condition of the


matrix E[RX RA T
] expresses that the correlation between X and A is strong
enough after regressing out W . This is a typical TSLS assumption [91, 92, 16,
22, 13, 6]. The rank assumptions in Theorem 2.1 in particular require that A,
X, and Y are not deterministic functions of W .
The instrument A instead of RA can alternatively identify β0 in Theorem 2.1.
However, this procedure leads to a suboptimal convergence rate of the resulting
estimator; see Section 3.1.
The identifiability condition (5) is central to Theorem 2.1. Section G in the
appendix presents examples illustrating SEMs where the identifiability condition
holds and where it fails to hold.
Regularizing DML in endogenous PLMs 6469

2.2. Alternative interpretations of β0

We present two alternative interpretations of β0 apart from performing TSLS of


RY on RX using the instrument RA . The second representation will be used to
formulate our regularization schemes in Section 4. To formulate these alternative
representations, we introduce the linear projection operator PRA on RA that
maps a random variable Z to its projection
   
T −1
PRA Z := E ZRA
T
E RA RA RA .

By Theorem 2.1, the population parameter β0 solves the TSLS moment equa-
tion      
T −1
0 = E RX RA T
E RA RA E RA (RY − RX T
β0 ) .
This motivates a generalized method of moments interpretation of β0 because
we have
 
T −1
 
β0 = arg min E[ψ(S; β, η 0 )] E RA RA E ψ T (S; β, η 0 )
β∈Rd

for ψ(S; β, η 0 ) := RA (RY − RX


T
β), where η 0 = (E[A|W ], E[X|W ], E[Y |W ]) de-
notes the nuisance parameter and S = (A, W, X, Y ) denotes the concatenation
of the observable variables.
This leads to the second interpretation of β0 . The coefficient β0 minimizes
the squared projection of the residual RY − RX T
β on RA , namely

2
β0 = arg min E PRA (RY − RX T
β) . (6)
β∈Rd

We employ the representation of β0 in (6) to formulate our regularization


schemes in Section 4.

3. Formulation of the DML estimator and its asymptotic properties

In this section, we describe how to estimate β0 using the TSLS-type DML


scheme, and we describe the asymptotic properties of this estimator.
Consider N iid realizations {Si = (Ai , Xi , Wi , Yi )}i∈[N ] of S = (A, X, W, Y )
from the SEM in (3). We concatenate the observations of A row-wise to form
an (N × q)-dimensional matrix A. Analogously, we construct the matrices X ∈
RN ×d and W ∈ RN ×v and the vector Y ∈ RN containing the respective obser-
vations.
We construct a DML estimator of β0 as follows. First, we split the data into
K ≥ 2 disjoint sets I1 , . . . , IK . For simplicity, we assume that these sets are
N
of equal cardinality n = K . In practice, their cardinality might differ due to
rounding issues.
For each k ∈ [K], we estimate the conditional expectations m0A (W ) :=
E[A|W ], m0X (W ) := E[X|W ], and m0Y (W ) := E[Y |W ], which act as nuisance
6470 C. Emmenegger and P. Bühlmann

Ic Ic
parameters, with data from Ikc . We call the resulting estimators m̂Ak , m̂Xk , and
Ic Ic
m̂Yk , respectively. Then, the adjusted residual terms RA,iIk
:= Ai − m̂Ak (Wi ),
Ic Ic
Ik
RX,i := Xi − m̂Xk (Wi ), and RY,i
Ik
:= Yi − m̂Yk (Wi ) for i ∈ Ik are evaluated on
c
Ik , the complement of Ik . We concatenate them row-wise to form the matrices
Ik Ik Ik
RA ∈ Rn×q and RX ∈ Rn×d and the vector RY ∈ Rn .
These K iterates are assembled to form the DML estimator
−1 
1 
K K
Ik T Ik 1 Ik T Ik
β̂ := RX ΠRIk RX RX ΠRIk RY (7)
K A K A
k=1 k=1

of β0 , where
Ik
 Ik T Ik
−1 Ik T
ΠRIk := RA RA RA RA (8)
A

denotes the orthogonal projection matrix onto the space spanned by the columns
Ik
of RA .
To obtain β̂ in (7), the individual matrices are first averaged before the final
matrix is inverted. It is also possible to compute K individual TSLS estimators
on the K iterates individually and average these. Both schemes are asymptot-
ically equivalent. Chernozhukov et al. [31] call these two schemes DML2 and
DML1, respectively, where DML2 is as in (7). The DML1 version of the coeffi-
cient estimator is given in the appendix in Section B.1. The advantage of DML2
over DML1 is that it enhances stability properties of the estimator. To ensure
stability of the DML1 estimator, every individual matrix that is inverted needs
to be well conditioned. Stability of the DML2 estimator is ensured if the average
of these matrices is well conditioned.

The K sample splits are random. To reduce the effect of this randomness,
we repeat the overall procedure S times and assemble the results as suggested
in Chernozhukov et al. [31]. This procedure is described in Algorithm 1 in Sec-
tion 4.2 below.

The following theorem establishes that β̂ converges at the parametric rate


and is asymptotically Gaussian.
Theorem 3.1. Consider model (3). Suppose that Assumption I.5 in the ap-
pendix in Section I holds and consider ψ given in Definition I.1 in the appendix
in Section I. Then β̂ as in (7) concentrates in a √1N neighborhood of β0 . It is
approximately linear and centered Gaussian, namely

√ 1 
N
N σ −1 (β̂ − β0 ) = √
d
ψ(Si ; β0 , η 0 ) + oP (1) → N (0, 1d×d ) (N → ∞),
N i=1

uniformly over the law P of S = (A, W, X, Y ), and where the variance-co-


variance matrix σ 2 is given by σ 2 = J0 J˜0 J0T for the matrices J˜0 and J0 given
in Definition I.1 in the appendix.
Regularizing DML in endogenous PLMs 6471

A similar result to Theorem 3.1 is presented by Chernozhukov et al. [31].


However, their result requires univariate A and X, and it imposes conditional
moment restrictions instead of the identifiability condition (5); see also Section A
in the appendix that presents an example where our identifiability condition
holds but the conditional moment conditions do not. If A and X are univariate
and the respective conditional moment conditions hold, our result coincides
with Chernozhukov et al. [31].
Theorem 3.1 also holds for the DML1 version of β̂ defined in the appendix
in Section B.1. Assumption I.5 specifies regularity conditions and the conver-
gence rate of the machine learners estimating the conditional expectations. The
machine learners are required to satisfy the product relations
Ic
N−2 ,
1
m0A (W ) − m̂Ak (W )2P,2
c
I
m0A (W ) − m̂Ak (W )P,2 (9)
Ic Ic − 12
· m0Y (W ) − m̂Yk (W )P,2 + m0X (W ) − m̂Xk (W )P,2 N

for k ∈ [K], which allows us to employ a broad range of ML estimators. For in-
stance, these convergence rates are satisfied by 1 -penalized and related methods
in a variety of sparse, high-dimensional linear models [26, 20, 24, 17], forward se-
lection in sparse linear models [57], high-dimensional additive models [69, 56, 99],
or regression trees and random forests [95, 14]. Please see Chernozhukov et al.
[31] for additional references. In particular, the rate condition (9) is satisfied if
the individual ML estimators converge at rate N − 4 . Therefore, the individual
1

ML estimators are not required to converge at rate N − 2 .


1

2
The asymptotic variance σ can be consistently estimated by replacing the
true β0 by β̂ or its DML1 version. The nuisance functions are estimated on
subsampled datasets, and the estimator of σ 2 is obtained by cross-fitting. The
formal definition, the consistency result, and its proof are given in Definition I.1
and in Theorem I.21 in the appendix in Section I.
For fixed P , the asymptotic variance-covariance matrix σ 2 is the same as if
the conditional expectations m0A (W ), m0X (W ), and m0Y (W ) and hence RA , RX ,
and RY were known.
Theorem 3.1 holds uniformly over laws P . This uniformity guarantees some
robustness of the asymptotic statement [31]. The dimension v of the covariate
W may grow as the sample size increases. Thus, high-dimensional methods can
be considered to estimate the conditional expectations E[A|W ], E[X|W ], and
E[Y |W ].

The estimator β̂ solves the moment equations


1  −1 1  
1  1  Ik
K
Ik T Ik Ik T Ikc
0= RX,i RA,i RA,i RA,i ψ(Si ; β̂, η̂ ) ,
K n n n
k=1 i∈Ik i∈Ik i∈Ik

where the score function ψ is given by


 
T
ψ(S; β, η) := A − mA (W ) Y − mY (W ) − X − mX (W ) β (10)
6472 C. Emmenegger and P. Bühlmann

for η = (mA , mX , mY ), and where the estimated nuisance parameter is given


c Ic Ic Ic
by η̂ Ik = (m̂Ak , m̂Xk , m̂Yk ). Observe that ψ(S; β0 , η 0 ) with η 0 = (m0A , m0X , m0Y )
coincides with the term whose expectation is constrained to equal 0 in the
identifiability
√ condition (5). The crucial step to prove asymptotic normality of
 c
N (β̂ − β0 ) is to analyze the asymptotic behavior of √1n i∈Ik ψ(Si ; β̂, η̂ Ik ) for
k ∈ [K].
Apart from the identifiability condition, the first fundamental requirement
to analyze these terms is the ML convergence rates in (9). Second, we employ
sample splitting and cross-fitting. Sample splitting ensures that the data used
to estimate the nuisance parameters and the data on which these estimators are
evaluated are independent. Cross-fitting enables us to regain full efficiency. The
third requirement is that the underlying score function ψ in (10) is Neyman
orthogonal, which we explain next.
Neyman orthogonality ensures that ψ is insensitive to small changes in the
nuisance parameter η at the true unknown linear coefficient β0 and the true
unknown nuisance parameter η 0 . This makes estimation of β0 robust to inserting
biased ML estimators of the nuisance parameter in the estimation equation. The
following definition formally introduces this concept.
Definition 3.2. [31, Definition 2.1]. A score ψ = ψ(S; β, η) is Neyman orthog-
onal at (β0 , η 0 ) if the pathwise derivative map

∂  
E ψ S; β0 , η 0 + r(η − η 0 )
∂r
exists for all r ∈ [0, 1) and nuisance parameters η and vanishes at r = 0.
Definition 3.2 does not entirely coincide with Chernozhukov et al. [31, Defini-
tion 2.1] because the latter also includes an identifiability condition. We directly
assume the identifiability condition (5).
The subsequent proposition states that the score function ψ in (10) is indeed
Neyman orthogonal.
Proposition 3.3. The score ψ given in Equation (10) is Neyman orthogonal.
We would like to remark that Neyman orthogonality of ψ neither depends
on the distribution of S nor on the value of β0 and η 0 . In addition to being
Neyman orthogonal, ψ is linear in β in the sense that we have

ψ(S; β, η) = ψ b (S; η) − ψ a (S; η)β (11)

for
ψ b (S; η) := A − mA (W ) Y − mY (W )
and
T
ψ a (S; η) := A − mA (W ) X − mX (W ) .
This linearity property is also employed in the proof of Theorem 3.1.
Regularizing DML in endogenous PLMs 6473

3.1. Suboptimal estimation procedure

In general, we cannot employ A as an instrument instead of RA in our TSLS-type


DML estimation procedure. For simplicity, we assume K = 2 in this subsection
and consider disjoint index sets I and I c of size n = N2 . The term

1 
√ I
Ai RY,i − (RX,i
I
)T β0 (12)
n
i∈I

c c
may diverge as N → ∞ because m̂IX and m̂IY may be biased estimators of
m0X and m0Y . This in particular happens if the functions m0X and m0Y are high-
dimensional and need to be estimated by regularization techniques; see Cher-
nozhukov et al. [31]. Even if sample splitting is employed, the term (12) is
asymptotically not well behaved because the underlying score function
 
T
ϕ(S; β, η) := A Y − mY (W ) − X − mX (W ) β

is not Neyman orthogonal. The issue is illustrated in Figure 3. The SEM used to
generate the data is similar to the nonconfounded model used in Chernozhukov
β̂−β0
et al. [31, Figure 1]. The centered and rescaled term  using A as an in-
Var(β̂)
 β̂)
strument is biased whereas it is not if the instrument RA is used. Here, Var(
denotes the empirically observed variance of β̂ with respect to the performed
simulation runs.

Fig 3. Histograms of β̂−β0


,  β̂) denotes the empirically observed variance of β̂ with
where Var(
 β̂)
Var(
respect to the simulation runs, using A as an instrument in the left plot and using RA as an
instrument in the right plot. The orange curves represent the density of N (0, 1). The results
come from 5000 simulation runs of sample size 5000 each from the SEM in the appendix in
Section C with K = 2 and S = 1. The conditional expectations are estimated with random
forests consisting of 500 trees that have a minimal node size of 5.
6474 C. Emmenegger and P. Bühlmann

4. Regularizing the DML estimator: regDML and regsDML

We introduce a regularized estimator, regsDML, whose estimated standard de-


viation is typically smaller and never worse than the one of the TSLS-type
DML estimator described above. Supporting theory and simulations illustrate
that the associated confidence intervals nevertheless reach (asymptotically) valid
and good coverage. The regsDML estimator selects either the DML estimator
or its regularization-only version regDML, depending on which of the two has
a smaller estimated standard deviation.

Subsequently, we first introduce the regularization-only method regDML.


The regDML estimator is obtained by regularizing DML and choosing a data-
dependent regularization parameter. Before we describe the choice of the regu-
larization parameter, we introduce the regularization scheme for fixed regular-
ization parameters.
Given a regularization parameter γ ≥ 0, the population coefficient bγ of
the regularization scheme optimizes an objective function similar to the one
used in k-class regression [93] or anchor regression [82, 23]. We established the
representation

2
β0 = arg min E PRA (RY − RX T
β)
β∈Rd

of β0 in (6). For a regularization parameter γ ≥ 0, we consider the regularized


objective function and corresponding population coefficient
 
2 2
bγ := arg min E (Id −PRA )(RY − RX T
β) + γ E PRA (RY − RX T
β) . (13)
β∈Rd

This regularized objective is form-wise analogous to the objective function em-


ployed in anchor regression. The anchor regression estimator has been reformu-
lated as a k-class estimator by Jakobsen and Peters [54] for a linear model.
If γ = 1, ordinary least squares regression of RY on RX is performed. If
γ = 0, we are partialling out or adjusting for the variable RA . If γ = ∞, we
perform TSLS regression of RY on RX using the instrument RA . In this case,
bγ coincides with β0 . The coefficient bγ interpolates between the OLS coefficient
bγ=1 and the TSLS coefficient β0 for general choices of γ > 1. For γ > 1, there
is a one to one correspondence between bγ and the k-class estimator (based on
RA , RX , and RY ) with regularization parameter κ = γ−1γ ∈ (0, 1); see Jakobsen
and Peters [54].

4.1. Estimation and asymptotic normality

In this section, we describe how to estimate bγ in (13) for fixed γ ≥ 0 using


a DML scheme, and we describe the asymptotic properties of this estimator.
Ik Ik
We consider the residual matrices RA ∈ Rn×q and RX ∈ Rn×d and the vector
Regularizing DML in endogenous PLMs 6475

Ik
RY ∈ Rn introduced in Section 3 that adjust the data with respect to the
nonparametric variables. The estimator of bγ is given by

K  
 Ik Ik T 2
b̂γ := arg minb∈Rd 1
K k=1  1 − ΠRIk RY − RX b 
2 
2

A

 Ik Ik
T 
+γ ΠRIk (RY − (RX ) b) ,
A 2

where ΠRIk is as in (8). This estimator can be expressed in closed form by


A

−1
1  1 
K K
γ Ik T Ik Ik T Ik
b̂ = RX
 RX
 RX
 RY , (14)
K K
k=1 k=1

where
Ik
 √  Ik Ik
 √  Ik
 := 1 + ( γ − 1)Π Ik RX
RX and RY := 1 + ( γ − 1)ΠRIk RY . (15)
R A A

Ik
The computation of b̂γ is similar to an OLS scheme where RY is regressed on
Ik
γ
RX  . To obtain b̂ , individual matrices are first averaged before the final matrix
is inverted. It is also possible to directly carry out the K OLS regressions of
Ik Ik
RY on RX  and average the resulting parameters. Both schemes are asymp-
totically equivalent. We call the two schemes DML2 and DML1, respectively.
This is analogous to Chernozhukov et al. [31] as already mentioned in Section 3.
The DML1 version is presented in the appendix in Section B.2. As mentioned
in Section 3, the advantage of DML2 over DML1 is that it enhances stability
properties of the coefficient estimator because the average of matrices needs to
be well conditioned but not every individual matrix.

Theorem 4.1. Let γ ≥ 0. Suppose that Assumption I.5 in the appendix in Sec-
tion I (same as in Theorem 3.1) except I.5.1 holds, and consider the quantities
σ 2 (γ) and ψ introduced in Definition J.1 in the appendix in Section J. The es-
timator b̂γ concentrates in a √1N neighborhood of bγ . It is approximately linear
and centered Gaussian, namely

√ 1 
N
N σ −1 (γ)(b̂γ − bγ ) = √
d
ψ(Si ; bγ , η 0 ) + oP (1) → N (0, 1d×d ) (N → ∞),
N i=1

uniformly over laws P of S = (A, W, X, Y ).

Theorem 4.1 also holds for the DML1 version of b̂γ defined in the appendix
in Section B.2. The influence function is denoted by ψ in both Theorems 3.1
and 4.1 but is defined differently. Assumption I.5 specifies regularity conditions
and the convergence rate of the machine learners of the conditional expectations.
6476 C. Emmenegger and P. Bühlmann

The machine learners are required to satisfy the product relations


Ic
N−2 ,
1
m0A (W ) − m̂Ak (W )2P,2
c
I
m0X (W ) − m̂Xk (W )P,2
Ic Ic
N−2 ,
1
· m0Y (W ) − m̂Yk (W )P,2 + m0X (W ) − m̂Xk (W )P,2
Ic
m0A (W ) − m̂Ak (W )P,2
Ic Ic
N−2
1
· m0Y (W ) − m̂Yk (W )P,2 + m0X (W ) − m̂Xk (W )P,2

for k ∈ [K]. The main difference to Theorem 3.1 and quantity of interest is the
asymptotic variance σ 2 (γ). It can be consistently estimated with either b̂γ or
its DML1 version as illustrated in Theorem J.3 in the appendix in Section J.
Typically, for γ < ∞, the asymptotic variance σ 2 (γ) is smaller than σ 2 in
Theorem 3.1. Such a variance gain comes at the price of bias because b̂γ estimates
bγ and not the true parameter β0 .
The proof of Theorem 4.1 uses Neyman orthogonality of the underlying score
function. Recall that Neyman orthogonality neither depends on the distribution
of S nor on the value of the coefficients β0 and η 0 as discussed in Section 3.
For fixed γ > 1, Theorem 4.1 furthermore implies that the k-class estimator
corresponding to b̂γ converges at the parametric rate and follows a Gaussian
distribution asymptotically.

4.2. Estimating the regularization parameter γ

For simplicity, we assume d = 1 in this subsection. The results can be extended


to d > 1.

Subsequently, we introduce a data-driven method to choose the regulariza-


tion parameter γ in practice. The estimated regularization parameter γ leads
to an estimate of β0 that asymptotically has the same MSE behavior as the
TSLS-type estimator β̂ in (7) but may exhibit substantially better finite sample
properties.

We consider the estimated regularization parameter


1 2
γ̂ := arg min σ̂ (γ) + |b̂γ − β̂|2 . (16)
γ≥0 N

It optimizes an estimate of the asymptotic MSE of b̂γ : the term σ̂ 2 (γ) is the
consistent estimator of σ 2 (γ) described in Theorem J.3 in the appendix in Sec-
tion J, and the term |b̂γ − β̂|2 is a plug-in estimator of the squared population
bias |bγ − β0 |2 . The estimated regularization parameter γ̂ is random because it
depends on the data.

First, we investigate the bias of the population parameter bγN for a nonran-
dom sequence of regularization parameters {γN }N ≥1 as N → ∞. Afterwards,
Regularizing DML in endogenous PLMs 6477

we propose a modified estimator of the regularization parameter whose cor-


responding parameter estimate is denoted by regDML, and we introduce the
regularization-selection estimator regsDML. Finally, we analyze the asymptotic
properties of regDML and regsDML.
Let us consider a deterministic sequence {γN }N ≥1 of regularization
√ param-
eters. By Proposition 4.2 below, the (scaled) population
√ bias N |b − β0 |
γN

vanishes as N → ∞ if γN is of larger order than N .


Proposition 4.2. Suppose that I.5.1, I.5.3, and I.5.4 of Assumption I.5 in the
appendix in Section I hold (subset of the assumptions in Theorem 3.1). Assume
{γN }N ≥1 is a sequence of non-negative real numbers. Then we have
⎧ √

⎨0, if γN N
√ √
N |bγN − β0 | → C, if γN ∼ N

⎩ √
∞, if γN N

as N → ∞ for some non-negative finite real number C.


Theorem 4.3 below shows that the estimated
√ regularization parameter γ̂ is
of equal or larger stochastic order than N . If it were not, choosing γ = ∞
in (16), and hence selecting the TSLS-type estimator β̂, would lead to a smaller
estimated asymptotic MSE.

Theorem 4.3. Let γN = o( N ), and suppose that Assumption I.5 in the ap-
pendix in Section I holds (same as in Theorem 3.1). We then have

lim P σ̂ 2 (γN ) + N (b̂γN − β̂)2 ≤ σ̂ 2 = 0.


N →∞

If γ̂ is multiplied by a deterministic scalar aN that diverges to +∞ at an ar-


bitrarily slow rate as N → ∞, the √ modified regularization parameter γ̂  :=√aN γ̂
is of stochastic order larger than N . By default, we choose aN = log( N ).
Proposition 4.2 is formulated for deterministic regularization parameters, but
the deterministic statements can be replaced by probabilistic ones. Proposi-
γ̂ 
tion 4.2 then implies that the population
√ bias term |b√ − β0 | vanishes at rate
  
−21
oP (N ). Thus, the two quantities N (b̂ − b ) and N (b̂γ̂ − β0 ) are asymp-
γ̂ γ̂

totically equivalent due to Theorem 4.4 below, and we have


√ 
N (b̂γ̂ − β0 ) ≈ N 0, σ 2 (γ̂  )

whenever N is sufficiently large (note that asymptotically as N → ∞, the right-


hand side has the same limit as described in Theorem 4.4).

We call b̂γ̂ the regDML (regularized DML) estimator. The regularization-
selection estimator selects between DML and regDML based on whose variance
estimate is smaller. The “s” in regsDML stands for selection.
Theorem 4.4. Suppose that Assumption I.5 in the appendix in Section I holds
(same as in Theorem 3.1). Let {aj }j≥1 be a sequence of deterministic, non-
negative real numbers that diverges to ∞ as N → ∞. Furthermore, consider
6478 C. Emmenegger and P. Bühlmann

γ̂  = aN γ̂ as above. Then, we have


√   √
N σ̂ −1 (γ̂  )(b̂γ̂ − bγ̂ ) = N σ −1 (β̂ − β0 ) + oP (1)

uniformly over laws P of S = (A, W, X, Y ), where σ̂(·) ist the estimator from
Theorem J.3 in the appendix, which consistently estimates σ(·) from 4.1.
 
Particularly, b̂γ̂ and β̂ are asymptotically equivalent. But b̂γ̂ may exhibit
substantially better finite sample properties as we demonstrate in the subsequent

section. Because b̂γ̂ and β̂ are asymptotically equivalent, the same result also
holds for the selection estimator regsDML.
The proof of Theorem 4.4 does not depend on the precise construction of γ̂ 
and only uses
√ that the random regularization parameter is of stochastic order
larger than N . Thus, Theorem 4.4 remains valid if the regularization param-
eter comes from k-class estimation and is of the required stochastic order. The
same stochastic order is also required to show that k-class estimators are asymp-
totically Gaussian [70, 68].

The K sample splits are random. To reduce the effect of this randomness, we
repeat the overall procedure S times and assemble the results as suggested
in Chernozhukov et al. [31]. The assembled parameter estimate is given by
the median of the individual parameter estimates; see Steps 9 and 10 of Al-
gorithm 1. The assembled variance estimate is given by adding a correction
term to the individual variances and subsequently taking the median of these
corrected terms. The correction term measures the variability due to sample
spitting across s ∈ [S].
It is possible that the assembled variance of regDML is larger than the as-
sembled variance of DML. In such a case, we do not use the regDML estimator
and select the DML estimator instead to ensure that the final estimator of β0
does not experience a larger estimated variance than DML. This is the regsDML
scheme. A summary of this procedure is given in Algorithm 1.

5. Numerical experiments

This section illustrates the performance of the DML, regDML, and regsDML
estimators in a simulation study and for an empirical dataset. Our implemen-
tation is available in the R-package dmlalg [40]. We employ the DML2 method
and K = 2 and S = 100 in Algorithm 1. Furthermore, we compare our estima-
tion schemes with the following three k-class estimators: LIML, Fuller(1), and
Fuller(4). On each of the K sample splits, we compute the regularization pa-
rameter of the respective k-class estimation procedure and average them. Then,
we compute the corresponding γ-value and proceed as for the other regularized
estimators according to Algorithm 1.
The first example in Section 5.1 considers an overidentified model in which
the dimension of A is larger than the dimension of X. The second example in
Regularizing DML in endogenous PLMs 6479

Algorithm 1: regsDML in a PLM with confounding variables.


Input : N iid realizations from the SEM (3), a natural number S, a regularization
parameter grid {γi }i∈[L] for some natural number L, a non-negative diverging
sequence {aj }j≥1 .
Output: An estimator of β0 in (3) together with its estimated asymptotic variance.
1 for s ∈ [S] do
2 Compute β̂s = β̂ and σ̂s2 = σ̂ 2 .
γ
3 Compute b̂s i = b̂γi and σ̂s2 (γi ) = σ̂ 2 (γi ) for i ∈ [L].
4 Choose γ̂s = arg minγ∈{γi }i∈[L] N 1 2
σ̂s (γ) + |b̂γs − β̂s |2 and let γ̂s = aN γ̂s .
γ̂  
5 Compute b̂s s = b̂γ̂s and σ̂s2 (γ̂s ) = σ̂ 2 (γ̂s ).
6 end
7 Compute β̂ med = medians∈[S] (β̂s ).
γ̂ 
8 Compute b̂med s
reg = medians∈[S] (b̂s ).
9 Compute σ̂ 2,med = medians∈[S] σ̂s2 + (β̂s − β̂ med )2 .
γ̂ 
10
2,med
Compute σ̂reg = medians∈[S] σ̂s2 (γ̂s ) + (b̂s s − b̂med 2
reg ) .
2,med
11 if σ̂reg < σ̂ 2,med then
12 Take the parameter estimate b̂med
reg together with its associated estimated
1 2,med
asymptotic variance N σ̂reg .
13 else
14 Take the parameter estimate β̂ med together with its associated estimated
1 2,med
asymptotic variance N σ̂ .
15 end

Section 5.2 considers justidentified real-world data. In both examples, the con-
ditional expectations acting as nuisance parameters are estimated with random
forests.
An example where the conditional expectations are estimated with splines is
given in Section 1.1. Additional empirical results are provided in the appendix in
Section D, E, and F. The latter section considers examples where DML, regDML,
and regsDML do not work well in finite sample situations: we follow the NCP
(No Cherry Picking) guideline [25] to possibly enhance further insights into the
finite sample behavior. Section E in the appendix presents examples where the
link A → X is weak and examples illustrating the bias-variance tradeoff of the
respective estimated quantities as a function of the regularization parameter γ.

5.1. Simulation example with random forests

We generate data from the SEM in Figure 4. This SEM satisfies the identifi-
ability condition (5) because A1 and A2 are independent of H given W1 and
W2 ; a proof is given in the appendix in Section K. The model is overidentified
because the dimension of A = (A1 , A2 ) is larger than the dimension of X. The
variable A1 directly influences A2 that in turn directly affects W1 . Both W1 and
W2 directly influence H. Both A1 and A2 directly influence X. The variable A1
is a source node.
6480 C. Emmenegger and P. Bühlmann

Fig 4. An SEM and its associated causal graph.

We simulate M = 1000 datasets each from the SEM in Figure 4 for a range
of sample sizes. For every dataset, we compute a parameter estimate and an
associated confidence interval with DML, regDML, and regsDML. We choose
K = 2 and S = 100 in Algorithm 1 and estimate the conditional expectations
with random forests consisting of 500 trees that have a minimal node size of 5.

Figure 5 illustrates our findings. It gives the coverage, power, and relative
length of the 95% confidence intervals for a range of sample sizes N of the
three methods. The blue and green curves correspond to regDML and regsDML,
respectively. If the blue curve is not visible in Figure 5, it coincides with the green
one. The two regularization methods perform similarly because regularization
can considerably improve DML. The red curves correspond to DML. If the
red curve is not visible, it coincides with LIML, whose results are displayed in
orange. The Fuller(1) and Fuller(4) estimators correspond to purple and cyan,
respectively.
The top left plot in Figure 5 displays the coverages as interconnected dots.
The dashed lines represent 95% confidence regions of the coverages. These con-
fidence regions are computed with respect to uncertainties in the M simulation
runs. No coverage region falls below the nominal 95% level that is marked by
the gray line.
The bottom left plot in Figure 5 shows that the power of DML, LIML, and
Fuller(1) is lower for small sample sizes and increases gradually. The power of the
other regularization methods remains approximately 1. The dashed lines repre-
sent 95% confidence regions that are computed with respect to uncertainties in
the M simulation runs.
The right plot in Figure 5 displays boxplots of the scaled lengths of the con-
fidence intervals. For each N , the confidence interval lengths of all methods are
divided by the median confidence interval lengths of DML. The length of the
regsDML confidence intervals is around 50% − 80% the length of DML’s. Nev-
ertheless, the coverage of regsDML remains around 95%. The LIML, Fuller(1),
and Fuller(4) confidence intervals are considerably longer than regsDML’s. Al-
though the confidence intervals of regsDML are the shortest of all considered
Regularizing DML in endogenous PLMs 6481

methods, its coverage remains valid.

Fig 5. The results come from M = 1000 simulation runs each from the SEM in Figure 4 for a
range of sample sizes N and with K = 2 and S = 100 in Algorithm 1. The nuisance functions
are estimated with random forests. The figure displays the coverage of two-sided confidence
intervals for β0 , power for two-sided testing of the hypothesis H0 : β0 = 0, and scaled lengths
of two-sided confidence intervals of DML (red), regDML (blue), regsDML (green), LIML
(orange), Fuller(1) (purple), and Fuller(4) (cyan), where all results are at level 95%. At
each N , the lengths of the confidence intervals are scaled with the median length from DML.
The shaded regions in the coverage and the power plots represent 95% confidence bands with
respect to the M simulation runs. The blue and green lines as well as the red and orange ones
are indistinguishable in the left panel.

Simulation results with β0 = 0 in the SEM in Figure 4 are presented in


Figure 8 in the appendix in Section D.

5.2. Real data example

We apply the DML and regsDML methods to a real dataset. We estimate the lin-
ear effect β0 of institutions on economic performance following the work of Ace-
moglu, Johnson and Robinson [1] and Chernozhukov et al. [31]. Countries with
better institutions achieve a greater level of income per capita, and wealthy
economies can afford better institutions. This may cause simultaneity. To over-
come it, mortality rates of the first European settlers in colonies are considered
as a source of exogenous variation in institutions. For further details, we refer
to Acemoglu, Johnson and Robinson [1] and Chernozhukov et al. [31]. The data
is available in the R-package hdm [29] and is called AJR. In our notation, the
response Y is the GDP, the covariate X the average protection against expro-
priation risk, the variable A the logarithm of settler mortality, and the covariate
W consists of the latitude, the squared latitude, and the binary factors Africa,
Asia, North America, and South America. That is, we adjust nonparametrically
for the latitude and geographic information.
6482 C. Emmenegger and P. Bühlmann

Table 1
Coefficient estimate, its standard error, and a confidence interval with DML and regsDML
on the AJR dataset, where K = 2 and S = 100 in Algorithm 1, and where the conditional
expectations are estimated with random forests consisting of 1000 trees that have a minimal
node size of 5.
Estimate of β0 Standard error Confidence interval for β0
DML 0.739 0.459 [−0.161, 1.639]
regsDML 0.688 0.229 [0.239, 1.136]

We choose K = 2 and S = 100 in Algorithm 1 and compute the conditional


expectations with random forests with 1000 trees that have a minimal node
size of 5. The estimation results are displayed in Table 1. This table gives the
estimated linear coefficient, its standard deviation, and a confidence interval for
β0 for DML and regsDML. The coefficient estimate of DML is not significant
because the respective confidence interval includes 0. The regsDML estimate is
significant because it has a smaller standard deviation than the DML estimate.
Note that the coefficient estimate of regsDML falls within the DML confidence
interval.

The AJR dataset has also been analyzed in Chernozhukov et al. [31]. They
also estimate conditional expectations with random forests consisting of 1000
trees that have a minimal node size of 5 but implicitly assume an additional
homoscedasticity condition for the errors RY − RX T
β0 ; see Chernozhukov et al.
[30]. Such a homoscedastic error assumption is questionable though. Their pro-
cedure leads to a smaller estimate of the standard deviation of DML than what
we obtain.

6. Conclusion

We extended and regularized double machine learning (DML) in potentially


overidentified partially linear models (PLMs) with hidden variables. Our goal
was to estimate the linear coefficient β0 of the PLM. Hidden variables confound
the observables, which can cause endogeneity. For instance, a clinical study may
experience an endogeneity issue if a treatment is not randomly assigned and sub-
jects receiving different treatments differ in other ways than the treatment [73].
In such situations, employing estimation methods that do not account for en-
dogeneity leads to biased estimators [44].

Our contribution was twofold. First, we formulated the PLM as a structural


equation model (SEM) and imposed an identifiability condition on it to recover
the population parameter β0 . We estimated β0 using DML similarly to Cher-
nozhukov et al. [31]. However, our setting is more general than the one considered
in Chernozhukov et al. [31] because we allow the predictors to be multivariate,
and we impose a moment condition instead of restricting conditional moments.
The DML estimation procedure allows biased estimators of additional nuisance
functions to be plugged into the estimating equation of β0 . The resulting esti-
Regularizing DML in endogenous PLMs 6483

mator of β0 is asymptotically Gaussian and converges at the parametric rate of


N − 2 . However, DML has a two-stage least squares (TSLS) interpretation and
1

may therefore lead to overly wide confidence intervals.


Second, we proposed a regularization-only DML scheme, regDML, and a
regularization-selection DML scheme, regsDML. The latter has shorter confi-
dence intervals by construction because it selects between DML and regDML
depending on whose estimated standard deviation is smaller. Although regsDML
and plain DML are asymptotically equivalent, regsDML leads to drastically
shorter confidence intervals for finite sample sizes. Nevertheless, coverage guar-
antees for β0 remain. The regDML estimator is similar to k-class estimation [93]
and anchor regression [82, 23, 54] but allows potentially complex partially linear
models and chooses a data-driven regularization parameter.

Empirical examples demonstrated our methodological and theoretical devel-


opments. The results showed that regsDML is a highly effective method to in-
crease the power and sharpness of statistical inference. The DML estimator has
a TSLS interpretation. Therefore, if the confounding is strong, the DML esti-
mator leads to overly wide confidence intervals and can be substantially biased.
In such a case, regsDML drastically reduces the width of the confidence inter-
vals but may inherit additional bias from DML. This effect can be particularly
pronounced for small sample sizes. Section F in the appendix presents examples
with strong and reduced confounding and demonstrates the coverage behavior
of DML and regsDML. Section E in the appendix analyzes the performance
of our methods if the strength of the link A → X varies and investigates the
bias-variance tradeoff of the respective estimated quantities for different values
of the regularization parameter.
Although a wide range of machine learners can be employed to estimate
the nuisance functions, we observed that additive splines can estimate more
precise results than random forests if the underlying structure is additive in
good approximation. This effect is particularly pronounced if the sample size is
small. If such a finding is to be expected, it may be worthwhile to use structured
models rather than “general” machine learning algorithms, especially with small
or moderate sample size. Our regsDML methodology can be used with the
implementation that is available in the R-package dmlalg [40].

Appendix A: An example where the identifiability condition (5)


holds, but conditional moment requirements do not

This section presents an SEM where our identifiability condition (5) holds, but
where the conditional moment requirements of Chernozhukov et al. [31] do not.
Let d = 1 = q in this section (justidentified case), and assume the model

Y ← Xβ0 + gY (W ) + hY (H) + εY

given in (3) and the identifiability condition E[RA (RY −RX β0 )] = 0 given in (5).
6484 C. Emmenegger and P. Bühlmann

Chernozhukov et al. [31] assume the model

Y = Xβ0 + gY (W ) + U, A = gA (W ) + V (17)

for unknown functions gY and gA and impose the conditional moment restric-
tions
E[U |A, W ] = 0 and E[V |W ] = 0 (18)

on the error terms.


Model (17) and the conditional moment restrictions (18) imply the identifi-
ability condition (5) due to
     
E RA (RY − RX β0 ) = E A − gA (W ) U = E A − gA (W ) E[U |A, W ] = 0.

However, the reverse direction does not hold. A counterexample is presented


in Figure 6 where W directly affects H. This SEM satisfies the identifiability
condition (5) because A is independent of H conditional on W , but it does not
satisfy E[U |W, A] = 0 because we have

E[U |A, W ] = E[H + εY |A, W ] = E[H|W ] = E[W + εH |W ] = W

due to A ⊥⊥ H|W and (εY , εH ) ⊥


⊥ (W, A). We have A ⊥⊥ H|W because all paths
from A to H are blocked by W . The path A → X ← H is blocked by the
empty set because X is a collider on this path. The path A → X → Y ← H
is blocked by the empty set because Y is a collider on this path. The path
A → X → Y ← W → H is blocked by W . The paths A → X → W → Y ← H
and A → X → W → H are also blocked by W .

Fig 6. An SEM and its associated causal graph.

Appendix B: DML1 estimators

The DML1 estimators are less preferred than the DML2 estimators we proposed
to use in the main text, but for completeness we provide the definitions in this
section.
Regularizing DML in endogenous PLMs 6485

B.1. DML1 estimator of β0

The DML1 estimator of β0 is given by

1  Ik
K
β̂ DML1 := β̂ ,
K
k=1

where  −1
Ik T Ik Ik T Ik
β̂ Ik := RX ΠRIk RX RX ΠRIk RY , (19)
A A

Ik Ik Ik −1 Ik
and where we recall the projection matrix ΠRIk = RA (RA )T RA (R A )T
A
Ik Ik
defined in (8). The estimator β̂ Ik is the TSLS estimator of RY on RX using
Ik
the instrument RA .

B.2. DML1 estimator of bγ

The DML1 estimator of bγ is given by

1  γ
K
b̂γ,DML1 := b̂k , (20)
K
k=1

where
 2  2 
 Ik Ik T   Ik Ik T 
b̂γk := arg min  1 − ΠRIk RY − RX b  + γ ΠRIk RY − RX b  .
b∈Rd A 2 A 2

This estimator can be expressed in closed form by


 Ik T Ik
−1 Ik T Ik
b̂γk = RX
 RX
 RX
 RY ,

where we recall the notation


Ik
 √  Ik Ik
 √  Ik
RX  = 1 + ( γ − 1)Π Ik RX and RY = 1 + ( γ − 1)ΠRIk RY
R A A

Ik
as in (15). The computation of b̂γk is an OLS scheme where RY is regressed on
Ik
RX
 .

Appendix C: SEM of Figure 3

The data from the simulation displayed in Figure 3 come from the following
SEM. Let the dimension of W be v = 20. Let R be the upper triangular matrix
of the Cholesky decomposition of the Toeplitz matrix whose first row is given
by (1, 0.7, 0.72 , . . . , 0.719 ). The SEM we consider is given by
6486 C. Emmenegger and P. Bühlmann

(εA , εW , εH , εX , εY ) ∼ N24 (0, 1)


H ← εH
W ← εW R
eW 1
A ← 1+e W1 + W 2 + W 3 + ε A
eW 3
X ← 2A + W1 + 0.25 · 1+e W3 + H + ε X
eW 1
Y ← X + 1+eW1 + 0.25W3 + H + εY .

Appendix D: Additional numerical results

If we say in this section that the nuisance parameters are estimated with ad-
1
ditive splines, they are estimated with additive cubic B-splines with N 5 + 2
degrees of freedom, where N denotes the sample size of the data. If we say in this
section that the nuisance parameters are estimated with random forests, they
are estimated with random forests consisting of 500 trees that have a minimal
node size of 5.

Figure 7 and 8 illustrate the simulation results with β0 = 0 of the examples


presented in Figure 2 and 5 in Section 1.1 and 5.1, respectively. The coverage
and length of the scaled confidence intervals are similar to the results obtained
for β0 = 0. Instead of the power as in Figure 2 and 5, Figure 7 and 8 illustrate
the type I error.

In Figure 7, DML achieves a type I error of 0 or close to 0 over all sample


sizes considered. The regsDML method achieves a type I error that is closer to
the gray line indicating the 5% level. The dashed lines represent 95% confidence
regions. The type I error of regsDML is higher than the type I error of DML be-
cause the regsDML confidence intervals are considerably shorter than the DML
ones. The right plot in Figure 7 indicates that the length of the confidence inter-
vals of regsDML is around 10% − 30% the length of DML’s. Although regsDML
greatly reduces the confidence interval length, the type I error confidence bands
include the 5% level or are below it. This means that although regsDML is a
regularized version of DML, it does not incur an overlarge bias.

In Figure 8, the type I errors of both DML and regsDML are similar. The
95% confidence regions of both estimators, which are represented by dashed
lines, include the 5% level or are below it. The right plot in Figure 8 illustrates
that the regsDML confidence intervals are around 50% − 80% the length of
DML’s. Nevertheless, its type I error does not exceed the 95% level.

Appendix E: Weak A → X and bias-variance tradeoff

First, we analyze the behavior of our methods for varying strength from A to
X. For N = 200, we consider the coverage and length of the confidence intervals
for varying strength from A to X for the same settings as in Figure 2 and 5.
Regularizing DML in endogenous PLMs 6487

Fig 7. The results come from M = 1000 simulation runs each from the SEM in Figure 1
with β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines as well as the red and orange ones are indistinguishable in the left panel.

Figure 9 illustrates the results for data from the SEM from Figure 2. We vary
the strength of the direct link A → X and denote it by α in Figure 9. Figure 10
illustrates the results for data from the SEM from Figure 5. We leave the link
A2 → X as it is and only vary the strength of the direct link A1 → X, which we
denote by α in Figure 10. In both Figure 9 and 10, the coverage remains high
for all considered methods. If α becomes larger in absolute value, the confidence
intervals become shorter, which leads to a coverage that is closer to the nominal
95% level, especially in Figure 10. The regsDML method yields the shortest
confidence intervals in both figures.
Second, we analyze the bias-variance tradeoff of the respective estimated
quantities of the regularized methods. We again choose the sample size N = 200
and consider the same settings as in Figure 2 and 5. The results are summarized
in Figure 11 and 12 that display the estimated MSE, estimated variance, and
estimated squared bias as used in Equation (16). The MSE in both figures is
mainly driven by the variance, and regsDML achieves a considerable variance
reduction compared to the TSLS-type DML estimator.

Appendix F: Confounding and its mitigation

If we say in this section that the nuisance parameters are estimated with ad-
1
ditive splines, they are estimated with additive cubic B-splines with N 5 + 2
6488 C. Emmenegger and P. Bühlmann

Fig 8. The results come from M = 1000 simulation runs from the SEM in Figure 4 with
β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1. The
nuisance functions are estimated with random forests. The figure displays the coverage of
two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines as well as the red and orange ones are indistinguishable in the left panel.

Fig 9. Same setting as in Figure 2, but with N = 200 only. The strength of the direct link
A → X varies and is denoted by α. We considered the α-values −e−20 , −e−15 , −e−10 , −e−5 ,
−e−1 , −e−0.75 , −e−0.5 , −e−0.25 , and −e0 .

degrees of freedom, where N denotes the sample size of the data.

We consider models where the DML and the regsDML methods do not work
well in terms of coverage of β0 . We present possible explanations of these failures
and illustrate model changes to overcome them. The first model in Section F.1
features a strong confounding effect H → X, the second model in Section F.2
features an effect with noise in W → H, and the third model in Section F.3
features an effect with noise in H → W .
Regularizing DML in endogenous PLMs 6489

Fig 10. Same setting as in Figure 5, but with N = 200 only. The strength of the direct link
A1 → X varies and is denoted by α. We considered the α-values e−20 , e−15 , e−10 , e−5 , e−1 ,
e−0.75 , e−0.5 , e−0.25 , and e0 .

Fig 11. Estimated MSE, estimated variance, and estimated squared bias as used in Equa-
tion (16) for the same setting as in Figure 2, but with N = 200 only. The black solid line
displays the median of the respective quantity over the considered range of γ-values for b̂γ . The
yellow area marks the observed 25% and 75% quantiles. All methods incorporate an additional
variance adjustment from the S repetitions according to Algorithm 1. Boxplots illustrate the
performance of the TSLS and the regularized methods. The position of the boxplots is not
linked to the γ-values on the x-axis.

F.1. Strong confounding effect H → X

If the hidden variable H is strongly confounded with X, the resulting TSLS-type


DML estimator can be substantially biased depending on the choice of functions
in the model. If the estimated variances are not large enough, the coverage of
the resulting confidence intervals for β0 can be too low. This issue is illustrated
in Figure 14.
The regsDML estimator mimics the bias behavior of DML because the DML
estimator is used as a replacement of β0 in the MSE objective function that de-
fines the estimated regularization parameter of regDML in (16). The confidence
intervals of regsDML are shorter than the DML ones, but both are computed
with a similarly biased coefficient estimate of β0 . Therefore, the coverage of the
confidence intervals of regsDML is even worse than the one of DML.
The coverages of both DML and regsDML are considerably improved if the
confounding strength is reduced; see Figure 15.
6490 C. Emmenegger and P. Bühlmann

Fig 12. Estimated MSE, estimated variance, and estimated squared bias as used in Equa-
tion (16) for the same setting as in Figure 5, but with N = 200 only. The black solid line
displays the median of the respective quantity over the considered range of γ-values for b̂γ . The
yellow area marks the observed 25% and 75% quantiles. All methods incorporate an additional
variance adjustment from the S repetitions according to Algorithm 1. Boxplots illustrate the
performance of the TSLS and the regularized methods. The position of the boxplots is not
linked to the γ-values on the x-axis.

Fig 13. An SEM and its associated causal graph.

F.2. Noise in W → H

The variable W may have a direct effect on H. If this link is strong enough with
respect to the additional noise εH of H, it is possible to obtain some information
of H by observing W . This can reduce the overall level of confounding present
depending on the choice of functions in the model.
Simulation results where W explains only part of the variation in H are
presented in Figure 17. The confidence intervals of both DML and regsDML
do not attain a 95% coverage for small sample sizes N . The situation can be
considerably improved by reducing the variation of H that is not explained by
W ; see Figure 18.

F.3. Noise in H → W

The variable H may have a direct effect on W . If this link is strong enough
with respect to the additional noise εW of W , it is possible to obtain some
information of H by observing W similarly to Section F.2. The results again
Regularizing DML in endogenous PLMs 6491

Fig 14. The results come from M = 1000 simulation runs from the SEM in Figure 13 with
χ = 15 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.

depend on the choice of functions in the model.


Figure 20 presents simulation results where H explains only little variation
of W compared with εW . The confidence intervals of regsDML do not attain a
95% coverage for small sample sizes N because the estimator inherits additional
bias from DML. The situation can be improved by reducing the variation of W
that is not explained by H; see Figure 21.

Appendix G: Examples where the identifiability condition (5) does


and does not hold

The following examples illustrate SEMs where the identifiability condition (5)
holds and where it fails to hold. We argue using causal graphs; see Lauritzen
[59], Pearl [74, 76, 77], Peters, Janzing and Schölkopf [78], and Maathuis et al.
[64]. By convention, we omit error variables in a causal graph if they are assumed
to be mutually independent [76].
Example G.1. Consider the SEM of the 1-dimensional variables A, W , H,
X, and Y and its associated causal graph given in Figure 22, where β0 is a
fixed unknown parameter, and where aW , aX , gY , gH , hX , and hY are some
appropriate functions. The variable A directly influences W , and W directly
influences the hidden variable H. The variable A is independent of H given W
because every path from A to H is blocked by W .
6492 C. Emmenegger and P. Bühlmann

Fig 15. The results come from M = 1000 simulation runs from the SEM in Figure 13 with
χ = 1 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.

Fig 16. An SEM and its associated causal graph.

Proof of Example G.1. The path A → X ← H is blocked by the empty set


because X is a collider on this path. The paths A → · · · → Y ← H are blocked
by the empty set because Y is a collider on these paths. The path A → W → H
is blocked by W .
The variable A is exogenous in Example G.1. In general, this is no require-
ment; see Example G.2.
Example G.2. Consider the SEM of the 1-dimensional variables H, W , A,
X, and Y and its associated causal graph given in Figure 23, where β0 is a
fixed unknown parameter, and where aX , gA , gX , gY , hX , hW , and hY are
some appropriate functions. The variable A is not a source node. The hidden
Regularizing DML in endogenous PLMs 6493

Fig 17. The results come from M = 1000 simulation runs from the SEM in Figure 16 with
κ = 2 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.

variable H directly influences W , and W directly influences A. The variable A


is independent of H given W because every path from A to H is blocked by W .
Proof of Example G.2. The path A → X ← H is blocked by the empty set
because X is a collider on this path. The paths A → X → · · · → Y ← H are
blocked by the empty set because Y is a collider on these paths. The paths
A ← W → Y ← X ← H, A ← W ← H, and A → X ← W ← H are blocked
by W . The path A ← W → Y ← H is blocked by W or alternatively by the
empty set because Y is a collider on this path. The path A ← W → X ← H is
blocked by W or alternatively by the empty set because X is a collider on this
path.

Identifiability of β0 is not guaranteed if A and H are independent. An illus-


tration is given in Example G.3. Considering the instrument A instead of RA
in Theorem 2.1 cannot solve the issue. In such a situation, stronger structural
assumptions are required.
Example G.3. Consider the SEM of the 1-dimensional variables H, A, W ,
X, and Y and its associated causal graph given in Figure 24, where β0 is a
fixed unknown parameter. Although A and H are independent, the identifiability
condition (5) does not hold.
Proof of Example G.3. The two random variables A and H are independent
6494 C. Emmenegger and P. Bühlmann

Fig 18. The results come from M = 1000 simulation runs from the SEM in Figure 16 with
κ = 0.25 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in
Algorithm 1. The nuisance functions are estimated with additive splines. The figure displays
the coverage of two-sided confidence intervals for β0 , type I error for two-sided testing of
the hypothesis H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML
(red), regDML (blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4)
(cyan), where all results are at level 95%, and where the nuisance functions are estimated
with additive splines. At each sample size N , the lengths of the confidence intervals are scaled
with the median length from DML. The shaded regions in the coverage and the type I error
plots represent 95% confidence bands with respect to the M simulation runs. The blue and
green lines are indistinguishable in the left panel.

Fig 19. An SEM and its associated causal graph.

because the path A → W ← H is not blocked by W . Indeed, W is a collider on


this path.
All random variables are 1-dimensional. Therefore, the representation of β0
in Theorem 2.1 is equivalent to the identifiability condition

E[RA (RY − RX β0 )] = 0

in Equation (5). However, the identifiability condition does not hold in the
Regularizing DML in endogenous PLMs 6495

Fig 20. The results come from M = 1000 simulation runs from the SEM in Figure 19 with
κ = 1 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.

present situation. We have

E[RA (RY − RX β0 )] 
= E[R
 A H + εY − E[H + εY |W ]
= E RA H − E[H|W ]

because εY is independent of A and W and centered. By the tower property for


conditional expectations, we have
 
E[RA (RY − RX β0 )] = E AH − A E[H|W ] .

Because A and H are independent and centered, we have E[AH] = 0. Moreover,


we have H ∼ N (0, 1), W ∼ N (0, 3), and (W |H = h) ∼ N (h, 2). The conditional
distribution of H|W = w can be obtained by applying Bayes’ theorem and is
given by N ( 13 w, 23 ). Hence, we have E[H|W ] = 13 W and

  1 1   1
E A E[H|W ] = E[AW ] = E A2 = = 0
3 3 3

because A is independent of H and εW . Therefore, we have E[RA (RY −RX β0 )] =


0, and β0 cannot be represented as in Theorem 2.1.
6496 C. Emmenegger and P. Bühlmann

Fig 21. The results come from M = 1000 simulation runs from the SEM in Figure 19 with
κ = 0.25 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in
Algorithm 1. The nuisance functions are estimated with additive splines. The figure displays
the coverage of two-sided confidence intervals for β0 , type I error for two-sided testing of the
hypothesis H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red),
regDML (blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan),
where all results are at level 95%. At each sample size N , the lengths of the confidence intervals
are scaled with the median length from DML. The shaded regions in the coverage and the type
I error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.

Fig 22. An SEM satisfying the identifiability condition (5) and its associated causal graph as
in Example G.1.

Appendix H: Proofs of Section 2

Proof of Theorem 2.1. To prove the theorem, we need to verify that the repre-
sentation
      −1    
T −1 T −1
β0 = E RX RA T
E RA RA E RA RX T
E RX RAT
E RA RA E[RA RY ]

holds. This statement is equivalent to


   
T −1
 
0 = E RX RA
T
E RA RA E RA RY − RX
T
β0 ) .
Regularizing DML in endogenous PLMs 6497

Fig 23. An SEM satisfying the identifiability condition (5) and its associated causal graph as
in Example G.2.

Fig 24. An SEM not satisfying the identifiability condition (5) together with its associated
causal graph as in Example G.3

This last statement holds because E[RA (RY − RX


T
β0 )] equals 0 due to the iden-
tifiability condition (5).

Appendix I: Proofs of Section 3

We denote by · either the Euclidean norm for a vector or the operator norm
for a matrix.

Proof of Proposition 3.3. We have


  
∂ 
∂r  EP ψ S; β0 , η 0 + r(η − η 0 )
r=0  
∂ 
= ∂r  E P A − m0A (W ) − r mA (W ) − m0A (W )
r=0

· Y − m0Y (W ) − r mY (W ) − m0Y (W )
 T 
− X − m0X (W ) − r mX (W ) − m0X (W ) β0
  
T
= EP − mA (W ) − m0A (W ) Y − m0Y (W ) − X − m0X (W ) β0

+ A − m0A (W ) − mY (W ) − m0Y (W )

T
+ mX (W ) − m0X (W ) β0 .
6498 C. Emmenegger and P. Bühlmann

Subsequently, we show that both terms


  
T
EP mA (W ) − m0A (W ) Y − m0Y (W ) − X − m0X (W ) β0 (21)

and
  
T
EP A − m0A (W ) − mY (W ) − m0Y (W ) + mX (W ) − m0X (W ) β0 (22)

are equal to 0. We first consider the term (21). Recall the notations m0Y (W ) =
EP [Y |W ] and m0X (W ) = EP [X|W ]. We have
  
T
EP mA (W ) − m0A (W ) Y − m0Y (W ) − X − m0X (W ) β0
   
= EP mA (W ) − m0A (W ) EP Y − EP [Y |W ] − (X − EP [X|W ])T β0 W
= 0.
Next, we verify that the term given in (22) vanishes. Recall that we denote
m0A (W ) = EP [A|W ]. We have
  
T
EP A − m0A (W ) − mY (W ) − m0Y (W ) + mX (W ) − m0X (W ) β0
   
= EP EP A − E[A|W ]W − mY (W ) − m0Y (W )

T
+ mX (W ) − m0X (W ) β0
= 0.
Because both terms (21) and (22) vanish, we conclude
∂   
 EP ψ S; β0 , η 0 + r(η − η 0 ) = 0.
∂r r=0
Definition I.1. Consider a set T of nuisance functions. For S = (A, X, W, Y ),
an element η = (mA , mX , mY ) ∈ T , and β ∈ Rd , we introduce the score func-
tions
 
 β, η) := X − mX (W ) Y − mY (W ) − X − mX (W ) T β ,
ψ(S, (23)

and
T
ψ1 (S, η) := X − mX (W ) A − mA (W ) ,
T
ψ2 (S, η) := A − mA (W ) A − mA (W ) ,
T
ψ3 (S, η) := X − mX (W ) X − mX (W ) .
Furthermore, let the matrices
D1 := EP [ψ3 (S; η 0 )],  
D2 := EP [ψ1 (S; η 0 )] EP [ψ2 (S; η 0 )]−1 EP ψ1T (S; η 0 ) ,
D3 := EP [ψ1 (S; η 0 )] EP [ψ2 (S; η 0 )]−1 ,
D5 := EP [ψ2 (S; η 0 )]−1 EP [ψ(S; bγ , η 0 )],
J0 := D2−1D3 ,   
J˜0 := EP ψ(S; β0 , η 0 )ψ T (S; β0 , η 0 ) = EP RA RA T
(RY − RX
T
β0 )2 ,
J0 := EP [R T
 A RA ], T   −1  
J0 := EP RX (RA ) (J0 ) EP RA (RX )T
Regularizing DML in endogenous PLMs 6499

and the variance-covariance matrix σ 2 := J0 J˜0 J0T . Moreover, let the score func-
tion
−1
ψ(·; β0 , η 0 ) := σ −1 J˜0 2 ψ(·; β0 , η 0 ).
Definition I.2. Let γ ≥ 0. Consider a realization set T of nuisance functions.
Define the statistical rates
4
rN := max sup EP [ψ(S; b0 , η) − ψ(S; b0 , η 0 )]
S=(U,V,W,Z)∈{A,X,Y }2 ×{W }×{A,X,Y }, η∈T
b0 ∈{bγ ,β0 ,0}

and
 2  
λN := max sup ∂r EP ϕ S; b0 , η 0 + r(η − η 0 ) ,
 2 }, r∈(0,1),η∈T
ϕ∈{ψ,ψ,ψ
b0 ∈{bγ ,β0 ,0}

where we interpret ψ2 S; b0 , η 0 + r(η − η 0 ) as ψ2 S; η 0 + r(η − η 0 ) in the


definition of λN .
Remark I.3. We would like to remark that the respective definition of the
statistical rate rN given in Chernozhukov et al. [31] involves the L2 -norm of
ψ(S; b0 , η) − ψ(S; b0 , η 0 ) instead of its L1 -norm. However, it is essential to em-
ploy the L1 -norm to ensure that Assumption I.5.5 can constrain the L2 -norm of
the estimation errors incurred by the ML estimators of the nuisance parameters.
Thus, we do not have to constrain their higher order errors to employ Hölder’s
inequality in Lemma I.16.
Definition I.4. Let the nonrandom numbers
 
p −1,− 2
1 4 1
ρN := rN + N 2 λN and ρ̃N := N max + rN .

If not stated otherwise, we assume the following Assumption I.5 in all the
results presented in the appendix.
Assumptions I.5. Let γ ≥ 0. Let K ≥ 2 be a fixed integer independent of
N . We assume that N ≥ K holds. Let {δN }N ≥K and {ΔN }N ≥K be two se-
1
quences of positive numbers that converge to zero, where δN4 ≥ N − 2 holds. Let
1

{PN }N ≥1 be a sequence of sets of probability distributions P of the quadruple


S = (A, W, X, Y ).
Let p > 4. For all N , for all P ∈ PN , consider a nuisance function realization
set T such that the following conditions hold:
I.5.1 We have an SEM given by (3) that satisfies the identifiability conditon (5).
I.5.2 There exists a finite real constant C1 satisfying AP,p +XP,p +Y P,p ≤
C1 .
I.5.3 The matrix EP [RX RA T
] ∈ Rd×q has full rank d. This in particular requires
q ≥ d. The matrices D1 ∈ Rd×d and J0 ∈ Rq×q are invertible. Further-
more, the smallest and largest singular values of the symmetric matrices
J0 and J0 are bounded away from 0 by c1 > 0 and are bounded away from
+∞ by c2 < ∞.
6500 C. Emmenegger and P. Bühlmann

I.5.4 The symmetric matrices J˜0 , D1 + (γ − 1)D2 , and D4 are invertible, where
D4 is introduced in Definition J.1 in the appendix in Section J. The small-
est and largest singular values of these matrices are bounded away from 0
by c3 and are bounded away from +∞ by c4 .
I.5.5 The set T consists of P -integrable functions η = (mA , mX , mY ) whose pth
moment exists and it contains η 0 . There exists a finite real constant C2
such that
η 0 − ηP,p ≤ C2 , η 0 − ηP,2 ≤ δN ,
m0A (W ) − mA (W )2P,2 ≤ δN N − 2 ,
1

m0X (W ) − mX (W )P,2
· m0Y (W ) − mY (W )P,2 + m0X (W ) − mX (W )P,2 ≤ δN N − 2 ,
1

mA (W ) − mA (W )P,2
0

· m0Y (W ) − mY (W )P,2 + m0X (W ) − mX (W )P,2 ≤ δN N − 2


1

hold for all elements η of T . Given a partition I1 , . . . , IK of [N ] where


each Ik is of size n = KN
, for all k ∈ [K], the nuisance parameter estimate
Ikc Ikc
η̂ = η̂ ({Si }i∈Ikc ) satisfies
c c
η 0 − η̂ Ik P,p ≤ C2 , η 0 − η̂ Ik P,2 ≤ δN ,
Ic
m0A (W ) − m̂Ak (W )2P,2 ≤ δN N − 2 ,
1

Ic
m0X (W ) − m̂Xk (W )P,2
Ic Ic
· m0Y (W ) − m̂Yk (W )P,2 + m0X (W ) − m̂Xk (W )P,2 ≤ δN N − 2 ,
1

c
I
m0A (W ) − m̂Ak (W )P,2
Ic Ic
· m0Y (W ) − m̂Yk (W )P,2 + m0X (W ) − m̂Xk (W )P,2 ≤ δN N − 2
1

with P -probability no less than 1 − ΔN . Denote by EN the event that


c c
η̂ Ik = η̂ Ik ({Si }i∈Ikc ) belongs to T , and assume that this event holds with
P -probability no less than 1 − ΔN .
For instance, invertibility of the square matrices EP [RA RA T
] and J˜0 is satisfied
if εY is independent of both A and W and has a strictly positive variance.
Remark I.6. It is possible to drop some of the assumptions in Assumption I.5
if we are interested in proving the results about DML only. The full assumption
is required to prove the results about DML and the regularized methods.
Lemma I.7. Let u ≥ 1. Consider a t-dimensional random variable Z. Denote
the joint law of Z and W by P . Then we have
Z − EP [Z|W ]P,u ≤ 2ZP,u .
Proof of Lemma I.7. Because the Euclidean norm to the uth power is convex
for u ≥ 1, we have
EP[Z|W ]uP,u 
= EP EP [Z|W ]u 
≤ EP EP [Zu |W ]
= EP [Zu ]
= ZuP,u
Regularizing DML in endogenous PLMs 6501

by Jensen’s inequality. We hence have

Z − EP [Z|W ]P,u ≤ ZP,u + EP [Z|W ]P,u ≤ 2ZP,u

by the triangle inequality.


Lemma I.8. Consider a t-dimensional random variable Z. Denote the joint
law of Z and W by P . Then we have
  
 EP ZZ T − EP [Z|W ] EP [Z T |W ]  ≤ 2Z2P,2 .

Proof of Lemma I.8. Because the Euclidean norm is convex, we have


  
 EP ZZ T − EP [Z|W ] EP [Z T |W ] 
 
≤ EP ZZ T  + EP [Z|W ]EP [Z T |W ]
≤ EP Z2 + EP [Z|W ]2

by Jensen’s inequality, the triangle inequality, and the Cauchy–Schwarz inequal-


ity. Because the squared Euclidean norm is convex, we have
  
EP [Z|W ]2 ≤ EP Z2 W

by Jensen’s inequality. Therefore, we have


  
 EP ZZ T − EP [Z|W ] EP [Z T |W ] 
 
≤ EP Z2 + EP [Z|W ]2 
≤ EP Z2 + EP [Z2 |W ]
= 2Z2P,2 .

Lemma I.9. Let a t1 -dimensional random variable Z1 and a t2 -dimensional


random variable Z2 . Denote the joint law of Z1 , Z2 , and W by P . Then we
have
  
 EP (Z1 − EP [Z1 |W ])(Z2 − EP [Z2 |W ])T 2 ≤ Z1 2P,2 Z2 2P,2 .

Proof of Lemma I.9. By the Cauchy–Schwarz inequality, we have


  
 EP (Z1 − EP [Z1 |W ])(Z2 − EP [Z2 |W ])T 2
   
≤ EP (Z1 − EP [Z1 |W ])2 EP (Z2 − EP [Z2 |W ])2 .

Because the conditional expectation minimizes the mean squared error [39, The-
orem 5.1.8], we have
 
EP (Z1 − EP [Z1 |W ])2 ≤ Z1 2P,2

and  
EP (Z2 − EP [Z2 |W ])2 ≤ Z2 2P,2 .
In total, we thus have
  
 EP (Z1 − EP [Z1 |W ])(Z2 − EP [Z2 |W ])T 2 ≤ Z1 2P,2 Z2 2P,2 .
6502 C. Emmenegger and P. Bühlmann

Lemma I.10. Let a t1 -dimensional random variable Z1 and a t2 -dimensional


random variable Z2 . Denote the joint law of Z1 , Z2 , and W by P . Then we have
  
 EP (Z1 − EP [Z1 |W ])Z2T 2 ≤ Z1 2P,2 Z2 2P,2 .

Proof of Lemma I.10. By the Cauchy–Schwarz inequality, we have


      
 EP (Z1 − EP [Z1 |W ])Z2T 2 ≤ EP Z1 − EP [Z1 |W ]2 EP Z2 2 .

Because the conditional expectation minimizes the mean squared error [39, The-
orem 5.1.8], we have
   
EP Z1 − EP [Z1 |W ]2 ≤ EP Z1 2 = Z1 2P,2 .

Consequently,
  
 EP (Z1 − EP [Z1 |W ])Z2T 2 ≤ Z1 2P,2 Z2 2P,2

holds.
Lemma I.11. Let a, b ∈ R be two numbers. We have

(a + b)2 ≤ 2a2 + 2b2 . (24)

Proof of Lemma I.11. The true statement 0 ≤ (a − b)2 is equivalent to (24).


The following lemma proved in Chernozhukov et al. [31] states that condi-
tional convergence in probability implies unconditional convergence in proba-
bility.
Lemma I.12 (Based on Chernozhukov et al. [31, Lemma 6.1].). Let {Xt }t≥1
and {Yt }t≥1 be sequences of random vectors, and let u ≥ 1. Consider a determin-
istic sequence {εt }t≥1 with εt → 0 as t → ∞ such that we have E[Xt u |Yt ] ≤ εut .
Then we have Xt  = OP (εt ) unconditionally, meaning that that for any se-
quence {t }t≥1 with t → ∞ as t → ∞, we have P (Xt  > t εt ) → 0.
Proof of Lemma I.12. We have
 
E E[Xt u |Yt ] 1
P (Xt  > t εt ) = E[P (Xt  > t εt |Yt )] ≤ u u ≤ u → 0 (t → ∞)
 t εt t

by Markov’s inequality.
Lemma I.13. There exists a finite real constant C3 satisfying β0  ≤ C3 .
Proof of Lemma I.13. Recall the matrices J0 and J0 in Definition I.1. We have

β0  
     
≤ (J0 )−1  EP ARX T   −1 
(J0 ) EP ARY 
≤ c12 XP,2 Y P,2 A2P,2
2
Regularizing DML in endogenous PLMs 6503

by submultiplicativity, Assumption I.5.3, and Lemma I.10. We hence infer

1 4
β0  ≤ C
c22 1

by Assumption I.5.2.

Lemma I.14. Let γ ≥ 0. There exists a finite real constant C4 satisfying bγ  ≤
C4 .

Proof of Lemma I.14. We have

bγ 
        −1 
 T −1 
≤  EP RX RX T
+ (γ − 1) EP RX RA
T
EP RA RA EP RA RXT

     
 −1 
· EP [RX RY ] + (γ − 1) EP RX RA
T
EP RA RA
T
EP [RA RY ]

by submultiplicativity. By Assumption I.5.4, the largest singular value of the


matrix
     
T −1
 
D1 + (γ − 1)D2 = EP RX RX
T
+ (γ − 1) EP RX RA
T
EP RA RA EP RA RX
T

is upper bounded by 0 < c4 < ∞. Thus, we have



bγ  ≤ 1
EP [RX RY ]
   
c4
  T 
 T −1 

+|γ − 1| EP RX RA  EP RA RA  EP RA RYT 

by the triangle inequality and submultiplicativity. By Assumption I.5.3, the


largest singular value of EP [RA RA
T
] is upper bounded by 0 < c2 < ∞. By
Lemma I.9 and Assumption I.5.2, we have
  
 EP 
RX RYT  ≤ XP,2 Y P,2 ≤ C21 ,
2

 
 EP RX RTA ≤ XP,2 AP,2 ≤ C21 ,
 EP RA RY  ≤ AP,2 Y P,2 ≤ C1 .

In total, we hence have



1 C14
b  ≤
γ
C1 + |γ − 1|
2
.
c4 c2

Lemma I.15. Let γ ≥ 0 The statistical rates rN and λN introduced in Defini-


4
tion I.2 satisfy rN  δN and λN  √δNN .

Proof of Lemma I.15. This proof is modified from Chernozhukov et al. [31].
First, we verify the bound on rN . Let S = (U, V, W, Z) ∈ {A, X, Y }2 × {W } ×
6504 C. Emmenegger and P. Bühlmann

{A, X, Y }, let η = (mU , mV , mZ ) ∈ T , and let b0 ∈ {bγ , β0 , 0}. We have

ψ(S; b0 , η) − ψ(S; b0 , η 0 )
 T
T
= U − mU (W ) Z − mZ (W ) − V − mV (W ) b0
 T
T
− U − m0U (W ) Z − m0Z (W ) − V − m0V (W ) b0
 T
T
= U − m0U (W ) m0Z (W ) − mZ (W ) − m0V (W ) − mV (W ) b0
 T
T
+ m0U (W ) − mU (W ) Z − m0Z (W ) − V − m0V (W ) b0
+ m0U (W ) − mU (W )
 T
T
· m0Z (W ) − mZ (W ) − m0V (W ) − mV (W ) b0 .

By the triangle inequality and Hölder’s inequality, we have

EP [ψ(S; b0 , η) − ψ(S; b0 , η 0 )]


= ψ(S; b0 , η) − ψ(S; 0 0
 b , η )P,1 
 T 
≤ U − m0U (W )P,2 m0Z (W ) − mZ (W ) − m0V (W ) − mV (W ) b0 
  P,2
 T 
+m0U (W ) − mU (W )P,2 Z − m0Z (W ) − V − m0V (W ) b0 
P,2
+m0U(W ) − mU (W )P,2 
 T 0
·m0Z (W ) − mZ (W ) − m0V (W ) − mV (W ) b  .
P,2

Observe that U − m0U (W )P,2 ≤ 2U P,2 , and V − m0V (W )P,2 ≤ 2V P,2 ,
and Z − m0Z (W )P,2 ≤ 2ZP,2 hold by Lemma I.7. We have η − η 0 P,2 ≤ δN
by Assumption I.5.5. Therefore, we obtain the upper bound

EP [ψ(S; b0 , η) − ψ(S; b0 , η 0 )]


≤ 4 max{1, b0 }(U P,2 + V P,2 + ZP,2 )δN + 2 max{1, b0 }δN
2

 δN

by the triangle inequality, Lemma I.13, Lemma I.14, and Assumption I.5.2
and I.5.5. Because this upper bound is independent of η, we obtain our claimed
4
bound on rN .
Subsequently, we verify the bound on λN . Consider S = (A, X, W, Y ), denote
 ψ2 }, where
by U either A or X, denote by Z either A or Y , and let ϕ ∈ {ψ, ψ,
we interpret ψ2 (S; b, η) = ψ2 (S; η). We have
 
∂r2 EP ψ S; b0 , η 0 + r(η − η 0 )
= 2 EP mU (W ) − m0U (W )
  
T 0 T
· mZ (W ) − m0Z (W ) − mX (W ) − m0X (W ) b .
Regularizing DML in endogenous PLMs 6505

Due to the Cauchy–Schwarz inequality, we infer


 2  
∂r EP ψ S; b0 , η 0 + r(η − η 0 ) 
≤ 2mU (W ) − m0U (W )P,2
· mZ (W ) − m0Z (W )P,2 + mX (W ) − m0X (W )P,2 b0 
≤ 2 max{1, b0 }mU (W ) − m0U (W )P,2
· mZ (W ) − m0Z (W )P,2 + mX (W ) − m0X (W )P,2
 δN N − 2
1

by Lemma I.13, Lemma I.14, and Assumption I.5.5. Consequently, we obtain


our claimed bound on λN .

Lemma I.16. Let γ ≥ 0. Let k ∈ [K]. Let furthermore ϕ ∈ {ψ, ψ,  ψ2 } and


b ∈ {b , β0 , 0}. We have
0 γ

 
 1  1  
√ 0 Ikc
ϕ(Si ; b , η̂ ) − √ 0 0 
ϕ(Si ; b , η ) = OP (ρN ),
 n n
i∈Ik i∈Ik

1 1
where ρN = rN + N 2 λN is as in Definition I.4 and satisfies ρN  δN4 , and
where we interpret ψ2 (S; b, η) = ψ2 (S; η).
Proof of Lemma I.16. This proof is modified from Chernozhukov et al. [31]. By
the triangle inequality, we have
 
 √1  c  
 n i∈Ik ϕ(Si ; b0 , η̂ Ik ) − √1n i∈Ik ϕ(Si ; b0 , η 0 )
  
 c c
=  √1n i∈Ik ϕ(Si ; b0 , η̂ Ik ) − ϕ(s; b0 , η̂ Ik )dP (s)
 
− √1n i∈Ik ϕ(Si ; b0 , η 0 ) − ϕ(s; b0 , η 0 )dP (s)
√  
c 
+ n ϕ(s; b0 , η̂ Ik ) − ϕ(s; b0 , η 0 ) dP (s)

≤ I1 + nI2 ,

where I1 := M  for

 c  c
M := √1
n i∈Ik ϕ(Si ; b0 , η̂ Ik ) − ϕ(s; b0 , η̂ Ik )dP (s)

 
− √1n i∈Ik ϕ(Si ; b0 , η 0 ) − ϕ(s; b0 , η 0 )dP (s) ,

and where  
 
I2 := 
 ϕ(s; b , η̂ ) − ϕ(s; b , η ) dP (s)
0 Ikc 0
.
0

We bound the two terms I1 and I2 individually. First, we bound I1 . Because the
dimensions d and q are fixed, it is sufficient to bound one entry of the matrix M .
Let l index the rows of M , and let t index the columns of M (we interpret vectors
as matrices with one column). On the event EN that holds with P -probability
6506 C. Emmenegger and P. Bühlmann

1 − ΔN , we have
  
EP Ml,t 2 {Si }i∈Ikc  

= n1 i∈Ik EP |ϕl,t (Si ; b0 , η̂ Ik ) − ϕl,t (Si ; b0 , η 0 )|2 {Si }i∈Ikc
c

 c
+ n1 i,j∈Ik ,i=j EP ϕl,t (Si ; b0 , η̂ Ik ) − ϕl,t (Si ; b0 , η 0 )  
· ϕl,t (Sj ; b0 , η̂ Ik ) − ϕl,t(Sj ; b0 , η 0) {Si }i∈Ikc
c

  0 Ikc 0 0 
−2 i∈Ik EP ϕl,t  (Si ; b , 0η̂ I)c− ϕl,t (Si ; b ,0η )0 {S  i }i∈Ik 
c

· EP ϕl,t (S; b , η̂ ) − ϕl,t (S; b , η ) {Si }i∈Ikc


k  (25)
   
 (S; b0 , η̂ Ik ) − ϕl,t (S; b0, η 0 ){Si }i∈Ikc 
c 2
 EP ϕl,t
+n
0 Ikc 0 0 2
= EP |ϕl,t (S; b , η̂ ) − ϕl,t (S; b , η )| {Si }i∈Ikc
   2
+ n(n−1)  0 Ikc 0 0  c 
n  − 2n + n EP ϕl,t (S; b , η̂  ) − ϕl,t (S; b , η ) {Si }i∈Ik
≤ supη∈T EP ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )2 .

Furthermore, for η ∈ T , we have


 
EP ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )2
≤ EP [ϕ(S; b , η) − ϕ(S; b , η )]
0 0 0
 (26)
+ EP ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )2 1 ϕ(S;b0 ,η)−ϕ(S;b0 ,η 0 ) ≥1

and
 
E
P ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )2 1 ϕ(S;b0 ,η)−ϕ(S;b0 ,η0 ) ≥1
 
≤ EP ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )4 P (ϕ(S; b0 , η) − ϕ(S; b0 , η 0 ) ≥ 1)
(27)
by Hölder’s inequality. Observe that the term
  
EP ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )4 (28)

is upper bounded by Assumption I.5.5, Lemma I.13, and Lemma I.14. By


Markov’s inequality, we have

P (ϕ(S; b0 , η)−ϕ(S; b0 , η 0 ) ≥ 1) ≤ EP [ϕ(S; b0 , η)−ϕ(S; b0 , η 0 )] ≤ rN


4
. (29)

Therefore, we have EP [I12 |{Si }i∈Ikc ]  rN


2
due to (25)–(29). The statistical
1
rate rN satisfies rN  δN4 by Lemma I.15. Thus, we infer I1 = OP (rN ) by
Lemma I.12. Subsequently, we bound I2 . For r ∈ [0, 1], we introduce the func-
tion
  
fk (r) := EP ϕ S; b0 , η 0 + r(η̂ Ik − η 0 ) {Si }i∈Ikc − EP [ϕ(S; b0 , η 0 )].
c

Observe that I2 = fk (1) holds. We apply a Taylor expansion to this function
and obtain
1
fk (1) = fk (0) + fk (0) + fk (r̃)
2
for some r̃ ∈ (0, 1). We have
  
fk (0) = EP ϕ(S; b0 , η 0 ){Si }i∈Ikc − EP [ϕ(S; b0 , η 0 )] = 0.
Regularizing DML in endogenous PLMs 6507

Furthermore, the score ϕ satisfies the Neyman orthogonality property fk (0) = 0.
The proof of this claim is analogous to the proof of Proposition 3.3 because the
proof of Proposition 3.3 does neither depend on the underlying model of the
random variables nor on the value of β. Furthermore, we have

fk (r) = 2 EP mU (W ) − m0U (W )
 T 
T
· mZ (W ) − m0Z (W ) − mX (W ) − m0X (W ) b0

for U ∈ {A, X} and Z ∈ {A, Y }. On the event EN that holds with P -probability
1 − ΔN , we have
fk (r̃) ≤ sup fk (r)  λN .
r∈(0,1)

We thus infer
 
 1  1   √
√ 0 Ikc
ϕ(Si ; b , η̂ )− √ 0 0  1
ϕ(Si ; b , η ) ≤ I1 + nI2 = OP (rN +N 2 λN ).
 n n
i∈Ik i∈Ik
1
Because rN  δN and λN  √δNN hold by Lemma I.15 and because {δN }N ≥K
4

converges to 0 by Assumption I.5, we furthermore have


1 1
ρN = rN + N 2 λN  δN4 .
Lemma I.17. Let k ∈ [K]. Let furthermore U, V ∈ {A, X} and S = (U, V,
W, Y ). Let ϕ ∈ {ψ1 , ψ2 , ψ3 }. We have
1 
ϕ(Si ; η̂ Ik ) = EP [ϕ(S; η 0 )] + OP N − 2 (1 + ρN ) .
c 1

n
i∈Ik

Proof of Lemma I.17. Consider the decomposition


 Ikc
1
n i∈Ik ϕ(Si ; η̂ ) − EP [ϕ(S; η 0 )]
Ikc

= n i∈Ik ϕ(Si ; η̂ ) − ϕ(Si ; η 0 ) + n1 i∈Ik ϕ(Si ; η 0 ) − EP [ϕ(S; η 0 )] .
1

 c
Because of Lemma I.16, the term n1 i∈Ik ϕ(Si ; η̂ Ik ) − ϕ(Si ; η 0 ) is of order

OP (N − 2 ρN ). The term n1 i∈Ik ϕ(Si ; η 0 )−EP [ϕ(S; η 0 )] is of order OP (N − 2 )
1 1

due to the Lindeberg–Feller CLT and the Cramer–Wold device. Thus, we deduce
the statement.
Definition I.18. We denote by AIk the row-wise concatenation of c
the observa-
c c
tions Aic for i ∈ Ik . We denote similarly by X Ik , W Ik , Y Ik , AIk , X Ik , W Ik ,
Ik
and Y the row-wise concatenations of the respective observations.
Proof of Theorem 3.1. This proof is based on Chernozhukov et al. [31]. We show
the stronger statement
√ 1 
N
N σ −1 (β̂ − β0 ) = √
d
ψ(Si ; β0 , η 0 ) + OP (ρN ) → N (0, 1d×d ) (N → ∞),
N i=1
(30)
6508 C. Emmenegger and P. Bühlmann

where β̂ denotes the DML1 estimator β̂ DML1 or the DML2 estimator β̂ DML2 ,
and where the rate ρN is specified in Definition I.4, and we show that this
statement holds uniformly over laws P . We first consider β̂ DML2 . It suffices to
show that (30) holds uniformly over P ∈ PN . Fix a sequence {PN }N ≥1 such
that PN ∈ PN for all N ≥ 1. Because this sequence is chosen arbitrarily, it
suffices to show

N σ −1 (β̂ DML2 − β0 )
N
= √1 0
N i=1 ψ(Si ; β0 , η ) + OPN (ρN )
d
→ N (0, 1d×d ) (N → ∞).

We have

β̂ DML2
  −1
K Ikc Ik T Ikc
= 1
K k=1 X Ik
− m̂ X (W ) Π Ik X
RA
Ik
− m̂ X (W Ik
)
K Ikc Ik T Ikc
·K
1
k=1 X Ik
− m̂ X (W ) ΠRA
Ik Y
Ik
− m̂ Y (W )
Ik

K Ic T Ic
= 1
K
1
k=1 n X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
 −1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) (AIk − m̂Ak (W Ik ) (31)
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik )
K Ic T Ic
·K
1 1
X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )

k=1 n −1
Ic T Ic
· 1
n AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )

by (7). By Lemma I.17, we have

Ic T Ic
XIk − m̂Xk (W Ik )
1
n AIk − m̂Ak (W Ik )
T (32)
+ OPN N − 2 (1 + ρN )
1
= EPN X − m0X (W ) A − m0A (W )

and

Ic T Ic
AIk − m̂Ak (W Ik )
1
n AIk − m̂Ak (W Ik )
T (33)
+ OPN N − 2 (1 + ρN ) .
1
= EPN A − m0A (W ) A − m0A (W )

Recall the matrix J0 introduced in Definition I.1. By Weyl’s inequality and


Regularizing DML in endogenous PLMs 6509

Slutsky’s theorem, combining Equations (31)–(33) gives


N (β̂ DML2 − β0 )
 
T
= EPN X − m0X (W ) A − m0A (W )

T −1
· EPN A − m0A (W ) A − m0A (W )
 −1
T
· EPN A − m0A (W ) X − m0X (W )

T
· EPN X − m0X (W ) A − m0A (W )
 
T −1 − 12
· EPN A − mA (W ) A − mA (W )
0 0
+ OPN N (1 + ρN )
  c
K I T Ic
· √1K k=1 √1n AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )

Ic T Ic
− AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik ) β0
= J0 + OPN N − 2 (1 + ρN )
1

K Ic T
· √1K k=1 √1n AIk − m̂Ak (W Ik )
 
Ic Ic
· Y Ik − m̂Yk (W Ik ) − X Ik − m̂Xk (W Ik ) β0
(34)
because K is a constant independent of N and because N = nK holds. Recall
the linear score ψ in (11). We have

√   1 K
1 
N (β̂ DML2 − β0 ) = J0 + OPN N − 2 (1 + ρN ) √
1 c
√ ψ(Si ; β0 , η̂ Ik ).
K k=1 n i∈Ik
(35)
Let k ∈ [K]. By Lemma I.16, we have

1  c 1 
√ ψ(Si ; β0 , η̂ Ik ) = √ ψ(Si ; β0 , η 0 ) + OPN (ρN ). (36)
n n
i∈Ik i∈Ik

We combine (35) and (36) to obtain


 N (β̂
DML2
− β0 )  K 
= J0 + OPN N − 2 (1 + ρN ) √1K k=1 √1n i∈Ik ψ(Si ; β0 , η̂ Ik )
1 c

  K   
= J0 + OPN N − 2 (1 + ρN ) √1K k=1 √1n i∈Ik ψ(Si ; β0 , η 0 ) + OPN (ρN ) .
1

Recall that we have N = nK, that K is a constant independent of N , that the


1
sets Ik for k ∈ [K] form a partition of [N ], that ρN  δN4 by Lemma I.16, and
1
that δN converges to 0 as N → ∞ and δN4 ≥ N − 2 holds by Assumption I.5.
1
6510 C. Emmenegger and P. Bühlmann

Thus, we have


 N (β̂
DML2
− β0 ) 
= J0 + OPN N − 2 (1 + ρN )
1

K   
· √1K k=1 √1n i∈Ik ψ(Si ; β0 , η 0 ) + OPN (ρN )
  N
= J0 + OPN N − 2 (1 + ρN ) √1N i=1 ψ(Si ; β0 , η 0 ) + OPN (ρN )
1

N
= J0 · √1N i=1 ψ(Si ; β0 , η 0 ) + OPN (ρN ).

We have EPN [ψ(S; β0 , η 0 )] = 0 due to the identifiability condition (5). Therefore,


we conclude the proof concerning the DML2 method due to the Lindeberg–Feller
CLT and the Cramer–Wold device.
Subsequently, we consider the DML1 method. It suffices to show that (30)
holds uniformly over P ∈ PN . Fix a sequence {PN }N ≥1 such that PN ∈ PN for
all N ≥ 1. Because this sequence is chosen arbitrarily, it suffices to show


N σ −1 (β̂ DML1 − β0 )
N
= √1 0
N i=1 ψ(Si ; β0 , η ) + OPN (ρN )
d
→ N (0, 1d×d ) (N → ∞).

We have

β̂ Ik
 −1
Ic T Ic
= X Ik − m̂Xk (W Ik ) ΠRIk X Ik − m̂Xk (W Ik )
A
Ic T Ic
· X Ik − m̂Xk (W Ik ) ΠRIk Y Ik − m̂Yk (W Ik )
A
Ic T Ic
= 1
n X −
Ik
m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
 −1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik ) (37)
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik )
Ic T Ic
· n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
 −1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )

by (19). Due to Weyl’s inequality and Slutsky’s theorem, (32), (33), and (37),
Regularizing DML in endogenous PLMs 6511

we obtain

N (β̂ DML1 − β0 )
 
T
= EPN X − m0X (W ) A − m0A (W )

T −1
· EPN A − m0A (W ) A − m0A (W )
 −1
T
· EPN A − m0A (W ) X − m0X (W )

T
· EPN X − m0X (W ) A − m0A (W )
 
T −1
+ OPN N − 2 (1 + ρN )
1
· EPN A − m0A (W ) A − m0A (W )
K  Ic T Ic
· √1K k=1 √1n AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )

Ic T Ic
− √1n AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik ) β0
 
= J0 + OPN N − 2 (1 + ρN )
1

K Ic T
· √1K k=1 √1n AIk − m̂Ak (W Ik )
 
Ikc Ikc
· Y − m̂Y (W ) − X − m̂X (W ) β0 .
Ik Ik Ik Ik

√ (38)
Observe that the expression
√ for N ( β̂ DML1
− β 0 ) given in (38) coincides with
the√expression for N (β̂ DML2 − β0 ) given in (34). Thus, the asymptotic
√ analysis
of N (β̂ DML1 − β0 ) coincides with the asymptotic analysis of N (β̂ DML2 − β0 )
presented above.
Lemma I.19. Let γ ≥ 0. Let p > 4 be the p from Assumption I.5, let b0 ∈
{β0 , bγ , 0}, and let S = (U, V, Z) ∈ {A, X, Y }2 × {W } × {A, X, Y }. There exists
a finite real constant C5 satisfying
 p
2
p
sup EP ψ(S; b0 , η) 2 ≤ C5 .
η∈T

Proof of Lemma I.19. Let η = (mU , mV , mZ ) ∈ T . By Hölder’s inequality and


the triangle inequality, we have
 p
2
p
EP ψ(S; b0 , η) 2
= (U − mU (W )) Z − mZ (W ) − (V − mV (W ))T b0 P, p2
(39)
≤ U − m0U (W )P,p + m0U (W ) − mU (W )P,p
· Z − m0Z (W )P,p + (V − m0V (W ))T b0 P,p
+m0Z (W ) − mZ (W )P,p + (m0V (W ) − mV (W ))T b0 P,p .

By the Cauchy–Schwarz inequality, we have


   1
 T 
 V − m0V (W ) b0  ≤ EP V − m0V (W )p b0 p p = b0 V − m0V (W )P,p
P,p
(40)
6512 C. Emmenegger and P. Bühlmann

and analogously
 
 0 T 0
 mV (W ) − mV (W ) b  ≤ b0 m0V (W ) − mV (W )P,p . (41)
P,p

Hence, we infer
 p
2
p
EP ψ(S; b0 , η) 2 ≤ (U P,p + C2 )(ZP,p + V P,p + 2C2 ) max{1, b0 }
(42)
by (39), (40), (41), Lemma I.7, and Assumption I.5.5. By Lemma I.13, there
exists a finite real constant C3 that satisfies β0  ≤ C3 . By Lemma I.14, there
exists a finite real constant C4 that satisfies bγ  ≤ C4 . These two bounds lead
to b0  ≤ max{C3 , C4 }. By Assumption I.5.2, we have

max{U P,p , V P,p , ZP,p } ≤ U P,p + V P,p + ZP,p ≤ 3C1 .

Due to (42), we therefore have


 p
2
p
EP ψ(S; b0 , η) 2 ≤ (3C1 + C2 )(6C1 + 2C2 ) max{1, C3 , C4 }.

Lemma I.20. Let γ ≥ 0, and let p be as in Assumption I.5. Let the in-
dices k ∈ [K] and (j, l, t, r) ∈ [L1 ] × [L2 ] × [L3 ] × [L4 ], where L1 , L2 , L3 ,
and L4 are natural numbers representing the intended dimensions. Let b̂ ∈
{β̂ DML1 , β̂ DML2 , b̂γ,DML1 , b̂γ,DML2 }, and consider the corresponding true but un-
known underlying parameter vector b0 ∈ {β0 , bγ }. Consider the corresponding
score function combinations

ψ̂ A (·) ∈ {ψj (·; b̂, η̂ Ik ), ψj (·; b̂, η̂ Ik ), (ψ1 (·; η̂ Ik ))j,l , (ψ2 (·; η̂ Ik ))j,l },
c c c c

A
ψ̂full  b̂, η̂ Ikc ), ψ(·; b̂, η̂ Ikc ), ψ1 (·; η̂ Ikc ), ψ2 (·; η̂ Ikc )},
(·) ∈ {ψ(·;
ψ̂ B (·) ∈ {ψt (·; b̂, η̂ Ik ), ψt (·; b̂, η̂ Ik ), (ψ1 (·; η̂ Ik ))t,r , (ψ2 (·; η̂ Ik ))t,r },
c c c c

B
ψ̂full  b̂, η̂ Ikc ), ψ(·; b̂, η̂ Ikc ), ψ1 (·; η̂ Ikc ), ψ2 (·; η̂ Ikc )}
(·) ∈ {ψ(·;

and their respective nonestimated quantity

ψ A (·) ∈ {ψj (·; b0 , η 0 ), ψj (·; b0 , η 0 ), (ψ1 (·; η 0 ))j,l , (ψ2 (·; η 0 ))j,l },
A
ψfull  b0 , η 0 ), ψ(·; b0 , η 0 ), ψ1 (·; η 0 ), ψ2 (·; η 0 )},
(·) ∈ {ψ(·;
ψ B (·) ∈ {ψt (·; b0 , η 0 ), ψt (·; b0 , η 0 ), (ψ1 (·; η 0 ))t,r , (ψ2 (·; η 0 ))t,r },
B
ψfull  b0 , η 0 ), ψ(·; b0 , η 0 ), ψ1 (·; η 0 ), ψ2 (·; η 0 )}.
(·) ∈ {ψ(·;

Then we have
 
1  A  A 

Ik :=  ψ̂ (Si )ψ̂ (Si ) − EP ψ (S)ψ (S)  = OP (ρ̃N ),
B B
n
i∈Ik
 
p −1,− 2
4 1
where ρ̃N = N max + rN is as in Definition I.4.
Regularizing DML in endogenous PLMs 6513

Proof of Lemma I.20. This proof is modified from Chernozhukov et al. [31]. By
the triangle inequality, we have

Ik ≤ Ik,A + Ik,B ,

where  
1  A 1  A 
Ik,A :=  ψ̂ (Si )ψ̂ B (Si ) − ψ (Si )ψ B (Si )
n n
i∈Ik i∈Ik

and   
1  A 
Ik,B 
:=  ψ (Si )ψ (Si ) − EP ψ (S)ψ (S) .
A B B
n
i∈Ik

Subsequently, we bound the two terms Ik,A and Ik,B individually. First, we
bound Ik,B . We consider the case p ≤ 8. The von Bahr–Esseen inequality I [37,
p. 650] states that for 1 ≤ u ≤ 2 and for independent, real-valued, and mean 0
variables Z1 , . . . , Zn , we have
  u  
 n  1
n
E  
Zi  ≤ 2 − E[|Zi |u ].
i=1
n i=1

The individual summands ψ A (Si )ψ B (Si ) − EP [ψ A (S)ψ B (S)] for i ∈ Ik are in-
dependent and have mean 0. Therefore,
 p
EP Ik,B4

 
p
  A   p4
= n EP  i∈Ik ψ (Si )ψ (Si ) − EP ψ (S)ψ (S) 
1 4 A B B
      p
−1+ p
≤ n1 4
2 − n1 n1 i∈Ik EP ψ A (Si )ψ B (Si ) − EP ψ A (S)ψ B (S)  4
     p
−1+ p
= n1 4
2 − n1 EP ψ A (S)ψ B (S) − EP ψ A (S)ψ B (S)  4

follows due to the von Bahr–Esseen inequality I because 1 < p


4 ≤ 2 holds. By
Hölder’s inequality, we have
  p   p  p4
EP ψ A (S) 4 ψ B (S) 4
  p p2   p p2
≤ EP ψ A (S) 2 EP ψ B (S; bγ , η 0 ) 2
 A   B 
≤ ψfull (S)P, p ψfull (S)P, p .
2 2

0  b , η )P, p , ψ1 (S; η)P, p , and ψ2 (S; η)P, p


The terms ψ(S; b , η )P, p2 , ψ(S;
0 0 0
2 2 2
are upper bounded by the finite real constant C5 by Lemma I.19. Thus, we have
p
Ik,B = OP (N 4 −1 ) by Lemma I.12 because we have
   p 4

EP ψ A (S)ψ B (S) − EP ψ A (S)ψ B (S)  4


p

 
= ψ A (S)ψ B (S) − EP ψ A (S)ψ

B
(S) P, p4 
≤ ψ A (S)ψ B (S)P, p4 + EP |ψ A (S)ψ B (S)|
≤ 2ψ A (S)ψ B (S)P, p4
6514 C. Emmenegger and P. Bühlmann

p
by the triangle inequality, Hölder’s inequality, and due to 4 > 1.
Next, consider the case p > 8. Observe that
 2 

EP 1
n i∈Ik
A B
ψ (Si )ψ (Si )
  2
2 2 n(n−1)
= 1
n EP ψ A (S) ψ B (S) + n2 EP ψ A (S)ψ B (S)

holds because the data sample is iid. Thus, we infer

EP [I 2
k,B ]
 2   2
= EP 1
n i∈Ik ψ A
(Si )ψ B
(Si ) + EP ψ A (S)ψ B (S)
 
−2 EP n1 i∈Ik ψ A (Si )ψ B (Si ) EP [ψ A (S)ψ B (S)]
 
≤ n1 EP (ψ A (S))2 (ψ B (S))2 .

By the Cauchy–Schwarz inequality, we have



2
1
EP ψ A (S))2 (ψ B (S)
n
!  
4 4
≤ n1 EP ψ A (S) EP ψ B (S)
 A 2  B 2
≤ n1 ψfull (S)P,4 ψfull (S)P,4 .

 b0 , η 0 )P,4 , ψ1 (S; η)P,4 , and ψ2 (S; η)P,4


The terms ψ(S; b0 , η 0 )P,4 , ψ(S;
are upper bounded by C5 by Lemma I.19. Thus, we have

1ψfull
2  B 2 1
EP [Ik,B
2
]≤ A
(S)P,4 ψfull (S)P,4 ≤ (4C5 )4 .
n n

We hence infer Ik,B = OP (N − 2 ) by Lemma I.12.


1

Second, we bound the term Ik,A . For any real numbers a1 , a2 , b1 , and b2 such
that real numbers c and d exist that satisfy max{|b1 |, |b2 |} ≤ c and max{|a1 −
b1 |, |a2 − b2 |} ≤ d, we have |a1 a2 − b1 b2 | ≤ 2d(c + d). Indeed, we have

|a1 a2 − b1 b2 |
≤ |a1 − b1 | · |a2 − b2 | + |b1 | · |a2 − b2 | + |a1 − b1 | · |b2 |
≤ d2 + cd + dc
≤ 2d(c + d)

by the triangle inequality.


We apply this observation together with the triangle inequality and the
Regularizing DML in endogenous PLMs 6515

Cauchy–Schwarz inequality to obtain

Ik,A  A 

≤ 1  i )ψ̂ (Si ) − ψ (Si )ψ (Si )
B A B 
n i∈Ik ψ̂ (S A   B 
≤ 2  ψ̂ (Si ) − ψ (Si ) , ψ̂ (Si ) − ψ B (Si )
A  
n  i∈Ik max
   
· max ψ A (Si ), ψ B (Si )
    
+ max ψ̂ A (Si ) − ψ A (Si ), ψ̂ B (Si ) − ψ B (Si )
  " 2  2 # 12
≤ 2 n1 i∈Ik max ψ̂ A (Si ) − ψ A (Si ) , ψ̂ B (Si ) − ψ B (Si )
     
· n1 i∈Ik max ψ A (Si ), ψ B (Si )
    2  12
+ max ψ̂ A (Si ) − ψ A (Si ), ψ̂ B (Si ) − ψ B (Si ) .

By the triangle inequality, we hence have


 
    
Ik,A
2
≤ 4RN,k 1 ψ A (Si )2 + ψ B (Si )2 + RN,k (43)
n i∈Ik full full

by Lemma I.11, where

1  
ψ̂full
2  B 2 
RN,k := A
(Si ) − ψfull
A
(Si ) + ψ̂full (Si ) − ψfull
B
(Si ) .
n
i∈Ik

Note that we have


1  
ψfull
2  B 2 
A
(Si ) + ψfull (Si ) = OP (1)
n
i∈Ik

 b0 , η 0 )P,4 ,
by Markov’s inequality because the terms ψ(S; b0 , η 0 )P,4 , ψ(S;
ψ1 (S; η)P,4 , and ψ2 (S; η)P,4 are upper bounded by C5 by Lemma I.19. Thus,
it suffices to bound the term RN,k . To do this, we need to bound the four terms

1  c
ψ(Si ; b̂, η̂ Ik ) − ψ(Si ; b0 , η 0 )2 , (44)
n
i∈Ik
1   c
 i ; b0 , η 0 )2 ,
ψ(Si ; b̂, η̂ Ik ) − ψ(S (45)
n
i∈Ik
1  c
ψ1 (Si ; η̂ Ik ) − ψ1 (Si ; η 0 )2 , (46)
n
i∈Ik
1  c
ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 )2 . (47)
n
i∈Ik

First, we bound the two terms (44) and (45) simultaneously. Consider the ran-
dom variable U ∈ {A, X} and the quadruple S = (U, X, W, Y ). Because the
6516 C. Emmenegger and P. Bühlmann

score ψ is linear in β, these two terms are upper bounded by


 Ikc 0 Ic
n i∈Ik −ψ (Si ; η̂ )(b̂ − b ) + ψ(Si ; b , η̂ k ) − ψ(Si ; b , η )
1 a 0 0 0 2
Ikc
 0 Ikc
≤ n i∈Ik ψ (Si ; η̂ )(b̂ − b ) + n i∈Ik ψ(Si ; b , η̂ ) − ψ(Si ; b0 , η 0 )2
2 a 0 2 2

(48)
due to the triangle inequality and Lemma I.11. Subsequently, we verify that
1  a c
ψ (Si ; η̂ Ik )2 = OP (1)
n
i∈Ik

holds. Indeed, we have


 c
i∈Ik ψ
1 a
(Si ; η̂ Ik )2
n
  
 Ic Ic T 2
= n1 i∈Ik  Ui − m̂Uk (Wi ) Xi − m̂Xk (Wi )  (49)
   
Ic Ic
≤ n1 i∈Ik Ui − m̂Uk (Wi )4 n1 i∈Ik Xi − m̂Xk (Wi )4

by the Cauchy–Schwarz inequality. We have


4
1 
1

Ui − m0U (Wi )4 = OP (1) (50)


n
i∈Ik

by Markov’s inequality because the term EP [U − m0U (W )4 ] is upper bounded
by Lemma I.7 and Assumption I.5.2. On the event EN that holds with P -
probability 1 − ΔN , we have
  
 c 
EP n1 i∈Ik η 0 (Wi ) − η̂ Ik (Wi )4 {Si }i∈Ikc
 c  (51)
= EP η 0 (W ) − η̂ Ik (W )4 |{Si }i∈Ikc
≤ C24
 c
by Assumption I.5.5. We hence have n1 i∈Ik η 0 (Wi ) − η̂ Ik (Wi ) = OP (1) by
Lemma I.12. Let us denote by ·PIk ,p the Lp -norm with the empirical measure
on the data indexed by Ik . On the event EN that holds with P -probability
1 − ΔN , we have
 Ikc
i∈Ik U i − m̂U (Wi )
1 4
n c
I
= U − m̂Uk (W )4PI ,4
k
Ic 4 (52)
≤ U − m0U (W )PIk ,4 + m0U (W ) − m̂Uk (W )PIk ,4
c 4
≤ U − m0U (W )PIk ,4 + η 0 (W ) − η̂ Ik (W )PIk ,4
= OP (1)

by the triangle inequality, (50), and (51). Analogous arguments lead to


1  Ic
Xi − m̂Xk (Wi )4 = OP (1). (53)
n
i∈Ik
Regularizing DML in endogenous PLMs 6517

We combine (49), (52), and (53) to obtain


1  a c
ψ (Si ; η̂ Ik )2 = OP (1). (54)
n
i∈Ik

Because b̂ − b0 2 = OP (N −1 ) holds by Theorem 3.1 and Theorem 4.1, we can


bound the first summand in (48) by
1  a
ψ (Si ; η̂ Ik )(b̂ − b0 )2 = OP (1)OP (N −1 ) = OP (N −1 )
c
(55)
n
i∈Ik

due to the Cauchy–Schwarz inequality and (54). On the event EN that holds
with P -probability 1 − ΔN , the conditional expectation given {Si }i∈Ikc of the
second summand in (48) is equal to
  
c 
EP n2 i∈Ik ψ(Si ; b0 , η̂ Ik ) − ψ(Si ; b0 , η 0 )2 {Si }i∈Ikc
  
= 2 EP ψ(S; b0 , η̂ Ik ) − ψ(S; b0 , η 0 )2 {Si }i∈I
c


c
k
≤ 2 supη∈T EP ψ(S; b0 , η) − ψ(S; b0 , η 0 )2
 rN
2

due to arguments that are analogous to (25)–(29) presented in the proof of


Lemma I.16. Because the event EN holds with P -probability 1 − ΔN = 1 − o(1),
we infer
1  a
ψ (Si ; η̂ Ik )(b̂ − b0 ) + ψ(Si ; b0 , η̂ Ik ) − ψ(Si ; b0 , η 0 )2 = OP (N −1 + rN
c c
2
)
n
i∈Ik

by Lemma I.12. Next, we bound the two terms given in (46) and (47). We first
consider the term given in (46). On the event EN , we have
  
c 
EP n1 i∈Ik ψ1 (Si ; η̂ Ik ) − ψ1 (Si ; η 0 )2 {Si }i∈Ikc
  
= EP ψ1 (S;η̂ Ik ) − ψ1 (S; η 0 )2 {Si }i∈I
c

 k
c

≤ supη∈T EP ψ1 (S; η) − ψ1 (S; η 0 )2


 rN
2

due to arguments that are analogous to (25)–(29) presented in the proof of


Lemma I.16. Because the event EN holds with probability 1 − ΔN = 1 − o(1),
we infer
1  c
ψ1 (Si ; η̂ Ik ) − ψ1 (Si ; η 0 )2 = OP (rN
2
)
n
i∈Ik

by Lemma I.12. On the event EN , the conditional expectation given {Si }i∈Ikc of
the term (47) is given by
  
c 
EP n1 i∈Ik ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 )2 {Si }i∈Ikc
  
= EP ψ2 (S;η̂ Ik ) − ψ2 (S; η 0 )2 {Si }i∈I
c

 k
c

≤ supη∈T EP ψ2 (S; η) − ψ2 (S; η 0 )2


 rN2
6518 C. Emmenegger and P. Bühlmann

due to arguments that are analogous to (25)–(29) presented in the proof of


Lemma I.16. Because the event EN holds with probability 1 − ΔN = 1 − o(1),
we infer
1  c
ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 )2 = OP (rN
2
)
n
i∈Ik

by Lemma I.12. Therefore, we have Ik,A = OP (N − 2 + rN ) by (43). In total, we


1

thus have
      
Ik = OP N max p −1,− 2 +OP N − 2 +rN = OP N max p −1,− 2 +rN .
4 1 1 4 1

Theorem I.21. Suppose Assumption I.5 holds. Introduce the matrix


Jˆk,0
  −1  −1
1
 Ik Ik T 1 Ik Ik T 1 Ik Ik T
:= n i∈Ik RX,i (RA,i ) n i∈Ik RA,i (RA,i ) n i∈Ik RA,i (RX,i )
   −1
· n1 i∈Ik
Ik
RX,i Ik
(RA,i ) T 1
n i∈Ik
Ik
RA,i Ik T
(RA,i ) .

Let its average over k ∈ [K] be

1  ˆ
K
Jˆ0 := Jk,0 .
K
k=1

Define further the estimator


1 
K
1  c c

σ̂ 2 := Jˆ0 ψ(Si ; β̂, η̂ Ik )ψ T (Si ; β̂, η̂ Ik ) Jˆ0T
K n
k=1 i∈Ik

of σ 2 from Theorem 3.1, where β̂ ∈ {β̂DML1 , β̂ DML2 }. We then have σ̂ 2 =


σ 2 + OP (ρ̃N ), where ρ̃N = N max p −1,− 2 + rN is as in Definition I.4.
4 1

Proof of Theorem I.21. We derived Jˆk,0 = J0 + OP N − 2 (1 + ρN ) in the proof


1

of Theorem 3.1. Thus, Jˆ0 = J0 + OP N − 2 (1 + ρN ) holds because K is a fixed


1

number independent of N . To conclude the proof, it suffices to verify


 
1   
 ψ(S ; β̂, η̂ Ikc
)ψ T
(S ; β̂, η̂ Ikc
) − E ψ(S; β , η 0
)ψ T
(S; β , η 0 
)
n i i P 0 0  = OP (ρ̃N ).
i∈Ik

But this statement holds by Lemma I.20 because the dimensions of A and X
are fixed.

Appendix J: Proofs of Section 4


1
Definition J.1. Let γ ≥ 0, and recall the scalar ρN = rN + N 2 λN in Defini-
tion I.4. Introduce the function

 bγ , η 0 ) + (γ − 1)D3 ψ(·; bγ , η 0 )
ψ (·; bγ , η 0 ) := ψ(·;
+(γ − 1) ψ1 (·; η 0 ) − EP [ψ1 (S; η 0 )] D5
−(γ − 1)D3 ψ2 (·; η 0 ) − EP [ψ2 (S; η 0 )] D5 .
Regularizing DML in endogenous PLMs 6519

Let   

D4 := EP ψ (S; bγ , η 0 )(ψ (S; bγ , η 0 ))T ,
and let the variance
−1 −1
σ 2 (γ) := D1 + (γ − 1)D2 D4 D1T + (γ − 1)D2T .

Moreover, define the influence function


−1 
ψ(·; bγ , η 0 ) := σ −1 (γ) D1 + (γ − 1)D2 ψ (·; bγ , η 0 ).

Proof of Theorem 4.1. This proof is based on Chernozhukov et al. [31]. The
matrices D1 + (γ − 1)D2 and D4 are invertible by Assumption I.5.4. Hence,
σ 2 (γ) is invertible.
Subsequently, we show the stronger statement

√ 1 
N
N σ −1 (γ)(b̂γ −bγ ) = √
d
ψ(Si ; bγ , η 0 )+OP (ρN ) → N (0, 1d×d ) (N → ∞),
N i=1
(56)
where b̂γ denotes the DML2 estimator b̂γ,DML2 or its DML1 variant b̂γ,DML1 ,
and where ψ is as in Definition J.1. We first consider b̂γ,DML2 and afterwards
b̂γ,DML1 . Fix a sequence {PN }N ≥1 such that PN ∈ PN for all N ≥ 1. Because
this sequence is chosen arbitrarily, it suffices to show

N σ −1 (γ)(b̂γ,DML2 − bγ )
1
N
= √N i=1 ψ(Si ; bγ , η 0 ) + OPN (ρN )
d
→ N (0, 1d×d ) (N → ∞).
We have
b̂γ,DML2
  Ik T Ik
−1
K
= 1
K k=1 RX 1 + (γ − 1)ΠRIk RX
K Ik T
A
Ik
·K
1
RX 1 + (γ − 1)ΠRIk RY
$ k=1
A
K Ic T Ic
= 1
K k=1
1
n X Ik
− m̂Xk (W Ik ) X Ik − m̂Xk (W Ik )
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
 c
−1
I T Ic
· n1 AIk − m̂Ak (W Ik ) (AIk − m̂Ak (W Ik ) (57)
%−1
c T c
I I
· n1 A k − m̂Ak (W k )
I I
X k − m̂Xk (W k )
I I

K  1 Ic T Ic
·K
1
k=1 n X Ik − m̂Xk (W Ik ) Y Ik − m̂Yk (W Ik )
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) (AIk − m̂Ak (W Ik )
 −1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )

Ic T Ic
· n1 AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )
6520 C. Emmenegger and P. Bühlmann

by (14). By Lemma I.17, we have

Ic T Ic
1
nXIk − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
T
+ OPN N − 2 (1 + ρN ) ,
1
= EPN X − m0X (W ) A − m0A (W )
Ic T Ic
1
n AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
T
+ OPN N − 2 (1 + ρN ) ,
1
= EPN A − m0A (W ) A − m0A (W )
Ic T Ic
1
n XIk − m̂Xk (W Ik ) X Ik − m̂Xk (W Ik )
T
+ OPN N − 2 (1 + ρN ) .
1
= EPN X − m0X (W ) (X − m0X (W )

By Weyl’s inequality and Slutsky’s theorem, we hence have



 N (b̂
γ,DML2
− bγ ) 
−1
+ OPN N − 2 (1 + ρN )
1
= D1 + (γ − 1)D2
K 
Ic T
· √1K k=1 √1n X Ik − m̂Xk (W Ik )
 
Ic Ic
· Y Ik − m̂Yk (W Ik ) − X Ik − m̂Xk (W Ik ) bγ
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
 −1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
Ic T (58)
· AIk − m̂Ak (W Ik ) 
Ic Ic
· Y Ik − m̂Yk (W Ik ) − X Ik − m̂Xk (W Ik ) bγ
 
−1
+ OPN N − 2 (1 + ρN )
1
= D1 + (γ − 1)D2
K    i ; bγ , η̂ Ikc )
· √1K k=1 √1n i∈Ik ψ(S
 c  c −1
+(γ − 1) · √1n i∈Ik ψ1 (Si ; η̂ Ik ) · n1 i∈Ik ψ2 (Si ; η̂ Ik )
 c

· n1 i∈Ik ψ(Si ; bγ , η̂ Ik )

due to (57) because K and γ are constants independent of N and because


N = nK holds. Let k ∈ [K]. Next, we analyze the individual factors of the last
summand in (58). By Lemma I.16, we have
 Ikc
√1 γ
n i∈Ik ψ(Si ; b , η̂ )  
  c 
= √1n i∈Ik ψ(Si ; bγ , η 0 ) + √1n i∈Ik ψ(Si ; bγ , η̂ Ik ) − √1n i∈Ik ψ(Si ; bγ , η 0 )

= √1n i∈Ik ψ(Si ; bγ , η 0 ) + OPN (ρN ),
(59)
and
  Ikc
√1 γ
n i∈Ik ψ(Si ; b , η̂ )  
  i ; bγ , η 0 ) + √1   Ikc
 
= √1n i∈Ik ψ(S i∈Ik ψ(Si ; b , η̂ ) − n
γ √1 γ 0
n i∈Ik ψ(Si ; b , η )
  i ; bγ , η 0 ) + OP (ρN ),
= √1 n
ψ(S
i∈Ik N

(60)
Regularizing DML in endogenous PLMs 6521

and
1
 c
ψ (S ; η̂ Ik )
ni∈Ik 1 i I c 
i∈Ik (ψ1 (Si ; η̂ ) − ψ1 (Si ; η )) + n i∈Ik (ψ1 (Si ; η ) − EPN [ψ1 (S; η )])
1 0 1 0 0
= n
k

+ EPN [ψ1 (S; η )] 0



= OPN (N − 2 ρN ) + n1 i∈Ik (ψ1 (Si ; η 0 ) − EPN [ψ1 (S; η 0 )]) + EPN [ψ1 (S; η 0 )].
1

(61)
We apply a series expansion to obtain
  −1
1 Ikc
i∈Ik ψ2 (Si ; η̂ )
 n
 c
= EPN [ψ2 (S; η 0 )] + n1 i∈Ik ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 )
 −1
+ n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
= EPN [ψ2 (S; η 0 )]−1 
− EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 ) EPN [ψ2 (S; η 0 )]−1
c


− EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
· EPN [ψ2 (S; η 0 )]−1
  2
 c 
+OPN  n1 i∈Ik ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 ) 
  2 
 
+ n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )] 
 
= EPN [ψ2 (S; η 0 )]−1 + OPN N − 2 ρN + OPN OPN N −1 ρ2N + OPN (N −1 )
1


− EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
· EPN [ψ2 (S; η 0 )]−1
= EPN [ψ2 (S; η 0 )]−1 + OPNN − 2 ρN
1

− EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]


· EPN [ψ2 (S; η 0 )]−1
(62)
due to Lemma I.16, the Lindeberg–Feller CLT, the Cramer–Wold device, be-
1 1
cause ρN  δN4 holds by Lemma I.16, and because δN4 ≥ N − 2 holds by As-
1

sumption I.5. Thus, the last summand in (58) can be expressed as


 c  c −1 1  c
√1
n  i∈Ik 1 i
ψ (S ; η̂ Ik ) · n1 i∈Ik ψ2 (Si ; η̂ Ik ) · n i∈Ik ψ(Si ; bγ , η̂ Ik )

= n OPN N − 2 ρN
1

 
+ n1 i∈Ik ψ1 (Si ; η 0 ) − EPN [ψ1 (S; η 0 )] + EPN [ψ1 (S; η 0 )]

· EPN [ψ2 (S; η 0 )]−1 + OPN N − 2 ρN
1


− EPN [ψ2 (S; η 0 )]−1 n1  i∈Ik ψ2 (Si ; η ) − EPN [ψ2 (S; η )]
0 0


· EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ(Si ; bγ , η 0 ) + OPN N − 2 ρN
1


= √1n i∈Ik ψ1 (Si ; η 0 ) − EPN [ψ1 (S; η 0 )] EPN [ψ2 (S; η 0 )]−1 EPN [ψ(S; bγ , η 0 )]

+ EPN [ψ1 (S; η 0 )] EPN [ψ2 (S; η 0 )]−1 √1n i∈Ik ψ(Si ; bγ , η 0 )

− EPN [ψ1 (S; η 0 )] EPN [ψ2 (S; η 0 )]−1 √1n i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
· EPN [ψ2 (S; η 0 )]−1 EPN [ψ(S; bγ , η 0 )] + OPN (ρN )
(63)
6522 C. Emmenegger and P. Bühlmann

due to (59)–(62), the Lindeberg–Feller CLT, and the Cramer–Wold device.


We combine (58) and (63) and obtain

 N (b̂
γ,DML2
− bγ ) 
−1
+ OPN N − 2 (1 + ρN )
1
= D1 + (γ − 1)D2
K  
· √1K k=1 √1n i∈Ik ψ(S  i ; bγ , η 0 ) + (γ − 1)D3 ψ(Si ; bγ , η 0 )
+(γ − 1) ψ1 (Si ; η 0 ) − EPN [ψ1 (S; η 0 )] D5
−(γ − 1)D3 ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )] D5 + OPN (ρN ) (64)
 
−1
= D1 + (γ − 1)D2
N  
· √1N i=1 ψ(S i ; b , η ) + (γ − 1)D3 ψ(Si ; b , η )
γ 0 γ 0

+(γ − 1) ψ1 (Si ; η 0 ) − EPN [ψ1 (S; η 0 )] D5 


−(γ − 1)D3 ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )] D5 + OPN (ρN )

by the Lindeberg–Feller CLT and the Cramer–Wold device. We conclude our


proof for the DML2 method by the Lindeberg–Feller CLT and the Cramer–Wold
device.
Subsequently, we consider the DML1 method. It suffices to show that (56)
holds uniformly over P ∈ PN . Fix a sequence {PN }N ≥1 such that PN ∈ PN for
all N ≥ 1. Because this sequence is chosen arbitrarily, it suffices to show

N σ −1 (γ)(b̂γ,DML1 − bγ )
1
N
= √N i=1 ψ(Si ; bγ , η 0 ) + OPN (ρN )
d
→ N (0, 1d×d ) (N → ∞).

We have

b̂γ,DML1
K  Ik T Ik
−1
= K
1
k=1 RX 1 + (γ − 1)ΠRIk RX
A
Ik Ik
·(RX )T 1 + (γ − 1)ΠRIk RY
K  A
Ikc Ik T Ic
= 1
K k=1
1
n X Ik
− m̂ X (W ) X Ik − m̂Xk (W Ik )
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
 c
−1
I T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik ) (65)
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik )

Ic T Ic
· 1
n X Ik − m̂Xk (W Ik ) Y Ik − m̂Yk (W Ik )
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
 c
−1
I T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )

Ic T Ic
· n1 AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )
Regularizing DML in endogenous PLMs 6523

by (20). By Slutsky’s theorem and Equation (65), we have



 N (b̂
γ,DML1
− bγ ) 
−1
+ OPN N − 2 (1 + ρN )
1
= D1 + (γ − 1)D2
K Ic T
· √1K k=1 √1n X Ik − m̂Xk (W Ik )
 
Ic Ic T γ
· Y Ik − m̂Yk (W Ik ) − X Ik − m̂Xk (W Ik ) b
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
 c
−1
I T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
Ic T
· AIk − m̂Ak (W Ik )
 
Ic Ic T
· Y Ik − m̂Yk (W Ik ) − X Ik − m̂Xk (W Ik ) bγ
 
−1
+ OPN N − 2 (1 + ρN )
1
= D1 + (γ − 1)D2
K √    i ; bγ , η̂ Ikc )
· √1K k=1 n n1 i∈Ik ψ(S
 Ikc
 c −1
+(γ − 1) · n1 i∈Ik ψ1 (Si ; η̂ ) · n1 i∈Ik ψ2 (Si ; η̂ Ik )
 c
· n1 i∈I c ψ(Si ; bγ , η̂ Ik ) .
k

The last expression above coincides with (58). Consequently, the same asymp-
totic analysis conducted for b̂γ,DML2 can also be employed in this case.
Lemma J.2. Let γ ≥ 0 and let ϕ ∈ {ψ, ψ}.  We have

1 
ϕ(Si ; b̂γ , η̂ Ik ) = EP [ϕ(S; bγ , η 0 )] + OP N − 2 (1 + ρN ) .
c 1

n
i∈Ik

Proof. We consider the case ϕ = ψ. We decompose


 Ikc
n i∈Ik ψ(Si ; b̂ , η̂ ) − EP [ψ(S; b , η )]
1 γ γ 0
c c
= n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik ) − ψ(Si ; bγ , η̂ Ik )
 c (66)
+ n1 i∈Ik ψ(Si ; bγ , η̂ Ik ) − ψ(Si ; bγ , η 0 )

+ n1 i∈Ik ψ(Si ; bγ , η 0 ) − EP [ψ(S; bγ , η 0 )] .

Subsequently, we analyze the three terms in the above decomposition (66) indi-
vidually. We have
1   c 
 ψ(Si ; b̂γ , η̂ Ik ) − n1 i∈Ik ψ(Si ; bγ , η̂ Ik )
c

 n1 i∈Ik 
≤  n i∈Ik (Ai − m̂Ak (Wi ))(Xi − m̂Xk (Wi ))T b̂γ − bγ 
c c
I I
1  c 
=  n i∈Ik ψ1 (Si ; η̂ Ik )b̂γ − bγ 
 
=  EP [ψ1 (S; η 0 )] + OP N − 2 (1 + ρN ) b̂γ − bγ 
1

by Lemma I.17. Because b̂γ − bγ  = OP (N − 2 ρN ) holds by Theorem 4.1, we


1

infer   
1
 1  
Ikc 
Ikc
ψ(Si ; b , η̂ ) = OP N − 2 ρN .
1

n ψ(Si ; b̂ , η̂ ) −
γ γ
(67)
n
i∈Ik i∈Ik
6524 C. Emmenegger and P. Bühlmann

Due to (59) that was established in the proof of Theorem 4.1, we have

1 
ψ(Si ; bγ , η̂ Ik ) − ψ(Si ; bγ , η 0 ) = OP N − 2 ρN .
c 1
(68)
n
i∈Ik

Due to the Lindeberg–Feller CLT and the Cramer–Wold device, we have

1 
ψ(Si ; bγ , η 0 ) − EP [ψ(S; bγ , η 0 )] = OP (N − 2 ).
1
(69)
n
i∈Ik

We combine (66)–(69) to infer the claim for ϕ = ψ. The case ϕ = ψ can be


analyzed analogously.

Theorem J.3. Suppose Assumption I.5 holds. Recall the score functions intro-
duced in Definition I.1, and let b̂γ ∈ {b̂γ,DML1 , b̂γ,DML2 }. Introduce the matrices

1
 c
D̂1k := ψ3 (Si ; η̂ Ik ),
n

i∈Ik
  −1 
1 Ikc 1 Ikc 1 Ikc
D̂2k := n i∈Ik ψ 1 (S; η̂ ) n i∈Ik ψ 2 (Si ; η̂ ) n
T
i∈Ik ψ1 (Si ; η̂ ),
 c
  c
−1
D̂3k := n1 i∈Ik ψ1 (Si ; η̂ Ik ) n1 i∈Ik ψ2 (Si ; η̂ Ik ) ,
  c
−1 
Ikc
D̂5k := n1 i∈Ik ψ2 (Si ; η̂ Ik ) n
1 γ
i∈Ik ψ(Si ; b̂ , η̂ ).

Let furthermore
 c
 b̂γ , η̂ Ik ) + (γ − 1)D̂k ψ(·; b̂γ , η̂ Ik )
ψ (·; b̂γ , η̂ Ik ) := ψ(·;
c c

c
3  c
+(γ − 1) ψ1 (·; η̂ Ik ) − n1 i∈Ik ψ1 (Si ; η̂ Ik ) D̂5k
c  c
−(γ − 1)D̂3k ψ2 (·; η̂ Ik ) − n1 i∈Ik ψ2 (Si ; η̂ Ik ) D̂5k

and
1   c  c T
D̂4k := ψ (Si ; b̂γ , η̂ Ik ) ψ (Si ; b̂γ , η̂ Ik ) .
n
i∈Ik

Define the estimators

1  k 1  k 1  k
K K K
D̂1 := D̂1 , D̂2 := D̂2 , and D̂4 := D̂4 .
K K K
k=1 k=1 k=1

We estimate the asymptotic variance covariance matrix σ 2 (γ) in Theorem 4.1


by
−1 −1
σ̂ 2 (γ) := D̂1 + (γ − 1)D̂2 D̂4 D̂1T + (γ − 1)D̂2T .

Then we have σ̂ 2 (γ) = σ 2 (γ) + OP ρ̃N + N − 2 (1 + ρN ) , where we have ρ̃N =
1

N max p −1,− 2 + rN is as in Definition I.4.


4 1
Regularizing DML in endogenous PLMs 6525

Proof of Theorem J.3. This proof is based on Chernozhukov et al. [31]. We al-
ready verified

D̂1 = D1 + OP N − 2 (1 + ρN ) D̂2 = D2 + OP N − 2 (1 + ρN )
1 1
and

in the proof of Theorem 4.1 because K is a fixed integer independent of N .


Thus, we have
−1 −1
+ OP N − 2 (1 + ρN )
1
D̂1 + (γ − 1)D̂2 = D1 + (γ − 1)D2

by Weyl’s inequality. Moreover, we have D̂3k = D3 + OP N − 2 (1 + ρN ) by


1

Lemma I.17.
Subsequently, we argue that D̂5k = D5 + OP N − 2 (1 + ρN ) holds. Due to
1

Lemma I.17 and Weyl’s inequality, we have


1 
ψ1 (Si ; η̂ Ik ) = EP [ψ1 (S; η 0 )] + OP N − 2 (1 + ρN )
c 1

n
i∈Ik

and
1  −1
= EP [ψ2 (S; η 0 )]−1 + OP N − 2 (1 + ρN ) .
c 1
ψ2 (Si ; η̂ Ik ) (70)
n
i∈Ik

Due to (70), it suffices to show


1 
ψ(Si ; b̂γ , η̂ Ik ) = EP [ψ(S; bγ , η 0 )] + OP N − 2 (1 + ρN )
c 1
(71)
n
i∈Ik

to infer D̂5k = D5 + OP N − 2 (1 + ρN ) . But (71) holds due to Lemma J.2. To


1

conclude the theorem, it remains verify D̂4k = D4 + OP (ρ̃N ). We have

D̂4k − D4 
 
1   
≤  i ; b̂γ , η̂ Ikc )ψT (Si ; b̂γ , η̂ Ikc ) − EP ψ(S;
ψ(S  bγ , η 0 )ψT (S; bγ , η 0 ) 
n 
i∈Ik

1 
+ (γ − 1)  i ; b̂γ , η̂ Ikc )ψ T (Si ; b̂γ , η̂ Ikc )DT
ψ(S
n 3
i∈Ik

  T

− EP ψ(S; b , η )ψ (S; b , η ) D3 
γ 0 T γ 0

 
1
+ (γ − 1) D3 ψ(Si ; b̂γ , η̂ Ik )ψT (Si ; b̂γ , η̂ Ik )
c c

n
i∈Ik

 0 T

0 
− D3 EP ψ(S; b , η )ψ (S; b , η ) 
γ γ

 

2 1 c c
+ (γ − 1)  D3 ψ(Si ; b̂γ , η̂ Ik )ψ T (Si ; b̂γ , η̂ Ik )D3T
n
i∈Ik
6526 C. Emmenegger and P. Bühlmann

  T
− D3 EP ψ(S; b , η )ψ (S; b , η ) D3 
γ 0 T γ 0


1 
+ (γ − 1)  i ; b̂γ , η̂ Ikc )DT ψ1 (Si ; η̂ Ikc ) − EP [ψ1 (S; η 0 )] T
ψ(S
n 5
i∈Ik
 
 T 
− EP ψ(S; b , η )D5 ψ1 (S; η ) − EP [ψ1 (S; η )]
γ 0 T 0 0 

 
1
+ (γ − 1) ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )] D5 ψT (Si ; b̂γ , η̂ Ik )
c c

n
i∈Ik
 

0 0 
− EP ψ1 (S; η ) − EP [ψ1 (S; η )] D5 ψ (S; b , η ) 
T γ 0 


 
2 1 c
+ (γ − 1)  ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )] D5
n
i∈Ik
c T
· D5T ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )]
 

− EP ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )] D5 D5T ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )]
T



1 
+ (γ − 1) D3 ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D5 ψT (Si ; b̂γ , η̂ Ik )
c c

n
i∈Ik
 

− D3 EP ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D5 ψT (S; bγ , η 0 )  

1 
+ (γ − 1)  i ; b̂γ , η̂ Ikc )DT ψ2 (Si ; η̂ Ikc ) − EP [ψ2 (S; η 0 )] T DT
ψ(S
n 5 3
i∈Ik
 


− EP ψ(S; b , η )D5 ψ2 (S; η ) − EP [ψ2 (S; η )]
γ 0 T 0 0 T T
D3 

 
2 1 c c T
+ (γ − 1)  D3 ψ(Si ; b̂γ , η̂ Ik )D5T ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )]
n
i∈Ik
 
T 
− D3 EP ψ(S; b , η )D5 ψ1 (S; η ) − EP [ψ1 (S; η )]
γ 0 T 0 0 

 
1
+ (γ − 1)2 
c c

n ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )] D5 ψ T (Si ; b̂γ , η̂ Ik )D3T


i∈Ik

  T
− EP ψ1 (S; η ) − EP [ψ1 (S; η )] D5 ψ (S; b , η ) D3 
0 0 T γ 0

 
1
+ (γ − 1)2 
c

n ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )] D5


i∈Ik
c T
· D5T ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D3T

− EP ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )] D5
Regularizing DML in endogenous PLMs 6527


D3T 
T
· ψ2 (S; η ) − EP [ψ2 (S; η )]
D5T 0 0


1 
+ (γ − 1)2 
c c T
n D3 ψ(Si ; b̂γ , η̂ Ik )D5T ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D3T
i∈Ik
 
T T

− D3 EP ψ(Si ; b , η )D5 ψ2 (Si ; η ) − EP [ψ2 (S; η )]
γ 0 T 0 0
D3 
 

2 1 c c
+ (γ − 1)  D3 ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D5 ψ T (Si ; b̂γ , η̂ Ik )D3T
n
i∈Ik

  T
− D3 EP ψ2 (S; η ) − EP [ψ2 (S; η )] D5 ψ (S; b , η ) D3 
0 0 T γ 0


1 
+ (γ − 1)2 
c

n D3 ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D5


i∈Ik
c T
· D5T ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )]

− D3 EP ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D5


· ψ1 (S; η ) − EP [ψ1 (S; η )]
D5T 0 0 T


 
1
+ (γ − 1)2 
c

n D3 ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D5


i∈Ik
c T
· D5T ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D3T

− D3 EP ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D5


D3T 
T
· D5T ψ2 (S; η ) − EP [ψ2 (S; η )]
0 0


+ OP N − 2 (1 + ρN )
1


16
Ii + OP N − 2 (1 + ρN )
1
=:
i=1

by the triangle inequality and the results derived so far. Subsequently, we bound
the terms I1 , . . . , I16 individually. Because all these terms consist of norms of
matrices of fixed size, it suffices to bound the individual matrix entries. Let
j, l, t, r be natural numbers not exceeding the dimensions of the respective object
they index. By Lemma I.20, we have
 
1   
 j (Si ; b̂γ , η̂ Ikc )ψl (Si ; b̂γ , η̂ Ikc ) − EP ψj (S; bγ , η 0 )ψl (S; bγ , η 0 )  = OP (ρ̃N ),
ψ
n 
i∈Ik

which implies I1 = OP (ρ̃N ). By Lemma I.20, we have


  
1  
  γ Ikc γ Ikc  0 
ψj (Si ; b̂ , η̂ )ψl (Si ; b̂ , η̂ ) − EP ψj (S; b η )ψl (S; β0 , η )  = OP (ρ̃N ),
γ 0
n
i∈Ik
6528 C. Emmenegger and P. Bühlmann

which implies I2 = OP (ρ̃N ) = I3 due to


   

ψ(S Ikc Ikc  (S; bγ , η 0 ) D3T 
 1 i ; b̂ , η̂ )ψ γ(SiI; cb̂ , Tη̂ )D3γ − IEc P ψ(S;b , η )ψ
γ T γ T γ 0 T

≤  n i∈Ik ψ(S  bγ , η 0 )ψ T (S; bγ , η 0 ) D3 
 i ; b̂ , η̂ k )ψ (Si ; b̂ , η̂ k ) − EP ψ(S;

and
1 
 D3 ψ(Si ; b̂γ , η̂ Ik )ψT (Si ; b̂γ , η̂ Ik )
c c

n 
i∈Ik 
−D3 EP ψ(S; bγ , η 0 )ψT (S; bγ , η 0 ) 
   

≤ D3  n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik )ψT (Si ; b̂γ , η̂ Ik ) − EP ψ(S; bγ , η 0 )ψT (S; bγ , η 0 ) .
c c

By Lemma I.20, we have


 
1  
 ψ (S ; b̂ γ
, η̂ Ikc
)ψ (S ; b̂ γ
, η̂ Ikc
) − E [ψ (S; β , η 0
)ψ (S; β , η 0 
)]
n j i l i P j 0 l 0  = OP (ρ̃N ),
i∈Ik

which implies I4 = OP (ρ̃N ) due to


1 
 γ Ikc T γ Ikc T
n i∈Ik D 3 ψ(Si ;γb̂ ,0η̂ T)ψ (Sγi ; b̂ 0 , η̂ T)D3
−D3EP ψ(S; b , η )ψ (S; b , η ) D3 
  
≤ D3 2  n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik )ψ T (Si ; b̂γ , η̂ Ik )−EP ψ(S; bγ , η 0 )ψ T (S; bγ , η 0 ) .
c c

By Lemma I.20, we have


1   
  γ Ikc Ikc
− EP ψj (S; bγ , η 0 ) ψ1 (S; η 0 ) 
n i∈Ik ψj (Si ; b̂ , η̂ ) ψ1 (Si ; η̂ ) l,t l,t
= OP (ρ̃N ),
which implies I5 = OP (ρ̃N ) because we have
 
1  i ; b̂γ , η̂ Ikc )DT ψ1 (Si ; η̂ Ikc ) − EP [ψ1 (S; η 0 )] T
 n i∈Ik ψ(S
 5

 T 
− EP ψ(S; b , η )D5 ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )]
γ 0 T

1   
  γ Ikc T T Ikc 
≤ n i∈Ik ψ(Si ; b̂ , η̂ )D5 ψ1 (Si ; η̂ ) − EP ψ(S; b , η 0 )D5T ψ1T (S; η 0 ) 
γ
1   
+  i ; b̂γ , η̂ Ik ) − EP ψ(S;
ψ(S
c
 bγ , η 0 ) D5 EP [ψ1 (S; η 0 )],
n i∈Ik

where the last summand is OP N − 2 (1 + ρN ) by Lemma J.2, and we have


1

1 
  γ Ikc T T Ikc
i∈Ik ψ(Si ; b̂ , η̂ )D5 ψ1 (Si ; η̂ ) j,l
n 
 bγ , η 0 )DT ψ T (S; η 0 )] 
− EP [ψ(S;
1  5 1 j,l
=  n i∈Ik D5T ψ1 (Si ; η̂ Ik ) ·,l ψj (Si ; b̂γ , η̂ Ik )
c c

 
−D5T EP ψ1 (S; η 0 ) ·,l ψj (S; bγ , η 0 ) 
 
≤ D5  n1 i∈Ik ψ1 (Si ; η̂ Ik ) ·,l ψj (Si ; b̂γ , η̂ Ik )
c c

 
− EP ψ1 (S; η 0 ) ψj (S; bγ , η 0 ) .
·,l

The term I6 can be bounded analogously to I5 . By Lemma I.20, we have


1   
 Ikc Ikc
ψ1 (S; η 0 ) j,l ψ1 (S; η 0 ) l,t 
n i∈Ik ψ1 (Si ; η̂ ) j,l ψ1 (Si ; η̂ ) t,r − EP
= OP (ρ̃N ),
Regularizing DML in endogenous PLMs 6529

which implies I7 = OP (ρ̃N ). Indeed, we have


 
1 c c T
 n i∈Ik ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )] D5 D5T ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )]
 
T 
− EP ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )] D5 D5T ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )] 
1   
 Ikc Ikc
≤ n i∈Ik ψ1 (Si ; η̂ )D5 D5 ψ1 (Si ; η̂ ) − EP ψ1 (S; η )D5 D5 ψ1 (S; η )
T T 0 T T 0 

1 Ic 0 
i ; η̂ k ) − EP [ψ1 (S; η )] D
 5  EP0 [ψ1 (S;Tη T)] 0 
2 0
 1+2
 n i∈Ik ψ1 (S
= ψ1 (Si ; η̂ k )D5 D5 ψ1 (Si ; η̂ k ) − EP ψ1 (S; η )D5 D5 ψ1 (S; η ) 
c c
I T T I
n i∈Ik
+OP N − 2 (1 + ρN )
1

by Lemma I.17, and we have


 
1 c c
 n i∈Ik ψ1 (Si ; η̂ Ik )D5 D5T ψ1T (Si ; η̂ Ik ) j,r
  
− EP ψ1 (S; η 0 )D5 D5T ψ1T (S; η 0 ) j,r 
 
 c c
=  n1 i∈Ik ψ1 (Si ; η̂ Ik ) j,· D5 D5T (ψ1T (Si ; η̂ Ik ))·,r
 

− EP ψ1 (S; η 0 ) j,· D5 D5T ψ1T (S; η 0 ) ·,r 
1 
=  n i∈Ik D5T ψ1T (Si ; η̂ Ik ) ·,r ψ1 (Si ; η̂ Ik ) j,· D5
c c

 
− EP D5T ψ1T (S; η 0 ) ·,r ψ1 (S; η 0 ) j,· D5 
   
 c c 
≤  n1 i∈Ik ψ1T (Si ; η̂ Ik ) ·,r ψ1 (Si ; η̂ Ik ) j,· − EP ψ1T (S; η 0 ) ·,r
ψ1 (S; η 0 ) j,·

·D5 2 .

Next, we bound I8 . By Lemma I.20, we have


1   
 ψj (Si ; b̂γ , η̂ Ik ) ψ2 (Si ; η̂ Ik ) − EP ψj (Si ; bγ , η 0 ) ψ2 (S; η 0 ) 
c c

n i∈Ik l,t l,t


= OP (ρ̃N ),

which implies I8 = OPN (ρ̃N ). Indeed, we have


1 
 D3 ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D5 ψT (Si ; b̂γ , η̂ Ik )
c c

n 
i∈Ik 
 −D 3 EP ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D5 ψT (S; bγ , η 0 ) 
1 
≤  n i∈Ik D3 ψ2 (Si ; η̂ Ik )D5 ψT (Si ; b̂γ , η̂ Ik )
c c

 
−D3 EP ψ2 (S; η 0 )D5 ψT (S; bγ , η 0 ) 
 
+ n1 i∈Ik D3 EP [ψ2 (S; η 0 )]D5 ψT (Si ; b̂γ , η̂ Ik )
c

 
 1−D E [ψ (S; η 0 )]D5 EP ψT (S; bγ , η 0 ) 
3 P 2
≤ D3  n i∈Ik ψ2 (Si ; η̂ Ik )D5 ψT (Si ; b̂γ , η̂ Ik )
c c

 
− EP ψ2 (S; η 0 )D5 ψT (S; bγ , η 0 ) 
3 EP [ψ2 (S; η )]D5 
0
+D
1   
· n i∈Ik ψ (Si ; b̂ , η̂ Ik ) − EP ψT (S; bγ , η 0 ) 
c
T γ
 
≤ D3  n1 i∈Ik ψ2 (Si ; η̂ Ik )D5 ψT (Si ; b̂γ , η̂ Ik )
c c

 
− EP ψ2 (S; η 0 )D5 ψT (S; bγ , η 0 )  + OP N − 2 (1 + ρN )
1
6530 C. Emmenegger and P. Bühlmann

by Lemma J.2, and we have


1 
 ψ2 (Si ; η̂ Ik )D5 ψT (Si ; b̂γ , η̂ Ik ) j,t
c c

n

i∈Ik
 
− EP ψ2 (S; η 0 )D5 ψT (S; bγ , η 0 ) j,t 
 
=  n1 i∈Ik ψ2 (Si ; η̂ Ik ) j,· D5 ψt (Si ; b̂γ , η̂ Ik )
c c

 
− EP ψ2 (S; η 0 ) j,· D5 ψt (S; bγ , η 0 ) 
1 
=  n i∈Ik ψt (Si ; b̂γ , η̂ Ik ) ψ2 (Si ; η̂ Ik ) j,· D5
c c

 
− E ψ (S; bγ , η 0 ) ψ2 (S; η 0 ) j,· D5 
 P t  
 
≤  n1 i∈Ik ψt (Si ; b̂γ , η̂ Ik ) ψ2 (Si ; η̂ Ik ) j,· − EP ψt (S; bγ , η 0 ) ψ2 (S; η 0 )
c c

j,·

·D5 .

The term I9 can be bounded analogously to I8 . Next, we bound I10 . By


Lemma I.20, we have
1   
 c c
ψj (Si ; b̂γ , η̂ Ik ) ψ1 (Si ; η̂ Ik ) − EP ψj (S; bγ , η 0 ) ψ1 (S; η 0 ) 
n i∈Ik l,t l,t
= OP (ρ̃N ),

which implies I10 = OPN (ρ̃N ). Indeed, we have


 
1 c c T
 n i∈Ik D3 ψ(Si ; b̂γ , η̂ Ik )D5T ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )]
 
T 
−D3 EP ψ(S; bγ , η 0 )D5T ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )] 
 
≤  n1 i∈Ik D3 ψ(S i ; b̂ γ
, η̂ Ikc
)D T T
ψ
5 1 (S i ; η̂ Ikc
) 
−D
1  3 E P ψ(S; b γ
, η 0
)D T T
5 ψ 1 (S; η0 ) 
+ n i∈Ik D3 ψ(Si ; b̂γ , η̂ Ik )D5T EPN [ψ1T (S; 
c
η 0 )]
0 
 −D E ψ(S; b , η )D5 EP [ψ1 (S; η )
γ 0 T T
3 P  
≤ D3  n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik )D5T ψ1T (Si ; η̂ Ik ) − EP ψ(S; bγ , η 0 )D5T ψ1T (S; η 0 ) 
c c

  
+D3  n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik ) − EP [ψ(S; bγ , η 0 )D5 EPN [ψ1 (S; η 0 )]
c

1   
≤ D3  ψ(Si ; b̂γ , η̂ Ik )D5T ψ1T (Si ; η̂ Ik ) − EP ψ(S; bγ , η 0 )D5T ψ1T (S; η 0 ) 
c c

n i∈Ik
+OP N − 2 (1 + ρN )
1

by Lemma J.2, and we have


1 
 c c
ψ(Si ; b̂γ , η̂ Ik )D5T ψ1T (Si ; η̂ Ik ) j,t
n i∈Ik   
− E ψ(S; bγ , η 0 )D5T ψ1T (S; η 0 ) j,t 
  P
1 c c
=  n i∈Ik ψj (Si ; b̂γ , η̂ Ik )D5T ψ1T (Si ; η̂ Ik ) ·,t
 

− EP ψj (S; bγ , η 0 )D5T ψ1T (S; η 0 ) ·,t 
 
=  n1 i∈Ik D5T ψ1T (Si ; η̂ Ik ) ·,t ψj (Si ; b̂γ , η̂ Ik )
c c

 T T 
− EP D5 ψ1 (S; η 0 ) ·,t ψj (S; bγ , η 0 ) 
  
ψj (S; bγ , η 0 ) 
c c
≤  n1 i∈Ik ψ1T (Si ; η̂ Ik ) ·,t ψj (Si ; b̂γ , η̂ Ik ) − EP ψ1T (S; η 0 ) ·,t
·D5 .
Regularizing DML in endogenous PLMs 6531

The term I11 can be bounded analogously to I10 . Next, we bound I12 . By
Lemma I.20, we have

1   
 c c
ψ1 (Si ; η̂ Ik ))j,l (ψ2 (Si ; η̂ Ik ) − EP ψ1 (S; η 0 ) ψ2 (S; η 0 ) 
n i∈Ik t,r j,l t,r
= OP (ρ̃N ),

which implies I12 = OPN (ρ̃N ). Indeed, we have

1 
 Ikc
i∈Ik ψ1 (Sic; η̂ ) − EP [ψ1 (S; η )] D5
0
n
·D5 ψ2 (Si ; η̂ ) − EP [ψ2 (S; η )] D3T
T Ik 0
 
 1−EP ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )] D5 D5T ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D3T 
≤  Ikc T T Ic T
n i∈Ik ψ1 (S
 i ; η̂ )D 5 D5 ψ2 (Si ; η̂ k )D 3 
T
 1−EP ψ1 (S; η )D
0 T T 0
5 D5 ψ2 (S; η ) D3
 Ikc
+ n i∈Ik ψ1 (Si ; η̂ )D5 D5 EP [ψ2 (S; η 0)]D3T
T T
T
 1−EP ψ1 (S; η )D50D5 EP [ψ
0 T T 0
2 (S; η )]c D3

+ n i∈Ik EP [ψ1 (S; η )]D5 D5 ψ2 (Si ; η̂ )D3T
T T Ik
T
 1  − EP EP [ψ
0 T T 0
1 (S; η )]D5 D5 ψ2 (S; η ) D 3 
≤  Ikc Ikc 0 
i∈Ik ψ1 (Si ; η̂ )D5 D5 ψ2 (Si ; η̂ ) − EP ψ1 (S; η )D5 D5 ψ2 (S; η )
T T 0 T T
n
 3 
·D 
+ n1 i∈Ik ψ1 (Si ; η̂ Ik ) − EP [ψ 0 
c
1 (S; η )] D5  EP [ψ2 (S; η )]D3 
2 0
 
+E [ψ (S; η 0
)]D  2
D   1
ψ (S ; η̂ Ik ) − E [ψ (S; η 0 )]
c

  P 1 5 3 n c i∈Ik 2 i P 2 
≤  n1 i∈Ik ψ1 (Si ; η̂ Ik )D5 D5T ψ2T (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )D5 D5T ψ2T (S; η 0 )]
c

·D3  + OP N − 2 (1 + ρN )
1

by Lemma I.17, and we have

1 
 c c
ψ1 (Si ; η̂ Ik )D5 D5T ψ2T (Si ; η̂ Ik ) j,r
n i∈Ik   
− E ψ1 (S; η 0 )D5 D5T ψ2T (S; η 0 ) j,r 
  P
1 c c
=  n i∈Ik ψ1 (Si ; η̂ Ik ) j,· D5 D5T ψ2T (Si ; η̂ Ik ) ·,r
 

− EP ψ1 (S; η 0 ) j,· D5 D5T ψ2T (S; η 0 ) ·,r 
1 
=  n i∈Ik D5T ψ2T (Si ; η̂ Ik ) ·,r ψ1 (Si ; η̂ Ik ) j,· D5
c c

 
− E D5T ψ2T (S; η 0 ) ·,r ψ1 (S; η 0 ) j,· D5 
 P  
 c c 
≤  n1 i∈Ik ψ2T (Si ; η̂ Ik ) ·,r ψ1 (Si ; η̂ Ik ) j,· − EP ψ2T (S; η 0 ) ·,r
ψ1 (S; η 0 ) j,·

·D5 2 .

Next, we bound I13 . By Lemma I.20, we have

1   
 c c
ψj (Si ; b̂γ , η̂ Ik ) ψ2 (Si ; η̂ Ik ) − EP ψj (S; bγ , η 0 ) ψ2 (S; η 0 ) 
n i∈Ik t,r t,r
= OP (ρ̃N ),
6532 C. Emmenegger and P. Bühlmann

which implies I13 = OP (ρ̃N ). Indeed, we have


1 
 c c
D3 ψ(Si ; b̂γ , η̂ Ik )D5T ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D3T
T
 T T
n i∈Ik

1 −D3 EP ψ(S; bγ , η 0 )D5T ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D3 


 

≤ n i∈Ik ψ(Si ; b̂ , η̂ )D5 ψ2 (Si ; η̂ ) − EP ψ(S; b , η )D5 ψ2 (S; η 0 ) 
γ Ikc T T Ikc γ 0 T T

·D3 2   
+D 2 D5 EP [ψ2 (S; η 0 )] n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik ) − EP [ψ(S; bγ , η 0 )]
c

1  3  
= ψ(Si ; b̂γ , η̂ Ik )D5T ψ2T (Si ; η̂ Ik ) − EP ψ(S; bγ , η 0 )D5T ψ2T (S; η 0 ) 
c c

n i∈Ik
·D3 2 + OP N − 2 (1 + ρN )
1

by Lemma J.2, and we have


 
1 c c
 n i∈Ik ψ(Si ; b̂γ , η̂ Ik )D5T ψ2T (Si ; η̂ Ik ) j,r
 

− EP ψ(S; bγ , η 0 )D5T ψ2T (S; η 0 ) j,r 
 
 c c
=  n1 i∈Ik ψj (Si ; b̂γ , η̂ Ik )D5T ψ2T (Si ; η̂ Ik ) ·,r
 

− EP ψj (S; bγ , η 0 )D5T ψ2T (S; η 0 ) ·,r 
 
=  n1 i∈Ik D5T (ψ2T (Si ; η̂ Ik ))·,r ψj (Si ; b̂γ , η̂
c
Ikc
)
− EP D5T (ψ2T S; η 0 ) ·,r ψj (S; bγ , η 0 ) 
   
≤  n1 i∈Ik ψ2T (Si ; η̂ Ik ) ·,r ψj (Si ; b̂γ , η̂ Ik ) − EP ψ2T (S; η 0 ) ψj (S; bγ , η 0 ) 
c c

·,r
·D5 .

The term I14 can be bounded analogously to I13 . The term I15 can be bounded
analogously to I12 . Last, we bound the term I16 . By Lemma I.20, we have
1   T 
 Ikc Ikc
ψ2 (S; η 0 ) t,r ψ2 (S; η 0 ) j,l 
i∈Ik ψ2 (Si ; η̂ ) t,r ψ2 (Si ; η̂ ) j,l − EP
T
n
= OP (ρ̃N ),

which implies I16 = OP (ρ̃N ). Indeed, we have


1 
 Ikc
i∈Ik D3 ψ2 (Si ; η̂ ) − EP [ψ2 (S; η )] D5
0
n
c T T
·D5T ψ2 (S
 i ; η̂ ) − EP [ψ2 (S; η )] D3
Ik 0

−D3 EP ψ2 (S; η 0 ) − EPN [ψ2 (S; η 0 )] D5



D3T 
T
·D5T ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )]
 
≤ D3 2  n1  i∈Ik ψ2 (Si ; η̂ Ik )D5 D5T ψ2T (S
c
Ic
 i ; η̂ k )
− EP ψ 2 (S;
0 T T
η )D5 D5 ψ2 I(S; η0 )   T 
2 1 c
+2D3  n i∈Ik ψ2 (Si ; η̂ k )D5 D5T EP N
ψ2 (S; η 0 )
−EP ψ2 (S; η 0 )D5 D5T EP [ψ2T (S; η 0 )] 
≤ D3 2  n1  i∈Ik ψ2 (Si ; η̂ Ik )D5 D5T ψ2T (S
c
Ic
 i ; η̂ k )
− EP ψ2 (S; η 0 )D5 D5T ψ2T (S; η0 )  

3  D5  EP [ψ2 (S; η )] n
 Ikc 0 
i∈Ikc ψ2 (Si ; η̂ ) − EP [ψ2 (S; η )]
2 2 0 1
+2D  
= D3 2  n1 i∈Ik ψ2 (Si ; η̂ Ik )D5 D5T ψ2T (Si ; η̂ Ik )
c

 
− EP ψ2 (S; η 0 )D5 D5T ψ2T (S; η 0 )  + OP N − 2 (1 + ρN )
1
Regularizing DML in endogenous PLMs 6533

by Lemma I.17, and we have


1 
 Ikc T T Ikc
n i∈Ik  ψ2 (Si ; η̂ )D5 D5 ψ2 (Si ; η̂ ) j,r
− E ψ (S; η 0 )D5 D5T ψ2T (S; η 0 ) j,r 
1  P 2
 c c
= n i∈Ik ψ2 (Si ; η̂ Ik ) j,· D5 D5T ψ2T (Si ; η̂ Ik ) ·,r
 
− EP ψ2 (S; η 0 ) j,· D5 D5T ψ2T (S; η 0 ) ·,r 
 
=  n1 i∈Ik D5T ψ2T (Si ; η̂ Ik ) ·,r ψ2 (Si ; η̂ Ik ) j,· D5
c c

  
−D5T EP ψ2T (S; η 0 ) ·,r (ψ2 (S; η 0 ))j,· D5 
   
≤  n1 i∈Ik ψ2T (Si ; η̂ Ik ) ·,r (ψ2 (Si ; η̂ Ik ))j,· − EP (ψ2T (S; η 0 ) 
c c

·,r
ψ2 (S; η 0 ) j,·
·D5 2 .
Proof of Proposition 4.2. The statement of Proposition 4.2 can be reformulated
as ⎧ √ √

⎨ 0, if γN = Ω( N ) and γN ∈ Θ( N )
√ √
N |bγN − β0 | → C, if γN = Θ( N )

⎩ √
∞, if γN = o( N )
using the Bachmann–Landau notation, which is presented in Lattimore and
Szepesvári [58], for instance.
Introduce the matrices
F1 := EP [R
 X RYT],
F2 := EP RX RX ,
   
T −1
G1 := EP RX RA T
E RA RA EP [RA RY ],
  P  
T −1
 
G2 := EP RX RA EP
T
RA RA EP RA RX T
.
We have
√ √  −1


N |bγN − β0 | = N  F2 + (γN − 1)G2 F1 + (γN − 1)G1 − G−1
2 G1 .

First, we assume that the sequence {γN }N ≥1 diverges to +∞ as N → ∞, so


that γN − 1 is bounded away from 0 for N large enough. By Henderson and
Searle [50, Section 3], we have
−1
F2 + (γN − 1)G2
 −1
= γN1−1 G−1 −1
2 − 1 + γN −1 G2 F2
1 1 −1 1 −1
γN −1 G2 F2 γN −1 G2 .

Hence, we have

N |bγN − β0 |

 −1 −1 −1 1 −1 −1
−1 G2 F1 − 1 +
= γN N 1
γN −1 G2 F2 γN −1G2 F2 G2 F1
−1 −1 −1 
− 1+ 1
γN −1 G2 F2 G2 F2 G−1
2 G1 

and infer our claim because we have


−1 1
G−1
2 F1 − 1 +
1 −1
γN −1 G2 F2
−1 −1
γN −1 G2 F2 G2 F1
−1 −1 −1
− 1+ 1
γN −1 G2 F2 G2 F2 G−1
2 G1
= O(1).
6534 C. Emmenegger and P. Bühlmann

Next, we assume that the sequence {γN }N ≥1 is bounded. We have


 
 −1 
|bγN − β0 | =  F2 + (γN − 1)G2 F1 + (γN − 1)G1 − G−12 G 1  = O(1),

which concludes the proof.


Proof of Theorem 4.3. We show that
P σ̂ 2 (γN ) + N (b̂γN − β̂)2 ≤ σ̂ 2 ≤ P (|ΞN | ≥ CN )
holds for some random variable ΞN satisfying ΞN = OP (1) and for some se-
quence {CN }N ≥1 of non-negative numbers diverging to +∞ as N → ∞.
For real numbers a and b, observe that we have
1 1
|a|2 + |b|2 ≥ |a| + |b|
2 2
due to
3 2 2  3
|a| + |b|2 − |a||b| ≥ (|a| − |b|)2 ≥ 0.
4 3 4
Thus, we have
2
P σ̂ (γN ) + N (b̂γN − β̂)2 ≤ σ̂ 2
 
=P σ̂ 2 (γN ) + N (b̂γN − β̂)2 ≤ σ̂

≤ P σ̂(γN ) + N |b̂γN − β̂| ≤ 2σ̂ .
By the reverse triangle inequality, we have
|b̂γN − β̂|
= |b̂γN − bγN + bγN − β0 + β0 − β̂|
≥ |bγN − β0 | − |b̂γN − bγN | − |β0 − β̂|.
Thus, we have
P σ̂ 2 (γN ) +√N (b̂γN − β̂)2 ≤ √ 2σ̂ 2 √
≤P σ̂(γ
√ N ) + N |b γN
− β 0 | − N | b̂√
γN
− b γN
| − 0 − β̂| ≤ 2σ̂
N |β√
 N |b − β0 | ≤ √ 2σ̂ − σ̂(γN ) + √ N |b̂ − b  | +√ N |β0 − β̂|
γN γN γN
=P
≤P σ̂(γN ) − 2σ̂ − N |b̂γN − bγN | − N |β0 − β̂| ≥ N |bγN − β0 |
√ √ √
≤P |σ̂(γN ) − 2σ̂ − N (b̂γN − bγN ) − N (β0 − β̂)| ≥ N |bγN − β0 |
by the reverse triangle inequality. Let us introduce the random variable
√ √
ΞN := σ̂(γN ) − 2σ̂ − N (b̂γN − bγN ) − N (β0 − β̂)

and the deterministic number CN := N |bγN − β0 |. By Lemma J.6, we have
ΞN = OP (1). Let ε > 0, and choose Cε and Nε such that for all N ≥ Nε the
statement P (|ΞN | > Cε ) √
< ε holds. By Proposition 4.2, CN tends to infinity as
N → ∞ due to γN = o( N ). Hence, there exists some N  =N  (Cε ) such that

we have CN > Cε for all N ≥ N . This implies P (|ΞN | > CN ) ≤ P (|ΞN | > Cε )
.
for all N ≥ N
Let N := max{Nε , N  }. For all N ≥ N , we therefore have P (|ΞN | > CN ) < ε.
We conclude limN →∞ P (|ΞN | > CN ) = 0.
Regularizing DML in endogenous PLMs 6535
√ √
Lemma J.4. Let γN = o( N ). We have N (b̂γN − bγN ) = OP (1).
Proof of Lemma J.4. We already verified D̂1 = D1 +oP (1) and D̂2 = D2 +oP (1)
in the proof of Theorem 4.1. Let us assume that γN diverges to +∞ as N → ∞.
We then have
−1
D̂1 + (γN − 1)D̂2
 −1
= γN1−1 γN1−1 D1 + D2 + oP (1) + γN1−1 oP (1)
 −1 
= γN1−1 γN1−1 D1 + D2 + oP (1)
−1
= D1 + (γN − 1)D2 + oP 1
γN −1

1
because γN −1 = O(1) holds. Furthermore, we have

 N (b̂ − b )
γN γN

−1
= D1 + (γN − 1)D2 + oP γN1−1
K  
· √1K k=1 √1n i∈Ik ψ(S  i ; bγN , η̂ Ikc )
 c
  c
−1 c

+(γN − 1) n1 i∈Ik ψ1 (Si ; η̂ Ik ) n1 i∈Ik ψ2 (Si ; η̂ Ik ) ψ(Si ; bγN , η̂ Ik )

by (14). Lemma I.16 states that


 
 1  1  
√ ϕ(S ; b 0 Ikc
, η̂ ) − √ ϕ(S ; b 0 0 
, η )
 n i
n
i  = OP (ρN )
i∈Ik i∈Ik

 ψ2 }, and b0 ∈ {bγ , β0 , 0}, and where ρN =


holds for k ∈ [K], ϕ ∈ {ψ, ψ,
1 1
rN + N 2 λN is as in Definition I.4 and satisfies ρN  δN4 , and where we inter-
pret ψ2 (S; b, η) = ψ2 (S; η). This statement remains valid in the present setting
because there exists some finite real constant C such that we have |bγN | ≤ C
for N large enough. Hence, we have

N (b̂γN − bγN ) 
 −1
1
= γN −1 D1 + D2 + oP (1)
K  
· √1K k=1 √1n i∈Ik γN1−1 ψ(S  i ; bγN , η 0 ) + D3 ψ(Si ; bγN , η 0 )

+ ψ1 (Si ; η 0 ) − EP [ψ1 (S; η 0 )] D5 − D3 ψ2 (Si ; η 0 ) − EP [ψ2 (S; η 0 )] D5

+oP (1)

by (64). Consider the random variables

i
X
:= 1  γN
, η 0 ) + D3 ψ(Si ; bγN , η 0 ) + ψ1 (Si ; η 0 ) − EP [ψ1 (S; η 0 )] D5
γN −1 ψ(Si ; b
−D3 ψ2 (Si ; η 0 ) − EP [ψ2 (S; η 0 )] D5
6536 C. Emmenegger and P. Bühlmann

 i , and Vn :=  2
for i ∈ [N ], Sn := i∈Ik X i∈Ik EP [Xi ], where n = K denotes
N

the size of Ik . The Lyapunov condition is satisfied for δ = 2 > 0 because

1    1 1  
 i |2+δ =
EP | X 1 |2+δ → 0
· 1+δ EP |X
 2]
EP [ X
2+δ  ])2+δ n
(EP [X 2
i∈Ik i i∈Ik 1

holds as n → ∞. Therefore, the Lindeberg–Feller condition is satisfied, which


implies SVnn → N (0, 1) as n → ∞.
The case where the sequence γN is bounded can be analyzed analogously.

Lemma J.5. Let γN = o( N ). We then have σ̂ 2 (γN ) = OP (1).
Proof of Lemma J.5. We have
−1 −1
σ̂ 2 (γN ) = D̂1 + (γN − 1)D̂2 D̂4 D̂1T + (γN − 1)D̂2T .

As verified in the proof of Theorem 4.1, we have D̂1 = D1 + oP (1) and D̂2 =
D2 + oP (1). We established D̂4k = D4 + oP (1) in the proof of Theorem J.3 for
fixed γ. Consequently, the claim follows if the sequence {γN }N ≥1 is bounded.
Next, assume that γN diverges to +∞ as N → ∞. We verified
 1 
−1 −1
D̂1 + (γN − 1)D̂2 = D1 + (γN − 1)D2 + oP
γN − 1

in the proof of Lemma J.4. It can be shown that (γN 1−1)2 D̂4 is bounded in P -
probability by adapting the arguments presented in the proof of Theorem J.3
because there exists some finite real constant C such that we have |bγN | ≤ C
for N large enough. Therefore,

σ̂ 2 (γ )
 N −1  −1
= γN1−1 D1 + D2 + oP (1) 1 1 T T
(γN −1)2 D̂4 γN −1 D1 + D2 + oP (1)

is bounded in P -probability.

Lemma J.6. Let γ = o( N ). We then have
√ √
ΞN := σ̂(γN ) − 2σ̂ − N (b̂γN − bγN ) − N (β0 − β̂) = OP (1).

Proof of Lemma J.6. By Theorem 3.1, the term N (β0 − β̂) asymptotically
follows a Gaussian distribution and is hence bounded in P -probability. By The-
orem I.21, the term σ̂ 2 converges in P -probability.
√ Thus, 2σ̂ is bounded in
P -probability as well. By Lemma J.4, we have N (b̂γN − bγN ) = OP (1). By
Lemma J.5, we have σ̂ 2 (γN ) = OP (1).

Proof of Theorem 4.4. That the statement holds uniformly for P ∈ PN can
be derived using analogous arguments as used to prove Theorem 3.1 and 4.1.
Theorem J.3 in the appendix shows that σ̂(γ) consistently estimates σ(γ) for
Regularizing DML in endogenous PLMs 6537

fixed γ. Analogous arguments show that σ̂(γ̂  ) consistently estimates σ from


Theorem 3.1. Let μ̂ := γ̂  − 1. We have
√  
N (b̂γ̂ − bγ̂ )
√  1 K Ik T 1

Ik −1
= N K k=1 RX μ̂ 1 + ΠRIAk RX
K Ik T 1 Ik Ik 
·K1
k=1 RX μ̂ 1 + ΠRIk RY − RX bγ̂ .
A

1 √1 oP (1).
Due to Theorem 4.3, we have μ̂ = N
Due to Proposition 4.2, whose

statements also hold stochastically for random γ, we have bγ̂ = β0 + √1N oP (1).
Therefore, we have
√  
N (b̂γ̂ − bγ̂ )
√  K Ik T

Ik −1 1 K Ik T Ik Ik
= N K 1
k=1 R X Π Ik R
RA X K k=1 RX ΠRIk RY − RX β0
A

√ +oP (1)
= N (β̂ − β0 ) + oP (1)

due to Slutsky’s theorem and similar arguments as presented in the proofs of


Theorem 3.1 and 4.1.

Appendix K: Proof of Section 5.1

We argue that A1 and A2 are independent of H conditional on W1 and W2 in


the SEM in Figure 4. First, we consider A1 . All paths from A1 to H through
X or Y are blocked by the empty set because either X or Y is a collider on
these paths. The path A1 → A2 → W1 → H is blocked by W1 . Second, we
consider A2 . All paths from A2 to H through X or Y are blocked by the empty
set because either X or Y is a collider on these paths. The path A2 → W1 → H
is blocked by W1 .

Acknowledgments

We thank Matthias Löffler, the editor, associate editor, and anonymous review-
ers for constructive comments.

References

[1] Acemoglu, D., Johnson, S. and Robinson, J. A. (2001). The colo-


nial origins of comparative development: An empirical investigation. The
American Economic Review 91 1369–1401.
[2] Ai, C. and Chen, X. (2003). Efficient estimation of models with condi-
tional moment restrictions containing unknown functions. Econometrica 71
1795–1843.
6538 C. Emmenegger and P. Bühlmann

[3] Amemiya, T. (1974). The nonlinear two-stage least-squares estimator.


Journal of Econometrics 2 105–110.
[4] Amemiya, T. (1985). Advanced Econometrics. Harvard University Press,
Cambridge, Massachusetts.
[5] Anderson, T. W. (1983). Some recent developments on the distributions
of single-equation estimators. In Advances in econometrics, (A. Deaton,
D. McFadden and H. Sonnenschein, eds.). Econometric Society Monographs
in Quantitative Economics 4, 109–122. Cambridge University Press, Cam-
bridge.
[6] Anderson, T. W. (2005). Origins of the limited information maximum
likelihood and two-stage least squares estimators. Journal of Econometrics
127 1–16.
[7] Anderson, T. W., Kunitomo, N. and Sawa, T. (1982). Evaluation of
the Distribution Function of the Limited Information Maximum Likelihood
Estimator. Econometrica 50 1009–1027.
[8] Anderson, T. W., Kunitomo, N. and Morimune, K. (1986). Compar-
ing single-equation estimators in a simultaneous equation system. Econo-
metric Theory 2 1–32.
[9] Anderson, T. W., Kunitomo, N. and Matsushita, Y. (2010). On the
asymptotic optimality of the LIML estimator with possibly many instru-
ments. Journal of Econometrics 157 191–204.
[10] Anderson, T. W. and Rubin, H. (1949). Estimation of the Parameters
of a Single Equation in a Complete System of Stochastic Equations. The
Annals of Mathematical Statistics 20 46–63.
[11] Anderson, T. W. and Sawa, T. (1979). Evaluation of the Distribution
Function of the Two-Stage Least Squares Estimate. Econometrica 47 163–
182.
[12] Andrews, I., Stock, J. and Sun, L. (2019). Weak instruments in IV
regression: Theory and practice. Annual Review of Economics 11 727–753.
[13] Angrist, J. D., Imbens, G. W. and Rubin, D. B. (1996). Identifica-
tion of causal effects using instrumental variables. Journal of the American
Statistical Association 91 444–455.
[14] Athey, S., Tibshirani, J. and Wager, S. (2019). Generalized random
forests. The Annals of Statistics 47 1148–1178.
[15] Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing
data and causal inference models. Biometrics 61 962–972.
[16] Basmann, R. L. (1957). A generalized classical method of linear estimation
of coefficients in a structural equation. Econometrica 25 77–83.
[17] Belloni, A. and Chernozhukov, V. (2013). Least squares after model
selection in high-dimensional sparse models. Bernoulli 19 521–547.
[18] Berndt, E. R., Hall, B. H., Hall, R. E. and Hausman, J. A. (1974).
Estimation and inference in nonlinear structural models. Annals of Eco-
nomic and Social Measurement 3 653–665.
[19] Bickel, P. J. (1982). On adaptive estimation. The Annals of Statistics 10
647–671.
[20] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous
Regularizing DML in endogenous PLMs 6539

Analysis of Lasso and Dantzig Selector. The Annals of Statistics 37 1705–


1732.
[21] Bound, J., Jaeger, D. A. and Baker, R. M. (1995). Problems with
instrumental variables estimation when the correlation between the instru-
ments and the endogenous explanatory variable is weak. Journal of the
American Statistical Association 90 443–450.
[22] Bowden, R. J. and Turkington, D. A. (1985). Instrumental vari-
ables. Econometric Society Monographs. Cambridge University Press, Cam-
bridge.
[23] Bühlmann, P. (2020). Invariance, causality and robustness. Statistical Sci-
ence 35 404–426.
[24] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-
Dimensional Data: Methods, Theory and Applications. Springer Series in
Statistics. Springer, Heidelberg.
[25] Bühlmann, P. and van de Geer, S. (2018). Statistics for big data: A
perspective. Statistics & Probability Letters 136 37–41.
[26] Candes, E. and Tao, T. (2007). The Dantzig Selector: Statistical Estima-
tion When p Is Much Larger than n. The Annals of Statistics 35 2313–2351.
[27] Chen, J., Huang, C.-H. and Tien, J.-J. (2021). Debiased/Double Ma-
chine Learning for Instrumental Variable Quantile Regressions. Economet-
rics 9.
[28] Chen, B., Liang, H. and Zhou, Y. (2016). GMM estimation in partial
linear models with endogenous covariates causing an over-identified prob-
lem. Communications in Statistics - Theory and Methods 45 3168–3184.
[29] Chernozhukov, V., Hansen, C. and Spindler, M. (2016). hdm: High-
dimensional metrics. R Journal 8 185–199.
[30] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E.,
Newey, W. and Robins, J. (2017). Repo for the paper “Dou-
ble/debiased machine learning for treatment and structural parame-
ters”. https://github.com/VC2015/DMLonGitHub. Accessed: September
23, 2020.
[31] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E.,
Hansen, C., Newey, W. and Robins, J. (2018). Double/debiased ma-
chine learning for treatment and structural parameters. The Econometrics
Journal 21 C1–C68.
[32] Chiang, H. D., Kato, K., Ma, Y. and Sasaki, Y. (2021). Multiway
Cluster Robust Double/Debiased Machine Learning. Journal of Business
& Economic Statistics 0 1–11.
[33] Colangelo, K. and Lee, Y.-Y. (2020). Double debiased machine
learning nonparametric inference with continuous treatments. Preprint
arXiv:2004.03036.
[34] Cragg, J. G. (1967). On the Relative Small-Sample Properties of Several
Structural-Equation Estimators. Econometrica 35 89–110.
[35] Crown, W. H., Henk, H. J. and Vanness, D. J. (2011). Some cautions
on the use of instrumental variables estimators in outcomes research: How
bias in instrumental variables estimators is affected by instrument strength,
6540 C. Emmenegger and P. Bühlmann

instrument contamination, and sample size. Value in Health 14 1078–1084.


[36] Cui, Y. and Tchetgen Tchetgen, E. (2020). Selective machine learning
of doubly robust functionals. Preprint arXiv:1911.02029.
[37] DasGupta, A. (2008). Asymptotic theory of statistics and probability.
Springer Texts in Statistics. Springer, New York.
[38] DiazOrdaz, K., Daniel, R. and Kreif, N. (2019). Data-adaptive doubly
robust instrumental variable methods for treatment effect heterogeneity.
Preprint arXiv:1802.02821.
[39] Durrett, R. (2010). Probability: Theory and examples, 4 ed. Cambridge
Series in Statistical and Probabilistic Mathematics. Cambridge University
Press, Cambridge.
[40] Emmenegger, C. (2021). dmlalg: Double machine learning algorithms R-
package available on CRAN.
[41] Farbmacher, H., Huber, M., Lafférs, L., Langen, H. and
Spindler, M. (2020). Causal mediation analysis with double machine
learning. Preprint arXiv:2002.12710.
[42] Florens, J.-P., Johannes, J. and Van Bellegem, S. (2012). Instru-
mental regression in partially linear models. The Econometrics Journal 15
304–324.
[43] Fuller, W. A. (1977). Some Properties of a Modification of the Limited
Information Estimator. Econometrica 45 939–53.
[44] Fuller, W. A. (1987). Measurement error models. Wiley series in prob-
ability and mathematical statistics. John Wiley & Sons, New York.
[45] Hahn, J., Hausman, J. and Kuersteiner, G. (2004). Estimation with
weak instruments: Accuracy of higher-order bias and MSE approximations.
The Econometrics Journal 7 272–306.
[46] Hansen, L. P. (1982). Large sample properties of generalized method of
moments estimators. Econometrica 50 1029–1054.
[47] Hansen, L. P. (1985). A method for calculating bounds on the asymptotic
covariance matrices of generalized method of moments estimators. Journal
of Econometrics 30 203–238.
[48] Härdle, W., Liang, H. and Gao, J. (2000). Partially linear models.
Contributions to Statistics. Springer, Berlin Heidelberg.
[49] Härdle, W., Müller, M., Sperlich, S. and Werwatz, A. (2004).
Nonparametric and semiparametric models. Springer series in statistics.
Springer, Berlin.
[50] Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse of
a sum of matrices. SIAM Review 23 53–60.
[51] Hill, R. C., Griffiths, W. E. and Lim, G. C. (2011). Principles of
econometrics, 4 ed. Wiley, Hoboken, New Jersey.
[52] Hillier, G. H. and Skeels, C. L. (1993). Some further exact results
for structural equation estimators. In Models, Methods and Applications of
Econometrics: essays in Honor of A. R. Bergstroms (P. C. B. Phillips, ed.)
117–139. Blackwell, Cambridge, Massachusetts.
[53] Horowitz, J. L. (2011). Applied nonparametric instrumental variables
estimation. Econometrica 79 347–394.
Regularizing DML in endogenous PLMs 6541

[54] Jakobsen, M. E. and Peters, J. (2021). Distributional Robustness of


K-class Estimators and the PULSE. The Econometrics Journal.
[55] Knaus, M. C. (2020). Double machine learning based program evaluation
under unconfoundedness. Preprint arXiv:2003.03191.
[56] Koltchinskii, V. and Yuan, M. (2010). Sparsity in multiple kernel learn-
ing. The Annals of Statistics 38 3660–3695.
[57] Kozbur, D. (2020). Analysis of Testing-Based Forward Model Selection.
Econometrica 88 2147–2173.
[58] Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cam-
bridge University Press, Cambridge.
[59] Lauritzen, S. L. (1996). Graphical models. Oxford statistical science se-
ries. Clarendon Press, Oxford.
[60] Lewis, G. and Syrgkanis, V. (2020). Double/debiased machine learning
for dynamic treatment effects. Preprint arXiv:2002.07285.
[61] Liu, M., Zhang, Y. and Zhou, D. (2021). Double/debiased machine
learning for logistic partially linear model. The Econometrics Journal.
[62] Lloyd, W. P. (1975). A Note on the Use of the Two-Stage Least Squares
Estimator in Financial Models. The Journal of Financial and Quantitative
Analysis 10 143–149.
[63] Ma, Y. and Carroll, R. J. (2006). Locally efficient estimators for semi-
parametric models with measurement error. Journal of the American Sta-
tistical Association 101 1465–1474.
[64] Maathuis, M., Drton, M., Lauritzen, S. and Wainwright, M., eds.
(2019). Handbook of graphical models. Handbooks of Modern Statistical
Methods. Chapman & Hall/CRC, Boca Raton, FL.
[65] Mammen, E. and van de Geer, S. (1997). Penalized Quasi-Likelihood
Estimation in Partial Linear Models. The Annals of Statistics 25 1014–
1035.
[66] Mariano, R. S. (1972). The Existence of Moments of the Ordinary Least
Squares and Two-Stage Least Squares Estimators. Econometrica 40 643–
652.
[67] Mariano, R. S. (1982). Analytical Small-Sample Distribution Theory in
Econometrics: The Simultaneous-Equations Case. International Economic
Review 23 503–533.
[68] Mariano, R. S. (2003). Simultaneous Equation Model Estimators: Statis-
tical Properties and Practical Implications In A Companion to Theoretical
Econometrics 6, 122–141. John Wiley & Sons, Ltd.
[69] Meier, L., van de Geer, S. and Bühlmann, P. (2009). High-
dimensional additive modeling. The Annals of Statistics 37 3779–3821.
[70] Nagar, A. L. (1959). The Bias and Moment Matrix of the General k-Class
Estimators of the Parameters in Simultaneous Equations. Econometrica 27
575–595.
[71] Nagar, A. L. (1960). A Monte Carlo Study of Alternative Simultaneous
Equation Estimators. Econometrica 28 573–590.
[72] Newey, W. K. and McFadden, D. (1994). Large sample estimation and
hypothesis testing. In Handbook of Econometrics, 4 36, 2111–2245. Elsevier
6542 C. Emmenegger and P. Bühlmann

Science.
[73] Okui, R., Small, D. S., Tan, Z. and Robins, J. M. (2012). Doubly
robust instrumental variable regression. Statistica Sinica 22 173–205.
[74] Pearl, J. (1998). Graphs, causality, and structural equation models. So-
ciological Methods & Research 27 226–284.
[75] Pearl, J. (2004). Robustness of causal claims. In Proceedings of the 20th
Conference on Uncertainty in Artificial Intelligence. UAI ’04 446–453.
AUAI Press, Arlington, Virginia, USA.
[76] Pearl, J. (2009). Causality: Models, reasoning, and inference, 2 ed. Cam-
bridge University Press, Cambridge.
[77] Pearl, J. (2010). An introduction to causal inference. The International
Journal of Biostatistics 6 Article 7.
[78] Peters, J., Janzing, D. and Schölkopf, B. (2017). Elements of causal
inference: Foundations and learning algorithms. Adaptive computation and
machine learning. The MIT Press, Cambridge, MA.
[79] Phillips, P. C. B. (1984). The Exact Distribution of LIML: I. Interna-
tional Economic Review 25 249–261.
[80] Phillips, P. C. B. (1985). The Exact Distribution of LIML: II. Interna-
tional Economic Review 26 21–36.
[81] Robinson, P. M. (1988). Root-N -consistent semiparametric regression.
Econometrica 56 931–954.
[82] Rothenhäusler, D., Meinshausen, N., Bühlmann, P. and Peters, J.
(2021). Anchor regression: Heterogeneous data meet causality. Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 83 215-246.
[83] Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric
regression. Cambridge series in statistical and probabilistic mathematics 12.
Cambridge University Press, Cambridge.
[84] Smucler, E., Rotnitzky, A. and Robins, J. M. (2019). A unifying
approach for doubly-robust 1 regularized estimation of causal contrasts.
Preprint arXiv:1904.03737.
[85] Speckman, P. (1988). Kernel smoothing in partial linear models. Journal
of the Royal Statistical Society. Series B (Methodological) 50 413–436.
[86] Staiger, D. and Stock, J. H. (1997). Instrumental variables regression
with weak instruments. Econometrica 65 557–586.
[87] Stock, J. H., Wright, J. H. and Yogo, M. (2002). A survey of weak
instruments and weak identification in generalized method of moments.
Journal of Business and Economic Statistics 20 518–529.
[88] Su, L. and Zhang, Y. (2016). Semiparametric estimation of partially lin-
ear dynamic panel data models with fixed effects. In Essays in Honor of
Aman Ullah, 1 ed. (G. González-Rivera, R. C. Hill and T.-H. Lee, eds.). Ad-
vances in Econometrics 36 137–204. Emerald Group Publishing Limited,
Howard House, Wagon Lane, Bingley BD16 1WA, UK.
[89] Summers, R. (1965). A Capital Intensive Approach to the Small Sample
Properties of Various Simultaneous Equation Estimators. Econometrica 33
1–41.
[90] Takeuchi, K. and Morimune, K. (1985). Third-Order Efficiency of the
Regularizing DML in endogenous PLMs 6543

Extended Maximum Likelihood Estimators in a Simultaneous Equation


System. Econometrica 53 177–200.
[91] Theil, H. (1953a). Repeated least-squares applied to complete equation
systems. Central Planning Bureau, The Hague. Mimeographed memoran-
dum.
[92] Theil, H. (1953b). Estimation and simultaneous correlation in complete
equation systems. Central Planning Bureau, The Hague. Mimeographed
memorandum.
[93] Theil, H. (1961). Economic forecasts and policy, 2 ed. Contributions to
economic analysis 15. North-Holland Publishing Company, Amsterdam.
[94] van der Laan, M. J. and Robins, J. M. (2003). Unified methods for cen-
sored longitudinal data and causality. Springer series in statistics. Springer,
New York.
[95] Wager, S. and Walther, G. (2016). Adaptive concentration of regression
trees, with application to random forests. Preprint arXiv:1503.06388.
[96] Wagner, H. M. (1958). A Monte Carlo Study of Estimates of Simultane-
ous Linear Structural Equations. Econometrica 26 117–133.
[97] Wooldridge, J. M. (2013). Introductory econometrics: A modern ap-
proach, 5 ed. South-Western Cengage Learning, Mason, OH.
[98] Yao, F. (2012). Efficient semiparametric instrumental variable estimation
under conditional heteroskedasticity. Journal of Quantitative Economics
10 32–55.
[99] Yuan, M. and Zhou, D.-X. (2016). Minimax optimal rates of estimation
in high-dimensional additive models. The Annals of Statistics 44 2564–
2593.

You might also like