21 Ejs1931
21 Ejs1931
1. Introduction
European Union’s Horizon 2020 research and innovation programme (grant agreement No.
786461).
6461
6462 C. Emmenegger and P. Bühlmann
Y = X T β0 + gY (W ) + hY (H) + εY . (1)
The covariates X and W and the response Y are observed whereas the variable
H is not observed and acts as a potential confounder. It can cause endogeneity
in the model when it is correlated with X, W , and Y . The variable εY denotes
a random error. An overview of PLMs is presented in Härdle, Liang and Gao
[48]. Semiparametric methods are summarized in Ruppert, Wand and Carroll
[83] and Härdle et al. [49], for instance.
Equation (2) has a two-stage least squares (TSLS) interpretation [91, 92,
16, 22, 13, 6]. As mentioned above, the residual term Y − E[Y |W ] is regressed
on X − E[X|W ] using the instrument A − E[A|W ]. In entirely linear models,
the following findings have been reported about TSLS and related procedures.
The TSLS estimator has been observed to be highly variable, leading to overly
wide confidence intervals. For instance, although ordinary least squares (OLS) is
biased in the presence of endogeneity, it has been observed to be less variable [96,
71, 89, 34, 62]. The issue with large or nonexisting variance of TSLS (the order
of existing moments of TSLS depends on the degree of overidentification [66,
67, 68]) is also coupled with the strength of the instrument [21, 86, 87, 35, 12].
Regularizing DML in endogenous PLMs 6463
and anchor regression [82, 23]. Both k-class estimation and anchor regression are
designed for linear models and may require choosing a regularization parameter.
Our approach is designed for PLMs, and the regularization parameter is data
driven. Recently, Jakobsen and Peters [54] have proposed a related strategy for
linear (structural equation) models; whereas they rely on testing for choosing
the amount of regularization, we tailor our approach to reduce the MSE such
that the coverage of confidence intervals for β0 remains valid. The regsDML
estimator converges at the parametric rate and is asymptotically Gaussian.
In this sense, and in contrast to Jakobsen and Peters [54], regsDML focuses
on statistical inference beyond point estimation with coverage guarantees not
only in linear models but also in potentially complex partially linear ones. The
regsDML estimator is asymptotically equivalent to the TSLS-type DML esti-
mator, but regsDML may exhibit substantially better finite sample properties.
Furthermore, our developments show how DML and k-class estimation can be
combined to estimate the linear coefficient in an endogenous PLM.
Our approach allows flexible model specification. We only require that X
enters linearly in (1) and that the other terms are additive. In particular, the
form of the effect of W on A or of A on W is not constrained. This is partly
similar to TSLS, which is robust to model misspecifications in its first stage be-
cause it does not rely on a correct specification of the instrument effect on the
covariate [15]. The detailed assumptions on how the variables A, X, W , H, and
Y interact are given in Section 2: the variable A needs to satisfy an assumption
similar to that for a conditional instrument, but there is some flexibility.
Fig 2. The results come from M = 1000 simulation runs each from the SEM in Figure 1
for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1. The nuisance
functions are estimated with additive splines. The figure displays the coverage of two-sided
confidence intervals for β0 , power for two-sided testing of the hypothesis H0 : β0 = 0, and
scaled lengths of two-sided confidence intervals of DML (red), regDML (blue), and regsDML
(green), where all results are at level 95%. At each N , the lengths of the confidence intervals
are scaled with the median length from DML. The shaded regions in the coverage and power
plots represent 95% confidence bands with respect to the M simulation runs. The blue and
green lines are indistinguishable in the left panel.
PLMs have received considerable interest. Härdle, Liang and Gao [48] present
an overview of estimation methods in purely exogenous PLMs, and many refer-
6466 C. Emmenegger and P. Bühlmann
ences are given there. The remaining part of this paragraph refers to literature
investigating endogenous PLMs. Ai and Chen [2] consider semiparametric es-
timation with a sieve estimator. Ma and Carroll [63] introduce a parametric
model for the latent variable. Yao [98] considers a heteroskedastic error term
and a partialling-out scheme [81, 85]. Florens, Johannes and Van Bellegem [42]
propose to solve an ill-posed integral equation. Su and Zhang [88] investigate a
partially linear dynamic panel data model with fixed effects and lagged variables
and consider sieve IV estimators as well as an approach with solving integral
equations. Horowitz [53] compares inference and other properties of nonpara-
metric and parametric estimation if instruments are employed.
Outline of the paper. Section 2 and 3 describe the DML estimator. The former
section introduces an identifiability condition, and the latter investigates asymp-
totic properties. Section 4 introduces the regularized regularization-selection es-
timator regsDML and its regularization-only version regDML and investigates
their asymptotic properties. Section 5 presents numerical experiments and an
empirical real data example. Section 6 concludes our work. Proofs and addi-
tional definitions and material are given in the appendix.
Y ← X T β0 + gY (W ) + hY (H) + εY (3)
Theorem 2.1. Let the dimensions q = dim(A) and d = dim(X), and assume
q ≥ d. Assume furthermore that the matrices E[RX RA T
] and E[RA RA
T
] are of full
rank, and assume the identifiability condition (5). Then the representation
−1
T −1 T −1
β0 = E RX RA
T
E RA RA E RA RX
T
E RX RA
T
E RA RA E[RA RY ]
holds.
By Theorem 2.1, the population parameter β0 solves the TSLS moment equa-
tion
T −1
0 = E RX RA T
E RA RA E RA (RY − RX T
β0 ) .
This motivates a generalized method of moments interpretation of β0 because
we have
T −1
β0 = arg min E[ψ(S; β, η 0 )] E RA RA E ψ T (S; β, η 0 )
β∈Rd
Ic Ic
parameters, with data from Ikc . We call the resulting estimators m̂Ak , m̂Xk , and
Ic Ic
m̂Yk , respectively. Then, the adjusted residual terms RA,iIk
:= Ai − m̂Ak (Wi ),
Ic Ic
Ik
RX,i := Xi − m̂Xk (Wi ), and RY,i
Ik
:= Yi − m̂Yk (Wi ) for i ∈ Ik are evaluated on
c
Ik , the complement of Ik . We concatenate them row-wise to form the matrices
Ik Ik Ik
RA ∈ Rn×q and RX ∈ Rn×d and the vector RY ∈ Rn .
These K iterates are assembled to form the DML estimator
−1
1
K K
Ik T Ik 1 Ik T Ik
β̂ := RX ΠRIk RX RX ΠRIk RY (7)
K A K A
k=1 k=1
of β0 , where
Ik
Ik T Ik
−1 Ik T
ΠRIk := RA RA RA RA (8)
A
denotes the orthogonal projection matrix onto the space spanned by the columns
Ik
of RA .
To obtain β̂ in (7), the individual matrices are first averaged before the final
matrix is inverted. It is also possible to compute K individual TSLS estimators
on the K iterates individually and average these. Both schemes are asymptot-
ically equivalent. Chernozhukov et al. [31] call these two schemes DML2 and
DML1, respectively, where DML2 is as in (7). The DML1 version of the coeffi-
cient estimator is given in the appendix in Section B.1. The advantage of DML2
over DML1 is that it enhances stability properties of the estimator. To ensure
stability of the DML1 estimator, every individual matrix that is inverted needs
to be well conditioned. Stability of the DML2 estimator is ensured if the average
of these matrices is well conditioned.
The K sample splits are random. To reduce the effect of this randomness,
we repeat the overall procedure S times and assemble the results as suggested
in Chernozhukov et al. [31]. This procedure is described in Algorithm 1 in Sec-
tion 4.2 below.
√ 1
N
N σ −1 (β̂ − β0 ) = √
d
ψ(Si ; β0 , η 0 ) + oP (1) → N (0, 1d×d ) (N → ∞),
N i=1
for k ∈ [K], which allows us to employ a broad range of ML estimators. For in-
stance, these convergence rates are satisfied by 1 -penalized and related methods
in a variety of sparse, high-dimensional linear models [26, 20, 24, 17], forward se-
lection in sparse linear models [57], high-dimensional additive models [69, 56, 99],
or regression trees and random forests [95, 14]. Please see Chernozhukov et al.
[31] for additional references. In particular, the rate condition (9) is satisfied if
the individual ML estimators converge at rate N − 4 . Therefore, the individual
1
2
The asymptotic variance σ can be consistently estimated by replacing the
true β0 by β̂ or its DML1 version. The nuisance functions are estimated on
subsampled datasets, and the estimator of σ 2 is obtained by cross-fitting. The
formal definition, the consistency result, and its proof are given in Definition I.1
and in Theorem I.21 in the appendix in Section I.
For fixed P , the asymptotic variance-covariance matrix σ 2 is the same as if
the conditional expectations m0A (W ), m0X (W ), and m0Y (W ) and hence RA , RX ,
and RY were known.
Theorem 3.1 holds uniformly over laws P . This uniformity guarantees some
robustness of the asymptotic statement [31]. The dimension v of the covariate
W may grow as the sample size increases. Thus, high-dimensional methods can
be considered to estimate the conditional expectations E[A|W ], E[X|W ], and
E[Y |W ].
∂
E ψ S; β0 , η 0 + r(η − η 0 )
∂r
exists for all r ∈ [0, 1) and nuisance parameters η and vanishes at r = 0.
Definition 3.2 does not entirely coincide with Chernozhukov et al. [31, Defini-
tion 2.1] because the latter also includes an identifiability condition. We directly
assume the identifiability condition (5).
The subsequent proposition states that the score function ψ in (10) is indeed
Neyman orthogonal.
Proposition 3.3. The score ψ given in Equation (10) is Neyman orthogonal.
We would like to remark that Neyman orthogonality of ψ neither depends
on the distribution of S nor on the value of β0 and η 0 . In addition to being
Neyman orthogonal, ψ is linear in β in the sense that we have
for
ψ b (S; η) := A − mA (W ) Y − mY (W )
and
T
ψ a (S; η) := A − mA (W ) X − mX (W ) .
This linearity property is also employed in the proof of Theorem 3.1.
Regularizing DML in endogenous PLMs 6473
1
√ I
Ai RY,i − (RX,i
I
)T β0 (12)
n
i∈I
c c
may diverge as N → ∞ because m̂IX and m̂IY may be biased estimators of
m0X and m0Y . This in particular happens if the functions m0X and m0Y are high-
dimensional and need to be estimated by regularization techniques; see Cher-
nozhukov et al. [31]. Even if sample splitting is employed, the term (12) is
asymptotically not well behaved because the underlying score function
T
ϕ(S; β, η) := A Y − mY (W ) − X − mX (W ) β
is not Neyman orthogonal. The issue is illustrated in Figure 3. The SEM used to
generate the data is similar to the nonconfounded model used in Chernozhukov
β̂−β0
et al. [31, Figure 1]. The centered and rescaled term using A as an in-
Var(β̂)
β̂)
strument is biased whereas it is not if the instrument RA is used. Here, Var(
denotes the empirically observed variance of β̂ with respect to the performed
simulation runs.
Ik
RY ∈ Rn introduced in Section 3 that adjust the data with respect to the
nonparametric variables. The estimator of bγ is given by
K
Ik Ik T 2
b̂γ := arg minb∈Rd 1
K k=1 1 − ΠRIk RY − RX b
2
2
A
Ik Ik
T
+γ ΠRIk (RY − (RX ) b) ,
A 2
−1
1 1
K K
γ Ik T Ik Ik T Ik
b̂ = RX
RX
RX
RY , (14)
K K
k=1 k=1
where
Ik
√ Ik Ik
√ Ik
:= 1 + ( γ − 1)Π Ik RX
RX and RY := 1 + ( γ − 1)ΠRIk RY . (15)
R A A
Ik
The computation of b̂γ is similar to an OLS scheme where RY is regressed on
Ik
γ
RX . To obtain b̂ , individual matrices are first averaged before the final matrix
is inverted. It is also possible to directly carry out the K OLS regressions of
Ik Ik
RY on RX and average the resulting parameters. Both schemes are asymp-
totically equivalent. We call the two schemes DML2 and DML1, respectively.
This is analogous to Chernozhukov et al. [31] as already mentioned in Section 3.
The DML1 version is presented in the appendix in Section B.2. As mentioned
in Section 3, the advantage of DML2 over DML1 is that it enhances stability
properties of the coefficient estimator because the average of matrices needs to
be well conditioned but not every individual matrix.
Theorem 4.1. Let γ ≥ 0. Suppose that Assumption I.5 in the appendix in Sec-
tion I (same as in Theorem 3.1) except I.5.1 holds, and consider the quantities
σ 2 (γ) and ψ introduced in Definition J.1 in the appendix in Section J. The es-
timator b̂γ concentrates in a √1N neighborhood of bγ . It is approximately linear
and centered Gaussian, namely
√ 1
N
N σ −1 (γ)(b̂γ − bγ ) = √
d
ψ(Si ; bγ , η 0 ) + oP (1) → N (0, 1d×d ) (N → ∞),
N i=1
Theorem 4.1 also holds for the DML1 version of b̂γ defined in the appendix
in Section B.2. The influence function is denoted by ψ in both Theorems 3.1
and 4.1 but is defined differently. Assumption I.5 specifies regularity conditions
and the convergence rate of the machine learners of the conditional expectations.
6476 C. Emmenegger and P. Bühlmann
for k ∈ [K]. The main difference to Theorem 3.1 and quantity of interest is the
asymptotic variance σ 2 (γ). It can be consistently estimated with either b̂γ or
its DML1 version as illustrated in Theorem J.3 in the appendix in Section J.
Typically, for γ < ∞, the asymptotic variance σ 2 (γ) is smaller than σ 2 in
Theorem 3.1. Such a variance gain comes at the price of bias because b̂γ estimates
bγ and not the true parameter β0 .
The proof of Theorem 4.1 uses Neyman orthogonality of the underlying score
function. Recall that Neyman orthogonality neither depends on the distribution
of S nor on the value of the coefficients β0 and η 0 as discussed in Section 3.
For fixed γ > 1, Theorem 4.1 furthermore implies that the k-class estimator
corresponding to b̂γ converges at the parametric rate and follows a Gaussian
distribution asymptotically.
It optimizes an estimate of the asymptotic MSE of b̂γ : the term σ̂ 2 (γ) is the
consistent estimator of σ 2 (γ) described in Theorem J.3 in the appendix in Sec-
tion J, and the term |b̂γ − β̂|2 is a plug-in estimator of the squared population
bias |bγ − β0 |2 . The estimated regularization parameter γ̂ is random because it
depends on the data.
First, we investigate the bias of the population parameter bγN for a nonran-
dom sequence of regularization parameters {γN }N ≥1 as N → ∞. Afterwards,
Regularizing DML in endogenous PLMs 6477
uniformly over laws P of S = (A, W, X, Y ), where σ̂(·) ist the estimator from
Theorem J.3 in the appendix, which consistently estimates σ(·) from 4.1.
Particularly, b̂γ̂ and β̂ are asymptotically equivalent. But b̂γ̂ may exhibit
substantially better finite sample properties as we demonstrate in the subsequent
section. Because b̂γ̂ and β̂ are asymptotically equivalent, the same result also
holds for the selection estimator regsDML.
The proof of Theorem 4.4 does not depend on the precise construction of γ̂
and only uses
√ that the random regularization parameter is of stochastic order
larger than N . Thus, Theorem 4.4 remains valid if the regularization param-
eter comes from k-class estimation and is of the required stochastic order. The
same stochastic order is also required to show that k-class estimators are asymp-
totically Gaussian [70, 68].
The K sample splits are random. To reduce the effect of this randomness, we
repeat the overall procedure S times and assemble the results as suggested
in Chernozhukov et al. [31]. The assembled parameter estimate is given by
the median of the individual parameter estimates; see Steps 9 and 10 of Al-
gorithm 1. The assembled variance estimate is given by adding a correction
term to the individual variances and subsequently taking the median of these
corrected terms. The correction term measures the variability due to sample
spitting across s ∈ [S].
It is possible that the assembled variance of regDML is larger than the as-
sembled variance of DML. In such a case, we do not use the regDML estimator
and select the DML estimator instead to ensure that the final estimator of β0
does not experience a larger estimated variance than DML. This is the regsDML
scheme. A summary of this procedure is given in Algorithm 1.
5. Numerical experiments
This section illustrates the performance of the DML, regDML, and regsDML
estimators in a simulation study and for an empirical dataset. Our implemen-
tation is available in the R-package dmlalg [40]. We employ the DML2 method
and K = 2 and S = 100 in Algorithm 1. Furthermore, we compare our estima-
tion schemes with the following three k-class estimators: LIML, Fuller(1), and
Fuller(4). On each of the K sample splits, we compute the regularization pa-
rameter of the respective k-class estimation procedure and average them. Then,
we compute the corresponding γ-value and proceed as for the other regularized
estimators according to Algorithm 1.
The first example in Section 5.1 considers an overidentified model in which
the dimension of A is larger than the dimension of X. The second example in
Regularizing DML in endogenous PLMs 6479
Section 5.2 considers justidentified real-world data. In both examples, the con-
ditional expectations acting as nuisance parameters are estimated with random
forests.
An example where the conditional expectations are estimated with splines is
given in Section 1.1. Additional empirical results are provided in the appendix in
Section D, E, and F. The latter section considers examples where DML, regDML,
and regsDML do not work well in finite sample situations: we follow the NCP
(No Cherry Picking) guideline [25] to possibly enhance further insights into the
finite sample behavior. Section E in the appendix presents examples where the
link A → X is weak and examples illustrating the bias-variance tradeoff of the
respective estimated quantities as a function of the regularization parameter γ.
We generate data from the SEM in Figure 4. This SEM satisfies the identifi-
ability condition (5) because A1 and A2 are independent of H given W1 and
W2 ; a proof is given in the appendix in Section K. The model is overidentified
because the dimension of A = (A1 , A2 ) is larger than the dimension of X. The
variable A1 directly influences A2 that in turn directly affects W1 . Both W1 and
W2 directly influence H. Both A1 and A2 directly influence X. The variable A1
is a source node.
6480 C. Emmenegger and P. Bühlmann
We simulate M = 1000 datasets each from the SEM in Figure 4 for a range
of sample sizes. For every dataset, we compute a parameter estimate and an
associated confidence interval with DML, regDML, and regsDML. We choose
K = 2 and S = 100 in Algorithm 1 and estimate the conditional expectations
with random forests consisting of 500 trees that have a minimal node size of 5.
Figure 5 illustrates our findings. It gives the coverage, power, and relative
length of the 95% confidence intervals for a range of sample sizes N of the
three methods. The blue and green curves correspond to regDML and regsDML,
respectively. If the blue curve is not visible in Figure 5, it coincides with the green
one. The two regularization methods perform similarly because regularization
can considerably improve DML. The red curves correspond to DML. If the
red curve is not visible, it coincides with LIML, whose results are displayed in
orange. The Fuller(1) and Fuller(4) estimators correspond to purple and cyan,
respectively.
The top left plot in Figure 5 displays the coverages as interconnected dots.
The dashed lines represent 95% confidence regions of the coverages. These con-
fidence regions are computed with respect to uncertainties in the M simulation
runs. No coverage region falls below the nominal 95% level that is marked by
the gray line.
The bottom left plot in Figure 5 shows that the power of DML, LIML, and
Fuller(1) is lower for small sample sizes and increases gradually. The power of the
other regularization methods remains approximately 1. The dashed lines repre-
sent 95% confidence regions that are computed with respect to uncertainties in
the M simulation runs.
The right plot in Figure 5 displays boxplots of the scaled lengths of the con-
fidence intervals. For each N , the confidence interval lengths of all methods are
divided by the median confidence interval lengths of DML. The length of the
regsDML confidence intervals is around 50% − 80% the length of DML’s. Nev-
ertheless, the coverage of regsDML remains around 95%. The LIML, Fuller(1),
and Fuller(4) confidence intervals are considerably longer than regsDML’s. Al-
though the confidence intervals of regsDML are the shortest of all considered
Regularizing DML in endogenous PLMs 6481
Fig 5. The results come from M = 1000 simulation runs each from the SEM in Figure 4 for a
range of sample sizes N and with K = 2 and S = 100 in Algorithm 1. The nuisance functions
are estimated with random forests. The figure displays the coverage of two-sided confidence
intervals for β0 , power for two-sided testing of the hypothesis H0 : β0 = 0, and scaled lengths
of two-sided confidence intervals of DML (red), regDML (blue), regsDML (green), LIML
(orange), Fuller(1) (purple), and Fuller(4) (cyan), where all results are at level 95%. At
each N , the lengths of the confidence intervals are scaled with the median length from DML.
The shaded regions in the coverage and the power plots represent 95% confidence bands with
respect to the M simulation runs. The blue and green lines as well as the red and orange ones
are indistinguishable in the left panel.
We apply the DML and regsDML methods to a real dataset. We estimate the lin-
ear effect β0 of institutions on economic performance following the work of Ace-
moglu, Johnson and Robinson [1] and Chernozhukov et al. [31]. Countries with
better institutions achieve a greater level of income per capita, and wealthy
economies can afford better institutions. This may cause simultaneity. To over-
come it, mortality rates of the first European settlers in colonies are considered
as a source of exogenous variation in institutions. For further details, we refer
to Acemoglu, Johnson and Robinson [1] and Chernozhukov et al. [31]. The data
is available in the R-package hdm [29] and is called AJR. In our notation, the
response Y is the GDP, the covariate X the average protection against expro-
priation risk, the variable A the logarithm of settler mortality, and the covariate
W consists of the latitude, the squared latitude, and the binary factors Africa,
Asia, North America, and South America. That is, we adjust nonparametrically
for the latitude and geographic information.
6482 C. Emmenegger and P. Bühlmann
Table 1
Coefficient estimate, its standard error, and a confidence interval with DML and regsDML
on the AJR dataset, where K = 2 and S = 100 in Algorithm 1, and where the conditional
expectations are estimated with random forests consisting of 1000 trees that have a minimal
node size of 5.
Estimate of β0 Standard error Confidence interval for β0
DML 0.739 0.459 [−0.161, 1.639]
regsDML 0.688 0.229 [0.239, 1.136]
The AJR dataset has also been analyzed in Chernozhukov et al. [31]. They
also estimate conditional expectations with random forests consisting of 1000
trees that have a minimal node size of 5 but implicitly assume an additional
homoscedasticity condition for the errors RY − RX T
β0 ; see Chernozhukov et al.
[30]. Such a homoscedastic error assumption is questionable though. Their pro-
cedure leads to a smaller estimate of the standard deviation of DML than what
we obtain.
6. Conclusion
This section presents an SEM where our identifiability condition (5) holds, but
where the conditional moment requirements of Chernozhukov et al. [31] do not.
Let d = 1 = q in this section (justidentified case), and assume the model
Y ← Xβ0 + gY (W ) + hY (H) + εY
given in (3) and the identifiability condition E[RA (RY −RX β0 )] = 0 given in (5).
6484 C. Emmenegger and P. Bühlmann
Y = Xβ0 + gY (W ) + U, A = gA (W ) + V (17)
for unknown functions gY and gA and impose the conditional moment restric-
tions
E[U |A, W ] = 0 and E[V |W ] = 0 (18)
The DML1 estimators are less preferred than the DML2 estimators we proposed
to use in the main text, but for completeness we provide the definitions in this
section.
Regularizing DML in endogenous PLMs 6485
1 Ik
K
β̂ DML1 := β̂ ,
K
k=1
where −1
Ik T Ik Ik T Ik
β̂ Ik := RX ΠRIk RX RX ΠRIk RY , (19)
A A
Ik Ik Ik −1 Ik
and where we recall the projection matrix ΠRIk = RA (RA )T RA (R A )T
A
Ik Ik
defined in (8). The estimator β̂ Ik is the TSLS estimator of RY on RX using
Ik
the instrument RA .
1 γ
K
b̂γ,DML1 := b̂k , (20)
K
k=1
where
2 2
Ik Ik T Ik Ik T
b̂γk := arg min 1 − ΠRIk RY − RX b + γ ΠRIk RY − RX b .
b∈Rd A 2 A 2
Ik
as in (15). The computation of b̂γk is an OLS scheme where RY is regressed on
Ik
RX
.
The data from the simulation displayed in Figure 3 come from the following
SEM. Let the dimension of W be v = 20. Let R be the upper triangular matrix
of the Cholesky decomposition of the Toeplitz matrix whose first row is given
by (1, 0.7, 0.72 , . . . , 0.719 ). The SEM we consider is given by
6486 C. Emmenegger and P. Bühlmann
If we say in this section that the nuisance parameters are estimated with ad-
1
ditive splines, they are estimated with additive cubic B-splines with N 5 + 2
degrees of freedom, where N denotes the sample size of the data. If we say in this
section that the nuisance parameters are estimated with random forests, they
are estimated with random forests consisting of 500 trees that have a minimal
node size of 5.
In Figure 8, the type I errors of both DML and regsDML are similar. The
95% confidence regions of both estimators, which are represented by dashed
lines, include the 5% level or are below it. The right plot in Figure 8 illustrates
that the regsDML confidence intervals are around 50% − 80% the length of
DML’s. Nevertheless, its type I error does not exceed the 95% level.
First, we analyze the behavior of our methods for varying strength from A to
X. For N = 200, we consider the coverage and length of the confidence intervals
for varying strength from A to X for the same settings as in Figure 2 and 5.
Regularizing DML in endogenous PLMs 6487
Fig 7. The results come from M = 1000 simulation runs each from the SEM in Figure 1
with β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines as well as the red and orange ones are indistinguishable in the left panel.
Figure 9 illustrates the results for data from the SEM from Figure 2. We vary
the strength of the direct link A → X and denote it by α in Figure 9. Figure 10
illustrates the results for data from the SEM from Figure 5. We leave the link
A2 → X as it is and only vary the strength of the direct link A1 → X, which we
denote by α in Figure 10. In both Figure 9 and 10, the coverage remains high
for all considered methods. If α becomes larger in absolute value, the confidence
intervals become shorter, which leads to a coverage that is closer to the nominal
95% level, especially in Figure 10. The regsDML method yields the shortest
confidence intervals in both figures.
Second, we analyze the bias-variance tradeoff of the respective estimated
quantities of the regularized methods. We again choose the sample size N = 200
and consider the same settings as in Figure 2 and 5. The results are summarized
in Figure 11 and 12 that display the estimated MSE, estimated variance, and
estimated squared bias as used in Equation (16). The MSE in both figures is
mainly driven by the variance, and regsDML achieves a considerable variance
reduction compared to the TSLS-type DML estimator.
If we say in this section that the nuisance parameters are estimated with ad-
1
ditive splines, they are estimated with additive cubic B-splines with N 5 + 2
6488 C. Emmenegger and P. Bühlmann
Fig 8. The results come from M = 1000 simulation runs from the SEM in Figure 4 with
β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1. The
nuisance functions are estimated with random forests. The figure displays the coverage of
two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines as well as the red and orange ones are indistinguishable in the left panel.
Fig 9. Same setting as in Figure 2, but with N = 200 only. The strength of the direct link
A → X varies and is denoted by α. We considered the α-values −e−20 , −e−15 , −e−10 , −e−5 ,
−e−1 , −e−0.75 , −e−0.5 , −e−0.25 , and −e0 .
We consider models where the DML and the regsDML methods do not work
well in terms of coverage of β0 . We present possible explanations of these failures
and illustrate model changes to overcome them. The first model in Section F.1
features a strong confounding effect H → X, the second model in Section F.2
features an effect with noise in W → H, and the third model in Section F.3
features an effect with noise in H → W .
Regularizing DML in endogenous PLMs 6489
Fig 10. Same setting as in Figure 5, but with N = 200 only. The strength of the direct link
A1 → X varies and is denoted by α. We considered the α-values e−20 , e−15 , e−10 , e−5 , e−1 ,
e−0.75 , e−0.5 , e−0.25 , and e0 .
Fig 11. Estimated MSE, estimated variance, and estimated squared bias as used in Equa-
tion (16) for the same setting as in Figure 2, but with N = 200 only. The black solid line
displays the median of the respective quantity over the considered range of γ-values for b̂γ . The
yellow area marks the observed 25% and 75% quantiles. All methods incorporate an additional
variance adjustment from the S repetitions according to Algorithm 1. Boxplots illustrate the
performance of the TSLS and the regularized methods. The position of the boxplots is not
linked to the γ-values on the x-axis.
Fig 12. Estimated MSE, estimated variance, and estimated squared bias as used in Equa-
tion (16) for the same setting as in Figure 5, but with N = 200 only. The black solid line
displays the median of the respective quantity over the considered range of γ-values for b̂γ . The
yellow area marks the observed 25% and 75% quantiles. All methods incorporate an additional
variance adjustment from the S repetitions according to Algorithm 1. Boxplots illustrate the
performance of the TSLS and the regularized methods. The position of the boxplots is not
linked to the γ-values on the x-axis.
F.2. Noise in W → H
The variable W may have a direct effect on H. If this link is strong enough with
respect to the additional noise εH of H, it is possible to obtain some information
of H by observing W . This can reduce the overall level of confounding present
depending on the choice of functions in the model.
Simulation results where W explains only part of the variation in H are
presented in Figure 17. The confidence intervals of both DML and regsDML
do not attain a 95% coverage for small sample sizes N . The situation can be
considerably improved by reducing the variation of H that is not explained by
W ; see Figure 18.
F.3. Noise in H → W
The variable H may have a direct effect on W . If this link is strong enough
with respect to the additional noise εW of W , it is possible to obtain some
information of H by observing W similarly to Section F.2. The results again
Regularizing DML in endogenous PLMs 6491
Fig 14. The results come from M = 1000 simulation runs from the SEM in Figure 13 with
χ = 15 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.
The following examples illustrate SEMs where the identifiability condition (5)
holds and where it fails to hold. We argue using causal graphs; see Lauritzen
[59], Pearl [74, 76, 77], Peters, Janzing and Schölkopf [78], and Maathuis et al.
[64]. By convention, we omit error variables in a causal graph if they are assumed
to be mutually independent [76].
Example G.1. Consider the SEM of the 1-dimensional variables A, W , H,
X, and Y and its associated causal graph given in Figure 22, where β0 is a
fixed unknown parameter, and where aW , aX , gY , gH , hX , and hY are some
appropriate functions. The variable A directly influences W , and W directly
influences the hidden variable H. The variable A is independent of H given W
because every path from A to H is blocked by W .
6492 C. Emmenegger and P. Bühlmann
Fig 15. The results come from M = 1000 simulation runs from the SEM in Figure 13 with
χ = 1 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.
Fig 17. The results come from M = 1000 simulation runs from the SEM in Figure 16 with
κ = 2 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.
Fig 18. The results come from M = 1000 simulation runs from the SEM in Figure 16 with
κ = 0.25 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in
Algorithm 1. The nuisance functions are estimated with additive splines. The figure displays
the coverage of two-sided confidence intervals for β0 , type I error for two-sided testing of
the hypothesis H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML
(red), regDML (blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4)
(cyan), where all results are at level 95%, and where the nuisance functions are estimated
with additive splines. At each sample size N , the lengths of the confidence intervals are scaled
with the median length from DML. The shaded regions in the coverage and the type I error
plots represent 95% confidence bands with respect to the M simulation runs. The blue and
green lines are indistinguishable in the left panel.
E[RA (RY − RX β0 )] = 0
in Equation (5). However, the identifiability condition does not hold in the
Regularizing DML in endogenous PLMs 6495
Fig 20. The results come from M = 1000 simulation runs from the SEM in Figure 19 with
κ = 1 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in Algorithm 1.
The nuisance functions are estimated with additive splines. The figure displays the coverage
of two-sided confidence intervals for β0 , type I error for two-sided testing of the hypothesis
H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red), regDML
(blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan), where all
results are at level 95%. At each sample size N , the lengths of the confidence intervals are
scaled with the median length from DML. The shaded regions in the coverage and the type I
error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.
E[RA (RY − RX β0 )]
= E[R
A H + εY − E[H + εY |W ]
= E RA H − E[H|W ]
1 1 1
E A E[H|W ] = E[AW ] = E A2 = = 0
3 3 3
Fig 21. The results come from M = 1000 simulation runs from the SEM in Figure 19 with
κ = 0.25 and β0 = 0 for a range of sample sizes N and with K = 2 and S = 100 in
Algorithm 1. The nuisance functions are estimated with additive splines. The figure displays
the coverage of two-sided confidence intervals for β0 , type I error for two-sided testing of the
hypothesis H0 : β0 = 0, and scaled lengths of two-sided confidence intervals of DML (red),
regDML (blue), regsDML (green), LIML (orange), Fuller(1) (purple), and Fuller(4) (cyan),
where all results are at level 95%. At each sample size N , the lengths of the confidence intervals
are scaled with the median length from DML. The shaded regions in the coverage and the type
I error plots represent 95% confidence bands with respect to the M simulation runs. The blue
and green lines are indistinguishable in the left panel.
Fig 22. An SEM satisfying the identifiability condition (5) and its associated causal graph as
in Example G.1.
Proof of Theorem 2.1. To prove the theorem, we need to verify that the repre-
sentation
−1
T −1 T −1
β0 = E RX RA T
E RA RA E RA RX T
E RX RAT
E RA RA E[RA RY ]
Fig 23. An SEM satisfying the identifiability condition (5) and its associated causal graph as
in Example G.2.
Fig 24. An SEM not satisfying the identifiability condition (5) together with its associated
causal graph as in Example G.3
We denote by · either the Euclidean norm for a vector or the operator norm
for a matrix.
· Y − m0Y (W ) − r mY (W ) − m0Y (W )
T
− X − m0X (W ) − r mX (W ) − m0X (W ) β0
T
= EP − mA (W ) − m0A (W ) Y − m0Y (W ) − X − m0X (W ) β0
+ A − m0A (W ) − mY (W ) − m0Y (W )
T
+ mX (W ) − m0X (W ) β0 .
6498 C. Emmenegger and P. Bühlmann
and
T
EP A − m0A (W ) − mY (W ) − m0Y (W ) + mX (W ) − m0X (W ) β0 (22)
are equal to 0. We first consider the term (21). Recall the notations m0Y (W ) =
EP [Y |W ] and m0X (W ) = EP [X|W ]. We have
T
EP mA (W ) − m0A (W ) Y − m0Y (W ) − X − m0X (W ) β0
= EP mA (W ) − m0A (W ) EP Y − EP [Y |W ] − (X − EP [X|W ])T β0 W
= 0.
Next, we verify that the term given in (22) vanishes. Recall that we denote
m0A (W ) = EP [A|W ]. We have
T
EP A − m0A (W ) − mY (W ) − m0Y (W ) + mX (W ) − m0X (W ) β0
= EP EP A − E[A|W ]W − mY (W ) − m0Y (W )
T
+ mX (W ) − m0X (W ) β0
= 0.
Because both terms (21) and (22) vanish, we conclude
∂
EP ψ S; β0 , η 0 + r(η − η 0 ) = 0.
∂r r=0
Definition I.1. Consider a set T of nuisance functions. For S = (A, X, W, Y ),
an element η = (mA , mX , mY ) ∈ T , and β ∈ Rd , we introduce the score func-
tions
β, η) := X − mX (W ) Y − mY (W ) − X − mX (W ) T β ,
ψ(S, (23)
and
T
ψ1 (S, η) := X − mX (W ) A − mA (W ) ,
T
ψ2 (S, η) := A − mA (W ) A − mA (W ) ,
T
ψ3 (S, η) := X − mX (W ) X − mX (W ) .
Furthermore, let the matrices
D1 := EP [ψ3 (S; η 0 )],
D2 := EP [ψ1 (S; η 0 )] EP [ψ2 (S; η 0 )]−1 EP ψ1T (S; η 0 ) ,
D3 := EP [ψ1 (S; η 0 )] EP [ψ2 (S; η 0 )]−1 ,
D5 := EP [ψ2 (S; η 0 )]−1 EP [ψ(S; bγ , η 0 )],
J0 := D2−1D3 ,
J˜0 := EP ψ(S; β0 , η 0 )ψ T (S; β0 , η 0 ) = EP RA RA T
(RY − RX
T
β0 )2 ,
J0 := EP [R T
A RA ], T −1
J0 := EP RX (RA ) (J0 ) EP RA (RX )T
Regularizing DML in endogenous PLMs 6499
and the variance-covariance matrix σ 2 := J0 J˜0 J0T . Moreover, let the score func-
tion
−1
ψ(·; β0 , η 0 ) := σ −1 J˜0 2 ψ(·; β0 , η 0 ).
Definition I.2. Let γ ≥ 0. Consider a realization set T of nuisance functions.
Define the statistical rates
4
rN := max sup EP [ψ(S; b0 , η) − ψ(S; b0 , η 0 )]
S=(U,V,W,Z)∈{A,X,Y }2 ×{W }×{A,X,Y }, η∈T
b0 ∈{bγ ,β0 ,0}
and
2
λN := max sup ∂r EP ϕ S; b0 , η 0 + r(η − η 0 ) ,
2 }, r∈(0,1),η∈T
ϕ∈{ψ,ψ,ψ
b0 ∈{bγ ,β0 ,0}
If not stated otherwise, we assume the following Assumption I.5 in all the
results presented in the appendix.
Assumptions I.5. Let γ ≥ 0. Let K ≥ 2 be a fixed integer independent of
N . We assume that N ≥ K holds. Let {δN }N ≥K and {ΔN }N ≥K be two se-
1
quences of positive numbers that converge to zero, where δN4 ≥ N − 2 holds. Let
1
I.5.4 The symmetric matrices J˜0 , D1 + (γ − 1)D2 , and D4 are invertible, where
D4 is introduced in Definition J.1 in the appendix in Section J. The small-
est and largest singular values of these matrices are bounded away from 0
by c3 and are bounded away from +∞ by c4 .
I.5.5 The set T consists of P -integrable functions η = (mA , mX , mY ) whose pth
moment exists and it contains η 0 . There exists a finite real constant C2
such that
η 0 − ηP,p ≤ C2 , η 0 − ηP,2 ≤ δN ,
m0A (W ) − mA (W )2P,2 ≤ δN N − 2 ,
1
m0X (W ) − mX (W )P,2
· m0Y (W ) − mY (W )P,2 + m0X (W ) − mX (W )P,2 ≤ δN N − 2 ,
1
mA (W ) − mA (W )P,2
0
Ic
m0X (W ) − m̂Xk (W )P,2
Ic Ic
· m0Y (W ) − m̂Yk (W )P,2 + m0X (W ) − m̂Xk (W )P,2 ≤ δN N − 2 ,
1
c
I
m0A (W ) − m̂Ak (W )P,2
Ic Ic
· m0Y (W ) − m̂Yk (W )P,2 + m0X (W ) − m̂Xk (W )P,2 ≤ δN N − 2
1
Because the conditional expectation minimizes the mean squared error [39, The-
orem 5.1.8], we have
EP (Z1 − EP [Z1 |W ])2 ≤ Z1 2P,2
and
EP (Z2 − EP [Z2 |W ])2 ≤ Z2 2P,2 .
In total, we thus have
EP (Z1 − EP [Z1 |W ])(Z2 − EP [Z2 |W ])T 2 ≤ Z1 2P,2 Z2 2P,2 .
6502 C. Emmenegger and P. Bühlmann
Because the conditional expectation minimizes the mean squared error [39, The-
orem 5.1.8], we have
EP Z1 − EP [Z1 |W ]2 ≤ EP Z1 2 = Z1 2P,2 .
Consequently,
EP (Z1 − EP [Z1 |W ])Z2T 2 ≤ Z1 2P,2 Z2 2P,2
holds.
Lemma I.11. Let a, b ∈ R be two numbers. We have
by Markov’s inequality.
Lemma I.13. There exists a finite real constant C3 satisfying β0 ≤ C3 .
Proof of Lemma I.13. Recall the matrices J0 and J0 in Definition I.1. We have
β0
≤ (J0 )−1 EP ARX T −1
(J0 ) EP ARY
≤ c12 XP,2 Y P,2 A2P,2
2
Regularizing DML in endogenous PLMs 6503
1 4
β0 ≤ C
c22 1
by Assumption I.5.2.
Lemma I.14. Let γ ≥ 0. There exists a finite real constant C4 satisfying bγ ≤
C4 .
bγ
−1
T −1
≤ EP RX RX T
+ (γ − 1) EP RX RA
T
EP RA RA EP RA RXT
−1
· EP [RX RY ] + (γ − 1) EP RX RA
T
EP RA RA
T
EP [RA RY ]
Proof of Lemma I.15. This proof is modified from Chernozhukov et al. [31].
First, we verify the bound on rN . Let S = (U, V, W, Z) ∈ {A, X, Y }2 × {W } ×
6504 C. Emmenegger and P. Bühlmann
ψ(S; b0 , η) − ψ(S; b0 , η 0 )
T
T
= U − mU (W ) Z − mZ (W ) − V − mV (W ) b0
T
T
− U − m0U (W ) Z − m0Z (W ) − V − m0V (W ) b0
T
T
= U − m0U (W ) m0Z (W ) − mZ (W ) − m0V (W ) − mV (W ) b0
T
T
+ m0U (W ) − mU (W ) Z − m0Z (W ) − V − m0V (W ) b0
+ m0U (W ) − mU (W )
T
T
· m0Z (W ) − mZ (W ) − m0V (W ) − mV (W ) b0 .
Observe that U − m0U (W )P,2 ≤ 2U P,2 , and V − m0V (W )P,2 ≤ 2V P,2 ,
and Z − m0Z (W )P,2 ≤ 2ZP,2 hold by Lemma I.7. We have η − η 0 P,2 ≤ δN
by Assumption I.5.5. Therefore, we obtain the upper bound
δN
by the triangle inequality, Lemma I.13, Lemma I.14, and Assumption I.5.2
and I.5.5. Because this upper bound is independent of η, we obtain our claimed
4
bound on rN .
Subsequently, we verify the bound on λN . Consider S = (A, X, W, Y ), denote
ψ2 }, where
by U either A or X, denote by Z either A or Y , and let ϕ ∈ {ψ, ψ,
we interpret ψ2 (S; b, η) = ψ2 (S; η). We have
∂r2 EP ψ S; b0 , η 0 + r(η − η 0 )
= 2 EP mU (W ) − m0U (W )
T 0 T
· mZ (W ) − m0Z (W ) − mX (W ) − m0X (W ) b .
Regularizing DML in endogenous PLMs 6505
1 1
√ 0 Ikc
ϕ(Si ; b , η̂ ) − √ 0 0
ϕ(Si ; b , η ) = OP (ρN ),
n n
i∈Ik i∈Ik
1 1
where ρN = rN + N 2 λN is as in Definition I.4 and satisfies ρN δN4 , and
where we interpret ψ2 (S; b, η) = ψ2 (S; η).
Proof of Lemma I.16. This proof is modified from Chernozhukov et al. [31]. By
the triangle inequality, we have
√1 c
n i∈Ik ϕ(Si ; b0 , η̂ Ik ) − √1n i∈Ik ϕ(Si ; b0 , η 0 )
c c
= √1n i∈Ik ϕ(Si ; b0 , η̂ Ik ) − ϕ(s; b0 , η̂ Ik )dP (s)
− √1n i∈Ik ϕ(Si ; b0 , η 0 ) − ϕ(s; b0 , η 0 )dP (s)
√
c
+ n ϕ(s; b0 , η̂ Ik ) − ϕ(s; b0 , η 0 ) dP (s)
√
≤ I1 + nI2 ,
where I1 := M for
c c
M := √1
n i∈Ik ϕ(Si ; b0 , η̂ Ik ) − ϕ(s; b0 , η̂ Ik )dP (s)
− √1n i∈Ik ϕ(Si ; b0 , η 0 ) − ϕ(s; b0 , η 0 )dP (s) ,
and where
I2 :=
ϕ(s; b , η̂ ) − ϕ(s; b , η ) dP (s)
0 Ikc 0
.
0
We bound the two terms I1 and I2 individually. First, we bound I1 . Because the
dimensions d and q are fixed, it is sufficient to bound one entry of the matrix M .
Let l index the rows of M , and let t index the columns of M (we interpret vectors
as matrices with one column). On the event EN that holds with P -probability
6506 C. Emmenegger and P. Bühlmann
1 − ΔN , we have
EP Ml,t 2 {Si }i∈Ikc
= n1 i∈Ik EP |ϕl,t (Si ; b0 , η̂ Ik ) − ϕl,t (Si ; b0 , η 0 )|2 {Si }i∈Ikc
c
c
+ n1 i,j∈Ik ,i=j EP ϕl,t (Si ; b0 , η̂ Ik ) − ϕl,t (Si ; b0 , η 0 )
· ϕl,t (Sj ; b0 , η̂ Ik ) − ϕl,t(Sj ; b0 , η 0) {Si }i∈Ikc
c
0 Ikc 0 0
−2 i∈Ik EP ϕl,t (Si ; b , 0η̂ I)c− ϕl,t (Si ; b ,0η )0 {S i }i∈Ik
c
and
E
P ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )2 1 ϕ(S;b0 ,η)−ϕ(S;b0 ,η0 ) ≥1
≤ EP ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )4 P (ϕ(S; b0 , η) − ϕ(S; b0 , η 0 ) ≥ 1)
(27)
by Hölder’s inequality. Observe that the term
EP ϕ(S; b0 , η) − ϕ(S; b0 , η 0 )4 (28)
Observe that I2 = fk (1) holds. We apply a Taylor expansion to this function
and obtain
1
fk (1) = fk (0) + fk (0) + fk (r̃)
2
for some r̃ ∈ (0, 1). We have
fk (0) = EP ϕ(S; b0 , η 0 ){Si }i∈Ikc − EP [ϕ(S; b0 , η 0 )] = 0.
Regularizing DML in endogenous PLMs 6507
Furthermore, the score ϕ satisfies the Neyman orthogonality property fk (0) = 0.
The proof of this claim is analogous to the proof of Proposition 3.3 because the
proof of Proposition 3.3 does neither depend on the underlying model of the
random variables nor on the value of β. Furthermore, we have
fk (r) = 2 EP mU (W ) − m0U (W )
T
T
· mZ (W ) − m0Z (W ) − mX (W ) − m0X (W ) b0
for U ∈ {A, X} and Z ∈ {A, Y }. On the event EN that holds with P -probability
1 − ΔN , we have
fk (r̃) ≤ sup fk (r) λN .
r∈(0,1)
We thus infer
1 1 √
√ 0 Ikc
ϕ(Si ; b , η̂ )− √ 0 0 1
ϕ(Si ; b , η ) ≤ I1 + nI2 = OP (rN +N 2 λN ).
n n
i∈Ik i∈Ik
1
Because rN δN and λN √δNN hold by Lemma I.15 and because {δN }N ≥K
4
n
i∈Ik
c
Because of Lemma I.16, the term n1 i∈Ik ϕ(Si ; η̂ Ik ) − ϕ(Si ; η 0 ) is of order
OP (N − 2 ρN ). The term n1 i∈Ik ϕ(Si ; η 0 )−EP [ϕ(S; η 0 )] is of order OP (N − 2 )
1 1
due to the Lindeberg–Feller CLT and the Cramer–Wold device. Thus, we deduce
the statement.
Definition I.18. We denote by AIk the row-wise concatenation of c
the observa-
c c
tions Aic for i ∈ Ik . We denote similarly by X Ik , W Ik , Y Ik , AIk , X Ik , W Ik ,
Ik
and Y the row-wise concatenations of the respective observations.
Proof of Theorem 3.1. This proof is based on Chernozhukov et al. [31]. We show
the stronger statement
√ 1
N
N σ −1 (β̂ − β0 ) = √
d
ψ(Si ; β0 , η 0 ) + OP (ρN ) → N (0, 1d×d ) (N → ∞),
N i=1
(30)
6508 C. Emmenegger and P. Bühlmann
where β̂ denotes the DML1 estimator β̂ DML1 or the DML2 estimator β̂ DML2 ,
and where the rate ρN is specified in Definition I.4, and we show that this
statement holds uniformly over laws P . We first consider β̂ DML2 . It suffices to
show that (30) holds uniformly over P ∈ PN . Fix a sequence {PN }N ≥1 such
that PN ∈ PN for all N ≥ 1. Because this sequence is chosen arbitrarily, it
suffices to show
√
N σ −1 (β̂ DML2 − β0 )
N
= √1 0
N i=1 ψ(Si ; β0 , η ) + OPN (ρN )
d
→ N (0, 1d×d ) (N → ∞).
We have
β̂ DML2
−1
K Ikc Ik T Ikc
= 1
K k=1 X Ik
− m̂ X (W ) Π Ik X
RA
Ik
− m̂ X (W Ik
)
K Ikc Ik T Ikc
·K
1
k=1 X Ik
− m̂ X (W ) ΠRA
Ik Y
Ik
− m̂ Y (W )
Ik
K Ic T Ic
= 1
K
1
k=1 n X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) (AIk − m̂Ak (W Ik ) (31)
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik )
K Ic T Ic
·K
1 1
X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
k=1 n −1
Ic T Ic
· 1
n AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )
Ic T Ic
XIk − m̂Xk (W Ik )
1
n AIk − m̂Ak (W Ik )
T (32)
+ OPN N − 2 (1 + ρN )
1
= EPN X − m0X (W ) A − m0A (W )
and
Ic T Ic
AIk − m̂Ak (W Ik )
1
n AIk − m̂Ak (W Ik )
T (33)
+ OPN N − 2 (1 + ρN ) .
1
= EPN A − m0A (W ) A − m0A (W )
√
N (β̂ DML2 − β0 )
T
= EPN X − m0X (W ) A − m0A (W )
T −1
· EPN A − m0A (W ) A − m0A (W )
−1
T
· EPN A − m0A (W ) X − m0X (W )
T
· EPN X − m0X (W ) A − m0A (W )
T −1 − 12
· EPN A − mA (W ) A − mA (W )
0 0
+ OPN N (1 + ρN )
c
K I T Ic
· √1K k=1 √1n AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )
Ic T Ic
− AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik ) β0
= J0 + OPN N − 2 (1 + ρN )
1
K Ic T
· √1K k=1 √1n AIk − m̂Ak (W Ik )
Ic Ic
· Y Ik − m̂Yk (W Ik ) − X Ik − m̂Xk (W Ik ) β0
(34)
because K is a constant independent of N and because N = nK holds. Recall
the linear score ψ in (11). We have
√ 1 K
1
N (β̂ DML2 − β0 ) = J0 + OPN N − 2 (1 + ρN ) √
1 c
√ ψ(Si ; β0 , η̂ Ik ).
K k=1 n i∈Ik
(35)
Let k ∈ [K]. By Lemma I.16, we have
1 c 1
√ ψ(Si ; β0 , η̂ Ik ) = √ ψ(Si ; β0 , η 0 ) + OPN (ρN ). (36)
n n
i∈Ik i∈Ik
√
N (β̂
DML2
− β0 ) K
= J0 + OPN N − 2 (1 + ρN ) √1K k=1 √1n i∈Ik ψ(Si ; β0 , η̂ Ik )
1 c
K
= J0 + OPN N − 2 (1 + ρN ) √1K k=1 √1n i∈Ik ψ(Si ; β0 , η 0 ) + OPN (ρN ) .
1
Thus, we have
√
N (β̂
DML2
− β0 )
= J0 + OPN N − 2 (1 + ρN )
1
K
· √1K k=1 √1n i∈Ik ψ(Si ; β0 , η 0 ) + OPN (ρN )
N
= J0 + OPN N − 2 (1 + ρN ) √1N i=1 ψ(Si ; β0 , η 0 ) + OPN (ρN )
1
N
= J0 · √1N i=1 ψ(Si ; β0 , η 0 ) + OPN (ρN ).
√
N σ −1 (β̂ DML1 − β0 )
N
= √1 0
N i=1 ψ(Si ; β0 , η ) + OPN (ρN )
d
→ N (0, 1d×d ) (N → ∞).
We have
β̂ Ik
−1
Ic T Ic
= X Ik − m̂Xk (W Ik ) ΠRIk X Ik − m̂Xk (W Ik )
A
Ic T Ic
· X Ik − m̂Xk (W Ik ) ΠRIk Y Ik − m̂Yk (W Ik )
A
Ic T Ic
= 1
n X −
Ik
m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik ) (37)
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik )
Ic T Ic
· n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )
by (19). Due to Weyl’s inequality and Slutsky’s theorem, (32), (33), and (37),
Regularizing DML in endogenous PLMs 6511
we obtain
√
N (β̂ DML1 − β0 )
T
= EPN X − m0X (W ) A − m0A (W )
T −1
· EPN A − m0A (W ) A − m0A (W )
−1
T
· EPN A − m0A (W ) X − m0X (W )
T
· EPN X − m0X (W ) A − m0A (W )
T −1
+ OPN N − 2 (1 + ρN )
1
· EPN A − m0A (W ) A − m0A (W )
K Ic T Ic
· √1K k=1 √1n AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )
Ic T Ic
− √1n AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik ) β0
= J0 + OPN N − 2 (1 + ρN )
1
K Ic T
· √1K k=1 √1n AIk − m̂Ak (W Ik )
Ikc Ikc
· Y − m̂Y (W ) − X − m̂X (W ) β0 .
Ik Ik Ik Ik
√ (38)
Observe that the expression
√ for N ( β̂ DML1
− β 0 ) given in (38) coincides with
the√expression for N (β̂ DML2 − β0 ) given in (34). Thus, the asymptotic
√ analysis
of N (β̂ DML1 − β0 ) coincides with the asymptotic analysis of N (β̂ DML2 − β0 )
presented above.
Lemma I.19. Let γ ≥ 0. Let p > 4 be the p from Assumption I.5, let b0 ∈
{β0 , bγ , 0}, and let S = (U, V, Z) ∈ {A, X, Y }2 × {W } × {A, X, Y }. There exists
a finite real constant C5 satisfying
p
2
p
sup EP ψ(S; b0 , η) 2 ≤ C5 .
η∈T
and analogously
0 T 0
mV (W ) − mV (W ) b ≤ b0 m0V (W ) − mV (W )P,p . (41)
P,p
Hence, we infer
p
2
p
EP ψ(S; b0 , η) 2 ≤ (U P,p + C2 )(ZP,p + V P,p + 2C2 ) max{1, b0 }
(42)
by (39), (40), (41), Lemma I.7, and Assumption I.5.5. By Lemma I.13, there
exists a finite real constant C3 that satisfies β0 ≤ C3 . By Lemma I.14, there
exists a finite real constant C4 that satisfies bγ ≤ C4 . These two bounds lead
to b0 ≤ max{C3 , C4 }. By Assumption I.5.2, we have
Lemma I.20. Let γ ≥ 0, and let p be as in Assumption I.5. Let the in-
dices k ∈ [K] and (j, l, t, r) ∈ [L1 ] × [L2 ] × [L3 ] × [L4 ], where L1 , L2 , L3 ,
and L4 are natural numbers representing the intended dimensions. Let b̂ ∈
{β̂ DML1 , β̂ DML2 , b̂γ,DML1 , b̂γ,DML2 }, and consider the corresponding true but un-
known underlying parameter vector b0 ∈ {β0 , bγ }. Consider the corresponding
score function combinations
ψ̂ A (·) ∈ {ψj (·; b̂, η̂ Ik ), ψj (·; b̂, η̂ Ik ), (ψ1 (·; η̂ Ik ))j,l , (ψ2 (·; η̂ Ik ))j,l },
c c c c
A
ψ̂full b̂, η̂ Ikc ), ψ(·; b̂, η̂ Ikc ), ψ1 (·; η̂ Ikc ), ψ2 (·; η̂ Ikc )},
(·) ∈ {ψ(·;
ψ̂ B (·) ∈ {ψt (·; b̂, η̂ Ik ), ψt (·; b̂, η̂ Ik ), (ψ1 (·; η̂ Ik ))t,r , (ψ2 (·; η̂ Ik ))t,r },
c c c c
B
ψ̂full b̂, η̂ Ikc ), ψ(·; b̂, η̂ Ikc ), ψ1 (·; η̂ Ikc ), ψ2 (·; η̂ Ikc )}
(·) ∈ {ψ(·;
ψ A (·) ∈ {ψj (·; b0 , η 0 ), ψj (·; b0 , η 0 ), (ψ1 (·; η 0 ))j,l , (ψ2 (·; η 0 ))j,l },
A
ψfull b0 , η 0 ), ψ(·; b0 , η 0 ), ψ1 (·; η 0 ), ψ2 (·; η 0 )},
(·) ∈ {ψ(·;
ψ B (·) ∈ {ψt (·; b0 , η 0 ), ψt (·; b0 , η 0 ), (ψ1 (·; η 0 ))t,r , (ψ2 (·; η 0 ))t,r },
B
ψfull b0 , η 0 ), ψ(·; b0 , η 0 ), ψ1 (·; η 0 ), ψ2 (·; η 0 )}.
(·) ∈ {ψ(·;
Then we have
1 A A
Ik := ψ̂ (Si )ψ̂ (Si ) − EP ψ (S)ψ (S) = OP (ρ̃N ),
B B
n
i∈Ik
p −1,− 2
4 1
where ρ̃N = N max + rN is as in Definition I.4.
Regularizing DML in endogenous PLMs 6513
Proof of Lemma I.20. This proof is modified from Chernozhukov et al. [31]. By
the triangle inequality, we have
Ik ≤ Ik,A + Ik,B ,
where
1 A 1 A
Ik,A := ψ̂ (Si )ψ̂ B (Si ) − ψ (Si )ψ B (Si )
n n
i∈Ik i∈Ik
and
1 A
Ik,B
:= ψ (Si )ψ (Si ) − EP ψ (S)ψ (S) .
A B B
n
i∈Ik
Subsequently, we bound the two terms Ik,A and Ik,B individually. First, we
bound Ik,B . We consider the case p ≤ 8. The von Bahr–Esseen inequality I [37,
p. 650] states that for 1 ≤ u ≤ 2 and for independent, real-valued, and mean 0
variables Z1 , . . . , Zn , we have
u
n 1
n
E
Zi ≤ 2 − E[|Zi |u ].
i=1
n i=1
The individual summands ψ A (Si )ψ B (Si ) − EP [ψ A (S)ψ B (S)] for i ∈ Ik are in-
dependent and have mean 0. Therefore,
p
EP Ik,B4
p
A p4
= n EP i∈Ik ψ (Si )ψ (Si ) − EP ψ (S)ψ (S)
1 4 A B B
p
−1+ p
≤ n1 4
2 − n1 n1 i∈Ik EP ψ A (Si )ψ B (Si ) − EP ψ A (S)ψ B (S) 4
p
−1+ p
= n1 4
2 − n1 EP ψ A (S)ψ B (S) − EP ψ A (S)ψ B (S) 4
= ψ A (S)ψ B (S) − EP ψ A (S)ψ
B
(S) P, p4
≤ ψ A (S)ψ B (S)P, p4 + EP |ψ A (S)ψ B (S)|
≤ 2ψ A (S)ψ B (S)P, p4
6514 C. Emmenegger and P. Bühlmann
p
by the triangle inequality, Hölder’s inequality, and due to 4 > 1.
Next, consider the case p > 8. Observe that
2
EP 1
n i∈Ik
A B
ψ (Si )ψ (Si )
2
2 2 n(n−1)
= 1
n EP ψ A (S) ψ B (S) + n2 EP ψ A (S)ψ B (S)
EP [I 2
k,B ]
2 2
= EP 1
n i∈Ik ψ A
(Si )ψ B
(Si ) + EP ψ A (S)ψ B (S)
−2 EP n1 i∈Ik ψ A (Si )ψ B (Si ) EP [ψ A (S)ψ B (S)]
≤ n1 EP (ψ A (S))2 (ψ B (S))2 .
1ψfull
2 B 2 1
EP [Ik,B
2
]≤ A
(S)P,4 ψfull (S)P,4 ≤ (4C5 )4 .
n n
Second, we bound the term Ik,A . For any real numbers a1 , a2 , b1 , and b2 such
that real numbers c and d exist that satisfy max{|b1 |, |b2 |} ≤ c and max{|a1 −
b1 |, |a2 − b2 |} ≤ d, we have |a1 a2 − b1 b2 | ≤ 2d(c + d). Indeed, we have
|a1 a2 − b1 b2 |
≤ |a1 − b1 | · |a2 − b2 | + |b1 | · |a2 − b2 | + |a1 − b1 | · |b2 |
≤ d2 + cd + dc
≤ 2d(c + d)
Ik,A A
≤ 1 i )ψ̂ (Si ) − ψ (Si )ψ (Si )
B A B
n i∈Ik ψ̂ (S A B
≤ 2 ψ̂ (Si ) − ψ (Si ) , ψ̂ (Si ) − ψ B (Si )
A
n i∈Ik max
· max ψ A (Si ), ψ B (Si )
+ max ψ̂ A (Si ) − ψ A (Si ), ψ̂ B (Si ) − ψ B (Si )
" 2 2 # 12
≤ 2 n1 i∈Ik max ψ̂ A (Si ) − ψ A (Si ) , ψ̂ B (Si ) − ψ B (Si )
· n1 i∈Ik max ψ A (Si ), ψ B (Si )
2 12
+ max ψ̂ A (Si ) − ψ A (Si ), ψ̂ B (Si ) − ψ B (Si ) .
1
ψ̂full
2 B 2
RN,k := A
(Si ) − ψfull
A
(Si ) + ψ̂full (Si ) − ψfull
B
(Si ) .
n
i∈Ik
b0 , η 0 )P,4 ,
by Markov’s inequality because the terms ψ(S; b0 , η 0 )P,4 , ψ(S;
ψ1 (S; η)P,4 , and ψ2 (S; η)P,4 are upper bounded by C5 by Lemma I.19. Thus,
it suffices to bound the term RN,k . To do this, we need to bound the four terms
1 c
ψ(Si ; b̂, η̂ Ik ) − ψ(Si ; b0 , η 0 )2 , (44)
n
i∈Ik
1 c
i ; b0 , η 0 )2 ,
ψ(Si ; b̂, η̂ Ik ) − ψ(S (45)
n
i∈Ik
1 c
ψ1 (Si ; η̂ Ik ) − ψ1 (Si ; η 0 )2 , (46)
n
i∈Ik
1 c
ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 )2 . (47)
n
i∈Ik
First, we bound the two terms (44) and (45) simultaneously. Consider the ran-
dom variable U ∈ {A, X} and the quadruple S = (U, X, W, Y ). Because the
6516 C. Emmenegger and P. Bühlmann
(48)
due to the triangle inequality and Lemma I.11. Subsequently, we verify that
1 a c
ψ (Si ; η̂ Ik )2 = OP (1)
n
i∈Ik
by Markov’s inequality because the term EP [U − m0U (W )4 ] is upper bounded
by Lemma I.7 and Assumption I.5.2. On the event EN that holds with P -
probability 1 − ΔN , we have
c
EP n1 i∈Ik η 0 (Wi ) − η̂ Ik (Wi )4 {Si }i∈Ikc
c (51)
= EP η 0 (W ) − η̂ Ik (W )4 |{Si }i∈Ikc
≤ C24
c
by Assumption I.5.5. We hence have n1 i∈Ik η 0 (Wi ) − η̂ Ik (Wi ) = OP (1) by
Lemma I.12. Let us denote by ·PIk ,p the Lp -norm with the empirical measure
on the data indexed by Ik . On the event EN that holds with P -probability
1 − ΔN , we have
Ikc
i∈Ik U i − m̂U (Wi )
1 4
n c
I
= U − m̂Uk (W )4PI ,4
k
Ic 4 (52)
≤ U − m0U (W )PIk ,4 + m0U (W ) − m̂Uk (W )PIk ,4
c 4
≤ U − m0U (W )PIk ,4 + η 0 (W ) − η̂ Ik (W )PIk ,4
= OP (1)
due to the Cauchy–Schwarz inequality and (54). On the event EN that holds
with P -probability 1 − ΔN , the conditional expectation given {Si }i∈Ikc of the
second summand in (48) is equal to
c
EP n2 i∈Ik ψ(Si ; b0 , η̂ Ik ) − ψ(Si ; b0 , η 0 )2 {Si }i∈Ikc
= 2 EP ψ(S; b0 , η̂ Ik ) − ψ(S; b0 , η 0 )2 {Si }i∈I
c
c
k
≤ 2 supη∈T EP ψ(S; b0 , η) − ψ(S; b0 , η 0 )2
rN
2
by Lemma I.12. Next, we bound the two terms given in (46) and (47). We first
consider the term given in (46). On the event EN , we have
c
EP n1 i∈Ik ψ1 (Si ; η̂ Ik ) − ψ1 (Si ; η 0 )2 {Si }i∈Ikc
= EP ψ1 (S;η̂ Ik ) − ψ1 (S; η 0 )2 {Si }i∈I
c
k
c
by Lemma I.12. On the event EN , the conditional expectation given {Si }i∈Ikc of
the term (47) is given by
c
EP n1 i∈Ik ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 )2 {Si }i∈Ikc
= EP ψ2 (S;η̂ Ik ) − ψ2 (S; η 0 )2 {Si }i∈I
c
k
c
thus have
Ik = OP N max p −1,− 2 +OP N − 2 +rN = OP N max p −1,− 2 +rN .
4 1 1 4 1
1 ˆ
K
Jˆ0 := Jk,0 .
K
k=1
But this statement holds by Lemma I.20 because the dimensions of A and X
are fixed.
Let
D4 := EP ψ (S; bγ , η 0 )(ψ (S; bγ , η 0 ))T ,
and let the variance
−1 −1
σ 2 (γ) := D1 + (γ − 1)D2 D4 D1T + (γ − 1)D2T .
Proof of Theorem 4.1. This proof is based on Chernozhukov et al. [31]. The
matrices D1 + (γ − 1)D2 and D4 are invertible by Assumption I.5.4. Hence,
σ 2 (γ) is invertible.
Subsequently, we show the stronger statement
√ 1
N
N σ −1 (γ)(b̂γ −bγ ) = √
d
ψ(Si ; bγ , η 0 )+OP (ρN ) → N (0, 1d×d ) (N → ∞),
N i=1
(56)
where b̂γ denotes the DML2 estimator b̂γ,DML2 or its DML1 variant b̂γ,DML1 ,
and where ψ is as in Definition J.1. We first consider b̂γ,DML2 and afterwards
b̂γ,DML1 . Fix a sequence {PN }N ≥1 such that PN ∈ PN for all N ≥ 1. Because
this sequence is chosen arbitrarily, it suffices to show
√
N σ −1 (γ)(b̂γ,DML2 − bγ )
1
N
= √N i=1 ψ(Si ; bγ , η 0 ) + OPN (ρN )
d
→ N (0, 1d×d ) (N → ∞).
We have
b̂γ,DML2
Ik T Ik
−1
K
= 1
K k=1 RX 1 + (γ − 1)ΠRIk RX
K Ik T
A
Ik
·K
1
RX 1 + (γ − 1)ΠRIk RY
$ k=1
A
K Ic T Ic
= 1
K k=1
1
n X Ik
− m̂Xk (W Ik ) X Ik − m̂Xk (W Ik )
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
c
−1
I T Ic
· n1 AIk − m̂Ak (W Ik ) (AIk − m̂Ak (W Ik ) (57)
%−1
c T c
I I
· n1 A k − m̂Ak (W k )
I I
X k − m̂Xk (W k )
I I
K 1 Ic T Ic
·K
1
k=1 n X Ik − m̂Xk (W Ik ) Y Ik − m̂Yk (W Ik )
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) (AIk − m̂Ak (W Ik )
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )
6520 C. Emmenegger and P. Bühlmann
Ic T Ic
1
nXIk − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
T
+ OPN N − 2 (1 + ρN ) ,
1
= EPN X − m0X (W ) A − m0A (W )
Ic T Ic
1
n AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
T
+ OPN N − 2 (1 + ρN ) ,
1
= EPN A − m0A (W ) A − m0A (W )
Ic T Ic
1
n XIk − m̂Xk (W Ik ) X Ik − m̂Xk (W Ik )
T
+ OPN N − 2 (1 + ρN ) .
1
= EPN X − m0X (W ) (X − m0X (W )
(60)
Regularizing DML in endogenous PLMs 6521
and
1
c
ψ (S ; η̂ Ik )
ni∈Ik 1 i I c
i∈Ik (ψ1 (Si ; η̂ ) − ψ1 (Si ; η )) + n i∈Ik (ψ1 (Si ; η ) − EPN [ψ1 (S; η )])
1 0 1 0 0
= n
k
(61)
We apply a series expansion to obtain
−1
1 Ikc
i∈Ik ψ2 (Si ; η̂ )
n
c
= EPN [ψ2 (S; η 0 )] + n1 i∈Ik ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 )
−1
+ n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
= EPN [ψ2 (S; η 0 )]−1
− EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 ) EPN [ψ2 (S; η 0 )]−1
c
− EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
· EPN [ψ2 (S; η 0 )]−1
2
c
+OPN n1 i∈Ik ψ2 (Si ; η̂ Ik ) − ψ2 (Si ; η 0 )
2
+ n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
= EPN [ψ2 (S; η 0 )]−1 + OPN N − 2 ρN + OPN OPN N −1 ρ2N + OPN (N −1 )
1
− EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
· EPN [ψ2 (S; η 0 )]−1
= EPN [ψ2 (S; η 0 )]−1 + OPNN − 2 ρN
1
+ n1 i∈Ik ψ1 (Si ; η 0 ) − EPN [ψ1 (S; η 0 )] + EPN [ψ1 (S; η 0 )]
· EPN [ψ2 (S; η 0 )]−1 + OPN N − 2 ρN
1
− EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ2 (Si ; η ) − EPN [ψ2 (S; η )]
0 0
· EPN [ψ2 (S; η 0 )]−1 n1 i∈Ik ψ(Si ; bγ , η 0 ) + OPN N − 2 ρN
1
= √1n i∈Ik ψ1 (Si ; η 0 ) − EPN [ψ1 (S; η 0 )] EPN [ψ2 (S; η 0 )]−1 EPN [ψ(S; bγ , η 0 )]
+ EPN [ψ1 (S; η 0 )] EPN [ψ2 (S; η 0 )]−1 √1n i∈Ik ψ(Si ; bγ , η 0 )
− EPN [ψ1 (S; η 0 )] EPN [ψ2 (S; η 0 )]−1 √1n i∈Ik ψ2 (Si ; η 0 ) − EPN [ψ2 (S; η 0 )]
· EPN [ψ2 (S; η 0 )]−1 EPN [ψ(S; bγ , η 0 )] + OPN (ρN )
(63)
6522 C. Emmenegger and P. Bühlmann
We have
b̂γ,DML1
K Ik T Ik
−1
= K
1
k=1 RX 1 + (γ − 1)ΠRIk RX
A
Ik Ik
·(RX )T 1 + (γ − 1)ΠRIk RY
K A
Ikc Ik T Ic
= 1
K k=1
1
n X Ik
− m̂ X (W ) X Ik − m̂Xk (W Ik )
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
c
−1
I T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik ) (65)
−1
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) X Ik − m̂Xk (W Ik )
Ic T Ic
· 1
n X Ik − m̂Xk (W Ik ) Y Ik − m̂Yk (W Ik )
Ic T Ic
+(γ − 1) · n1 X Ik − m̂Xk (W Ik ) AIk − m̂Ak (W Ik )
c
−1
I T Ic
· n1 AIk − m̂Ak (W Ik ) AIk − m̂Ak (W Ik )
Ic T Ic
· n1 AIk − m̂Ak (W Ik ) Y Ik − m̂Yk (W Ik )
Regularizing DML in endogenous PLMs 6523
The last expression above coincides with (58). Consequently, the same asymp-
totic analysis conducted for b̂γ,DML2 can also be employed in this case.
Lemma J.2. Let γ ≥ 0 and let ϕ ∈ {ψ, ψ}. We have
1
ϕ(Si ; b̂γ , η̂ Ik ) = EP [ϕ(S; bγ , η 0 )] + OP N − 2 (1 + ρN ) .
c 1
n
i∈Ik
Subsequently, we analyze the three terms in the above decomposition (66) indi-
vidually. We have
1 c
ψ(Si ; b̂γ , η̂ Ik ) − n1 i∈Ik ψ(Si ; bγ , η̂ Ik )
c
n1 i∈Ik
≤ n i∈Ik (Ai − m̂Ak (Wi ))(Xi − m̂Xk (Wi ))T b̂γ − bγ
c c
I I
1 c
= n i∈Ik ψ1 (Si ; η̂ Ik )b̂γ − bγ
= EP [ψ1 (S; η 0 )] + OP N − 2 (1 + ρN ) b̂γ − bγ
1
infer
1
1
Ikc
Ikc
ψ(Si ; b , η̂ ) = OP N − 2 ρN .
1
n ψ(Si ; b̂ , η̂ ) −
γ γ
(67)
n
i∈Ik i∈Ik
6524 C. Emmenegger and P. Bühlmann
Due to (59) that was established in the proof of Theorem 4.1, we have
1
ψ(Si ; bγ , η̂ Ik ) − ψ(Si ; bγ , η 0 ) = OP N − 2 ρN .
c 1
(68)
n
i∈Ik
1
ψ(Si ; bγ , η 0 ) − EP [ψ(S; bγ , η 0 )] = OP (N − 2 ).
1
(69)
n
i∈Ik
Theorem J.3. Suppose Assumption I.5 holds. Recall the score functions intro-
duced in Definition I.1, and let b̂γ ∈ {b̂γ,DML1 , b̂γ,DML2 }. Introduce the matrices
1
c
D̂1k := ψ3 (Si ; η̂ Ik ),
n
i∈Ik
−1
1 Ikc 1 Ikc 1 Ikc
D̂2k := n i∈Ik ψ 1 (S; η̂ ) n i∈Ik ψ 2 (Si ; η̂ ) n
T
i∈Ik ψ1 (Si ; η̂ ),
c
c
−1
D̂3k := n1 i∈Ik ψ1 (Si ; η̂ Ik ) n1 i∈Ik ψ2 (Si ; η̂ Ik ) ,
c
−1
Ikc
D̂5k := n1 i∈Ik ψ2 (Si ; η̂ Ik ) n
1 γ
i∈Ik ψ(Si ; b̂ , η̂ ).
Let furthermore
c
b̂γ , η̂ Ik ) + (γ − 1)D̂k ψ(·; b̂γ , η̂ Ik )
ψ (·; b̂γ , η̂ Ik ) := ψ(·;
c c
c
3 c
+(γ − 1) ψ1 (·; η̂ Ik ) − n1 i∈Ik ψ1 (Si ; η̂ Ik ) D̂5k
c c
−(γ − 1)D̂3k ψ2 (·; η̂ Ik ) − n1 i∈Ik ψ2 (Si ; η̂ Ik ) D̂5k
and
1 c c T
D̂4k := ψ (Si ; b̂γ , η̂ Ik ) ψ (Si ; b̂γ , η̂ Ik ) .
n
i∈Ik
1 k 1 k 1 k
K K K
D̂1 := D̂1 , D̂2 := D̂2 , and D̂4 := D̂4 .
K K K
k=1 k=1 k=1
Then we have σ̂ 2 (γ) = σ 2 (γ) + OP ρ̃N + N − 2 (1 + ρN ) , where we have ρ̃N =
1
Proof of Theorem J.3. This proof is based on Chernozhukov et al. [31]. We al-
ready verified
D̂1 = D1 + OP N − 2 (1 + ρN ) D̂2 = D2 + OP N − 2 (1 + ρN )
1 1
and
Lemma I.17.
Subsequently, we argue that D̂5k = D5 + OP N − 2 (1 + ρN ) holds. Due to
1
n
i∈Ik
and
1 −1
= EP [ψ2 (S; η 0 )]−1 + OP N − 2 (1 + ρN ) .
c 1
ψ2 (Si ; η̂ Ik ) (70)
n
i∈Ik
D̂4k − D4
1
≤ i ; b̂γ , η̂ Ikc )ψT (Si ; b̂γ , η̂ Ikc ) − EP ψ(S;
ψ(S bγ , η 0 )ψT (S; bγ , η 0 )
n
i∈Ik
1
+ (γ − 1) i ; b̂γ , η̂ Ikc )ψ T (Si ; b̂γ , η̂ Ikc )DT
ψ(S
n 3
i∈Ik
T
− EP ψ(S; b , η )ψ (S; b , η ) D3
γ 0 T γ 0
1
+ (γ − 1) D3 ψ(Si ; b̂γ , η̂ Ik )ψT (Si ; b̂γ , η̂ Ik )
c c
n
i∈Ik
0 T
0
− D3 EP ψ(S; b , η )ψ (S; b , η )
γ γ
2 1 c c
+ (γ − 1) D3 ψ(Si ; b̂γ , η̂ Ik )ψ T (Si ; b̂γ , η̂ Ik )D3T
n
i∈Ik
6526 C. Emmenegger and P. Bühlmann
T
− D3 EP ψ(S; b , η )ψ (S; b , η ) D3
γ 0 T γ 0
1
+ (γ − 1) i ; b̂γ , η̂ Ikc )DT ψ1 (Si ; η̂ Ikc ) − EP [ψ1 (S; η 0 )] T
ψ(S
n 5
i∈Ik
T
− EP ψ(S; b , η )D5 ψ1 (S; η ) − EP [ψ1 (S; η )]
γ 0 T 0 0
1
+ (γ − 1) ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )] D5 ψT (Si ; b̂γ , η̂ Ik )
c c
n
i∈Ik
0 0
− EP ψ1 (S; η ) − EP [ψ1 (S; η )] D5 ψ (S; b , η )
T γ 0
2 1 c
+ (γ − 1) ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )] D5
n
i∈Ik
c T
· D5T ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )]
− EP ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )] D5 D5T ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )]
T
1
+ (γ − 1) D3 ψ2 (Si ; η̂ Ik ) − EP [ψ2 (S; η 0 )] D5 ψT (Si ; b̂γ , η̂ Ik )
c c
n
i∈Ik
− D3 EP ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D5 ψT (S; bγ , η 0 )
1
+ (γ − 1) i ; b̂γ , η̂ Ikc )DT ψ2 (Si ; η̂ Ikc ) − EP [ψ2 (S; η 0 )] T DT
ψ(S
n 5 3
i∈Ik
− EP ψ(S; b , η )D5 ψ2 (S; η ) − EP [ψ2 (S; η )]
γ 0 T 0 0 T T
D3
2 1 c c T
+ (γ − 1) D3 ψ(Si ; b̂γ , η̂ Ik )D5T ψ1 (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )]
n
i∈Ik
T
− D3 EP ψ(S; b , η )D5 ψ1 (S; η ) − EP [ψ1 (S; η )]
γ 0 T 0 0
1
+ (γ − 1)2
c c
+ OP N − 2 (1 + ρN )
1
16
Ii + OP N − 2 (1 + ρN )
1
=:
i=1
by the triangle inequality and the results derived so far. Subsequently, we bound
the terms I1 , . . . , I16 individually. Because all these terms consist of norms of
matrices of fixed size, it suffices to bound the individual matrix entries. Let
j, l, t, r be natural numbers not exceeding the dimensions of the respective object
they index. By Lemma I.20, we have
1
j (Si ; b̂γ , η̂ Ikc )ψl (Si ; b̂γ , η̂ Ikc ) − EP ψj (S; bγ , η 0 )ψl (S; bγ , η 0 ) = OP (ρ̃N ),
ψ
n
i∈Ik
and
1
D3 ψ(Si ; b̂γ , η̂ Ik )ψT (Si ; b̂γ , η̂ Ik )
c c
n
i∈Ik
−D3 EP ψ(S; bγ , η 0 )ψT (S; bγ , η 0 )
≤ D3 n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik )ψT (Si ; b̂γ , η̂ Ik ) − EP ψ(S; bγ , η 0 )ψT (S; bγ , η 0 ) .
c c
1
γ Ikc T T Ikc
i∈Ik ψ(Si ; b̂ , η̂ )D5 ψ1 (Si ; η̂ ) j,l
n
bγ , η 0 )DT ψ T (S; η 0 )]
− EP [ψ(S;
1 5 1 j,l
= n i∈Ik D5T ψ1 (Si ; η̂ Ik ) ·,l ψj (Si ; b̂γ , η̂ Ik )
c c
−D5T EP ψ1 (S; η 0 ) ·,l ψj (S; bγ , η 0 )
≤ D5 n1 i∈Ik ψ1 (Si ; η̂ Ik ) ·,l ψj (Si ; b̂γ , η̂ Ik )
c c
− EP ψ1 (S; η 0 ) ψj (S; bγ , η 0 ) .
·,l
− EP D5T ψ1T (S; η 0 ) ·,r ψ1 (S; η 0 ) j,· D5
c c
≤ n1 i∈Ik ψ1T (Si ; η̂ Ik ) ·,r ψ1 (Si ; η̂ Ik ) j,· − EP ψ1T (S; η 0 ) ·,r
ψ1 (S; η 0 ) j,·
·D5 2 .
n
i∈Ik
−D 3 EP ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D5 ψT (S; bγ , η 0 )
1
≤ n i∈Ik D3 ψ2 (Si ; η̂ Ik )D5 ψT (Si ; b̂γ , η̂ Ik )
c c
−D3 EP ψ2 (S; η 0 )D5 ψT (S; bγ , η 0 )
+ n1 i∈Ik D3 EP [ψ2 (S; η 0 )]D5 ψT (Si ; b̂γ , η̂ Ik )
c
1−D E [ψ (S; η 0 )]D5 EP ψT (S; bγ , η 0 )
3 P 2
≤ D3 n i∈Ik ψ2 (Si ; η̂ Ik )D5 ψT (Si ; b̂γ , η̂ Ik )
c c
− EP ψ2 (S; η 0 )D5 ψT (S; bγ , η 0 )
3 EP [ψ2 (S; η )]D5
0
+D
1
· n i∈Ik ψ (Si ; b̂ , η̂ Ik ) − EP ψT (S; bγ , η 0 )
c
T γ
≤ D3 n1 i∈Ik ψ2 (Si ; η̂ Ik )D5 ψT (Si ; b̂γ , η̂ Ik )
c c
− EP ψ2 (S; η 0 )D5 ψT (S; bγ , η 0 ) + OP N − 2 (1 + ρN )
1
6530 C. Emmenegger and P. Bühlmann
n
i∈Ik
− EP ψ2 (S; η 0 )D5 ψT (S; bγ , η 0 ) j,t
= n1 i∈Ik ψ2 (Si ; η̂ Ik ) j,· D5 ψt (Si ; b̂γ , η̂ Ik )
c c
− EP ψ2 (S; η 0 ) j,· D5 ψt (S; bγ , η 0 )
1
= n i∈Ik ψt (Si ; b̂γ , η̂ Ik ) ψ2 (Si ; η̂ Ik ) j,· D5
c c
− E ψ (S; bγ , η 0 ) ψ2 (S; η 0 ) j,· D5
P t
≤ n1 i∈Ik ψt (Si ; b̂γ , η̂ Ik ) ψ2 (Si ; η̂ Ik ) j,· − EP ψt (S; bγ , η 0 ) ψ2 (S; η 0 )
c c
j,·
·D5 .
+D3 n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik ) − EP [ψ(S; bγ , η 0 )D5 EPN [ψ1 (S; η 0 )]
c
1
≤ D3 ψ(Si ; b̂γ , η̂ Ik )D5T ψ1T (Si ; η̂ Ik ) − EP ψ(S; bγ , η 0 )D5T ψ1T (S; η 0 )
c c
n i∈Ik
+OP N − 2 (1 + ρN )
1
T T
− EP D5 ψ1 (S; η 0 ) ·,t ψj (S; bγ , η 0 )
ψj (S; bγ , η 0 )
c c
≤ n1 i∈Ik ψ1T (Si ; η̂ Ik ) ·,t ψj (Si ; b̂γ , η̂ Ik ) − EP ψ1T (S; η 0 ) ·,t
·D5 .
Regularizing DML in endogenous PLMs 6531
The term I11 can be bounded analogously to I10 . Next, we bound I12 . By
Lemma I.20, we have
1
c c
ψ1 (Si ; η̂ Ik ))j,l (ψ2 (Si ; η̂ Ik ) − EP ψ1 (S; η 0 ) ψ2 (S; η 0 )
n i∈Ik t,r j,l t,r
= OP (ρ̃N ),
1
Ikc
i∈Ik ψ1 (Sic; η̂ ) − EP [ψ1 (S; η )] D5
0
n
·D5 ψ2 (Si ; η̂ ) − EP [ψ2 (S; η )] D3T
T Ik 0
1−EP ψ1 (S; η 0 ) − EP [ψ1 (S; η 0 )] D5 D5T ψ2 (S; η 0 ) − EP [ψ2 (S; η 0 )] D3T
≤ Ikc T T Ic T
n i∈Ik ψ1 (S
i ; η̂ )D 5 D5 ψ2 (Si ; η̂ k )D 3
T
1−EP ψ1 (S; η )D
0 T T 0
5 D5 ψ2 (S; η ) D3
Ikc
+ n i∈Ik ψ1 (Si ; η̂ )D5 D5 EP [ψ2 (S; η 0)]D3T
T T
T
1−EP ψ1 (S; η )D50D5 EP [ψ
0 T T 0
2 (S; η )]c D3
+ n i∈Ik EP [ψ1 (S; η )]D5 D5 ψ2 (Si ; η̂ )D3T
T T Ik
T
1 − EP EP [ψ
0 T T 0
1 (S; η )]D5 D5 ψ2 (S; η ) D 3
≤ Ikc Ikc 0
i∈Ik ψ1 (Si ; η̂ )D5 D5 ψ2 (Si ; η̂ ) − EP ψ1 (S; η )D5 D5 ψ2 (S; η )
T T 0 T T
n
3
·D
+ n1 i∈Ik ψ1 (Si ; η̂ Ik ) − EP [ψ 0
c
1 (S; η )] D5 EP [ψ2 (S; η )]D3
2 0
+E [ψ (S; η 0
)]D 2
D 1
ψ (S ; η̂ Ik ) − E [ψ (S; η 0 )]
c
P 1 5 3 n c i∈Ik 2 i P 2
≤ n1 i∈Ik ψ1 (Si ; η̂ Ik )D5 D5T ψ2T (Si ; η̂ Ik ) − EP [ψ1 (S; η 0 )D5 D5T ψ2T (S; η 0 )]
c
·D3 + OP N − 2 (1 + ρN )
1
1
c c
ψ1 (Si ; η̂ Ik )D5 D5T ψ2T (Si ; η̂ Ik ) j,r
n i∈Ik
− E ψ1 (S; η 0 )D5 D5T ψ2T (S; η 0 ) j,r
P
1 c c
= n i∈Ik ψ1 (Si ; η̂ Ik ) j,· D5 D5T ψ2T (Si ; η̂ Ik ) ·,r
− EP ψ1 (S; η 0 ) j,· D5 D5T ψ2T (S; η 0 ) ·,r
1
= n i∈Ik D5T ψ2T (Si ; η̂ Ik ) ·,r ψ1 (Si ; η̂ Ik ) j,· D5
c c
− E D5T ψ2T (S; η 0 ) ·,r ψ1 (S; η 0 ) j,· D5
P
c c
≤ n1 i∈Ik ψ2T (Si ; η̂ Ik ) ·,r ψ1 (Si ; η̂ Ik ) j,· − EP ψ2T (S; η 0 ) ·,r
ψ1 (S; η 0 ) j,·
·D5 2 .
1
c c
ψj (Si ; b̂γ , η̂ Ik ) ψ2 (Si ; η̂ Ik ) − EP ψj (S; bγ , η 0 ) ψ2 (S; η 0 )
n i∈Ik t,r t,r
= OP (ρ̃N ),
6532 C. Emmenegger and P. Bühlmann
·D3 2
+D 2 D5 EP [ψ2 (S; η 0 )] n1 i∈Ik ψ(Si ; b̂γ , η̂ Ik ) − EP [ψ(S; bγ , η 0 )]
c
1 3
= ψ(Si ; b̂γ , η̂ Ik )D5T ψ2T (Si ; η̂ Ik ) − EP ψ(S; bγ , η 0 )D5T ψ2T (S; η 0 )
c c
n i∈Ik
·D3 2 + OP N − 2 (1 + ρN )
1
·,r
·D5 .
The term I14 can be bounded analogously to I13 . The term I15 can be bounded
analogously to I12 . Last, we bound the term I16 . By Lemma I.20, we have
1 T
Ikc Ikc
ψ2 (S; η 0 ) t,r ψ2 (S; η 0 ) j,l
i∈Ik ψ2 (Si ; η̂ ) t,r ψ2 (Si ; η̂ ) j,l − EP
T
n
= OP (ρ̃N ),
− EP ψ2 (S; η 0 )D5 D5T ψ2T (S; η 0 ) + OP N − 2 (1 + ρN )
1
Regularizing DML in endogenous PLMs 6533
−D5T EP ψ2T (S; η 0 ) ·,r (ψ2 (S; η 0 ))j,· D5
≤ n1 i∈Ik ψ2T (Si ; η̂ Ik ) ·,r (ψ2 (Si ; η̂ Ik ))j,· − EP (ψ2T (S; η 0 )
c c
·,r
ψ2 (S; η 0 ) j,·
·D5 2 .
Proof of Proposition 4.2. The statement of Proposition 4.2 can be reformulated
as ⎧ √ √
⎪
⎨ 0, if γN = Ω( N ) and γN ∈ Θ( N )
√ √
N |bγN − β0 | → C, if γN = Θ( N )
⎪
⎩ √
∞, if γN = o( N )
using the Bachmann–Landau notation, which is presented in Lattimore and
Szepesvári [58], for instance.
Introduce the matrices
F1 := EP [R
X RYT],
F2 := EP RX RX ,
T −1
G1 := EP RX RA T
E RA RA EP [RA RY ],
P
T −1
G2 := EP RX RA EP
T
RA RA EP RA RX T
.
We have
√ √ −1
N |bγN − β0 | = N F2 + (γN − 1)G2 F1 + (γN − 1)G1 − G−1
2 G1 .
Hence, we have
√
N |bγN − β0 |
√
−1 −1 −1 1 −1 −1
−1 G2 F1 − 1 +
= γN N 1
γN −1 G2 F2 γN −1G2 F2 G2 F1
−1 −1 −1
− 1+ 1
γN −1 G2 F2 G2 F2 G−1
2 G1
1
because γN −1 = O(1) holds. Furthermore, we have
√
N (b̂ − b )
γN γN
−1
= D1 + (γN − 1)D2 + oP γN1−1
K
· √1K k=1 √1n i∈Ik ψ(S i ; bγN , η̂ Ikc )
c
c
−1 c
+(γN − 1) n1 i∈Ik ψ1 (Si ; η̂ Ik ) n1 i∈Ik ψ2 (Si ; η̂ Ik ) ψ(Si ; bγN , η̂ Ik )
i
X
:= 1 γN
, η 0 ) + D3 ψ(Si ; bγN , η 0 ) + ψ1 (Si ; η 0 ) − EP [ψ1 (S; η 0 )] D5
γN −1 ψ(Si ; b
−D3 ψ2 (Si ; η 0 ) − EP [ψ2 (S; η 0 )] D5
6536 C. Emmenegger and P. Bühlmann
i , and Vn := 2
for i ∈ [N ], Sn := i∈Ik X i∈Ik EP [Xi ], where n = K denotes
N
1 1 1
i |2+δ =
EP | X 1 |2+δ → 0
· 1+δ EP |X
2]
EP [ X
2+δ ])2+δ n
(EP [X 2
i∈Ik i i∈Ik 1
As verified in the proof of Theorem 4.1, we have D̂1 = D1 + oP (1) and D̂2 =
D2 + oP (1). We established D̂4k = D4 + oP (1) in the proof of Theorem J.3 for
fixed γ. Consequently, the claim follows if the sequence {γN }N ≥1 is bounded.
Next, assume that γN diverges to +∞ as N → ∞. We verified
1
−1 −1
D̂1 + (γN − 1)D̂2 = D1 + (γN − 1)D2 + oP
γN − 1
in the proof of Lemma J.4. It can be shown that (γN 1−1)2 D̂4 is bounded in P -
probability by adapting the arguments presented in the proof of Theorem J.3
because there exists some finite real constant C such that we have |bγN | ≤ C
for N large enough. Therefore,
σ̂ 2 (γ )
N −1 −1
= γN1−1 D1 + D2 + oP (1) 1 1 T T
(γN −1)2 D̂4 γN −1 D1 + D2 + oP (1)
is bounded in P -probability.
√
Lemma J.6. Let γ = o( N ). We then have
√ √
ΞN := σ̂(γN ) − 2σ̂ − N (b̂γN − bγN ) − N (β0 − β̂) = OP (1).
√
Proof of Lemma J.6. By Theorem 3.1, the term N (β0 − β̂) asymptotically
follows a Gaussian distribution and is hence bounded in P -probability. By The-
orem I.21, the term σ̂ 2 converges in P -probability.
√ Thus, 2σ̂ is bounded in
P -probability as well. By Lemma J.4, we have N (b̂γN − bγN ) = OP (1). By
Lemma J.5, we have σ̂ 2 (γN ) = OP (1).
Proof of Theorem 4.4. That the statement holds uniformly for P ∈ PN can
be derived using analogous arguments as used to prove Theorem 3.1 and 4.1.
Theorem J.3 in the appendix shows that σ̂(γ) consistently estimates σ(γ) for
Regularizing DML in endogenous PLMs 6537
1 √1 oP (1).
Due to Theorem 4.3, we have μ̂ = N
Due to Proposition 4.2, whose
statements also hold stochastically for random γ, we have bγ̂ = β0 + √1N oP (1).
Therefore, we have
√
N (b̂γ̂ − bγ̂ )
√ K Ik T
Ik −1 1 K Ik T Ik Ik
= N K 1
k=1 R X Π Ik R
RA X K k=1 RX ΠRIk RY − RX β0
A
√ +oP (1)
= N (β̂ − β0 ) + oP (1)
Acknowledgments
We thank Matthias Löffler, the editor, associate editor, and anonymous review-
ers for constructive comments.
References
Science.
[73] Okui, R., Small, D. S., Tan, Z. and Robins, J. M. (2012). Doubly
robust instrumental variable regression. Statistica Sinica 22 173–205.
[74] Pearl, J. (1998). Graphs, causality, and structural equation models. So-
ciological Methods & Research 27 226–284.
[75] Pearl, J. (2004). Robustness of causal claims. In Proceedings of the 20th
Conference on Uncertainty in Artificial Intelligence. UAI ’04 446–453.
AUAI Press, Arlington, Virginia, USA.
[76] Pearl, J. (2009). Causality: Models, reasoning, and inference, 2 ed. Cam-
bridge University Press, Cambridge.
[77] Pearl, J. (2010). An introduction to causal inference. The International
Journal of Biostatistics 6 Article 7.
[78] Peters, J., Janzing, D. and Schölkopf, B. (2017). Elements of causal
inference: Foundations and learning algorithms. Adaptive computation and
machine learning. The MIT Press, Cambridge, MA.
[79] Phillips, P. C. B. (1984). The Exact Distribution of LIML: I. Interna-
tional Economic Review 25 249–261.
[80] Phillips, P. C. B. (1985). The Exact Distribution of LIML: II. Interna-
tional Economic Review 26 21–36.
[81] Robinson, P. M. (1988). Root-N -consistent semiparametric regression.
Econometrica 56 931–954.
[82] Rothenhäusler, D., Meinshausen, N., Bühlmann, P. and Peters, J.
(2021). Anchor regression: Heterogeneous data meet causality. Journal of
the Royal Statistical Society: Series B (Statistical Methodology) 83 215-246.
[83] Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric
regression. Cambridge series in statistical and probabilistic mathematics 12.
Cambridge University Press, Cambridge.
[84] Smucler, E., Rotnitzky, A. and Robins, J. M. (2019). A unifying
approach for doubly-robust 1 regularized estimation of causal contrasts.
Preprint arXiv:1904.03737.
[85] Speckman, P. (1988). Kernel smoothing in partial linear models. Journal
of the Royal Statistical Society. Series B (Methodological) 50 413–436.
[86] Staiger, D. and Stock, J. H. (1997). Instrumental variables regression
with weak instruments. Econometrica 65 557–586.
[87] Stock, J. H., Wright, J. H. and Yogo, M. (2002). A survey of weak
instruments and weak identification in generalized method of moments.
Journal of Business and Economic Statistics 20 518–529.
[88] Su, L. and Zhang, Y. (2016). Semiparametric estimation of partially lin-
ear dynamic panel data models with fixed effects. In Essays in Honor of
Aman Ullah, 1 ed. (G. González-Rivera, R. C. Hill and T.-H. Lee, eds.). Ad-
vances in Econometrics 36 137–204. Emerald Group Publishing Limited,
Howard House, Wagon Lane, Bingley BD16 1WA, UK.
[89] Summers, R. (1965). A Capital Intensive Approach to the Small Sample
Properties of Various Simultaneous Equation Estimators. Econometrica 33
1–41.
[90] Takeuchi, K. and Morimune, K. (1985). Third-Order Efficiency of the
Regularizing DML in endogenous PLMs 6543