Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference
Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference
MAX H. FARRELL
Booth School of Business, University of Chicago
TENGYUAN LIANG
Booth School of Business, University of Chicago
SANJOG MISRA
Booth School of Business, University of Chicago
We study deep neural networks and their use in semiparametric inference. We es-
tablish novel nonasymptotic high probability bounds for deep feedforward neural nets.
These deliver rates of convergence that are sufficiently fast (in some cases minimax op-
timal) to allow us to establish valid second-step inference after first-step estimation with
deep learning, a result also new to the literature. Our nonasymptotic high probability
bounds, and the subsequent semiparametric inference, treat the current standard ar-
chitecture: fully connected feedforward neural networks (multilayer perceptrons), with
the now-common rectified linear unit activation function, unbounded weights, and a
depth explicitly diverging with the sample size. We discuss other architectures as well,
including fixed-width, very deep networks. We establish the nonasymptotic bounds for
these deep nets for a general class of nonparametric regression-type loss functions,
which includes as special cases least squares, logistic regression, and other generalized
linear models. We then apply our theory to develop semiparametric inference, focus-
ing on causal parameters for concreteness, and demonstrate the effectiveness of deep
learning with an empirical application to direct mail marketing.
1. INTRODUCTION
STATISTICAL MACHINE LEARNING METHODS are being rapidly integrated into the social
and medical sciences. Economics is no exception, and there has been a recent surge of
research that applies and explores machine learning methods in the context of econo-
metric modeling, particularly in “big data” settings. Furthermore, theoretical properties
of these methods are the subject of intense recent study. Our goal in the present work
is to study a particular statistical machine learning technique which is widely popular in
industrial applications, but less frequently used in academic work and largely ignored in
recent theoretical developments on inference: deep neural networks.
Neural networks are estimation methods that model the relationship between inputs
and outputs using layers of connected computational units (neurons), patterned after the
biological neural networks of brains. These computational units sit between the inputs
and output and allow data-driven learning of the appropriate model, in addition to learn-
ing the parameters of that model. Put into terms more familiar in nonparametric econo-
metrics: neural networks can be thought of as a (complex) type of linear sieve estimation
where the basis functions themselves are flexibly learned from the data by optimizing over
many combinations of simple functions. Neural networks are perhaps not as familiar to
economists as other methods, and indeed, were out of favor in the machine learning com-
munity for several years, returning to prominence only very recently in the form of deep
learning. Deep neural nets contain many hidden layers of neurons between the input and
output layers, and have been found to exhibit superior performance across a variety of
contexts. Our work aims to bring wider attention to these methods and to fill some gaps
in the theoretical understanding of inference using deep neural networks.
Before the recent surge in attention, neural networks had taken a back seat to other
methods (such as kernel methods or forests) largely because of their modest empirical
performance and challenging optimization. Before falling out of favor, neural networks
were widely studied and applied, particularly in the 1990s. In that time, shallow neural
networks with smooth activation functions were shown to have many good theoretical
properties (White (1992), Anthony and Bartlett (1999), Chen and White (1999)). How-
ever, the availability of scalable computing and stochastic optimization techniques (Le-
Cun, Bottou, Bengio, and Haffner (1998), Kingma and Ba (2014)) and the changes from
shallow to deep networks and from smooth sigmoid-type activation functions to rectified
linear units (ReLU), x → max(x 0) (Nair and Hinton (2010)), have seemingly overcome
optimization hurdles and empirical issues, and now this form of deep learning matches or
sets the state of the art in many prediction contexts. Our theoretical results speak directly
to this modern implementation of deep learning: we focus on the ReLU activation func-
tion, explicitly model the depth of the network as diverging with the sample size, and do
not require bounded weights.
We provide nonasymptotic high probability bounds for nonparametric estimation using
deep neural networks for a large class of statistical models. Our bounds appear to be
new to the literature and are the main theoretical contributions of the paper. We provide
results for a general class of smooth loss functions for nonparametric regression style
problems, covering as special cases generalized linear models and other empirically useful
contexts. For example, in our application to causal inference we specialize our results
to linear and logistic regression as concrete illustrations. Our bounds immediately yield
empirical and population L2 convergence rates. For a certain architecture, we obtain the
optimal rate. Our proof strategy employs a localization analysis that uses scale-insensitive
measures of complexity, allowing us to consider richer classes of neural networks. This
is in contrast to analyses which restrict the networks to have bounded parameters for
each unit (discussed more below) and to the application of scale sensitive measures such
as metric entropy. These approaches would not deliver our sharp bounds and fast rates
when treating standard, feasible neural networks. Recent developments in approximation
theory and complexity for deep ReLU networks are important building blocks for our
results.
We follow our main results by applying our nonasymptotic high probability bounds to
deliver valid inference on finite-dimensional parameters following first-step estimation
using deep learning. Our aim is not to innovate at the semiparametric step but to utiliz-
ing existing results. Our work contributes directly to this area of research by showing that
deep nets are a valid and useful first-step estimator for semiparametric inference in gen-
eral. Further, we show that inference after deep learning may not require sample splitting
or cross-fitting. In particular, we use localization to directly verify conditions required for
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 183
valid inference, which may be a novel application of this proof method that is useful in
future problems of inference following machine learning.
We illustrate these ideas in the context of causal inference for concreteness and wide
applicability, as well as to allow direct comparison to the literature. Program evaluation
with observational data is one of the most common and important inference problems,
and has often been used as a test case for theoretical study of inference following machine
learning (e.g., Belloni, Chernozhukov, and Hansen (2014), Farrell (2015), Belloni, Cher-
nozhukov, Fernández-Val, and Hansen (2017), Athey, Imbens, and Wager (2018)). Deep
neural networks have been argued (experimentally) to outperform the previous state-of-
the-art (Westreich, Lessler, and Funk (2010), Shalit, Johansson, and Sontag (2017), Hart-
ford et al. (2017)). We establish valid inference for treatment effects and counterfactual
expected utility/profits from treatment targeting strategies. We note that the selection on
observables framework yields identification of counterfactual average outcomes without
additional structural assumptions, so that, for example, expected profit from a counter-
factual treatment rule can be evaluated.
We numerically illustrate our results, and more generally the utility of deep learning,
with an empirical study of a direct mail marketing campaign. Our data come from a large
US consumer products retailer and consists of around three hundred thousand consumers
with one hundred fifty covariates. Hitsch and Misra (2018) recently used this data to study
various estimators, both traditional and modern, of heterogeneous treatment effects. We
study the effect of catalog mailings on consumer purchases, and moreover, compare dif-
ferent targeting strategies (i.e., to which consumers catalogs should be mailed). The cost
of sending out a single catalog can be close to one dollar, and with millions being set out,
carefully assessing the targeting strategy is crucial. Our results suggest that deep nets are
at least as good as (and sometimes better) that the best methods found by Hitsch and
Misra (2018).
Our paper contributes to several rapidly growing literatures, and we cannot hope to do
justice to each here. We give only those citations of particular relevance; more references
can be found within these works. First, there has been much recent study of the statistical
properties of machine learning tools as an end in itself. Many studies have focused on the
lasso and its variants (Bickel, Ritov, and Tsybakov (2009), Belloni, Chernozhukov, and
Hansen (2014), Farrell (2015)) and tree/forest based methods (Wager and Athey (2018)).
Relatively less work has been done for deep neural networks. An important exception
is the recent work of Schmidt-Hieber (2019), who showed that a particular deep ReLU
network with uniformly bounded weights attains the optimal rate in expected risk for
squared loss. Further, Schmidt-Hieber (2019) formally shows that deep neural networks
can strictly improve on classical methods: if the unknown target function is itself a com-
position of simpler functions, then the composition-based deep net estimator is provably
superior to estimators that do not use compositions. This is a possible first step in theo-
retically understanding why deep learning is so successful empirically. Our work differs
substantially from Schmidt-Hieber (2019). First, our goal is not to demonstrate adap-
tation, and we do not study this property of deep nets, but focus on the common non-
parametric case. Second, our results and assumptions are quite different in that: (i) we
prove nonasymptotic high probability bounds instead of bounds on the expected risk, (ii)
we cover general, nonlinear regression problems, (iii) in linear models we allow for non-
Gaussian, heteroskedastic errors, relying only on boundedness, and (iv) we allow for un-
bounded weight parameters, which is crucial for feasible implementation and for approx-
imation results. Finally, our method of proof is entirely different from Schmidt-Hieber
(2019), and it is this proof which enables us to deliver (i)–(iv). Specializing our results
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
184 M. H. FARRELL, T. LIANG, AND S. MISRA
to the linear model treated by Schmidt-Hieber (2019), and looking at smooth functions,
our high probability bounds imply expect risk bounds similar to those obtained in that
paper, but under somewhat different regularity conditions: Schmidt-Hieber (2019) re-
quires bounded weights and errors that are Gaussian, independent of the covariates, and
have known homoskedasticity. These differences between our work and Schmidt-Hieber
(2019) are elaborated on further below, after stating our main results. Bach (2017) and
Bauer and Kohler (2019) also make important contributions on adaptation properties
of deep nets on functions with certain low dimensional structure. Yarotsky (2017, 2018)
and Bartlett, Harvey, Liaw, and Mehrabian (2017) are important building blocks for our
results.
A second strand of literature focuses on inference following of machine learning. Ini-
tial theoretical results were concerned with obtaining valid inference on a coefficient in a
high-dimensional regression, following model selection or regularization, with particular
focus on the lasso (Belloni, Chernozhukov, and Hansen (2014), Javanmard and Monta-
nari (2014), van de Geer, Buhlmann, Ritov, and Dezeure (2014)). Intuitively, this is a
semiparametric problem, where the coefficient of interest is estimable at the parametric
rate and the remaining coefficients are collectively a nonparametric nuisance parameter
estimated using machine learning methods. Building on this intuition, many have studied
the semiparametric stage directly, such as obtaining novel, weaker conditions easing the
application of machine learning methods (Belloni, Chernozhukov, and Hansen (2014),
Farrell (2015), Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins
(2018), and references therein). We build on this work, employing conditions therein, and
in particular, verifying them for deep ReLU nets.
The next section introduces deep ReLU networks and states our main theoretical re-
sults: nonasymptotic bounds for nonparametric regression-type losses. Semiparametric
inference is discussed in Section 3. The empirical study is in Section 4. Section 5 con-
cludes and proofs are given in the Appendix. We will use the following norms: for a ran-
dom vector X ∈ Rd , with generic realization x and sample realization xi , and a function
g(x), g∞ := supx |g(x)|, gL2 (X) := E[g(X)2 ]1/2 , and gn := En [g(xi )2 ]1/2 , where En [·]
denotes the sample average.
we want our results to inform empirical practice. However, our results are more general,
accommodating other architectures provided they are able to yield a universal approxi-
mation (in the appropriate function class), and so we review neural nets more generally
and give concrete examples.
Our goal is to estimate an unknown function f∗ (x) that relates the covariates X ∈ Rd to
a scalar outcome Y as the minimizer of the expectation of a per-observation loss function.
Collecting these random variables into the vector Z = (Y X ) ∈ Rd+1 , with z = (y x )
denoting a realization, we write
f∗ = arg min E (f Z)
f
We allow for any loss function that is Lipschitz in f and obeys a curvature condition
around f∗ . Specifically, for constants c1 , c2 , and C that are bounded and bounded away
from zero, we assume that (f z) obeys
(f z) − (g z) ≤ C f (x) − g(x)
(2.1)
c1 E (f − f∗ )2 ≤ E (f Z) − E (f∗ Z) ≤ c2 E (f − f∗ )2
Our results will be stated for a general loss obeying these two conditions.1 We give a
unified localization analysis of all such problems. This family of loss function covers many
interesting problems. Two leading examples, used in our application to causal inference,
are least squares and logistic regression, corresponding to the outcome and propensity
score models, respectively. For least squares, the target function and loss are
1 2
f∗ (x) := E[Y |X = x] and (f z) = y − f (x) (2.2)
2
respectively, while for logistic regression these are
E[Y |X = x]
f∗ (x) := log and (f z) = −yf (x) + log 1 + ef (x) (2.3)
1 − E[Y |X = x]
Lemma 8 verifies, with explicit constants, that (2.1) holds for these two. Losses obeying
(2.1) extend beyond these cases to other generalized linear models, such as count models,
and can even cover multinomial logistic regression (multiclass classification), as shown in
Lemma 9.
1
We thank an anonymous referee for suggesting this approach of exposition.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
186 M. H. FARRELL, T. LIANG, AND S. MISRA
FIGURE 1.—Illustration of a feedforward neural network with W = 18, L = 2, U = 5, and input dimension
d = 2. The input units are shown in white at left, the output in black at right, and the hidden units in grey
between them.
X ∈ Rd , one output unit for the outcome Y . Between these are U hidden units, or com-
putational nodes or neurons. These are connected by a directed acyclic graph specifying
the architecture. The key graphical feature of a feedforward network is that hidden units
are grouped in a sequence of L layers, the depth of the network, where a node is in layer
l = 1 2 L, if it has a predecessor in layer l − 1 and no predecessor in any layer l ≥ l.
The width of the network at a given layer, denoted Hl , is the number of units in that layer.
The network is completed with the choice of an activation function σ : R → R applied to
the output of each node as described below. In this paper, we focus on the popular ReLU
activation function σ(x) = max(x 0), though our results can be extended (at notational
cost) to cover piecewise linear activation functions (see also Remark 3).
An important and widely used subclass is the one that is fully connected between con-
secutive layers but has no other connections and each layer has number of hidden units
that are of the same order of magnitude. This architecture is often referred to as a Mul-
tilayer Perceptron (MLP) and we denote this class as FMLP ; see Figure 2, cf. Figure 1. We
will assume that all the width of all layers share a common asymptotic order H, implying
that for this class U LH.
We allow for generic feedforward networks, but we present special results for the MLP
case, as it is widely used in empirical practice. As we will see below, the architecture,
through its complexity, and equally importantly, approximation power, plays a crucial
role in the final bound. In particular, we find only a suboptimal rate for the MLP case, but
our upper bound is still sufficient for semiparametric inference.
To build intuition on the computation, and compare to other nonparametric methods,
let us focus on least squares for the moment, that is, equation (2.2), with a continuous out-
come using a multilayer perceptron with constant width H. Each hidden unit u receives an
input in the form of a linear combination x̃ w + b, and then returns σ(x̃ w + b), where the
vector x̃ collects the output of all the units with a directed edge into u (i.e., from prior lay-
ers), w is a vector of weights, and b is a constant term. (The constant term is often referred
to as the “bias” in the deep learning literature, but given the loaded meaning of this term
in inference, we will largely avoid referring to b as a bias.) To be precise, let x̃hl denote the
scalar output of a node u = (h l), for h = 1 Hl , l = 1 L, and let x̃l = (x̃1l x̃Hl )
for layer l ≤ L. The full network is defined through recursion: each node computes
x̃hl = σ(x̃l−1 whl−1 + bhl−1 ) and the final output is y = fMLP (x) = x̃L wL + bL . The MLP
estimator can be also written as a composition as follows. Define W l as the Hl+1 × Hl
Hl Hl
matrix collecting {whl }h=1 , where H0 = d, bl as the Hl -vector collecting {bhl }h=1 , and
σ : R → R as the function which applies σ(·) componentwise. Then
Hl Hl
fMLP (x) = W L σ · · · σ W 3 σ W 2 σ W 1 σ (W 0 x + b0 ) + b1 + b2 + b3 + · · · + bL
(This exact structure does not hold for the more general case of Section 2.3.) It is also
useful to write the output of the final layer as x̃L = x̃L (x), explicitly as a function of the
original covariates, and thus the final output may be seen as a basis function approxima-
tion (albeit a complex and data-dependent one), written as fMLP (x) = x̃L (x) wL + bL ,
which is reminiscent of a traditional series (linear sieve) estimator. If all layers save
the last were fixed, we could simply optimize using least squares directly: (wL bL ) ∈
arg minwb yi − x̃L w − b2n .
The crucial distinction is that the basis functions x̃L (·) are learned from the data.
The “basis” is x̃L = (x̃1L x̃HL ) , where each x̃hL = σ(x̃L−1 whL−1 + bhL−1 ). There-
fore, “before” we can solve the least squares problem above, we would have to esti-
mate (whL−1 bhL−1 ) h = 1 H, anticipating the final estimation. These in turn de-
pend on the prior layer, and so forth back to the original inputs x. Optimization pro-
ceeds layer-by-layer using (variants of) stochastic gradient descent, with gradients of
the parameters calculated by back-propagation (implementing the chain rule) induced
by the network structure. Our results match standard optimization methods by not re-
quiring the weight parameters to be uniformly bounded. The collection, over all nodes,
of w and b, constitutes the parameters θ which are optimized in the final estima-
tion. We denote W as the total number of parameters of the network. For the MLP,
W = (d + 1)H + (L − 1)(H 2 + H) + H + 1.
To further clarify the use of deep nets, it is useful to make explicit analogies to more
classical nonparametric techniques, leveraging the form fMLP (x) = x̃L (x) wL + bL . For
a traditional series estimator (such as splines) the two choices for the practitioner are
the basis (the spline shape and degree) and the number of terms (knots), commonly re-
ferred to as the smoothing and tuning parameters, respectively. In kernel regression, these
would respectively be the shape of the kernel (and degree of local polynomial) and the
bandwidth(s). For neural networks, the same phenomena are present: the architecture as
a whole (the graph structure and activation function) are the smoothing parameters while
the width and depth play the role of tuning parameters.
The architecture plays a crucial role in that it determines the approximation power
of the network, and it is worth noting that because of the relative complexity of neural
networks, such approximations, and comparisons across architectures, are not simple. It
is comparatively obvious that quartic splines are more flexible than cubic splines (for the
same number of knots) as is a higher degree local polynomial (for the same bandwidth).
At a glance, it may not be clear what function class a given network architecture (width,
depth, graph structure, and activation function) can approximate. As we will show below,
the MLP architecture is not yet known to yield an optimal approximation (for a given
width and depth) and, therefore, we are only able to prove a bound with slower than
optimal rate. As a final note, computational considerations are important for deep nets
in a way that is not true conventionally; see Remarks 1, 2, and 3.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
188 M. H. FARRELL, T. LIANG, AND S. MISRA
Just as for classical nonparametrics, for a fixed architecture it is the tuning parameters
that determine the rate of convergence (fixing smoothness of f∗ ). The recent wave of the-
oretical study of deep learning is still in its infancy. As such, there is no understanding
of optimal architectures or tuning parameters. These choices can be difficult and only
preliminary research has been done (see, e.g., Daniely (2017), Telgarsky (2016) and refer-
ences therein). However, it is interesting that in some cases, results can be obtained even
with a fixed width H, provided the network is deep enough; see Corollary 2.
In sum, for a user-chosen architecture FDNN , encompassing the choices σ(·), U, L,
W , and the graph structure, the final estimate is computed using observed samples zi =
(yi xi ) , i = 1 2 n, of Z, by solving
n
Recall that θ collects, over all nodes, the weights and constants w and b. When (2.4) is
restricted to the MLP class, we denote the resulting estimator fMLP . The choice of M may
be arbitrarily large, and is part of the definition of the class FDNN . This is neither a tun-
ing parameter nor regularization in the usual sense: it is not assumed to vary with n, and
beyond being finite and bounding f∗ ∞ (see Assumption 1), no properties of M are re-
quired. This is simply a formalization of the requirement that the optimizer is not allowed
to diverge on the function level in the l∞ sense– the weakest form of constraint. It is im-
portant to note that while typically regularization will alter the approximation power of
the class, that is not the case with the choice of M as we will assume that the true function
f∗ (x) is bounded, as is standard in nonparametric analysis. With some extra notational
burden, one can make the dependence of the bound on M explicit, though we omit this
for clarity as it is not related to statistical issues.
This assumption is fairly standard in nonparametrics. The only restriction worth men-
tioning is that the outcome is bounded. In many cases this holds by default (such as logistic
regression, where Y = {0 1}) or count models (where Y = {0 1 M}, with M limited
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 189
by real-world constraints). For continuous outcomes, such as least squares regression, our
restriction is not substantially more limiting than the usual assumption of a model such
as Y = f∗ (X) + ε, where X is compact-supported, f∗ is bounded, and the stochastic error
ε possesses many moments. Indeed, in many applications such a structure is only coher-
ent with bounded outcomes, such as the common practice of including lagged outcomes
as predictors. Next, the assumption of continuously distributed covariates is quite stan-
dard. Discrete covariates taking on many values may be more realistically thought of as
continuous, and it may be more accurate to allow these to slow the convergence rates.
Our focus on L2 (X) convergence allows for these essentially automatically. Finally, from
a practical point of view, deep networks handle discrete covariates seamlessly and have
demonstrated excellent empirical performance, which is in contrast to other more classi-
cal nonparametric techniques that may require manual adaptation.
Our main result treats the multilayer perceptron architecture, with the ReLU activation
function and unbounded weights, matching perhaps the most standard deep neural net-
work. Such MLPs are now known to approximate smooth functions well (Yarotsky (2017,
2018)), leading to our next assumption: that the target function f∗ lies in a Hölder ball
with certain smoothness. Discussion of Hölder, Sobolev, and Besov spaces can be found
in Gine and Nickl (2016).
ASSUMPTION 2: Assume f∗ lies in the Hölder ball W β∞ ([−1 1]d ), with smoothness β ∈
N+ ,
f∗ (x) ∈ W β∞ [−1 1]d := f : max ess supDα f (x) ≤ 1
α|α|≤β
x∈[−11]d
Under Assumptions 1 and 2, we obtain the following high probability bounds, cover-
ing a host of models, which, to the best of our knowledge, is new to the literature. In
some sense, this is our main result for deep learning, as it deals with the most common
architecture.
n
n
} and
(b) En [(fMLP − f∗ )2 ] ≤ C · {n β+d log n + n },
β
− 8 log log n
for a constant C > 0 independent of n, which may depend on d, M, and other fixed constants.
classic sieve analysis with scale sensitive measure such as metric entropy. This allows for a
richer set of approximating possibilities, in particular allowing more flexibility in seeking
architectures with specific properties, as we explore in the next subsection. For the special
case of least squares regression, Koltchinskii (2011) used a similar approach, and a similar
result to our Theorem 1(a) can be derived for this case using his Theorem 5.2 and Exam-
ple 3 (p. 85f). This is perhaps the nearest antecedent to our result. To avoid repetition,
other important results are discussed following Theorem 2 below.
Second, we are able to attain a faster rate on the second term of the bound, order
n−1 in the sample size, instead of the n−1/2 that would result from a direct application of
uniform deviation bounds. This upper bound informs the trade offs between H and L,
and the approximation power, and may point toward optimal architectures for statistical
inference. Even with these choices of H and L, the bound of Theorem 1 is not optimal (for
fixed β, in the sense of Stone (1982)). We rely on the explicit approximating constructions
of Yarotsky (2017), and it is possible that in the future improved approximation properties
of MLPs will be found, allowing for a sharpening of the results of Theorem 1 immediately,
that is, without change to our theoretical argument. At present, it is not clear if this rate
can be improved, but it is sufficiently fast for valid inference.
Finally, we note that as is standard in nonparametrics, this result relies on choosing H
appropriately given the smoothness β of Assumption 2. Of course, the true smoothness
is unknown and thus in practice the “β” appearing in H, and consequently in the con-
vergence rates, need not match that of Assumption 2. In general, the rate will depend on
the smaller of the two. Most commonly, it is assumed that the user-chosen β is fixed and
that the truth is smoother; witness the ubiquity of cubic splines and local linear regres-
sion. Rather than spell out these consequences directly, we will tacitly assume the true
smoothness is not less than the β appearing in H (here and below). Smoothness adaptive
approaches, as in classical nonparametrics, may also be possible with deep nets, but are
beyond the scope of this study.
ASSUMPTION 3: Let f∗ lie in a class F . For the feedforward network class FDNN , used in
(2.4), let the approximation error DNN be
It may be possible to require only an approximation in the L2 (X) norm, but this as-
sumption matches the current approximation theory literature and is more comparable
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 191
with other work in nonparametrics, and thus we maintain the uniform definition. We then
obtain the following result.
This result is more general than Theorem 1, covering the general deep ReLU net-
work problem defined in (2.4), general feedforward architectures, and the general class
of losses defined by (2.1). The same comments as were made following Theorem 1 apply
here as well: the same localization argument is used with the same benefits. We explicitly
use this in the next two corollaries, where we exploit the allowed flexibility in controlling
DNN by stating results for particular architectures. The bound here is not directly appli-
cable without specifying the network structure, which will determine both the variance
portion (through W , L, and U) and the approximation error. With these set, the bound
becomes operational upon choosing γ, which can be optimized as desired.
Perhaps the most directly related existing result, in addition to the aforementioned
result of Koltchinskii (2011), is Theorem 2 of Schmidt-Hieber (2019), which also uses
generic approximation error. That result is not a high-probability bound, only a rate
on the expected risk, only covers squared loss, and requires Gaussian noise that is in-
dependent of the covariates and has known, homoskedastic variance, and, importantly,
requires uniformly bounded weights in the network. The assumption of bounded weights
may be difficult to impose computationally and can limit the approximation power of
the network. To see this last point, consider a simple example: suppose that d = 1 and
f∗ (x) = σ(ζx + 1)/2 − σ(ζx − 1)/2. This f∗ , for any ζ, is bounded and can be realized
by a ReLU network without norm constraints using only two hidden units, and is thus
estimable at 1/n. However, for ζ > 1 a network with weights bounded by one (as in
Schmidt-Hieber (2019)) must have width 2ζ, so ζ must be known, and yields expected
risk of order ζ/n.
Turning to special cases, we first show that the optimal rate of Stone (1982) can be
attained, up to log factors. However, this relies on a rather artificial network structure,
designated to approximate functions in a Sobolev space well, but without concern for
practical implementation. Thus, while the following rate improves upon Theorem 1, we
view this result as mainly of theoretical interest: establishing that (certain) deep ReLU
networks are able to attain the optimal rate.
COROLLARY 1—Optimal Rate: Suppose Assumptions 1 and 2 hold. Let fOPT solve (2.4)
d
using the (deep and wide) network of Yarotsky (2017, Theorem 1), with W U n 2β+d log n
and depth L log n, the following hold with probability at least 1 − e−γ , for n large enough,
(a) fOPT − f∗ 2L2 (X) ≤ C · {n− 2β+d log4 n + log logn n+γ } and
2β
for a constant C > 0 independent of n, which may depend on d, M, and other fixed constants.
The same rate, up to log factors, albeit concerning only the expected risk and subject to
the other limitations above, can be obtained from Theorems 2 and 5 of Schmidt-Hieber
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
192 M. H. FARRELL, T. LIANG, AND S. MISRA
(2019). However, the main goal of Schmidt-Hieber (2019) is not the standard nonpara-
metric problem considered here, but rather in studying dimension adaptivity. Specifically,
the main result therein, Theorem 1, shows that if f∗ is itself a composition of functions
2β
which are individually estimable faster than n− 2β+d , then a sparsely connected deep ReLU
network adapts to this structure and attains the faster rate, an oracle type result. We do
not explicitly study sparse networks. Further, it is shown that estimators which are not
based on a composition structure do not possess the same adaptation property. For more
on the results and limitations of Schmidt-Hieber (2019), see the published discussions
(Ghorbani, Mei, Misiakiewicz, and Montanari (2019), Shamir (2019), Kutyniok (2019)).
Other work in this direction is Bach (2017) and Bauer and Kohler (2019). Polson and
Ročková (2018) also obtain bounds for deep nets, building on these works, but applied in
a Bayesian context.
Next, we turn to very deep networks that are very narrow, which have attracted sub-
stantial recent interest. Theorem 1 and Corollary 1 dealt with networks where the depth
and the width grow with sample size. This matches the most common empirical prac-
tice, and is what we use in Sections 4. However, it is possible to allow for networks of
fixed width, provided the depth is sufficiently large. The next result is perhaps the largest
departure from the classical study of neural networks: earlier work considered networks
with diverging width but fixed depth (often a single layer), while the reverse is true here.
The activation function is of course qualitatively different as well, being piecewise lin-
ear instead of smooth. Using recent results (Mhaskar and Poggio (2016), Hanin (2019),
Yarotsky (2018)), we can establish the following rate for very deep, fixed-width MLPs.
COROLLARY 2—Fixed Width Networks: Let the conditions of Theorem 1 hold, with β ≥
1 in Assumption 2. Let fFW solve (2.4) for an MLP with fixed width H = 2d + 10 and depth
d
L n 2(2+d) . Then with probability at least 1 − e−γ , for n large enough,
(a) fFW − f∗ 2L2 (X) ≤ C · {n− 2+d log2 n + log logn n+γ } and
2
for a constant C > 0 independent of n, which may depend on d, M, and other fixed constants.
This result is again mainly of theoretical interest. The class is only able to approximate
well functions with β = 1 (cf. the choice of L) which limits the potential applications of
the result because, in practice, d will be large enough to render this rate, unlike those
above, too slow for use in later inference procedures. In particular, if d ≥ 3, the sufficient
conditions of Theorem 3 fail.
Finally, as mentioned following Theorem 1, our theory here will immediately yield a
faster rate upon discovery of improved approximation power of this class of networks.
In other words, for example, if a proof became available that fixed-width, very deep net-
works can approximate β-smooth functions (as in Assumption 2), then Corollary 2 will
be trivially improvable to match the rate of Theorem 1. Similarly, if the MLP architecture
can be shown to share the approximation power with that of Corollary 1, then Theorem 1
will itself deliver the optimal rate. Our proofs will not require adjustment.
REMARK 2: Although there has been a great deal of work in easing implementation
(optimization and tuning) of deep nets, it still may be a challenge in some settings, par-
ticularly when using non-standard architectures. See also Remark 1. Given the renewed
interest in deep networks, this is an area of study already (Hartford et al. (2017), Polson
and Ročková (2018)) and we expect this to continue and that implementations will rapidly
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 193
evolve. This is perhaps another reason that Theorem 1 is, at the present time, the most
practically useful, but that (as just discussed) Theorem 2 will be increasingly useful in the
future.
REMARK 3: Our results can be extended easily to include piecewise linear activation
functions beyond ReLU, using the complexity result obtained in Bartlett et al. (2017). In
principle, similar rates of convergence could be attained for other activation functions,
given results on their approximation error. However, it is not clear what practical value
would be offered due to computational issues (in which the activation choice plays a cru-
cial role). Indeed, the recent switch to ReLU stems not from their greater approximation
power, but from the fact that optimizing a deep net with sigmoid-type activation is unsta-
ble or impossible in practice. Thus, while it is certainly possible that we could complement
the single-layer results with rates for sigmoid-based deep networks, these results would
have no consequences for real-world practice.
From a purely practical point of view, several variations of the ReLU activation function
have been proposed recently (including the so-called Leaky ReLU, Randomized ReLU,
(Scaled) Exponential Linear Units, and so forth) and have been found in some experi-
ments to improve optimization properties. It is not clear what theoretical properties these
activation functions have or if the computational benefits persist more generically, though
this area is rapidly evolving. We conjecture that our results could be extended to include
these activation functions.
ASSUMPTION 4: Let p(x) = P[T = 1|X = x] denote the propensity score and μt (x) =
E[Y (t)|X = x], t ∈ {0 1} denote the two outcome regression functions. For t ∈ {0 1} and
almost surely X, E[Y (t)|T X = x] = E[Y (t)|X = x] and p̄ ≤ p(x) ≤ 1 − p̄ for some p̄ > 0.
Our approach to inference follows the current literature and uses sample averages of
the (uncentered) influence functions. This approach yields valid inference under weaker
conditions on the first step estimates (Farrell (2015), Chernozhukov et al. (2018)). Hahn
(1998) showed that the influence function for a single average potential outcome is
given by ψt (z) − E[Y (t)], for t ∈ {0 1} and z = (y t x ) , where ψt (z) = 1{T = t}(y −
μt (x))P[T = t | X = x]−1 + μt (x). We estimate the unknown functions with deep learning
to form
1{ti = t} yi −
μt (xi )
t (zi ) =
ψ +
μt (xi ) (3.1)
P[T = t | X = xi ]
THEOREM 3: Suppose that {zi = (yi ti xi ) }ni=1 are i.i.d. obeying Assumption 4 and
the conditions Theorem 1 hold with βp ∧ βμ > d. Further assume that, for t ∈ {0 1},
E[(s(X)ψt (Z))2 |X] is bounded away from zero and for some δ > 0, E[(s(X)ψt (Z))4+δ |X]
is bounded. Then the deep MLP-ReLU network estimators defined above obey the follow-
ing, for t ∈ {0 1}, (a) En [( p(xi ) − p(xi ))2 ] = oP (1) and En [(
μt (xi ) − μt (xi ))2 ] = oP (1),
−1/2
(b) En [(μt (xi ) − μt (xi )) ] En [(
2 1/2
p(xi ) − p(xi )) ] = oP (n ), and (c) En [(
2 1/2
μt (xi ) −
μt (xi ))(1 − 1{ti = t}/P[T = t|X = xi ])] = oP (n−1/2 ) and, therefore, if p (xi ) is bounded in-
side (0 1), for a given s(x) and t ∈ {0 1}, we have
√ t (zi ) 2
En s(xi )ψ
t (zi ) − s(xi )ψt (zi ) = oP (1)
nEn s(xi )ψ and 2 = oP (1)
En s(xi )ψt (zi )
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 195
It is immediate from Theorem 3 that the estimators of (3.2), and other similar estima-
tors are asymptotically Normal with estimable variance. Looking at
π (s) to fix ideas,
√ −1/2 d
nΣ π (s) − π(s) → N (0 1)
= En s(xi )ψ
with Σ
1 (zi ) + 1 − s(xi ) ψ
0 (zi ) 2 −
π (s)2
COROLLARY 3: Let the conditions of Theorem 3 hold but instead of Assumption 4, as-
sume T is independent of Y (0), Y (1), and X, and is distributed Bernoulli with param-
eter p∗ bounded inside (0 1). Then, if βμ > d, the deep MLP-ReLU networks obey (a )
μt (xi ) − μt (xi ))2 ] = oP (1) and (c ) En [(
En [( μt (xi ) − μt (xi ))(1 − 1{ti = t}/p∗ )] = oP (n−1/2 ),
and the results of Theorem 3 hold.
Theorem 3 shows, for a specific context, how deep learning delivers valid asymptotic
inference for our parameters of interest. Theorem 1 (a generic result using Theorem 2
could be stated) proves that the nonparametric estimates converge sufficiently fast, as
formalized by conditions (a), (b), and (c), enabling feasible efficient semiparametric in-
ference. Proofs and further discussion of similar results can be found in Farrell (2015),
Chernozhukov et al. (2018). Here, it is worth mentioning that the condition (c), which
arises from a “leave-in” type remainder, can be weakened using sample splitting. Instead,
we employ our localization analysis, as was used to obtain the results of Section 2, to ver-
ify (c) directly (see Lemma 10); this appears to be a novel application of localization, and
this approach may be useful in future applications of second-step inference using machine
learning methods where the theoretical gain of weaker requirements may not be worth
the price paid in constants in finite samples.
Finally, we close this discussion by noting that our focus with Theorem 3 is showcasing
the practical utility of deep learning in inference, and not in attaining minimal condi-
tions. The requirement that βp ∧ βμ > d, or βp ∧ βμ > d/2 in Corollary 1, is not minimal.
Minimal conditions for semiparametric inference have been studied by many, dating at
least to Bickel and Ritov (1988); see Robins, Tchetgen, Li, and van der Vaart (2009) for
recent results and references. For causal inference, Chen, Hong, and Tarozzi (2008) and
Athey, Imbens, and Wager (2018) obtain efficiency under weaker conditions than ours on
p(x) (the former under minimal smoothness on μt (x) and the latter under a sparsity in
a high-dimensional linear model). Further, cross-fitting paired with local robustness may
yield weaker smoothness conditions by providing “underfitting” robustness, that is, weak-
ening bias-related assumptions (Chernozhukov et al. (2018)), but the cost may be too
high here. Weaker variance-related assumptions, or “overfitting” robustness (Cattaneo,
Jansson, and Ma (2019)), may also be possible following deep learning, but are less auto-
matic at present. Other methods for causal inference under relaxed assumptions may be
useful here, such as extensions to doubly robust inference (Tan (2020)) or robust inverse
weighting (Ma and Wang (2018)).
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
196 M. H. FARRELL, T. LIANG, AND S. MISRA
4. EMPIRICAL APPLICATION
To illustrate our results, we study a data from a large US retailer of consumer prod-
ucts. The firm sells directly to the customer (as opposed to via retailers) using a variety
of channels such as the web and mail. Targeted marketing instruments, such as catalogs,
aim to induce demand and often contain advertising and informational content about the
firms offerings. It is important to carefully select which customers should be sent this ma-
terial, that is, be targeted for treatment, since the costs of its creation and dissemination
accumulate rapidly. For a typical retailer, the costs of one catalog may be close to a dol-
lar. With millions of catalogs being sent, ascertaining the causal effects of such targeted
mailing, and then using these effects to evaluate potential targeting strategies, is crucial
for policy making. For a full discussion, see Hitsch and Misra (2018) (we use their 2015
sample).
The data consists of 292,657 consumers chosen at random from the retailer’s database.
Of these, 2/3 were randomly chosen to receive a catalog (the treatment). We observe treat-
ment status, roughly one hundred fifty covariates, including demographics, past purchase
behaviors, interactions with the firm, and other relevant information, and total consumer
spending, the outcome of interest, aggregated from all available purchase channels in-
cluding phone, mail, and the web, in a 3-month window. Average spending is $7.31, but
for the roughly 6% who made a purchase, the average spend is $117.73.
We implement equations (3.1) and (3.2) for eight different deep nets. All computation
was done using TensorFlowTM . The details of the eight deep net architectures are pre-
sented in Table I. A key measure of fit reported in the final column of the table is the
portion of τ(xi ) that were negative. As argued by Hitsch and Misra (2018), it is implausi-
ble, for nearly all individuals, under standard marketing or economic theory that receipt
of a catalog causes lower purchasing. Here, deep nets perform as well as, and sometimes
better than, the best methods found by Hitsch and Misra (2018). Figure 3 shows the dis-
tribution of τ(xi ) across customers for each of the eight architectures. While there are
differences in the shapes, the mean and variance estimates are nonetheless similar. We
also conducted a placebo test: using only the untreated customers in the data and ran-
domly assigning half to treated status we then reran the eight architectures.2 Figure 4
TABLE I
DEEP NETWORK ARCHITECTURESa
2
We thank Guido Imbens suggesting this analysis.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 197
plots
τ(xi ) for these, and we see that the “true” zero average effect is recovered and with
the expected distribution.
Table II shows the estimates of the average treatment effect and the counterfactual
profits from three different targeting strategies, along with their respective 95% con-
fidence intervals. The strategies are (i) never treat, s(x) ≡ 0; (ii) a blanket treatment,
s(x) ≡ 1; (iii) a loyalty policy, s(x) = 1 only for those who had purchased in the prior cal-
endar year. In all cases, we add a profit margin m and a mailing cost c to π(s) (our NDA
with the firm forbids revealing m and c). It is clear that profits from the three policies are
ordered as π(never) < π(blanket) < π(loyalty). In all results, there is broad agreement
among the eight architectures. This may be due to the fact that the data is experimental,
so that the propensity score is constant. We have explored this using simulations, which
are reported in the Supplemental Material (Farrell, Liang, and Misra (2021)).
TABLE II
AVERAGE TREATMENT EFFECT ESTIMATES AND COUNTERFACTUAL PROFITS FROM THREE TARGETING
STRATEGIES, WITH 95% CONFIDENCE INTERVALS
1 2.606 [2273 2932] 2.016 [1923 2110] 2.234 [2162 2306] 2.367 [2292 2443]
2 2.577 [2252 2901] 2.022 [1929 2114] 2.229 [2157 2301] 2.363 [2288 2438]
3 2.547 [2223 2872] 2.027 [1934 2120] 2.224 [2152 2296] 2.358 [2283 2434]
4 2.488 [2160 2817] 2.037 [1944 2130] 2.213 [2140 2286] 2.350 [2274 2425]
5 2.459 [2127 2791] 2.043 [1950 2136] 2.208 [2135 2281] 2.345 [2269 2422]
6 2.430 [2093 2767] 2.048 [1954 2142] 2.202 [2128 2277] 2.341 [2263 2418]
7 2.400 [2057 2744] 2.053 [1959 2148] 2.197 [2122 2272] 2.336 [2258 2414]
8 2.371 [2021 2721] 2.059 [1963 2154] 2.192 [2116 2268] 2.332 [2253 2411]
5. CONCLUSION
The utility of deep learning in social science applications is still a subject of interest
and debate. While there is an acknowledgment of its predictive power, there has been
limited adoption of deep learning in social sciences such as economics. Some part of
the reluctance to adopting these methods stems from the lack of theory facilitating use
and interpretation. We have shown, both theoretically as well as empirically, that these
methods can offer excellent performance.
In this paper, we have given a formal proof that inference can be valid after using deep
learning methods for first-step estimation. Our results thus contribute directly to the re-
cent explosion in both theoretical and applied research using machine learning methods
in economics, and to the recent adoption of deep learning in empirical settings. We ob-
tained novel bounds for deep neural networks, speaking directly to the modern (and em-
pirically successful) practice of using fully-connected feedforward networks. Our results
allow for different network architectures, including fixed width, very deep networks. Our
results cover general nonparametric regression-type loss functions, covering most non-
parametric practice. We used our bounds to deliver fast convergence rates allowing for
second-stage inference on a finite-dimensional parameter of interest.
There are practical implications of the theory presented in this paper. We focused on
semiparametric causal effects as a concrete illustration, but deep learning is a potentially
valuable tool in many diverse economic settings. Our results allow researchers to embed
deep learning into standard econometric models such as linear regressions, generalized
linear models, and other forms of limited dependent variables models (e.g., censored re-
gression). Our theory can also be used as a starting point for constructing deep learning
implementations of two-step estimators in the context of selection models, dynamic dis-
crete choice, and the estimation of games.
To be clear, we see our paper as an early step in the exploration of deep learning as a
tool for economic applications. There are a number of opportunities, questions, and chal-
lenges that remain. For some estimands, it may be crucial to estimate the density as well,
and this problem can be challenging in high dimensions. Deep nets, in the formulation
of GANs, are a promising tool for distribution estimation (Liang (2018), Athey, Imbens,
Metzger, and Munro (2019)). There are also interesting questions of network architec-
tures representing, and adapting to, the underlying function, and if these can be learned
from the data (Bach (2017), Dou and Liang (2020)). Lastly, further computational and
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 199
optimization guidance is needed. Research into these applications and structures is un-
derway.
APPENDIX A: PROOFS
In this appendix, we provide a proof of Theorems 1 and 2, our main theoretical re-
sults for deep ReLU networks, and their corollaries. The proof proceeds in several steps.
We first give the main breakdown and bound the bias (approximation error) term. We
then turn our attention to the empirical process term, to which we apply our localiza-
tion. Much of the proof uses a generic architecture, and thus pertains to both results. We
will specialize the architecture to the multi-layer perceptron only when needed later on.
Other special cases and related results are covered in Appendix A.4. Supporting lemmas
are stated in Appendix B.
The statements of Theorems 1 and 2 assume that n is large enough. Precisely, we re-
quire n > (2eM)2 ∨ Pdim(FDNN ). For notational simplicity, we will denote fDNN := f; see
(2.4), and DNN := n , see Assumption 3. As we are simultaneously consider Theorems 1
and 2, the generic notation DNN will be used throughout.
Equation (A.1) is the main decomposition that begins the proof. The decomposition
must be done this way because of the above notes regarding f∗ and fn . The first term is the
empirical process term that will be treated in the subsequent subsection. For the second
term in (A.1), the bias term or approximation error, we apply Bernstein’s inequality to
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
200 M. H. FARRELL, T. LIANG, AND S. MISRA
2C2 γ̃ 7C M γ̃
≤ c2 2
n + n + (A.2)
n n
using the Lipschitz and curvature of the loss function defined in equation (2.1) and
E[fn − f∗ 2 ] ≤ fn − f∗ 2∞ , along with the definition of 2n .
Once the empirical process term is controlled (in Appendix A.2), the two bounds will
be brought back together to compute the final result; see Appendix A.3.
1
n
Rn F := sup ηi f (xi )
f ∈F n i=1
Intuitively, Rn F measures how flexible the function class is for predicting random
signs. Taking the expectation of Rn F conditioned on the data we obtain the empirical
Rademacher complexity, denoted Eη [Rn F ]. When the expectation is taken over both the
data and the draws ηi , ERn F , we get the Rademacher complexity.
En (f − f∗ )2 − E(f − f∗ )2
2γ̃ 36M 2 γ̃
≤ 3ERn g = (f − f∗ ) : f ∈ F f − f∗ L2 (X) ≤ r + 3Mr
2
+
n 3 n
2γ̃ 12M 2 γ̃
≤ 18MERn f − f∗ : f ∈ F f − f∗ L2 (X) ≤ r + 3Mr + (A.3)
n n
where the second inequality applies Lemma 2 to the Lipschitz functions {g} (as a function
of the real values f (x)) and iterated expectations.
Suppose the radius r satisfies
and
√
6 6M 2 γ̃
r ≥
2
(A.5)
n
Then we conclude from from (A.3) that
2γ̃ 12M 2 γ̃
En (f − f∗ ) ≤ r + r + 3Mr
2 2 2
+ ≤ (2r)2 (A.6)
n n
where the first inequality uses (A.4) and the second line uses (A.5). This means that for
r above the “critical radius” (see Step III), the empirical L2 -norm is at most twice the
population one with probability at least 1 − exp(−γ̃).
where the middle term is due to the following variance calculation (recall Equation (2.1))
2
V[g] ≤ E g2 = E (f z) − (f∗ z) ≤ C2 E(f − f∗ )2 ≤ C2 r02
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
202 M. H. FARRELL, T. LIANG, AND S. MISRA
Here, the fact that Lemma 5 is variance dependent, and that the variance depends on
the radius r0 , is important. It is this property which enables a sharpening of the rate with
step-by-step reductions in the variance bound, as in Appendix A.2.4.
For the empirical Rademacher complexity term, the first term of (A.7), Lemma 2,
Step I, and Lemma 3 (notation defined there), yield
with probability 1 − exp(−γ̃) (when applying Step I). Recall Lemma 4, one can further
upper bound the entropy integral when n > Pdim(FDNN ),
12 2r0
inf 4α + √ log N (δ FDNN |x1 xn ∞) dδ
0<α<2r0 n α
12 2r0 2eMn
≤ inf 4α + √ Pdim(FDNN ) log dδ
0<α<2r0 n α δ · Pdim(FDNN )
Pdim(FDNN ) 2eM 3
≤ 32r0 log + log n
n r0 2
with a particular choice of α = 2r0 Pdim(FDNN )/n < 2r0 . Therefore, whenever r0 ≥ 1/n
and n ≥ (2eM)2 ,
Pdim(FDNN )
Eη Rn G ≤ 128C r0 log n
n
Applying this bound to (A.7), we have
Pdim(FDNN ) 2C2 γ̃ 23MC γ̃
(E − En ) (f z) − (f∗ z) ≤ Kr0 log n + r0 + (A.8)
n n n
where K = 6 × 128C .
Going back now to the main decomposition, plug (A.8) and (A.2) into (A.1), and we
overall have found that, with probability at least 1 − 4 exp(−γ̃), the following holds:
By construction this obeys (A.4), and thus so does 2r∗ . Denote the event E (depending on
the data) to be
Then applying the one step improvement argument in Step II (again the variance depen-
dence captured in Lemma 5 is crucial, here reflected in the variance within each shell),
equation (A.9) yields that with probability at least 1 − 4 exp(−γ̃),
√
1 W L log W 2C2 t
f− f∗ 2L2 (X) ≤ 2j r̄ · K C log n +
c1 n n
2C2 γ̃ γ̃
+ c2 2
n + n + 30MC
n n
≤ 22j−2 r̄ 2
1 √ W L log W 2C2 γ̃ 1
K C log n + ≤ 2j r̄
c1 n n 2
1 2C2 γ̃ γ̃ 1
c2 2
n + n + 26MC ≤ 22j r̄ 2
c1 n n 4
√
8 W L log W 2C2 γ̃
r̄ = K C log n +
c1 n n
2(c2 ∨ 1) 120MC γ̃
+ n+ + r∗ (A.14)
c1 c1 n
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 205
The “and” part of each line follows from Step I and the implication uses the above ar-
gument following Step II. Therefore, in the end, we conclude with probability at least
1 − 6l exp(−γ̃),
Therefore, by choosing γ = − log(6l) + γ̃, we know from (A.14), and the upper bound
on r∗ in (A.10),
√
8 W L log W 2C2 (log log n + γ)
r̄ ≤ K C log n +
c1 n n
2(c2 ∨ 1) 120MC log log n + γ
+ n+ + r∗
c2 c1 n
W L log W log log n + γ
≤C log n + + n (A.17)
n n
with some constant C > 0 that does not depend on n. This completes the proof of Theo-
rem 2.
FIGURE 5.—Illustration of how to embed a feedforward network into a multilayer perceptron, with auxiliary
hidden nodes (shown in dark grey).
PROOF: The idea is illustrated in Figure 5. For the edges in the directed graph of
f ∈ FDNN that connect nodes not in adjacent layers (shown in yellow in Figure 5), one
can insert auxiliary hidden units in order to simply “pass forward” the information. The
number of such auxiliary “pass forward units” is at most the number of offending edges
times the depth L (i.e., for each edge, at most L auxiliary nodes are required), and this is
bounded by W L. Therefore the width of the MLP network that subsumes the original is
upper bounded by W L + U while still maintaining the required embedding that for any
fθ ∈ FDNN , there is a gθ ∈ FMLP such that gθ = fθ . In order to match modern practice,
we only need to show that auxiliary units can be implemented with ReLU activation. This
can be done by setting the constant (“bias”) term b of each auxiliary unit large enough
to ensure σ(x̃ w + b) = x̃ w + b when x̃ is the input covariates, and then subtracting the
same b in the last receiving unit along the path. Q.E.D.
Next, we give two properties of the Rademacher complexity (see Mendelson, 2003).
LEMMA 3—Dudley’s Chaining: Let N (δ F · n ) denote the metric entropy for class F
(with covering radius δ and metric · n ), then
12 r
Eη Rn f : f ∈ F f n ≤ r ≤ inf 4α + √ log N δ F · n dδ
0<α<r n α
Furthermore, because f n ≤ maxi |f (xi )| and, therefore, N (δ F · n ) ≤ N (δ F |x1 xn
∞) and so the upper bound in the conclusions also holds with N (δ F |x1 xn ∞), where
F |x1 xn is the class F projected onto the data.
The next two results, Theorems 12.2 and 14.1 in Anthony and Bartlett (1999), show that
the metric entropy may be bounded in terms of the pseudo-dimension and that the latter
is bounded by the Vapnik-Chervonenkis (VC) dimension.
The following symmetrization lemma bounds the empirical processes term using
Rademacher complexity, and is thus a crucial piece of our localization. This is a stan-
dard result based on Talagrand’s concentration, but here special care is taken with the
dependence on the variance.
When bounding the complexity of FDNN , we use the following result. Bartlett et al.
(2017) also verify these bounds for the VC-dimension.
LEMMA 6—Theorem 6 in Bartlett et al. (2017), ReLU case: Consider a ReLU network
architecture F = FDNN (W L U), then the pseudo-dimension is sandwiched as
For the MLP, we use the following approximation result, Yarotsky (2017) Theorem 1.
LEMMA 7: There exists a network class FDNN , with ReLU activation, such that for any
> 0:
(a) FDNN approximates the W β∞ ([−1 1]d ) in the sense for any f∗ ∈ W β∞ ([−1 1]d ), there
exists a fn ( ) := fn ∈ FDNN such that
fn − f∗ ∞ ≤
d
−β
(b) and FDNN has L( ) ≤ C · (log(1/ ) + 1) and W ( ) U( ) ≤ C · (log(1/ ) + 1).
Here C only depends on d and β.
For completeness, we verify the requirements on the loss functions, equation (2.1), for
several examples. We first treat least squares and logistic losses, in slightly more detail, as
these are used in our subsequent inference results and empirical application.
LEMMA 8: Both the least squares (2.2) and logistic (2.3) loss functions obey the require-
ments of Equation (2.1). For least squares, c1 = c2 = 1/2 and C = M. For logistic regression,
c1 = (2(exp(M) + exp(−M) + 2))−1 , c2 = 1/8 and C = 1.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 209
PROOF: The Lipschitz conditions are trivial. For least squares, using iterated expecta-
tions
2E(f Z) − 2E(f∗ Z) = E −2Yf + f 2 + 2Yf∗ − f∗2
= E −2f∗ f (x) + f 2 + 2(f∗ )2 − f∗2
= E (f − f∗ )2
For logistic regression,
exp(f∗ ) 1 + exp(f )
E (f Z) − E (f∗ Z) = E − (f − f∗ ) + log
1 + exp(f∗ ) 1 + exp(f∗ )
1
ha (b) = ha (a) + ha (a)(b − a) + ha ξa + (1 − ξ)b (b − a)2
2
and ha (b) = 1
exp(b)+exp(−b)+2
≤ 14 . The lower bound holds as |ξf∗ + (1 − ξ)f | ≤ M. Q.E.D.
Beyond least squares and logistic regression, we give three further examples, discussed
in the general language of generalized linear models. Note that in the final example we
move beyond a simple scalar outcome.
LEMMA 9: For a convex function g(·) : R → R, consider the generalized linear loss func-
tion (f z) = −y f (x) + g(f (x)). The curvature and the Lipschitz conditions in (2.1) will
hold given specific g(·). In each case, the loss function corresponds to the negative log likeli-
hood function.
(a) Poisson: g(t) = exp(t), with f∗ (x) = log E[y|X = x].
(b) Gamma: g(t) = − log t, with f∗ (x) = −(E[y|X = x]) −1
.
(c) Multinomial Logistic, K + 1 classes: g(t) = log(1 + k∈K exp(t [k] )), with
exp f∗[k] (x) / 1 + exp f∗[k ] (x) = E y [k] |X = x
k ∈K
PROOF: Denote ∇g, Hessian[g] to be the gradient and Hessian of the convex function
g. By the convexity of g, the optimal f∗ satisfies E[∂(f∗ Z)/∂f |X = x] = 0, which implies
∇g(f∗ ) = E[Y |X = x]. If 2c0 Hessian[g(f )] 2c1 for all f of interest, then the curvature
condition in (2.1) holds, because
E (f Z) − E (f∗ Z) = E − ∇g(f∗ ) f − f∗ + g(f ) − g(f∗ )
1
= E f − f∗ Hessian g(f˜) f − f∗
2
≥ c0 Ef − f∗ 2
and the parallel argument for ≤ c1 Ef − f∗ 2 . The Lipschitz condition is equivalent to
∇g(f ) ≤ C for all f of interest, with bounded Y .
For our three examples in particular, we have the following.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
210 M. H. FARRELL, T. LIANG, AND S. MISRA
1 exp(M)
2 ≤ λ Hessian c(f ) ≤
1 + K exp(M) 1 + (K − 1) exp(−M) + exp(M)
Our last result is to verify condition (c) of Theorem 3. We do so using our localization,
which may be of future interest in second-step inference with machine learning methods.
PROOF: Without loss of generality we can take p̄ < 1/2. The only estimated function
here is μt (x), which plays the role of f∗ here. For function(als) L(·) of the form
1{ti = t}
L(f ) := f (xi ) − f∗ (xi ) 1 −
P[T = t|X = xi ]
it is true that
E 1{ti = t}|xi
E L(f ) = E f (X) − f∗ (X) 1 − =0
P[T = t|X = xi ]
and
2
V L(f ) ≤ (1/p̄ − 1)2 E f (X) − f∗ (X) ≤ (1/p̄ − 1)2 r̄ 2
L(f ) ≤ (1/p̄ − 1)2M
where the first line is due to r̄ > r∗ , and second line uses Lemma 2.
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 211
Then by the localization analysis and Lemma 5, for all f ∈ F , f − f∗ L2 (X) ≤ r̄, L(f )
obeys
En L(f ) = En L(f ) − E L(f )
With probability at least 1 − exp(−n β+d log8 n), fMLP lies in this set of functions, and there-
d
fore
1(T = t)
En L(fMLP ) = En fnHL (x) − f∗ (x) 1 −
P(T = t|x = x)
β log log n
≤ C · n− β+d log8 n +
n
as claimed. Q.E.D.
REFERENCES
ANTHONY, M., AND P. L. BARTLETT (1999): Neural Network Learning: Theoretical Foundations. Campbridge
University Press. [182,207]
ATHEY, S., G. W. IMBENS, J. METZGER, AND E. M. MUNRO (2019): “Using Wasserstein Generative Adversarial
Networks for the Design of Monte Carlo Simulations,” arXiv:1909.02210. Preprint. [198]
ATHEY, S., G. W. IMBENS, AND S. WAGER (2018): “Approximate Residual Balancing: Debiased Inference of
Average Treatment Effects in High Dimensions,” Journal of the Royal Statistical Society, Series B, 80, 597–623.
[183,195]
BACH, F. (2017): “Breaking the Curse of Dimensionality With Convex Neural Networks,” The Journal of Ma-
chine Learning Research, 18, 629–681. [184,192,198]
BARTLETT, P. L., O. BOUSQUET, AND S. MENDELSON (2005): “Local Rademacher Complexities,” The Annals
of Statistics, 33, 1497–1537. [189,200,208]
BARTLETT, P. L., N. HARVEY, C. LIAW, AND A. MEHRABIAN (2017): “Nearly-Tight VC-Dimension Bounds
for Piecewise Linear Neural Networks,” in Proceedings of the 22nd Annual Conference on Learning Theory
(COLT 2017). [184,189,193,208]
BAUER, B., AND M. KOHLER (2019): “On Deep Learning as a Remedy for the Curse of Dimensionality in
Nonparametric Regression,” Annals of Statistics, 47, 2261–2285. [184,192]
BELLONI, A., V. CHERNOZHUKOV, I. FERNÁNDEZ-VAL, AND C. HANSEN (2017): “Program Evaluation and
Causal Inference With High-Dimensional Data,” Econometrica, 85, 233–298. [183,193]
BELLONI, A., V. CHERNOZHUKOV, AND C. HANSEN (2014): “Inference on Treatment Effects After Selection
Amongst High-Dimensional Controls,” Review of Economic Studies, 81, 608–650. [183,184,195]
BICKEL, P. J., AND Y. RITOV (1988): “Estimating Integrated Squared Density Derivatives: Sharp Best Order
of Convergence Estimates,” Sankhyā, 50, 381–393. [195]
BICKEL, P. J., Y. RITOV, AND A. B. TSYBAKOV (2009): “Simultaneous Analysis of LASSO and Dantzig Selec-
tor,” The Annals of Statistics, 37, 1705–1732. [183]
CATTANEO, M. D., M. JANSSON, AND X. MA (2019): “Two-Step Estimation and Inference With Possibly Many
Included Covariates,” Review of Economic Studies, 86, 1095–1122. [195]
CHEN, X., AND H. WHITE (1999): “Improved Rates and Asymptotic Normality for Nonparametric Neural
Network Estimators,” IEEE Transactions on Information Theory, 45, 682–691. [182]
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
212 M. H. FARRELL, T. LIANG, AND S. MISRA
CHEN, X., H. HONG, AND A. TAROZZI (2008): “Semiparametric Efficiency in GMM Models With Auxiliary
Data,” The Annals of Statistics, 36, 808–843. [195]
CHERNOZHUKOV, V., D. CHETVERIKOV, M. DEMIRER, E. DUFLO, C. HANSEN, W. NEWEY, AND J. ROBINS
(2018): “Double/Debiased Machine Learning for Treatment and Structural Parameters,” The Econometrics
Journal, 21, C1–C68. [184,193-195]
DANIELY, A. (2017): “Depth Separation for Neural Networks,” in Proceedings of the 22nd Annual Conference
on Learning Theory (COLT 2017). [188]
DOU, X., AND T. LIANG (2020): “Training Neural Networks as Learning Data-Adaptive Kernels: Provable
Representation and Approximation Benefits,” Journal of the American Statistical Association, (Forthcoming).
[198]
FARRELL, M. H. (2015): “Robust Inference on Average Treatment Effects With Possibly More Covariates
Than Observations,” Journal of Econometrics, 189, 1–23. arXiv:1309.4686. [183,184,194,195]
FARRELL, M. H., T. LIANG, AND S. MISRA (2019a): “Deep Neural Networks for Estimation and Inference,”
arXiv:1809.09953. [193]
(2021): “Supplement to ‘Deep Neural Networks for Estimation and Inference’,” Econometrica Sup-
plemental Material, 89, https://doi.org/10.3982/ECTA16901. [197]
GHORBANI, B., S. MEI, T. MISIAKIEWICZ, AND A. MONTANARI (2019): “Discussion of ‘Nonparametric Regres-
sion Using Deep Neural Networks With ReLU Activation Function’,” Annals of Statistics, (Forthcoming).
[192]
GINE, E., AND R. NICKL (2016): Mathematical Foundations of Infinite-Dimensional Models. Cambridge. [189]
GOODFELLOW, I., Y. BENGIO, AND A. COURVILLE (2016): Deep Learning. Cambridge: MIT Press. [185]
HAHN, J. (1998): “On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average
Treatment Effects,” Econometrica, 66, 315–331. [194]
HANIN, B. (2019): “Universal function approximation by deep neural nets with bounded width and relu acti-
vations”, Mathematics, 7, 992. [192]
HARTFORD, J., G. LEWIS, K. LEYTON-BROWN, AND M. TADDY (2017): “Deep iv: A Flexible Approach for
Counterfactual Prediction,” in International Conference on Machine Learning, 1414–1423. [183,192]
HITSCH, G. J., AND S. MISRA (2018): “Heterogeneous Treatment Effects and Optimal Targeting Policy Evalu-
ation,” SSRN preprint 3111957. [183,196]
JAVANMARD, A., AND A. MONTANARI (2014): “Confidence Intervals and Hypothesis Testing for High-
Dimensional Regression,” The Journal of Machine Learning Research, 15, 2869–2909. [184]
KINGMA, D. P., AND J. BA (2014): “Adam: A Method for Stochastic Optimization,” arXiv:1412.6980. Preprint.
[182]
KOLTCHINSKII, V. (2006): “Local Rademacher Complexities and Oracle Inequalities in Risk Minimization,”
The Annals of Statistics, 34, 2593–2656. [189]
(2011): Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer-
Verlag. [189-191,200]
KOLTCHINSKII, V., AND D. PANCHENKO (2000): “Rademacher Processes and Bounding the Risk of Function
Learning,” in High Dimensional Probability II. Springer, 443–457. [189,200]
KUTYNIOK, G. (2019): “Discussion of ‘Nonparametric Regression Using Deep Neural Networks With ReLU
Activation Function’,” Annals of Statistics, (Forthcoming). [192]
LECUN, Y., L. BOTTOU, Y. BENGIO, AND P. HAFFNER (1998): “Gradient-Based Learning Applied to Docu-
ment Recognition,” Proceedings of the IEEE, 86, 2278–2324. [182]
LIANG, T. (2018): “On How Well Generative Adversarial Networks Learn Densities: Nonparametric and Para-
metric Results,” arXiv:1811.03179. [198]
LIANG, T., A. RAKHLIN, AND K. SRIDHARAN (2015): “Learning With Square Loss: Localization Through Offset
Rademacher Complexity,” in Conference on Learning Theory, 1260–1285. [189,200]
MA, X., AND J. WANG (2018): “Robust Inference Using Inverse Probability Weighting,” arXiv:1810.11397.
Preprint. [195]
MENDELSON, S. (2003): “A few Notes on Statistical Learning Theory,” in Advanced Lectures on Machine Learn-
ing. Springer, 1–40. [207]
(2014): “Learning Without Concentration,” in Conference on Learning Theory, 25–39. [200]
MHASKAR, H. N., AND T. POGGIO (2016): “Deep vs. Shallow Networks: An Approximation Theory Perspec-
tive,” Analysis and Applications, 14, 829–848. [192]
NAIR, V., AND G. E. HINTON (2010): “Rectified Linear Units Improve Restricted Boltzmann Machines,” in
Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807–814. [182]
POLSON, N. G., AND V. ROČKOVÁ (2018): “Posterior Concentration for Sparse Deep Learning,” in Advances
in Neural Information Processing Systems, 930–941. [192]
14680262, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.3982/ECTA16901 by Worcester Polytechnic Institut, Wiley Online Library on [18/11/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
DEEP NEURAL NETWORKS FOR ESTIMATION AND INFERENCE 213
ROBINS, J., E. T. TCHETGEN, L. LI, AND A. VAN DER VAART (2009): “Semiparametric Minimax Rates,” Elec-
tronic Journal of Statistics, 3, 1305–1321. [195]
SCHMIDT-HIEBER, J. (2019): “Nonparametric Regression Using Deep Neural Networks With ReLU Activation
Function,” Annals of Statistics, arXiv:1708.06633. (Forthcoming). [183,184,191,192]
PRECUP, D., AND Y. W. TEH (2017): “Estimating individual treatment effect: generalization bounds and algo-
rithms,” in Proceedings of the 34th International Conference on Machine Learning, 3076–3085. [183]
SHAMIR, O. (2019): “Discussion of ‘Nonparametric Regression Using Deep Neural Networks With ReLU
Activation Function’,” Annals of Statistics, (Forthcoming). [192]
STONE, C. J. (1982): “Optimal Global Rates of Convergence for Nonparametric Regression,” The Annals of
Statistics, 10, 1040–1053. [190,191]
TAN, Z. (2020): “Model-Assisted Inference for Treatment Effects Using Regularized Calibrated Estimation
With High-Dimensional Data,” The Annals of Statistics, 48, 811–837. [195]
TELGARSKY, M. (2017): “Benefits of depth in neural networks,” in 29th Annual Conference on Learning Theory,
1517–1539. [188]
VAN DE GEER, S., P. BUHLMANN, Y. RITOV, AND R. DEZEURE (2014): “On Asymptotically Optimal Confi-
dence Regions and Tests for High-Dimensional Models,” The Annals of Statistics, 42, 1166–1202. [184]
WAGER, S., AND S. ATHEY (2018): “Estimation and Inference of Heterogeneous Treatment Effects Using
Random Forests,” Journal of the American Statistical Association, 113, 1228–1242. [183]
WESTREICH, D., J. LESSLER, AND M. J. FUNK (2010): “Propensity Score Estimation: Neural Networks, Support
Vector Machines, Decision Trees (CART), and Meta-Classifiers as Alternatives to Logistic Regression,”
Journal of Clinical Epidemiology, 63, 826–833. [183]
WHITE, H. (1992): Artificial Neural Networks: Approximation and Learning Theory. Blackwell Publishers, Inc.
[182]
YAROTSKY, D. (2017): “Error Bounds for Approximations With Deep ReLU Networks,” Neural Networks, 94,
103–114. [184,189-191,206,208]
(2018): “Optimal approximation of continuous functions by very deep ReLU networks,” in 31st An-
nual Conference on Learning Theory, 639–649. [184,189,192,206]
ZHANG, C., S. BENGIO, M. HARDT, B. RECHT, AND O. VINYALS (2016): “Understanding Deep Learning
Requires Rethinking Generalization,” arXiv:1611.03530. Preprint. [188,189]